Page 1 of 1

Parallizing data in a flat file

Posted: Thu Jun 30, 2011 9:58 am
by blekota74
I have a flat file with content like this (CSV):

Code: Select all

field_A;field_B
1;a
2;b
3;c
4;d
5;e
6;f
7;g


I need to convert this file into a flat file with (for instance) three sets of columns:

Code: Select all

field_A1;field_B1;field_A2;field_B2;field_A3;field_B3
1;a;4;d;7;g
2;b;5;e;;
3;c;6;f;;


Normally to achieve the goal I perform following activities:
1. Check number of records/lines in a file
2. If the number of records/lines is not divisible by 3(in my case - but it can be any number) append missing blank recods/lines (in my case 2)
3. Take first 1/3 records(lines) and put them into colum set nr 1, take second 1/3 records(lines) and put them into colum set nr 2, take third 1/3 records(lines) and put them into colum set nr 3

now we have structure like this (copy of the example above):

Code: Select all

field_A1;field_B1;field_A2;field_B2;field_A3;field_B3
1;a;4;d;7;g
2;b;5;e;;
3;c;6;f;;


Do anybody has an idea how to do this in CloverETL - presently I have to do this in an external tool.

Re: Parallizing data in a flat file

Posted: Thu Jun 30, 2011 2:45 pm
by avackova
Uff, I've got it :wink:, although it was not easy at all.
Attached graph implements following algorithm:
  • read input data
  • add column number to each input record
  • sort the records according to this number
  • format each group of records to one record
    • add the empty record if needed
  • store data in new format

Re: Parallizing data in a flat file

Posted: Fri Jul 01, 2011 3:43 pm
by blekota74
Thank you your fast response but I am having an error when execute the graph:

Code: Select all

INFO  [main] - ***  CloverETL framework/transformation graph, (c) 2002-2011 Javlin a.s, released under GNU Lesser General Public License  ***
INFO  [main] - Running with CloverETL library version 3.1.0 build#17 compiled 16/06/2011 16:06:35
INFO  [main] - Running on 2 CPU(s), OS Windows XP, architecture x86, Java version 1.6.0_26, max available memory for JVM 253440 KB
INFO  [main] - Loading default properties from: defaultProperties
INFO  [main] - Graph definition file: graph/Parallizing.grf
INFO  [main] - Graph revision: 1.48 Modified by: user Modified: Thu Jun 30 17:32:23 CEST 2011
INFO  [main] - Checking graph configuration...
INFO  [main] - Graph configuration is valid.
INFO  [main] - Graph initialization (Parallizing)
INFO  [main] - [Clover] Initializing phase: 0
INFO  [main] - Compiling dynamic class FormatInput...
ERROR [main] - Error during graph initialization !
Element [1309427766212:Parallizing]-Phase 0 can't be initilized.
   at org.jetel.graph.TransformationGraph.init(TransformationGraph.java:458)
   at org.jetel.graph.runtime.EngineInitializer.initGraph(EngineInitializer.java:202)
   at org.jetel.graph.runtime.EngineInitializer.initGraph(EngineInitializer.java:165)
   at org.jetel.main.runGraph.runGraph(runGraph.java:364)
   at org.jetel.main.runGraph.main(runGraph.java:328)
Caused by: DENORMALIZER0 ...FATAL ERROR !
Reason: Used Java Platform doesn't provide any java compiler!
   at org.jetel.graph.Phase.init(Phase.java:174)
   at org.jetel.graph.TransformationGraph.init(TransformationGraph.java:456)
   ... 4 more
Caused by: java.lang.IllegalStateException: Used Java Platform doesn't provide any java compiler!
   at org.jetel.util.compile.DynamicCompiler.compile(DynamicCompiler.java:109)
   at org.jetel.util.compile.DynamicJavaClass.instantiate(DynamicJavaClass.java:66)
   at org.jetel.component.Denormalizer.createDenormalizerDynamic(Denormalizer.java:216)
   at org.jetel.component.Denormalizer.createRecordDenormalizer(Denormalizer.java:269)
   at org.jetel.component.Denormalizer.init(Denormalizer.java:241)
   at org.jetel.graph.Phase.init(Phase.java:165)
   ... 5 more

Re: Parallizing data in a flat file

Posted: Thu Jul 07, 2011 9:51 am
by jurban
Hi,

are you running CloverETL with a JRE or JDK? A JDK is required to run Java tranformations - and such a tranformation is used in the graph provided by Agata.

Best regards,
Jaro

Re: Parallizing data in a flat file

Posted: Wed Jul 13, 2011 10:43 am
by blekota74
I set path to the JDK and it works. I tried out to understand the code written in java in the given example in object named 'Format many to one' (component type: Denormalilzer) and I think there is lack of documentation. For instance I can't find references for classes like DataFormatter or ByteArrayOutputStream. I googled for cloveretl DataFormatter and I couldn't find any information.
I am dealing mainly with utf-8 and when I use non 'English' characters in the source file (formatted as utf-8) I got error (I set 'Denormalize source set' to utf-8):

Code: Select all

ERROR [WatchDog] - Node DENORMALIZER0 finished with status: Error occurred in nested transformation: ERROR caused by: Message: Denormalization failed! caused by: java.lang.RuntimeException: Exception when converting the field value: g zażółć gęślą jaźń a koń pędź (field name: 'field_B') to ISO-8859-1. (original cause: Input length = 1)


below is full content of my example data file:

Code: Select all

field_A;field_B
1;a
2;b
3;c
4;d
5;e
6;f
7;g zażółć gęślą jaźń a koń pędź

Re: Parallizing data in a flat file

Posted: Wed Jul 13, 2011 11:12 am
by avackova
Hello,
  1. to handle Polish characters you need to set proper charset on Writer
  2. javadoc and source files (of the open source part of CloverETL Engine) can be downloaded from the CloverETL on Sourceforge page

Re: Parallizing data in a flat file

Posted: Wed Jul 13, 2011 11:57 am
by blekota74
well, I check input data (debug) on the DENORMALIZER0 object and it is OK, but at the output there is no data - for me it seems to be problem of a class that can't handle multibyte characters (when remove all 'Polish' characters it works properly).
Exception when converting the field value: g zażółć gęślą jaźń a koń pędź (field name: 'field_B') to ISO-8859-1. (original cause: Input length = 1)

Re: Parallizing data in a flat file

Posted: Wed Jul 13, 2011 12:14 pm
by avackova
Please change the charset on Writer:
UniversalDataWriter.png
UniversalDataWriter.png (68.57 KiB) Viewed 9429 times

Charset in Denormalizer is used just for decoding of external source of transformation.

Re: Parallizing data in a flat file

Posted: Wed Jul 13, 2011 12:26 pm
by blekota74
Of course I did it and the the error still exists. The problem is in Node DENORMALIZER0 in my opinion. Try with my input file please.

Code: Select all

INFO  [main] - ***  CloverETL framework/transformation graph, (c) 2002-2011 Javlin a.s, released under GNU Lesser General Public License  ***
INFO  [main] - Running with CloverETL library version 3.1.0 build#17 compiled 16/06/2011 16:06:35
INFO  [main] - Running on 2 CPU(s), OS Windows XP, architecture x86, Java version 1.6.0_21, max available memory for JVM 253440 KB
INFO  [main] - Loading default properties from: defaultProperties
INFO  [main] - Graph definition file: graph/Parallizing.grf
INFO  [main] - Graph revision: 1.66 Modified by: informatyk Modified: Wed Jul 13 13:23:27 CEST 2011
INFO  [main] - Checking graph configuration...
INFO  [main] - Graph configuration is valid.
INFO  [main] - Graph initialization (Parallizing)
INFO  [main] - [Clover] Initializing phase: 0
INFO  [main] - Compiling dynamic class FormatInput...
INFO  [main] - Dynamic class FormatInput successfully compiled and instantiated.
INFO  [main] - [Clover] phase: 0 initialized successfully.
INFO  [main] - register MBean with name:org.jetel.graph.runtime:type=CLOVERJMX_1309427766212_0
INFO  [WatchDog] - Starting up all nodes in phase [0]
INFO  [WatchDog] - Successfully started all nodes in phase!
ERROR [WatchDog] - Graph execution finished with error
ERROR [WatchDog] - Node DENORMALIZER0 finished with status: Error occurred in nested transformation: ERROR caused by: Message: Denormalization failed! caused by: java.lang.RuntimeException: Exception when converting the field value: g zażółć gęślą jaźń a koń pędź (field name: 'field_B') to ISO-8859-1. (original cause: Input length = 1)

Record: #0|field_A|S->7
#1|field_B|S->g zażółć gęślą jaźń a koń pędź
#2|key|i->0

ERROR [WatchDog] - Node DENORMALIZER0 error details:
org.jetel.exception.TransformException: Message: Denormalization failed! caused by: java.lang.RuntimeException: Exception when converting the field value: g zażółć gęślą jaźń a koń pędź (field name: 'field_B') to ISO-8859-1. (original cause: Input length = 1)

Record: #0|field_A|S->7
#1|field_B|S->g zażółć gęślą jaźń a koń pędź
#2|key|i->0

   at org.jetel.component.denormalize.DataRecordDenormalize.appendOnError(DataRecordDenormalize.java:54)
   at org.jetel.component.Denormalizer.processInput(Denormalizer.java:381)
   at org.jetel.component.Denormalizer.execute(Denormalizer.java:452)
   at org.jetel.graph.Node.run(Node.java:425)
   at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.RuntimeException: Exception when converting the field value: g zażółć gęślą jaźń a koń pędź (field name: 'field_B') to ISO-8859-1. (original cause: Input length = 1)

Record: #0|field_A|S->7
#1|field_B|S->g zażółć gęślą jaźń a koń pędź
#2|key|i->0

   at org.jetel.data.formatter.DataFormatter.write(DataFormatter.java:263)
   at FormatInput.append(FormatInput.java from JavaSourceFileObject:57)
   at org.jetel.component.Denormalizer.processInput(Denormalizer.java:379)
   ... 3 more
Caused by: java.nio.charset.UnmappableCharacterException: Input length = 1
   at java.nio.charset.CoderResult.throwException(Unknown Source)
   at java.nio.charset.CharsetEncoder.encode(Unknown Source)
   at org.jetel.data.DataField.toByteBuffer(DataField.java:278)
   at org.jetel.data.formatter.DataFormatter.write(DataFormatter.java:228)
   ... 5 more
INFO  [WatchDog] - [Clover] Post-execute phase finalization: 0
INFO  [WatchDog] - [Clover] phase: 0 post-execute finalization successfully.
INFO  [WatchDog] - Execution of phase [0] finished with error - elapsed time(sec): 0
ERROR [WatchDog] - !!! Phase finished with error - stopping graph run !!!
INFO  [WatchDog] - -----------------------** Summary of Phases execution **---------------------
INFO  [WatchDog] - Phase#            Finished Status         RunTime(sec)    MemoryAllocation(KB)
INFO  [WatchDog] - 0                 ERROR                              0             15867
INFO  [WatchDog] - ------------------------------** End of Summary **---------------------------
INFO  [WatchDog] - WatchDog thread finished - total execution time: 5 (sec)
INFO  [main] - Freeing graph resources.
ERROR [main] - Execution of graph failed !

Re: Parallizing data in a flat file

Posted: Wed Jul 13, 2011 12:59 pm
by avackova
Yes, you're write. Change the line 22 of transformation to:

Code: Select all

   DataFormatter formatter = new DataFormatter("UTF-8");

Re: Parallizing data in a flat file

Posted: Wed Jul 13, 2011 1:47 pm
by blekota74
now I have no errors but the result file contains data like double utf-8 formatted - when I set formatting to utf-8 in my editor I see:
1;a;4;d;7;g zażółć gęślą jaźń a koń pędź
2;b;5;e;;
3;c;6;f;;

for me the text looks like I use no utf-8 formatting

but when I copied the text above into a txt editor (with no utf-8 formatting) saved it and browse with utf-8 coding it is OK.
1;a;4;d;7;g zażółć gęślą jaźń a koń pędź

Re: Parallizing data in a flat file

Posted: Wed Jul 13, 2011 2:31 pm
by avackova
Do you have the same charset everywhere? Attached graph works for me.

Re: Parallizing data in a flat file

Posted: Thu Jul 14, 2011 10:10 am
by blekota74
still have wrong results when execute your graph

my input file (ANSI Windows, coding 1250 - when switch coding to utf-8 the content is presented properly):
field_A;field_B
1;a
2;b
3;c
4;d
5;e
6;f
7;g zażółć gęślą jaźń a koń pędź


output:
1;a
;4;d
;7;g zażółć gęślą jaźń a koń pędź
2;b
;5;e
;;
3;c
;6;f
;;


for me the problem is in DENORMALIZER - input is correct (I can see all the characters properly in debug mode) but the output is wrong

Re: Parallizing data in a flat file

Posted: Thu Jul 14, 2011 12:05 pm
by avackova
I think I've found where the problem is: in Denormalizer we need to format data with the same charset as we convert it from bytes for sending to the next Writer (and it doesn't matter what charset is set on Reader or Writer) or we can send it as bytes. The first solution means, that charset used with DataFormater (line 22: DataFormatter formatter = new DataFormatter("UTF-8");) needs to be the same as the charset used for converting ByteArrayOutputStream to string (line 75: value = output.toString("UTF-8");).

Re: Parallizing data in a flat file

Posted: Thu Jul 14, 2011 1:22 pm
by blekota74
Now it is OK, :)
Dziękuję.