i have a grf file that parses a string delimited info.
in my delimited info i have 3 dates:
1. start date.
2. middle date.
3. end date.
my grf calculates the end date - start date in on reformat object, and end date - middle date in another different reformat object.
the 2 reformats object have the exact same implementation except that on uses the start data and the other uses the middle date.
everything work except i have a major performance issue:
on a constant size of input data on the same machine i ran 3 modifications of the grf:
1. disabled the end date - middle date - the ran took 7.5 minutes.
2. disabled the end date - start date - the ran took 7.5 minutes.
3. enabled both of them - ran took 22 minutes.
i cant figure out why enabling the 2 calculation causes such a dramatic decrease in the performance (from 7.5 - 22 minutes is 3 times slower - thats a lot), especially when running each one of them separately gives me the same results, meanning that there is no problem in my implementation, but some kind of problem with the clover etl calculations when running both of them.
as far as the results - they are fine on all runs.
my only problem is the performance.
any help will be greatly appropriated.
thanks
Hi, odedbobi,
Can you post your graph and sample input data? It would be very helpful in solving your issue. As well as any information about CloverETL Designer (and Server if present) version, Java version and vendor, memory settings of Designer (Server), OS version, components used, number of input records, etc. Thank you in advance.
Best regards,
hi.
here are some more info:
1. OS - ed Hat Enterprise Linux Server release 5.7 (Tikanga) - 64bit
2. java version - java version “1.4.2” gij (GNU libgcj) version 4.1.2 20080704 (Red Hat 4.1.2-51)
3. clover version - how do i check it out ?
i did make some progress in my investigation (and i must say that debugging the clover and understanding what it does is very difficult ):
i have attached 2 graphs:
1. Glr.grf - this is my full grf file where the run time is 19 minutes…
2. partGrf.grf - this grf is the same as Glr.grf, with a small modification: to one of the filters i added ‘and false’ which causes one of the aggregation to not function. on this run i on the same data i get 7.5 minutes.
3.qna-glr-1… : is a sample data file. of course that the 7.5 or 19 minutes run uses a much larger data input.
the strange behavior i see is that it does not matter which aggregation i disable (by adding the ‘false and…’ term to an filter) i get the 7.5 minutes run. further more - on one of the aggregations i added a filter before and after the aggregation: the after filter was set to false. once the aggregation got data - slow run, once data was blocked in the before filter - 7.5 minutes run.
so i figured that i have passed some kind of aggregations limit.
i have tried work arounds, but nothing seems to help, so far.
thanks.
Hello again,
Thanks for the info. You can find out the version of CloverETL Designer by clicking on Help → About CloverETL Designer.
I have a few notes:
1. We support only Java versions 6 and 7. Version 1.4 is too old and also possible source of slow processing.
2. I have noticed that you have your transformations written in Java. This is very hard to support, especially without source code of the transformations. You may for example import something unappropriate which slows the whole graph run down.
3. You can divide your graph into phases, e.g. one phase per a graph branch. This way you can find in graph run log which phase took the most of the processing time and it may help you with locating the source of your issue. Then you can send me some simplified graph just with the problematic part of the original graph.
Best regards,
A general performance related advice - DO NOT use GNU java - the “gij (GNU libgcj)” regardless of the version. It is just painfully slow. Use the Oracle/Sun JVM which is properly tuned. In the worst case use IBM Java which also has its issues, but is still better choice.
thank you guys for the advice.
but this graph is running on production environments for huge clients.
changing java version or clover version or any other component is out of the question.
plus you are forgetting that the performance issue that i has gotten worse once i have added some more aggregations and db writing.
not i have broken the problem into the following amazing fact:
my graph counts 19 time unique count for a parameter.
once i remove one of them, no matter which one my performance is half of what it is when working with 19 unique counts.
now, i suppose this has to do with memory or something like that.
i cant remove any one of my unique counts, i can separate the graph into 2 different graphs.
thanks.