Large, Wide File Performance hints?

Heya,

Sorry to ask yet another performance related question, but I was wondering if you had any hints on what I should explore.

I have a 5.4G file fixed width file with 1,492 fields with the record length of 16867 characters. The graph is:


INPUT --> INPUT_PARSER--> TRANSFORM --> REMOVE_ROW_NUM_OUTPUT --> OUTPUT

The INPUT_PARSER node narrows the source fields is some are not in use (this one uses all the fields) and records the raw source row number. The TRANSFORM node runs our java code to carry over the fields and for this test, just blank out one field. The REMOVE_ROW_NUM_OUTPUT just removes the raw row number before it is sent to the output file.

Using our third party ETL tool to create a similar process, it takes about 3 minutes to produce the output file. On the box we are running Clover Engine, it’s taking about 4 hours with both Clover v3.0.1 and Clover v3.4.0. I’ve been told that these two servers are identical, so it shouldn’t be an OS/hardware difference. I know the 3rd party tool is very aggressive about resources, but I wouldn’t think that Clover would be out-performed *that" badly. I figure there must be some settings holding something back. I can see it working hard to write the file to disk, but it runs very slowly.

We allow the JVM to have 1G of memory - I’m going to try and increase that to the max (1.5 in 32 bit java) and I’ve tried playing with some of the memory settings in defaultProperties. I’m also going to try removing that INPUT_PARSER to see if it’s our code being a clog. We’ve also been running tests with a smaller input file that seem to indicate the performance gets better when there are less fields in the input file. Is there something in WatchDog or whatnot that I could look at to see where it is slowing down?
I posted the actual node/edge section of the graph below…

Thanks,
Anna


<Phase number="0">
<Node id="INPUT_0" type="DATA_READER" fileURL="/xxx/HI003_source.cv" dataPolicy="Controlled" skipLeadingBlanks="false" trim="false"/>
<Node id="INPUT_PARSER_0" type="REFORMAT" transformClass="com.facorelogic.core.etl.transform.NarrowTrueFixedWidthInputData" fileFormatCode="W" constantMap="ASSESSED_YEAR=;TAX_YEAR=;" includeInOrphanRowReport="true" transformFileURL="/xxx/FILE_HI003_source_cv_INPUT_PARSER.tfm" fileName="HI003_source.cv" />

<Edge id="INPUT_PARSER_EDGE_0" fromNode="INPUT_0:0" toNode="INPUT_PARSER_0:0" metadata="INPUT_PARSER_0"/>

<Node id="TRANSFORM_0" type="REFORMAT" transformClass="com.facorelogic.core.etl.transform.TransformRules" constantMap="ASSESSED_YEAR=;TAX_YEAR=;" includeInOrphanRowReport="true" transformFileURL="/xxx/FILE_HI003_cv_OUTPUT.tfm" />
<Node id="REMOVE_ROW_NUM_OUTPUT_0" type="REFORMAT" transformClass="com.facorelogic.core.etl.transform.CarryOverNarrowTransform" />
<Node id="OUTPUT_0" type="FIXLEN_DATA_WRITER" fileURL="/xxx/HI003.cv" fieldFiller=" " leftAlign="false" />


<Edge id="TRANSFORM_EDGE_0" fromNode="INPUT_PARSER_0:0" toNode="TRANSFORM_0:0" metadata="INPUT_PARSER_METADATA_0"/>
<Edge id="IN_SORT_EDGE_1" fromNode="TRANSFORM_0:0" toNode="REMOVE_ROW_NUM_OUTPUT_0:0" metadata="OUTPUT_METADATA_0"/>
<Edge id="OUTPUT_EDGE_0" fromNode="REMOVE_ROW_NUM_OUTPUT_0:0" toNode="OUTPUT_0:0" metadata="OUTPUT_REMOVE_ROW_NUM_METADATA_0"/>
</Phase>

Hi, Anna,

Thank you for your interesting question. Can you post also some graph run log, please? I hope it will show us some bottleneck component(s).

Best regards,

Heya,

I provided some tracking log snippets below. I don’t know if this is the sort of information you were looking for me to provide, so please let me know if something else would be better. I think I’ve made some progress. :slight_smile:

I turned tracking on and found some interesting information. The clogs look to be our nodes - they really kill the number of records per second. I did the following runs:

1. With INPUT_PARSER_0 + TRANSFORM_0 + REMOVE_ROW_NUM_OUTPUT_0 ~4 hours to complete
2. With INPUT_PARSER_0 removed ~Didn’t let it complete since the numbers looked like they’d take a long time
3. With TRANSFORM_0 removed ~Didn’t let it complete since the numbers looked like they’d take a long time
4. With INPUT_PARSER_0 and TRANSFORM_0 removed. ~11 minutes to complete
5. With INPUT_PARSER_0 and TRANSFORM_0 removed and REMOVE_ROW_NUM_OUTPUT_0 replaced with a SimpleCopy node ~ 6 minutes

Mind, I had to alter the FMT’s a bit for 1-5, but each one would have produced a 5.4G file and getting the exact file isn’t my goal (yet). Since (5) is all Clover nodes and (4) was the simplest with our code, I’ve concentrated on it. What REMOVE_ROW_NUM_OUTPUT_0 does is take an all String “source” that has a raw row number and converts it to a target with the correct field types (string, numeric, etc) and strips out the raw source number.

In our transform() method, we had:


for (String fieldName : targetFieldNames) {
    Object value = TransformUtils.getDataFieldValue(fieldName, _source[0],false);
   TransformUtils.setDataFieldValue(value, fieldName, _target[0], false, null);
}

where the TransformUtils.getDataFieldValue() and TransformUtils.setDataFieldValue() are wrappers around GetVal.java and SetVal.java. In this case, I knew that the fields in the “source” and “target” were in the same order (with the source having one extra initial field), so when I changed it to use the field number instead of the field name:


for (int fieldNum =0;fieldNum < targetFieldNames.length; fieldNum++) {
  Object value = TransformUtils.getDataFieldValue(fieldNum+1, _source[0],false);
  TransformUtils.setDataFieldValue(value, fieldNum, _target[0], false,null);
}

I now get near the results of (5), appximately 7 minutes! I know there are a LOT of places in our code where we get fields by their name instead of by field position. In the Clover code, it looks like it accesses a HashMap every time this happens whereas getting it by position can just return the field (In fact, there is is note in GetVal.java that says “methods which takes field’s position within record as input parameter are much faster than those accepting field’s name”). For our TransformUtils.getDataFieldValue() and TransformUtils.setDataFieldValue() it was probably calling the get on that HashMap 6 * 1,492 fields * 342514 rows - not good.

So, I am concentrating on that. I did have one question about this approach - I use GetVal.java and SetVal.java becausse it looks like you cannot do things like:


_target[0].getField(fieldNum).setValue(_source[0].getField(fieldNum+1).getValue());
_target[0].getField(fieldNum).setValue(_source[0].getField(fieldNum+1));

if one of the source DataField is String and target DataField is numeric - even if the “string” value could be converted to numeric. Is this true? Is there a better way for me to set the value across if I am not modifying it?

It will take me a while to change all the areas in our code where we look up fields via their name, but I will post my results if it clears up our performance issue.

Thanks,
Anna



1.  With INPUT_PARSER_0 + TRANSFORM_0  ~4 hours
[2013-07-02 09:28:59,442] INFO - ---------------------** Start of tracking Log for phase [0] **-------------------
[2013-07-02 09:28:59,443] INFO - Time: 02/07/13 09:28:59
[2013-07-02 09:28:59,443] INFO - Node                   ID         Port      #Records         #KB  Rec/s    KB/s
[2013-07-02 09:28:59,443] INFO - ---------------------------------------------------------------------------------
[2013-07-02 09:28:59,443] INFO -                        INPUT_0                                          RUNNING
[2013-07-02 09:28:59,443] INFO -  %cpu:0.02                        Out:0          886       10406     21     248
[2013-07-02 09:28:59,443] INFO -                        INPUT_PARSER_0                                   RUNNING
[2013-07-02 09:28:59,443] INFO -  %cpu:0.96                         In:0          886       10406     21     248
[2013-07-02 09:28:59,443] INFO -                                   Out:0          856       26191     21     654
[2013-07-02 09:28:59,443] INFO -                        TRANSFORM_0                                      RUNNING
[2013-07-02 09:28:59,443] INFO -  %cpu:0.27                         In:0          856       26191     21     654
[2013-07-02 09:28:59,443] INFO -                                   Out:0          848       25946     20     636
[2013-07-02 09:28:59,443] INFO -                        REMOVE_ROW_NUM_OUTPUT_0                          RUNNING
[2013-07-02 09:28:59,443] INFO -  %cpu:0.06                         In:0          848       25946     20     636
[2013-07-02 09:28:59,443] INFO -                                   Out:0          840       25695     20     636
[2013-07-02 09:28:59,443] INFO -                        OUTPUT_0                                         RUNNING
[2013-07-02 09:28:59,443] INFO -  %cpu:0.02                         In:0          840       25695     20     636
[2013-07-02 09:28:59,443] INFO - ---------------------------------** End of Log **--------------------------------


2.  With INPUT_PARSER_0 removed

[2013-07-02 13:11:32,118] INFO - ---------------------** Start of tracking Log for phase [0] **-------------------
[2013-07-02 13:11:32,118] INFO - Time: 02/07/13 13:11:32
[2013-07-02 13:11:32,118] INFO - Node                   ID         Port      #Records         #KB  Rec/s    KB/s
[2013-07-02 13:11:32,118] INFO - ---------------------------------------------------------------------------------
[2013-07-02 13:11:32,118] INFO -                        INPUT_0                                          RUNNING
[2013-07-02 13:11:32,118] INFO -  %cpu:0.08                        Out:0         4887       59942     93    1140
[2013-07-02 13:11:32,118] INFO -                        TRANSFORM_0                                      RUNNING
[2013-07-02 13:11:32,118] INFO -  %cpu:0.98                         In:0         4887       59942     93    1140
[2013-07-02 13:11:32,118] INFO -                                   Out:0         4851      148429     93    2851
[2013-07-02 13:11:32,118] INFO -                        REMOVE_ROW_NUM_OUTPUT_0                          RUNNING
[2013-07-02 13:11:32,118] INFO -  %cpu:0.27                         In:0         4851      148429     93    2851
[2013-07-02 13:11:32,118] INFO -                                   Out:0         4848      148299     94    2880
[2013-07-02 13:11:32,118] INFO -                        OUTPUT_0                                         RUNNING
[2013-07-02 13:11:32,118] INFO -  %cpu:0.11                         In:0         4848      148299     94    2880
[2013-07-02 13:11:32,119] INFO - ---------------------------------** End of Log **--------------------------------


3.  With TRANSFORM_0 removed 


[2013-07-02 13:30:16,959] INFO - ---------------------** Start of tracking Log for phase [0] **-------------------
[2013-07-02 13:30:16,959] INFO - Time: 02/07/13 13:30:16
[2013-07-02 13:30:16,960] INFO - Node                   ID         Port      #Records         #KB  Rec/s    KB/s
[2013-07-02 13:30:16,960] INFO - ---------------------------------------------------------------------------------
[2013-07-02 13:30:16,960] INFO -                        INPUT_0                                          RUNNING
[2013-07-02 13:30:16,960] INFO -  %cpu:0.02                        Out:0          737        8661     20     246
[2013-07-02 13:30:16,960] INFO -                        INPUT_PARSER_0                                   RUNNING
[2013-07-02 13:30:16,960] INFO -  %cpu:0.96                         In:0          737        8661     20     246
[2013-07-02 13:30:16,960] INFO -                                   Out:0          712       21785     21     673
[2013-07-02 13:30:16,960] INFO -                        REMOVE_ROW_NUM_OUTPUT_0                          RUNNING
[2013-07-02 13:30:16,960] INFO -  %cpu:0.06                         In:0          712       21785     21     673
[2013-07-02 13:30:16,960] INFO -                                   Out:0          704       21535     20     636
[2013-07-02 13:30:16,960] INFO -                        OUTPUT_0                                         RUNNING
[2013-07-02 13:30:16,960] INFO -  %cpu:0.02                         In:0          704       21535     20     636
[2013-07-02 13:30:16,960] INFO - ---------------------------------** End of Log **--------------------------------
[2013-07-02 13:30:21,960] INFO - ---------------------** Start of tracking Log for phase [0] **-------------------


4.  With INPUT_PARSER_0 and TRANSFORM_0 removed. ~11 minutes to complete

[2013-07-02 13:37:23,799] INFO - ---------------------** Start of tracking Log for phase [0] **-------------------
[2013-07-02 13:37:23,799] INFO - Time: 02/07/13 13:37:23
[2013-07-02 13:37:23,799] INFO - Node                   ID         Port      #Records         #KB  Rec/s    KB/s
[2013-07-02 13:37:23,799] INFO - ---------------------------------------------------------------------------------
[2013-07-02 13:37:23,799] INFO -                        INPUT_0                                          RUNNING
[2013-07-02 13:37:23,799] INFO -  %cpu:0.40                        Out:0         4536       55664    458    5806
[2013-07-02 13:37:23,799] INFO -                        REMOVE_ROW_NUM_OUTPUT_0                          RUNNING
[2013-07-02 13:37:23,799] INFO -  %cpu:0.94                         In:0         4536       55664    458    5806
[2013-07-02 13:37:23,799] INFO -                                   Out:0         4508       38087    458    4049
[2013-07-02 13:37:23,799] INFO -                        OUTPUT_0                                         RUNNING
[2013-07-02 13:37:23,799] INFO -  %cpu:0.44                         In:0         4508       38087    458    4049
[2013-07-02 13:37:23,799] INFO - ---------------------------------** End of Log **--------------------------------

5. With INPUT_PARSER_0 and TRANSFORM_0 removed REMOVE_ROW_NUM_OUTPUT_0 replaced with a SimpleCopy node ~ 6 minutes

[2013-07-02 13:56:03,127] INFO - ---------------------** Start of tracking Log for phase [0] **-------------------
[2013-07-02 13:56:03,128] INFO - Time: 02/07/13 13:56:03
[2013-07-02 13:56:03,128] INFO - Node                   ID         Port      #Records         #KB  Rec/s    KB/s
[2013-07-02 13:56:03,128] INFO - ---------------------------------------------------------------------------------
[2013-07-02 13:56:03,128] INFO -                        INPUT_0                                          RUNNING
[2013-07-02 13:56:03,128] INFO -  %cpu:0.85                        Out:0       178885     2206163    857   11003
[2013-07-02 13:56:03,128] INFO -                        COPY_0                                           RUNNING
[2013-07-02 13:56:03,128] INFO -  %cpu:..                           In:0       178885     2206163    857   11003
[2013-07-02 13:56:03,128] INFO -                                   Out:0       178870     2205969    858   11008
[2013-07-02 13:56:03,128] INFO -                        OUTPUT_0                                         RUNNING
[2013-07-02 13:56:03,128] INFO -  %cpu:0.61                         In:0       178870     2205969    858   11008
[2013-07-02 13:56:03,128] INFO - ---------------------------------** End of Log **--------------------------------

Your investigation seems to be right. Look up from an array is definitely faster than look up from a map. So I recommend to use integer instead string lookup to a record’s field.

Regarding SetVal and GetVal classes, I do not see any big advantage in using these classes. The fastest way how to copy a value from a field to a field is

recordOut.getField(i).setValue(recordIn.getField(i))

and little bit slower way:

recordOut.getField(i).setValue(recordIn.getField(i).getValue())

No magical conversions are performed, which could be desired but very dangerous feature. Conversion from a type to another type usually need some other input for example ‘format’ for date to string conversion.

Martin