Heya,
I provided some tracking log snippets below. I don’t know if this is the sort of information you were looking for me to provide, so please let me know if something else would be better. I think I’ve made some progress.
I turned tracking on and found some interesting information. The clogs look to be our nodes - they really kill the number of records per second. I did the following runs:
1. With INPUT_PARSER_0 + TRANSFORM_0 + REMOVE_ROW_NUM_OUTPUT_0 ~4 hours to complete
2. With INPUT_PARSER_0 removed ~Didn’t let it complete since the numbers looked like they’d take a long time
3. With TRANSFORM_0 removed ~Didn’t let it complete since the numbers looked like they’d take a long time
4. With INPUT_PARSER_0 and TRANSFORM_0 removed. ~11 minutes to complete
5. With INPUT_PARSER_0 and TRANSFORM_0 removed and REMOVE_ROW_NUM_OUTPUT_0 replaced with a SimpleCopy node ~ 6 minutes
Mind, I had to alter the FMT’s a bit for 1-5, but each one would have produced a 5.4G file and getting the exact file isn’t my goal (yet). Since (5) is all Clover nodes and (4) was the simplest with our code, I’ve concentrated on it. What REMOVE_ROW_NUM_OUTPUT_0 does is take an all String “source” that has a raw row number and converts it to a target with the correct field types (string, numeric, etc) and strips out the raw source number.
In our transform() method, we had:
for (String fieldName : targetFieldNames) {
Object value = TransformUtils.getDataFieldValue(fieldName, _source[0],false);
TransformUtils.setDataFieldValue(value, fieldName, _target[0], false, null);
}
where the TransformUtils.getDataFieldValue() and TransformUtils.setDataFieldValue() are wrappers around GetVal.java and SetVal.java. In this case, I knew that the fields in the “source” and “target” were in the same order (with the source having one extra initial field), so when I changed it to use the field number instead of the field name:
for (int fieldNum =0;fieldNum < targetFieldNames.length; fieldNum++) {
Object value = TransformUtils.getDataFieldValue(fieldNum+1, _source[0],false);
TransformUtils.setDataFieldValue(value, fieldNum, _target[0], false,null);
}
I now get near the results of (5), appximately 7 minutes! I know there are a LOT of places in our code where we get fields by their name instead of by field position. In the Clover code, it looks like it accesses a HashMap every time this happens whereas getting it by position can just return the field (In fact, there is is note in GetVal.java that says “methods which takes field’s position within record as input parameter are much faster than those accepting field’s name”). For our TransformUtils.getDataFieldValue() and TransformUtils.setDataFieldValue() it was probably calling the get on that HashMap 6 * 1,492 fields * 342514 rows - not good.
So, I am concentrating on that. I did have one question about this approach - I use GetVal.java and SetVal.java becausse it looks like you cannot do things like:
_target[0].getField(fieldNum).setValue(_source[0].getField(fieldNum+1).getValue());
_target[0].getField(fieldNum).setValue(_source[0].getField(fieldNum+1));
if one of the source DataField is String and target DataField is numeric - even if the “string” value could be converted to numeric. Is this true? Is there a better way for me to set the value across if I am not modifying it?
It will take me a while to change all the areas in our code where we look up fields via their name, but I will post my results if it clears up our performance issue.
Thanks,
Anna
1. With INPUT_PARSER_0 + TRANSFORM_0 ~4 hours
[2013-07-02 09:28:59,442] INFO - ---------------------** Start of tracking Log for phase [0] **-------------------
[2013-07-02 09:28:59,443] INFO - Time: 02/07/13 09:28:59
[2013-07-02 09:28:59,443] INFO - Node ID Port #Records #KB Rec/s KB/s
[2013-07-02 09:28:59,443] INFO - ---------------------------------------------------------------------------------
[2013-07-02 09:28:59,443] INFO - INPUT_0 RUNNING
[2013-07-02 09:28:59,443] INFO - %cpu:0.02 Out:0 886 10406 21 248
[2013-07-02 09:28:59,443] INFO - INPUT_PARSER_0 RUNNING
[2013-07-02 09:28:59,443] INFO - %cpu:0.96 In:0 886 10406 21 248
[2013-07-02 09:28:59,443] INFO - Out:0 856 26191 21 654
[2013-07-02 09:28:59,443] INFO - TRANSFORM_0 RUNNING
[2013-07-02 09:28:59,443] INFO - %cpu:0.27 In:0 856 26191 21 654
[2013-07-02 09:28:59,443] INFO - Out:0 848 25946 20 636
[2013-07-02 09:28:59,443] INFO - REMOVE_ROW_NUM_OUTPUT_0 RUNNING
[2013-07-02 09:28:59,443] INFO - %cpu:0.06 In:0 848 25946 20 636
[2013-07-02 09:28:59,443] INFO - Out:0 840 25695 20 636
[2013-07-02 09:28:59,443] INFO - OUTPUT_0 RUNNING
[2013-07-02 09:28:59,443] INFO - %cpu:0.02 In:0 840 25695 20 636
[2013-07-02 09:28:59,443] INFO - ---------------------------------** End of Log **--------------------------------
2. With INPUT_PARSER_0 removed
[2013-07-02 13:11:32,118] INFO - ---------------------** Start of tracking Log for phase [0] **-------------------
[2013-07-02 13:11:32,118] INFO - Time: 02/07/13 13:11:32
[2013-07-02 13:11:32,118] INFO - Node ID Port #Records #KB Rec/s KB/s
[2013-07-02 13:11:32,118] INFO - ---------------------------------------------------------------------------------
[2013-07-02 13:11:32,118] INFO - INPUT_0 RUNNING
[2013-07-02 13:11:32,118] INFO - %cpu:0.08 Out:0 4887 59942 93 1140
[2013-07-02 13:11:32,118] INFO - TRANSFORM_0 RUNNING
[2013-07-02 13:11:32,118] INFO - %cpu:0.98 In:0 4887 59942 93 1140
[2013-07-02 13:11:32,118] INFO - Out:0 4851 148429 93 2851
[2013-07-02 13:11:32,118] INFO - REMOVE_ROW_NUM_OUTPUT_0 RUNNING
[2013-07-02 13:11:32,118] INFO - %cpu:0.27 In:0 4851 148429 93 2851
[2013-07-02 13:11:32,118] INFO - Out:0 4848 148299 94 2880
[2013-07-02 13:11:32,118] INFO - OUTPUT_0 RUNNING
[2013-07-02 13:11:32,118] INFO - %cpu:0.11 In:0 4848 148299 94 2880
[2013-07-02 13:11:32,119] INFO - ---------------------------------** End of Log **--------------------------------
3. With TRANSFORM_0 removed
[2013-07-02 13:30:16,959] INFO - ---------------------** Start of tracking Log for phase [0] **-------------------
[2013-07-02 13:30:16,959] INFO - Time: 02/07/13 13:30:16
[2013-07-02 13:30:16,960] INFO - Node ID Port #Records #KB Rec/s KB/s
[2013-07-02 13:30:16,960] INFO - ---------------------------------------------------------------------------------
[2013-07-02 13:30:16,960] INFO - INPUT_0 RUNNING
[2013-07-02 13:30:16,960] INFO - %cpu:0.02 Out:0 737 8661 20 246
[2013-07-02 13:30:16,960] INFO - INPUT_PARSER_0 RUNNING
[2013-07-02 13:30:16,960] INFO - %cpu:0.96 In:0 737 8661 20 246
[2013-07-02 13:30:16,960] INFO - Out:0 712 21785 21 673
[2013-07-02 13:30:16,960] INFO - REMOVE_ROW_NUM_OUTPUT_0 RUNNING
[2013-07-02 13:30:16,960] INFO - %cpu:0.06 In:0 712 21785 21 673
[2013-07-02 13:30:16,960] INFO - Out:0 704 21535 20 636
[2013-07-02 13:30:16,960] INFO - OUTPUT_0 RUNNING
[2013-07-02 13:30:16,960] INFO - %cpu:0.02 In:0 704 21535 20 636
[2013-07-02 13:30:16,960] INFO - ---------------------------------** End of Log **--------------------------------
[2013-07-02 13:30:21,960] INFO - ---------------------** Start of tracking Log for phase [0] **-------------------
4. With INPUT_PARSER_0 and TRANSFORM_0 removed. ~11 minutes to complete
[2013-07-02 13:37:23,799] INFO - ---------------------** Start of tracking Log for phase [0] **-------------------
[2013-07-02 13:37:23,799] INFO - Time: 02/07/13 13:37:23
[2013-07-02 13:37:23,799] INFO - Node ID Port #Records #KB Rec/s KB/s
[2013-07-02 13:37:23,799] INFO - ---------------------------------------------------------------------------------
[2013-07-02 13:37:23,799] INFO - INPUT_0 RUNNING
[2013-07-02 13:37:23,799] INFO - %cpu:0.40 Out:0 4536 55664 458 5806
[2013-07-02 13:37:23,799] INFO - REMOVE_ROW_NUM_OUTPUT_0 RUNNING
[2013-07-02 13:37:23,799] INFO - %cpu:0.94 In:0 4536 55664 458 5806
[2013-07-02 13:37:23,799] INFO - Out:0 4508 38087 458 4049
[2013-07-02 13:37:23,799] INFO - OUTPUT_0 RUNNING
[2013-07-02 13:37:23,799] INFO - %cpu:0.44 In:0 4508 38087 458 4049
[2013-07-02 13:37:23,799] INFO - ---------------------------------** End of Log **--------------------------------
5. With INPUT_PARSER_0 and TRANSFORM_0 removed REMOVE_ROW_NUM_OUTPUT_0 replaced with a SimpleCopy node ~ 6 minutes
[2013-07-02 13:56:03,127] INFO - ---------------------** Start of tracking Log for phase [0] **-------------------
[2013-07-02 13:56:03,128] INFO - Time: 02/07/13 13:56:03
[2013-07-02 13:56:03,128] INFO - Node ID Port #Records #KB Rec/s KB/s
[2013-07-02 13:56:03,128] INFO - ---------------------------------------------------------------------------------
[2013-07-02 13:56:03,128] INFO - INPUT_0 RUNNING
[2013-07-02 13:56:03,128] INFO - %cpu:0.85 Out:0 178885 2206163 857 11003
[2013-07-02 13:56:03,128] INFO - COPY_0 RUNNING
[2013-07-02 13:56:03,128] INFO - %cpu:.. In:0 178885 2206163 857 11003
[2013-07-02 13:56:03,128] INFO - Out:0 178870 2205969 858 11008
[2013-07-02 13:56:03,128] INFO - OUTPUT_0 RUNNING
[2013-07-02 13:56:03,128] INFO - %cpu:0.61 In:0 178870 2205969 858 11008
[2013-07-02 13:56:03,128] INFO - ---------------------------------** End of Log **--------------------------------