Partition: cannot compare DataRecords using equals()?

Hi,

I want to reject records that have the same field values and send them to port #1

I have a Partition node that has the PartitionClass attribute set to the java class below:

public class DuplicateRowPartitioner implements PartitionFunction {

private static final Logger logger = Logger.getLogger(DuplicateRowPartitioner.class);
DataRecord previous;

/**
* unique records are sent to output port 0. exact duplicate records are sent
* to output port 1.
*/
public int getOutputPort(DataRecord record) {

if (previous == null) {
logger.info(new StringBuffer(“Comparing record : '”).append(
record.toString()).append(“’ to previous record : NULL…”).toString());
}
else {
logger.info(new StringBuffer(“Comparing record : '”).append(
record.toString()).append(“’ to previous record : '”).append(
previous.toString()).append(“'…”).toString());
}

// sent exact duplicate records to a different output port
if (previous != null && record.equals(previous)) {
logger.info(new StringBuffer(“Found duplicate record…”).toString());
return 1;
}

logger.info(new StringBuffer(“Found different record…”).toString());
previous = record.duplicate();
return 0;

} // end of getOutputPort()

public void init(int numPartitions, RecordKey partitionKey)
throws ComponentNotReadyException {
} // end init()

}

I can see that I have a record that has field values exactly the same as those in the previous record (see the logger output below), but Clover still treats them as different records??

INFO [PARTITION_0] - Comparing record : '#0|REFERENCE|S->10273
#1|POSITION|S->5
#2|AMOUNT_1|N->0.0
’ to previous record : '#0|REFERENCE|S->10273
#1|POSITION|S->5
#2|AMOUNT_1|N->0.0
'…
INFO [PARTITION_0] - Found different record…

Does it mean that I cannot use record.equals(previous) (see Java class above) to check whether they have exact duplicate field values? What should I do instead?

Thanks,
al

Hi,

anyone has any thought or solution on this issue?

Thanks,
al

Hello Achan,

I do not understand why you are using the Partition component. If you used ExtSort and Dedup, you would obtain the desired result.

You should past ExtSort and Dedup components into your graph, connect the output port of ExtSort with the input port of Dedup by an edge.

After that, you only need to define Sort key in ExtSort (select all fields from the input metadata) and define the same Dedup key in the Dedup component (the same fields in the same order).

You can propagate metadata throughout ExtSort.

Then you must connect an edge to the output port 0 of Dedup and another edge to its ouput port 1.

You should propagate metadata throughout Dedup. Metadata of both of these edges will be the same as those on the Dedup input.

By default, Dedup sends only one record with the same field values through the output port 0 and sends all the other records with the same field values through the output port 1.

This is the simplest solution.

Best regards,

Tomas Waller

Hi Tomas,

Thanks for your suggestion, I will give it a try…

However, that still does not explain why the equal() method for DataRecord does not work… Please verify that…

Thanks,
al

Hi,
could you please attach the graph, with its input data, metadata etc., so we can test this? Comparing DataRecords should work as you describe.

Thanks!
Jaro

Hi Tomas,

I remember why I use PARTITION, but not SORT and DEDUP… my data can have up to 2000 fields, so it’s not practical to set the sortKey and dedupKey to 2000 fields, right? also, can Clover handle such a large number of keys for SORT and DEDUP? Thus, I resorted to use PARTITION since I don’t have to specify the key and could do the comparison using the equal() method from DataRecord object in the partitionClass… Then I found out equal() method did not work as I expect…

Hi Jaro,

The stripped-down version of my graph is INPUT → REFORMAT → SORT → PARTITION → OUTPUT, looks like this:

<?xml version="1.0" encoding="UTF-8"?>

INPUT_METADATA_0 looks like this :

<?xml version="1.0" encoding="UTF-8"?>

INPUT_PARSER_METADATA_0 looks like:

<?xml version="1.0" encoding="UTF-8"?>

The ParseInputData class basically take the long single string input and parse it, according to the “quoteChar” in the INPUT_PARSER_0 node in my graph and the fieldDelimiter in the INPUT_PARSER_METADATA_0 file, to various fields (works like a StringTokenizer)…

my INPUT_0 looks like:

REFERENCE, POSITION, AMOUNT_1
“10272”, “1”, “100”
“10273”, “2”, “0”
“10273”, “2”, “0”
“10274”, “3”, “200”

I expected data row 2 and row 3 with “10273, 2, 0” would be exact duplicates, using the equal() method from the DataRecord class (see my DuplicateRowPartitioner in previous posting), but the result is they are different records??

However, if I reduced my graph to INPUT_PARSER → SORT → PARTITION → OUTPUT, then the equal() method in my Partition Class would work, resulting in this output:

INFO [PARTITION_0] - Comparing record : '#0|REFERENCE|S->10273
#1|POSITION|S->5
#2|AMOUNT_1|N->0.0
’ to previous record : '#0|REFERENCE|S->10273
#1|POSITION|S->5
#2|AMOUNT_1|N->0.0
'…
INFO [PARTITION_0] - Found duplicate record…

I cannot imagine this is an issue with Clover’s SORT node, right? I am using equal() method to compare the previous record and current record coming into PARTITION_0, so it does not matter the input to PARTITION is from INPUT or REFORMAT before the SORT, right? I am guessing it’s the equal() in the PARTITION that somehow does not compare the previous record and current record (both have same metadata and data) correctly… maybe a reference pointer issue as I did use record.duplicate() to set the previous record (see my DuplicateRowPartitioner in previous posting)??

Thank you to both of you for your time and help,
albert :slight_smile:

I check out your code again. Still I don’t understand all observation what have you got. Nonetheless I have few suggestions. You have to be very careful whenever you use DataRecord.equal() method, since according our internal rules, two empty/null fields are different. So two ‘same’ records with at least one null field are considered as different.

In this case of record comparison I would recommend you the RecordComparator class instead a simple equal() method.

Second suggestion could be small performance enhancement. Try to substitute very slow DataRecord.duplicate() (new java object has to be created) by simple DataRecord.copyFrom(). Of course you need to create a temp record by one duplication call.

Martin

Hi Martin,

Thanks for clarifying the usage of DataRecord.equal(). I have data fields that are empty/null (I am just showing a stripped-down data set in my previous postings here), so that explains why the same record values (with some empty/null fields) are treated different although the non-null field values are the same.

Is there a hit in performance if I compare values in each field (between the current record and previous record) in my Partition class instead of using RecordComparator?

Thanks for your second suggestion too!

albert :slight_smile:

I don’t think so. I definitely recommend you to use prepared RecordComparator. For example in this way:

RecordComparator recordComparator = new RecordComparator(new int[] {0, 1, 2}); //you have to specify a key fields according the comparison will be done
recordComparator.setEqualNULLs(true);
if (recordComparator.compare(previousRecord, currentRecord) == 0) {
	//records are equal
}

That should be the least error-prone way.

Martin

Hi,

One more thing from me:
You can select all fields by clicking Ctrl + A in the Fields pane and, after that, copy them all by single clicking the Right arrow button in the wizard. This way all 2000 fields will be moved at one instant to the pane on the right to create the Sort key.
In the same way, you can copy them to the Dedup key.

Best regards,

Tomas Waller

Hi Martin,

If my data has 2000 fields and 1 key field, do I just instantiate the RecordComparator with the key field, like this:

RecordComparator recordComparator = new RecordComparator(new int {0})

and the recordComparator.compare(previousRecord, currentRecord) would compare all 2000 fields?

Hi Tomas,

I am not using the Clover GUI, I am generating the graph file programmatically, depending on what my user needs, via my web application.

Thanks to both of you :slight_smile:

al

No, the constructor parameter - the array of integers - specifies the list of comparated fields. So if you need compare two records according 2000 fields. You have to prepare the array ‘new int {1, 2, 3, …, 2000}’

Martin

Got it. Thanks, Martin!

al