Remove duplicate records and sorting in DEDUP?

Hi,

How do i remove duplicate records without specifying all the fields? my record has a metadata of 2000 fields…
here is a subset of my input data, sorted by REFERENCE (primary key):

“REFERENCE”,“NAME”,“NO”
"000000010271 ","WFB ",“1”
"000000010271 ","WFB ",“1”
"000000010272 ","ABC ",“1”
"000000010272 ","ABC ",“2”

i want an output result like this:

“REFERENCE”,“NAME”,“NO”
"000000010271 ","WFB ",“1” (removed the duplicate)
"000000010272 ","ABC ",“1”
"000000010272 ","ABC ",“2”

i know i can use DEDUP and set the dedupKey=“REFERENCE;NAME;NO” to achieve my output, but if my input data has 2000 fields, i do not want to set dedupKey to 2000 fields, right? moreover, can dedupKey be set to such a long string? so, is there a way to tell CloverETL to remove duplicate records if i have 2000 fields to match?

i would think DEDUP would just need a flag, say remove_only_if_all_fields_matches, set to true and can reference the FMT for the list of fields… if values of each respective fields match, then it’s a duplicate and remove it… that way, DEDUP would not need the dedupKey to be set to a large number of field names… right?

just to make sure, DEDUP does not sort the records, right?

thanks,
al

anyone has any idea of a better solution than putting all 2000 fields in the “key”?

this is an urgent matter for me, so any help/suggestion would be greatly appreciated :slight_smile:

al

Hello,
only idea I have is to use Partition instead of Dedup component: in partiotion function you can compare current record with previous


int getOutputPort(DataRecord record){
  if (record.compareTo(previous) != 0) {
     previous = record;
     return 0;
  }else {
     return 1;
  }
}

and then on port 0 you will have only distinct records.

Thanks for the suggestion :slight_smile:

I had to fix one thing: change
“previous = record;” to “previous = record.duplicate();”…

if not, the previous value will always be the current record since they are basically the same “pointer” or “address”…