URGENT: remove duplicate records and sorting in DEDUP?

Hi,

How do i remove duplicate records without specifying all the fields? my record has a metadata of 2000 fields…
here is a subset of my input data, sorted by REFERENCE (primary key):

“REFERENCE”,“NAME”,“NO”
"000000010271 ","WFB ",“1”
"000000010271 ","WFB ",“1”
"000000010272 ","ABC ",“1”
"000000010272 ","ABC ",“2”

i want an output result like this:

“REFERENCE”,“NAME”,“NO”
"000000010271 ","WFB ",“1” (removed the duplicate)
"000000010272 ","ABC ",“1”
"000000010272 ","ABC ",“2”

i know i can use DEDUP and set the dedupKey=“REFERENCE;NAME;NO” to achieve my output, but if my input data has 2000 fields, i do not want to set dedupKey to 2000 fields, right? moreover, can dedupKey be set to such a long string? so, is there a way to tell CloverETL to remove duplicate records if i have 2000 fields to match?

i would think DEDUP would just need a flag, say remove_only_if_all_fields_matches, set to true and can reference the FMT for the list of fields… if values of each respective fields match, then it’s a duplicate and remove it… that way, DEDUP would not need the dedupKey to be set to a large number of field names… right?

just to make sure, DEDUP does not sort the records, right?

any help would be greatly appreciated Smile

thanks,
al

Hello,
I’ve created the new issue with your request (http://bug.cloveretl.com/view.php?id=3401).
And answer about sorting: Dedup doesn’t sort records, but it expects, that records are sorted according the key fields. If not, it only deduplicates records for each group of records that have the same key in sequence input.

Hello Achan,

You can click inside the left pane of the Edit key dialog (Fields pane), then click Ctrl+A (after which all Fields will be selected and turned blue) and click the Right arrow key.

This way all the fields will be moved to the pane on the right. You only need to confirm this by clicking OK.

Before this, you should have done the same in the ExtSort component.

I think this is what you wanted.

Best regards,

Tomas Waller

Pulling this from a long time ago, but facing a similar issue. My problem is not the number of metadata fields (100), but rather the number of separate graphs. We have 400 files we will be uploading via CloverETL, but we need to have each in a separate graph for purposes of running the files individually.

I have used a shared metadata resource, database connection and SQL query file to make those elements common among all graphs. But I cannot figure out how to make an automatically updated or shared component for the sort/dedup. I have read that dedup has some functionality when the key is not specified, but that it still expects sorted input data. Is there a way to created a shared resource for the sort components? Or is there a sort that functions without any specified key and just sorts in order?

A ‘shared resource’ or non-key’d sort outputting to a non-key’d dedup i think would fix my problem, but I can’t figure out how to make it work. Thank you for your help!

Hi Tom,

I would recommend to use parameters in workspace.prm and ${} parameters substitution. That will allow you to control sort&dedup key from one place. Most of component parameters can be passed in this way. Please see attached sample.

In workspace.prm please define parameters like:


SORT_KEY=field1(a);field2(a)
DEDUP_KEY=field1(a)