Dealing with duplicate records

kasturi · November 22, 2011, 12:00am

I am working with large amount of data ( web data). When I load into fact table, I would like to check if the row is already there before I insert. I use incremental load so there is no chance of having duplicates. Only when the graph fails for some other reason and I need to rerun , I should be able to run from the point where it failed. Using intersection slows down the process quite a bit. Fact has millions of rows already. I was thinking if there is a way to use a transformation where I update the row if there is a duplicate. I will appreciate if you can give me some idea on how to avoid intersection and still accomplish the gal.
Thanks,
Kasturi

mvarecha · November 23, 2011, 2:44pm

Hi,

since you need to solve only the rerun after graph fails, I believe it shouldn’t occur too often.

You can avoid intersection and try to insert all input records. If the record already exists, DBOutputTable can’t insert it again and rejects the record.
Rejected record is sent to the output port 0.
(please don’t use DBOutputTable in batch mode, otherwise all records in the batch would be evaluated as rejected).
You may also connect the output port 0 to next DBOutputTable which may update the existing record.

Best regards,
Martin

Topic		Replies	Views
Problem with duplicated keys while inserting in database CloverDX Platform	3	1	March 26, 2008
Insert or Update incremental data from data file to database CloverDX Platform	8	10	June 30, 2009
Insert/update in DBOutput CloverDX Platform	2	17	February 1, 2013
How to do condition when inserting records in Database CloverDX Platform	3	5	October 17, 2011
Check if the record already exist CloverDX Platform	2	6	July 16, 2007

Dealing with duplicate records

Related topics