Easiest way to check a history file and only process new records

pintail · February 15, 2016, 12:00am

Hi,

Does anyone out there have a good example of a process to check a history file of processed records to see which incoming records haven’t already been processed? I’m using this in some processing where we get new sets of data that often include records which were previously supplied and don’t need to be processed again. The entire history is written out to a clover file at the end of the initial process, so we have that to check against.

I’m currently working on a solution using joins, but wondering if there is an easier way to do it. It would be easy to do with a SQL Query (Select * FROM NewData where KeyField NOT IN (select KeyField from HISTORY), but we don’t have the data written out to databases.

Any ideas what’s the most efficient way to do this using the clover components?

thanks!

Lukas_Cholasta · February 19, 2016, 11:56am

Hi pintail

Probably the most suitable solution is to use the DataIntersection component which is designed to this kind of job. However, you will need to sort the inputs, so if you process large amount of data it can be memory or HDD space consuming (depending on which sort component you choose, your memory settings and number of records). I attached a sample graph, where it is shown.

The DataIntersection has 3 outputs. Following is the layout of the DataIntersection outputs in the sample graph.

1. Records that weren’t processed yet.
2. Records that are processed again.
3. Records that were already processed, but aren’t processed again.

Here are some hints, you may find useful. If you don’t need to update the status (or contents) of already processed records, you can save only the key values, which are used to match processed and unprocessed records. This can save you memory or HDD space when the sort component process the records. Using the Clover file format can speed up processing the file. However, if you plan to use it outside of the CloverETL software, it will be wise to use some universal file format.

Best regards

pintail.rar

Topic		Replies	Views
Compare records and gather metrics CloverDX Platform	3	2	July 2, 2013
Compare data from next row with current row CloverDX Platform	6	2	June 9, 2014
Process only unprocessed files CloverDX Platform	3	1	October 30, 2009
How do you keep track of a file that has been processed? CloverDX Platform	1	1	September 20, 2010
DBExecute Stored Procedure Incremental Snapshot Processing CloverDX Platform	1	2	April 23, 2014

Easiest way to check a history file and only process new records

Related topics