Does anyone out there have a good example of a process to check a history file of processed records to see which incoming records haven’t already been processed? I’m using this in some processing where we get new sets of data that often include records which were previously supplied and don’t need to be processed again. The entire history is written out to a clover file at the end of the initial process, so we have that to check against.
I’m currently working on a solution using joins, but wondering if there is an easier way to do it. It would be easy to do with a SQL Query (Select * FROM NewData where KeyField NOT IN (select KeyField from HISTORY), but we don’t have the data written out to databases.
Any ideas what’s the most efficient way to do this using the clover components?
Probably the most suitable solution is to use the DataIntersection component which is designed to this kind of job. However, you will need to sort the inputs, so if you process large amount of data it can be memory or HDD space consuming (depending on which sort component you choose, your memory settings and number of records). I attached a sample graph, where it is shown.
The DataIntersection has 3 outputs. Following is the layout of the DataIntersection outputs in the sample graph.
1. Records that weren’t processed yet.
2. Records that are processed again.
3. Records that were already processed, but aren’t processed again.
Here are some hints, you may find useful. If you don’t need to update the status (or contents) of already processed records, you can save only the key values, which are used to match processed and unprocessed records. This can save you memory or HDD space when the sort component process the records. Using the Clover file format can speed up processing the file. However, if you plan to use it outside of the CloverETL software, it will be wise to use some universal file format.