Compare records and gather metrics

Hello all:

I am very new to CloverETL and have been able to perform simple tasks so far. However, I am now faced with my most challenging task and I’m hoping that one of you kind folks will be able to help me out.

I have two files, A and B. The records in both files share the same format and the first n characters of a record is its unique identifier. The record is of fixed length format and consists of m fields (field1, field2, field3, …fieldm). File A contains records that represent a point in time snapshot of a data set. File B contains new records and records in file A that have changed since the date of the initial data extract to file A. How can I use cloverETL to determine which fields have changed in a record that appears in both files?

Also, how can I gather metrics on the frequency of changes for individual fiels. For example, I would like to know how many records had changes in fieldm.

Thank you.

Hi, amd2002a,

Regarding your first question, I would use DataIntersection component. For more details, see: http://doc.cloveretl.com/documentation/ … ction.html
But it depends on what you want to do with the answer. Do you want some list of changed fields? Do you want to update the fields with the newest values? Do you want to count number of changes? Do you want to store all previous values somewhere?

Regarding your second question, again, it depends on what exactly you need. If you just want to count records on one of the outputs of DataIntersection, you can use Aggregate component. If you want some sophisticated counter, you can code it using CTL code and global variables or sequences. Or you can use some Filter + Aggregate combination.

If you could be more specific in description of your task and desired output, I will prepare some example graph for you.

Best regards,

Hi, amd2002a,

Regarding your first question, I would use DataIntersection component. For more details, see: http://doc.cloveretl.com/documentation/ … ction.html
But it depends on what you want to do with the answer. Do you want some list of changed fields? Do you want to update the fields with the newest values? Do you want to count number of changes? Do you want to store all previous values somewhere?

Regarding your second question, again, it depends on what exactly you need. If you just want to count records on one of the outputs of DataIntersection, you can use Aggregate component. If you want some sophisticated counter, you can code it using CTL code and global variables or sequences. Or you can use some Filter + Aggregate combination.

If you could be more specific in description of your task and desired output, I will prepare some example graph for you.

Best regards,

“imriskal”

Hello imriskal:

Thank you for your reply. I will use an example to better explain the scenario. Let us assume that the files have a pipe delimited format with 3 fields: id, firstName, lastName.

File A:
1|john|james
2|sammy|davis
3|peter|marx
4|pete|angstadt
5|jackie|david
6|paul|samuel
7|sam|kirsteyen
8|tim|tyler
9|justin|mccann
10|jack|mccue

File B:
3|peter|murray
5|jack|david
8|timothy|thompson
11|clark|kent

Notice that in file B we have:
1. one record where last name only has been updated (3).
2. one record where first name only has been updated (5).
3. one record where first name and last name have been updated (8).
4. one new record (11)

I need a graph that will tell me:
1. the number of new records (this should be one using the example above).
2. the number of changed records (this should be three using the example above).
3. for the changed records, the number of times a each field has been updated (using the example above, first name = 2 times, last name = 2 times).

I hope that makes my question clearer.

Thank you.

Hello again,

I have prepared an example graph for you. I hope it is clear enough but do not hesitate to ask in case of any troubles.

Best regards,