DataReader Policy

Heya,

I can see from the Wiki that DataReader has three data policies: Strict, Controlled, Lenient. Has there been any thought about adding a data policy (or a condition) that allow you to keep rows where “a parser found unexpected record/field delimiter or end of file?” Our legacy commercial ETL tool handles this sort of issue by allowing our users to treat the remaining fields as null (if you flag them appropriately).

For a workaround, we are currently parsing the row as a single field and then parsing the data ourselves, but it would be nice if Clover wouldn’t handle this as an error. Just thought I’d ask if there was any movement in this area since I will be able to remove one of our other workarounds when we move to 2.8.0…

Thanks,
Anna

This looks like an interesting idea. Can you elaborate a bit more ? Ideally post an example of input and what should happen on output… What does “Lenient” mode do when you run it on your data ?

Heya,

It has been a while since we used it, but I believe “Lenient” will skip the row with formatting errors, “Controlled” will log the record but still skip it, and “Strict” aborts the graph.

The common example given to me by our business folks is when they get a comma separated file (CSV) that was created via Excel (not sure if it’s an older version or not); it tends to not put commas in for empty cells, so when they get data like:



CUSTOMER_ID|LAST_NAME     |FIRST_NAME    |MIDDLE_INITIAL       |ADDRESS
001        |Smith         |Joe           |                     |
002        |Jones         |Mary          |P                    |
003        |Jackson       |Ken           |W                    |Mr.
004        |Hill          |Rose          |                     |Mrs.


it gets converted to a CSV file like:

CUSTOMER_ID,LAST_NAME,FIRST_NAME,MIDDLE_INITIAL,ADDRESS
001,Smith,Joe
002,Jones,Mary,P
003,Jackson,Ken,W,Mr.
004,Hill,Rose,Mrs.

With all three policies, rows 001 and 002 get rejected with formatting errors.

Our legacy ETL tool allows us to mark MIDDLE_INITIAL and ADDRESS fields as “not required/treat them as null” - we tend to mark all fields that way. The rows are not rejected and carry through the transform. For our Clover graph, we use a custom transform to parse the record as a single string, break it up into the fields, and then treat the missing fields as NULL (it does log an error and abort the graph if the field in question is required). We get the same results as if the CSV file had been:

CUSTOMER_ID,LAST_NAME,FIRST_NAME,MIDDLE_INITIAL,ADDRESS
001,Smith,Joe,
002,Jones,Mary,P,
003,Jackson,Ken,W,Mr.
004,Hill,Rose,Mrs.

That is the way we use it - I looked at our legacy ETL tool and they provide the following actions:

Global “Missing Columns” Actions:

  • Pad with “null” value: Missing columns are initialized to zero or spaces, depending on the data type.

  • Pad with empty value: Missing columns are initialized to an empty string.

  • Pad with NULL: Missing columns are initialized to NULL.

For each column, you can specify:

Column Contains Insufficient Data to Match the Meta Data

  • Error: Abort the job as soon as such a row is found.

  • Discard + Warn: Discard the current data row and issue a warning.

  • Replace + Warn: Pad the short column in accordance with missing columns action if missing; write a warning to the log file.

  • Retain + Warn: Pass the data on as it is, but issue a warning.

  • Replace: Pad a short column in accordance with missing columns action if missing .

  • Retain: Pass the data on as it is.

Thanks,
Anna