Preprocessing a file to remove invalid characters

Support/help with CloverETL (4.9) and CloverDX (5.0 or newer) implementation problems

proberge
Posts: 2
Joined: Tue Oct 06, 2015 6:12 pm

Preprocessing a file to remove invalid characters

Postby proberge » Tue Nov 17, 2015 6:34 pm

I am attempting to build a graph that does some fairly simple operations (this will be a pattern for a number of graphs):

  1. Remove a set of invalid characters from a gzip'ed data file (CSV)
  1. Read/parse the CSV data
  1. Add fields to the data stream using Reformat
  1. Write the data stream to a database table

I'm stuck at 1. It seems that the only ways to do this are a) employ a system command and use sed, perl, etc to create a fixed version of the file, then read the file normally, or b) read the CSV then iterate through all of the fields, removing the invalid characters from each.

I'm not wildly crazy about either. What I'd like to do is the following, but I don't see a combination of components and/or functions that will allow me to do it:

  1. Read the file s.t. each line is 1 record
  1. Remove invalid characters from each record with replace()
  1. Parse each record as a CSV into the necessary fields
  1. Do the balance of the transformation...

Perhaps I'm making this too hard and should just accept having to do the preprocessing using a shell command.

Thanks in advance,
-Phil

slechtaj
Posts: 192
Joined: Wed Aug 15, 2012 8:18 am

Re: Preprocessing a file to remove invalid characters

Postby slechtaj » Wed Nov 18, 2015 5:35 pm

Hi Phil,

What you can do is to read the whole records into one field and then parse the record in another component (Reformat). In other words, the output edge connected to the reader should contain just one field (the whole record with field delimiters).
The process might look like as follows:
  1. Read rows from gzipped CSV files.
    • You might use appropriate URL format so you don't have to extract the files.
    • Make sure the output metadata has just one field with record delimiter – this way you will read the whole record into one field.
  2. Attach the reader's output edge to Reformat component.
  3. Using Reformat's Transform attribute parse the record the way you want.
    • Here you can call replace() function to get rid of all unwanted character combinations. Please note the replace() function works basically the same way as Java replaceAll() method.
Jan Slechta
CloverCARE Support
CloverETL | Rapid Data Integration

Visit us online at http://www.cloveretl.com

How to speed up communication with CloverCARE support

proberge
Posts: 2
Joined: Tue Oct 06, 2015 6:12 pm

Re: Preprocessing a file to remove invalid characters

Postby proberge » Wed Nov 18, 2015 6:06 pm

Jan,

Thanks for the response. I do use the gzip:() form of the file URL in the UniversalDataReader.

I have contemplated what you suggest, but once I have done the reformat/replace to eliminate the invalid characters on entire records, I need to parse each (CSV) record into the various fields of the final record. Do I need to do a READ->REFORMAT->WRITE(to file) followed by a READ(corrected records, parsing CSV)->...balance of processing..., or is there a function/component where I can do the CSV parse of the corrected records. I don't see that Reformat will do that unless I use something like split(), which is not really intended to parse a CSV.

Thanks,
-Phil

slechtaj
Posts: 192
Joined: Wed Aug 15, 2012 8:18 am

Re: Preprocessing a file to remove invalid characters

Postby slechtaj » Fri Nov 20, 2015 8:29 pm

Hi Phil,

You have basically two options:
  • First is what you have already mentioned - parsing the record manually using the Reformat transformation.
  • The second option, which is from my point of view better solution, is sending the cleaned records to UniversalDataReader directly from the Reformat. You don't have to write the data to an output file, since all readers can read data from the input port. In this case, the File URL attribute of the UniversalDataReader would have the following value:

    Code: Select all

    port:$0.content:discrete

You might find some useful information in Clover documentation under the following links:

I have also created an example that might help you better understand the solution.

ParseData.zip
Example project
(4.16 KiB) Downloaded 194 times
Jan Slechta
CloverCARE Support
CloverETL | Rapid Data Integration

Visit us online at http://www.cloveretl.com

How to speed up communication with CloverCARE support