Preprocessing a file to remove invalid characters

proberge · November 17, 2015, 12:00am

I am attempting to build a graph that does some fairly simple operations (this will be a pattern for a number of graphs):

Remove a set of invalid characters from a gzip’ed data file (CSV)
Read/parse the CSV data
Add fields to the data stream using Reformat
Write the data stream to a database table

I’m stuck at 1. It seems that the only ways to do this are a) employ a system command and use sed, perl, etc to create a fixed version of the file, then read the file normally, or b) read the CSV then iterate through all of the fields, removing the invalid characters from each.

I’m not wildly crazy about either. What I’d like to do is the following, but I don’t see a combination of components and/or functions that will allow me to do it:

Read the file s.t. each line is 1 record
Remove invalid characters from each record with replace()
Parse each record as a CSV into the necessary fields
Do the balance of the transformation…

Perhaps I’m making this too hard and should just accept having to do the preprocessing using a shell command.

Thanks in advance,
-Phil

slechtaj · November 18, 2015, 4:35pm

Hi Phil,

What you can do is to read the whole records into one field and then parse the record in another component (Reformat). In other words, the output edge connected to the reader should contain just one field (the whole record with field delimiters).
The process might look like as follows:

Read rows from gzipped CSV files.
[list:1dz88vme]
[*:1dz88vme]You might use appropriate URL format so you don’t have to extract the files.
Make sure the output metadata has just one field with record delimiter – this way you will read the whole record into one field.
[/*1dz88vme]
Attach the reader’s output edge to Reformat component.
Using Reformat’s Transform attribute parse the record the way you want.
[*:1dz88vme]Here you can call replace() function to get rid of all unwanted character combinations. Please note the replace() function works basically the same way as Java replaceAll() method.
[/*1dz88vme][/list:o:1dz88vme]

proberge · November 18, 2015, 5:06pm

Jan,

Thanks for the response. I do use the gzip:() form of the file URL in the UniversalDataReader.

I have contemplated what you suggest, but once I have done the reformat/replace to eliminate the invalid characters on entire records, I need to parse each (CSV) record into the various fields of the final record. Do I need to do a READ->REFORMAT->WRITE(to file) followed by a READ(corrected records, parsing CSV)->…balance of processing…, or is there a function/component where I can do the CSV parse of the corrected records. I don’t see that Reformat will do that unless I use something like split(), which is not really intended to parse a CSV.

Thanks,
-Phil

slechtaj · November 20, 2015, 7:29pm

Hi Phil,

You have basically two options:

First is what you have already mentioned - parsing the record manually using the Reformat transformation.
The second option, which is from my point of view better solution, is sending the cleaned records to UniversalDataReader directly from the Reformat. You don’t have to write the data to an output file, since all readers can read data from the input port. In this case, the File URL attribute of the UniversalDataReader would have the following value:

port:$0.content:discrete

You might find some useful information in Clover documentation under the following links:

Input port reading - pay a special attention to processing types
UniversalDataReader

I have also created an example that might help you better understand the solution.

ParseData.zip

ParseData.zip

Topic		Replies	Views
How do you remove bad data? CloverDX Platform	10	7	May 5, 2008
Removing random "\|" characters from DBInputTable CloverDX Platform	2	1	December 4, 2017
I want to check format of the data CloverDX Platform	3	1	July 25, 2008
Regex prior to parse CloverDX Platform	1	2	April 15, 2011
CSV Registry contains newlines CloverDX Platform	1	5	September 27, 2013

Preprocessing a file to remove invalid characters

Related topics