Bug in DataParser

This is a follow-up to this: http://forum.cloveretl.com/viewtopic.php?f=4&t=4569

This issue continues to plague me. DataParser seems to work ok when its comma-delimited, but when I’m specifying other field & record delimiters, I am frequently unable to parse perfectly valid delimited data files.

I noticed that if I reverted back to 2.8.1, I can parse the file fine. Fails w/ 3.1.0 & 3.1.2. Situation is the same as I described in my earlier post, seems to end up eating the last record on the row along w/ the delimiter resulting an exception. Exception is usually at DataParser L437 calling parsingErrorNotFound.

Replacing that with the following logic seems to resolve the issue for me, end of record delimiters were not taking into account stuff still being in the delimiter searcher/buffer.


if (fieldBuffer.length() - delimiterSearcher.getMatchLength() > 0) {
	fieldBuffer.delete(fieldBuffer.length() - delimiterSearcher.getMatchLength(), fieldBuffer.length());
	break;
}
else if (fieldBuffer.length() - delimiterSearcher.getMatchLength() == 0 && fieldCounter == numFields-1) {
	// last record, but empty
	break;
}
else return parsingErrorFound("Unexpected record delimiter, probably record has too few fields.", record, fieldCounter);

I’d like to avoid having a forked version of clover w/ this fix included, is there a workaround I can use?

Thanks

May I ask you to send an example of valid delimited file, which is not possible to be parsed by UniversalDataReader.

Thanks a lot.

In your previous post your mentioned this:

The delimiter on the last field of my metadata is defined as delimiter=“,\\|;\\|:”

I think there is a problem in here. If you specify an explicit delimiter for the last field, it will override the default delimiter, which is usually the record delimiter (\r\n in your case). So when CloverETL parses the file, it will:

  • parse the first field

  • find the delimiter of the first field (comma = ,)

  • parse the second (and last) field

  • while parsing the second field, it will not find its delimiter (specified by “,\\|;\\|:”)

  • instead, it will run into “\r\n” sequence, which is a record delimited

  • therefore, you will get the “unexpected record delimiter” error

So I think CloverETL behaves correctly and the error message is alright.

I am not sure why you have set such a delimiter for you last field. Solution would be to remove completely the explicit delimiter for you last field and leave it to be set to implicit record delimiter “\r\n” - because the last field will be always terminated by record delimiter in CSV.