I have been doing some testing and found that the ‘Delimited Data Reader’ reads csv files significantly faster than the ‘Universal Data Reader’. While that may not come as a surprise, the documentation makes no reference to the performance benefits of one over the other.
I also don’t understand why one performs over the other… it seems that via the metadata or component properties the ‘Universal’ should be able to determine which component to run for the best performance, behind the scenes.
I also found that the Iced Tea 1.7.0.0 JVM runs significantly faster than the Java JRE 1.6.0.3, specifically on file I/O.
Just some thoughts and notes from my testing so far… Hope it helps someone.
Shane
Hello Shane!
Thanks a lot for you effort in this topic, we really appreciate it!
Regard your questions:
The Delimiter data reader component is about 10% faster then Universal data reader according our internal testing. We know that and the main reason lies in different delimiter searching algorithm - full implementation of Aho-Corasick string searching. This ensure correct handling of multi-character delimiters and record delimiter. There are few other improvements - like better error recognition and error reporting.
You are right, all three flat-file readers (delimited, fixlen and universal) could be joined into one component, that only decides based on the metadata, which of the available parsers will be used. Unfortunately we are in current state for some historical reason. However be sure we are considering this possibility.
We are also very interested in your experiences with running clover engine with Iced Tea 1.7.0.0 JVM. May I ask you to send us any results of your testing?
Thanks again for your deep inside. Martin