I have a graph that requires several large (~5GB) csv files as inputs. I would like to run the graph recursively, by selecting data in monthly increments. I read some postings about the use of incrementalKey and incrementalFile, but can’t figure out how to apply these concepts to csv files. (For instance, what would I use instead of a SQL ‘where’ clause?) Also, I have no idea how to make the graph iterate through every month in my data. Please help. I’m new to CloverETL, so please set verbose=true. Thanks.
Hi jolness,
incremental reading of db records works differently than incremental reading of csv files. Unfortunately, with csv files, you have no option to filter data you want to read. You just have a small “bookmark” that indicates where your reading ended last time. This bookmark contains just an offset address pointing to the last read record and reading in another graph run starts from this bookmark, reads the new records and move the bookmark. There is no additional functionality for it.
You can optimize your data access in following way:
1. Use a UniversalDataReader component with file url set to somethink like “*.csv”. That will read all the input files, one by one.
2. Add a new field into metadata where you can prepare value based on which you can differentiate months. For example, you have the date 10/22/2010 in your record so in the new field you can store the value 10/2010. If you already have such field, use it as it is.
3. With the functionality of partitioning in the UniversalDataWriter you can split all the large files into the smaller ones, one for every month, e.g. “10-2010-input-data.csv”.
4. This will allow you to work with the small files instead of the large ones, with data records based on the month you choose. Data size will be of course doubled, but working with the small files will be much more comfortable.
5. You can find more info here: http://doc.cloveretl.com/documentation/ … ading.html
Please, let me know if anything was unclear to you.
Best regards,
Thanks Lubos. I should have been clearer about one thing: My input files must be used in parallel, not in series. They are joined together during the transformation process.
I know how to limit the files in such a way that the joins can still be performed but it seems that Clover does not. Instead of performing the transformation on manageable subsets of the data, Clover appears to be reading in all of the data from all of the files, presumably because of the joins.
Your suggestion to partition the data might be workable but I would have to partition each of the large input files (of which there are four). And if I understood your suggestion correctly, I would have to create a distinct partition for every month that is represented in my data. As new data accumulates, I would have to add new partitions. This seems like a messy solution to the problem. Am I understanding you correctly?
-Jim
Hi jolness,
there are several aspects in your answer. It would be best if you can provide us sample graph - so we can analyse it.
1] “I would have to add new partitions” - if you mean manually, then not. Just define partitioning key as Lubos suggested, and Clover creates for you new partitioned files on the fly - when value of next month appears in partitioning key.
2] If your data are appended to existing files you can use Incremental Reading. It will remember position of last transformed records and in next round reads just new.
3] “Clover appears to be reading in all of the data from all of the files” - this highly depends on your graph structure. Here would be sample graph really helpful. For example, if you use ExSort or FastSort component somewhere, whole dataset on input port is read and kept in memory or temp file - because for sorting this is necessary. If you have pre-sorted join keys and use ExtMergeJoin, merging should be done in parallel with reading.
I hope this helps.
Thanks for your help. I’m attaching a sample graph. It reads data from three input csv files: Invoices, Subscriptions, and Accounts. The first join key is Subscription ID, the second join key is Account ID. (I haven’t supplied any metadata so the graph is just a visual depiction of what I want to do.) The real files contain several GB each. Depending on circumstances, I might need to read the data in from a database instead of from csv files, so I’d like to know how to handle both situations.
Should I be sorting the input files on the join keys? Would that enable me to run the graph without creating intermediate partitioned files?
I’m not at all clear on how to use the Partition component. I see how to select a partition key, but don’t see how to specify what should be done based on the values of the partition key.
-Jim
P.S. I accidentally used Merge joins in the sample graph. The real graph uses Hash joins. Sorry about that. I suspect that you’re going to tell me to use Merge joins. Freudian slip?
Hi Jim,
ExtMergeJoin is good for large input files where both inputs have approximately the same size. They also have to be sorted by join key. ExtHashJoin on the other hand, works great with data where one input is significantly smaller than the other one. This small input will be in the memory for the whole time of processing so I would recommend a size of a few thousands of records at most. And the inputs do not have to be sorted at all. So it is up to you which one is better for your project.
Regarding reading of your inputs, can I ask you how exactly do you want to process it? I am still not sure what you want to do.
The first option is that you want to read and process all the large files just once and then never read the same records again, just to read the records that append to the input files. In that case incremental reading is all what you need, see http://doc.cloveretl.com/documentation/ … ading.html
The second option is that you want to read those large files multiple times and work with them as with groups sharing the same month. In this case would be the best solution to split them into the smaller files according to the abovementioned desription. Partition component is not what you need in this situation. When I talked about partitioning I meant partitioning of output files according to http://doc.cloveretl.com/documentation/ … files.html
The third option is that you have your input data in some db. In that case you can use WHERE clause in selecting your input and your issue with input files is solved.
If neither one of the options does not describe your situation, I would like to politely ask you to send some closer description of your project and maybe I will come with some better solution.
Best regards,