Using RunGraph with Large Amounts of Data

Hello,

We are working with a large amount of data and have a number of people working on the project. There is a logical piece of processing that we want to put in a subgraph to promote reuse. It performs a large number of generic transformations, but we would like to use them on separate data streams.

Our idea right now is to have a series of specialized graphs to cover the specific loading concerns of each data source, then run the data through a transformation graph that will take place of the data mapping portion of the processing in a generic, centrally-defined way.

My immediate concern is that the only way I can see to pass data to the subgraph is via the graph name and some parameters. So if I have 300 GB of data to pass through, I’m likely going to have to write it to a temp Clover Metadata file, and pass that filename to the subgraph. That obviously incurs a 300 GB write/read cycle, something we are obviously loathe to do.

Several times I feel like I’m not understanding the Tao of CloverETL. How should I be approaching this problem in a Clover-y way?

Thanks,
Brad

Hello Brad,
this is right solution. The only way to reuse some parts of graph is to write the separate graph, which does the job, and run it with RunGraph component. Use clover internal format (CloverDataReader/Writer) for input/output of such graph. Such data are read/write really fast, but unfortunately requires a lot of disc space (much more than flat data). You can pack the data to the zip file, but, of course, reading/writing compressed data is slower then uncompressed one.
The real subgraphs are planned for CloverETL 4