Erroring out reading huge data

Hi, I’m facing some issue while trying to read the flat file lets say csv which is of 47.5 GB in total size , trying to read this file and filter on based on some condition and loading into respective database table. But im not able to read the flat file for at least 30 million rows because of the file worth of having 800+ million rows. I have tried increasing the heap size from initial to 16384 and max size to 60000 but still there is no progress, the graph is erroring out to space issue, how can i resolve this? if anyone had hint let me know

Hi Balaji,

Clover normally processes data in a streaming way - in your case, it would be reading your CSV row by row, filtering each row and then immediately sending to DB for writing.

If it fails due to space issues, it should not be memory issues.. Could be that if you run this in designer, then each edge connecting components run in debug mode and captures every record flowing through it for debugging purposes.

A quick thing you can try if you are executing this from Designer is to go to:
Run→Run Configurations→your_graph_name and check “Disable edge debugging” to make sure not debug data are stored.

But you should put here the console output with the error Clover displays and ideally also your transformation graph (the .grf file) so it can be further analyzed.

David

This is not a “need more heap” issue because CloverDX can handle 47.5GB/800M rows. It’s nearly always something breaking streaming once you’ve reached about 30 million rows and are dying with “space.”

I’ve witnessed a few things severely bite people:

running with edge debugging enabled from Designer (David has already pointed this out). That alone will quickly consume the disk or heap.

Sort, aggregate, join (non-streaming), lookup with “load all,” or writing to a reformat with auto-commit turned off are all examples of components that require caching.

DBOutput with large commit sizes or auto-commit disabled → Clover buffers rows that are awaiting commit.

filling up the temporary directory (Clover spills to disk, not heap). Verify the disk space and -Djava.io.tmpdir.

Heap is already enormous at 60GB. You would see GC thrashing much earlier if this was actually heap-related. concentrate on