Optimizing for ExtSort with very large file

Hi all,

I have a graph that keeps failing with GC out of memory error on the ExtSort routine. There are millions of records in the file that need to be sorted and the file itself is over 2GB. Is there a way I should configure the ExtSort itself to avoid this problem? I was looking at this page in the help:
http://doc.cloveretl.com/documentation/ … tsort.html

And noted the number of tapes setting: should I increase this dramatically to support sorting of millions of records?

Any other suggestions?

Hi anyeone,

That is quite a strange issue, in most cases, the ExtSort should be able to handle that amount of records without causing CG out memory. I would recommend at the moment to put the ExtSort in a different phase, which should force Disk Swapping does allow some of the memory to be released. Could you also answer the following questions:

  • CloverETL Designer and Server version

  • Your current memory settings

  • If possible please attached your graph (remove any sensitive data)

The graphs involved are extremely complicated, and the ExtSort occurs many places in the process so it will be kind of hard to send it in a usable way for you.

We upped the JVM heap to 8GB and that seems to have resolved the issue at least for now, but it seems to me that there should be a way to optimize it, and if we ever get even larger data sets we could end up having a ridiculously sized heap requirement.

I will try putting the sorts in their own different phase and see if that helps as well. I did notice sometimes the process is going in parallel with a bunch of other tasks so that could be affecting it.

Hi Anye,

Yes, indeed, with increasing size of files being processed there should also be more resources available for the JVM. Therefore increasing the JVM heap memory assigned to the process is a right decision. Just please be aware that you should not use the whole capacity of the physical RAM (especially if there is not just CloverETL running on that physical machine). The general recommendation is to assign half of the physical RAM to the CloverETL heap memory, but it depends on many circumstances (direct memory enabled/disabled and so on).

I would also like to point out again, that using Phases in the design might help distribute the resources more efficiently, as Pedro suggested in his update.

Nevertheless, if you want to learn more about ExtSort component and possibly other sorting options in CloverETL, you can read the following blog posts:

https://blog.cloveretl.com/sorting-data-extsort-vs-fastsort
https://blog.cloveretl.com/sorting-data-extsort-vs-fastsort-part-2

Note that these articles are from 2010 and might not be up to date in all details, but the principle hasn’t changed.

Please let me know if this is what you have been looking for or if any follow-up question arises.

Best Regards, Eva