hashJoin in community edition

So I have two files with 10 million+ records, couple of hundred columns each… both sorted by a common key. Slave file may have instances where key is not populated (if possible using outer join, otherwise i’ll just not carry those). I get the impression that the hashJoin is very memory intensive, it’s failing with a heap error … is there any way I can ‘leverage’ my files pre-sorted status to optimize this (seems like it would run very fast with that assumption in mind) ? or is that another Join in the non-community edition?

If the answer is the latter… if I were to buy the Desktop edition… can I still use the runGraph feature using graphs I create with the Desktop edition? or is that an enterprise function?

Thanks for your help.

Jeff

Hi Jeff,

you’re correct that HashJoin can be very memory intensive - it caches slave records in memory. To join large data, you need to use MergeJoin component that would leverage the pre-sorted status of your data. Please see Joining Data for more info on joiners. However, the MergeJoin component is not available in CloverETL Community - it contains only the HashJoin joiner component.

CloverETL Desktop contains the RunGraph component - is this the runGraph feature you meant?

Best regards,
Jaro

I just want to verify… that if I buy the desktop edition for windows, can I take my graph over to my linux box and run it via command line? (as I’ve been able to do with the community edition). I don’t have any kind of GUI desktop setup to hit my more powerful linux box with.

Thanks!

Hi Jeff,

to be able to answer your question - how exactly were you running the graphs in Linux?

Best regards,
Jaro