So I have two files with 10 million+ records, couple of hundred columns each… both sorted by a common key. Slave file may have instances where key is not populated (if possible using outer join, otherwise i’ll just not carry those). I get the impression that the hashJoin is very memory intensive, it’s failing with a heap error … is there any way I can ‘leverage’ my files pre-sorted status to optimize this (seems like it would run very fast with that assumption in mind) ? or is that another Join in the non-community edition?
If the answer is the latter… if I were to buy the Desktop edition… can I still use the runGraph feature using graphs I create with the Desktop edition? or is that an enterprise function?
you’re correct that HashJoin can be very memory intensive - it caches slave records in memory. To join large data, you need to use MergeJoin component that would leverage the pre-sorted status of your data. Please see Joining Data for more info on joiners. However, the MergeJoin component is not available in CloverETL Community - it contains only the HashJoin joiner component.
CloverETL Desktop contains the RunGraph component - is this the runGraph feature you meant?
I just want to verify… that if I buy the desktop edition for windows, can I take my graph over to my linux box and run it via command line? (as I’ve been able to do with the community edition). I don’t have any kind of GUI desktop setup to hit my more powerful linux box with.