hashJoin in community edition

stargazer · December 6, 2011, 12:00am

So I have two files with 10 million+ records, couple of hundred columns each… both sorted by a common key. Slave file may have instances where key is not populated (if possible using outer join, otherwise i’ll just not carry those). I get the impression that the hashJoin is very memory intensive, it’s failing with a heap error … is there any way I can ‘leverage’ my files pre-sorted status to optimize this (seems like it would run very fast with that assumption in mind) ? or is that another Join in the non-community edition?

If the answer is the latter… if I were to buy the Desktop edition… can I still use the runGraph feature using graphs I create with the Desktop edition? or is that an enterprise function?

Thanks for your help.

Jeff

jurban · December 12, 2011, 9:57am

Hi Jeff,

you’re correct that HashJoin can be very memory intensive - it caches slave records in memory. To join large data, you need to use MergeJoin component that would leverage the pre-sorted status of your data. Please see Joining Data for more info on joiners. However, the MergeJoin component is not available in CloverETL Community - it contains only the HashJoin joiner component.

CloverETL Desktop contains the RunGraph component - is this the runGraph feature you meant?

Best regards,
Jaro

stargazer · December 16, 2011, 11:53pm

I just want to verify… that if I buy the desktop edition for windows, can I take my graph over to my linux box and run it via command line? (as I’ve been able to do with the community edition). I don’t have any kind of GUI desktop setup to hit my more powerful linux box with.

Thanks!

jurban · December 20, 2011, 12:57pm

Hi Jeff,

to be able to answer your question - how exactly were you running the graphs in Linux?

Best regards,
Jaro

Topic		Replies	Views
How to Proceed with Large Joins CloverDX Platform	9	10	June 1, 2011
Performance issue with processing large csv files with joins CloverDX Platform	1	5	July 18, 2017
MergeJoin - how to stop the graph CloverDX Platform	2	8	September 23, 2010
Dbjoin CloverDX Platform	2	8	November 22, 2011
Multiple small joins VS one big join - performance-wise?! CloverDX Platform	1	2	April 17, 2013

hashJoin in community edition

Related topics