How to Proceed with Large Joins

I have a series of large joins I would like to perform on Windows Server 2008 R2.

The first join seems to be presenting some problems and I would like some suggestions on how to proceed. Joining a small amount of data appears to work just fine. Once I increase the data set, however, I begin to hit Java heap errors. I increased the heap substantially, but I get the error:

Error occurred during initialization of VM
Could not reserve enough space for object heap

I’ve lowered my heap down to about 1.5 GB, which seems to now enable it to work, but I run out of heap space about half way through my first join.

Let me explain my graph further. I have two universal file readers which flow into a ExtHashJoin which then flow into a Trash component (for now). Port 1 of the ExtHashJoin has about 725,000+ records, stored in about 750 MB of flat files. Each record has about 50+ fields. Port 0 of the ExtHashJoin has about 60 GB of data in flat files. Each record has just under 200 fields.

Next up, I have almost 200 other joins to perform, as well as other processing.

So, now to my questions:

  • I’m already running out of heap space. I’ve adjusted it about as far as I can in my current configuration. Is that correct?

  • As I understand it, I can switch JVMs to try and boost the amount of heap space I can use. Any recommendations or limitations for CloverETL?

  • Will switching OSes from Windows Server 2008 R2 to Linux overcome the RAM limitations?

  • Is there some other graph configuration which would be more ammenable to the data volume I need to work with?

Thanks,
Brad

Hello Brad,
ExtHashJoin should be used for joining data with small slave set only, as it reads all slave data into a memory. For joining large data sets you should rather use ExtMergeJoin. It requires data to be sorted, but to sort data with ExtSort or FastSort is still better, than fighting with the OutOfMemory error. Another solution would be LookupJoin if you are able to store the slave data in a lookup table - database lookup for duplicate keys or PersistentLookup if the key is unique.

Hi Agata,

I ran a set of test data through using the Hash method and the FastSort to MergeJoin method. The Hash method significantly outperformed the MergeJoin: 2084 versus 2732 seconds.

This obviously isn’t an increase in performance, but I’m wondering if you could speculate as to why that might be. I’ve switched to a 64 bit JRE and we have a lot of available RAM on the server (100+ GB). Would that be a sufficient explanation for the difference?

I tried using FastSort with the InMemoryOnly flag on, but that seemed to take even longer.

The data we are joining is already in a database, but I have been asked to investigate the performance implications of moving the join into Clover. Do you have a gut reaction to that endeavour?

Thank you,
Brad

Any thoughts?

Based upon what I’ve seen, I suspect that the DB will win out in the head to head join. Where I suspect we will begin to see Clover take back ground will be on the parallelization front, on its ability to perform many joins and transformations in parallel, instead of being constrained.

Any sense of when the advantages of parallelization will overcome a series of single threaded DB joins?

Brad

Hi Brad,

Database vs. Clover join
I wouldn’t be too surprised when performance of join in database is better than in Clover. The reason is that data is already stored in database and in-database may therefore be really efficient provided that database indexes are built and up to date. On the other hand, reading data out from database for joining them outside using Clover maybe less efficient. In general it depends on
What the join selectivity is. Does join produce significantly less data than what is available in the two datasources?

What the target system is.

What is the subsequent processing of joined data like? Is there any at all?

On the other hand, joining data in-database may cause other complications. In such cases, Clover can offer similar or better performance. Below I list just few of typical scenarios where using Clover for external joins is beneficiary:
(Unexpected) database server utilization during in-database joins may be undesirable.

Database tables may be missing some of the indexes required for efficient in-database joins. If it is not possible to build these indexes (due to server utilization, space, etc.), in-database join may be very inefficient.

Joining data from various sources would require loading data into database first in order to join them.

Performing further transformations would require indispensable effort for implementation; e.g. coding in PL/SQL.

Pipeline processing of the data throughout whole transformation. This means that although join it self may be slower in Clover, whole transformation with further processing may be faster than equivalent one implemented in database in PL/SQL

Heap Space Issues
After switching to 64 bit Java, you should be able to allocate several gigabytes of memory for it. If you are running JDK, I would also recommend using -server option which implements better optimizations and offers higher performance.

Best Regards,
Jan

Hi Jan,

Thanks for the response.

The joins are almost always left handed joins (sometimes up to 190 of them) and therefore will result in a lot more data. There is filtering done, and it seems like there will definitely be benefits to playing with component ordering to optimize the impact of that.

The database is definitely up to date, and the database is only being used as an ETL alternative. All indices identified by query optimizers and execution paths have been built. As the data comes out of the DB there is a small amount of processing to perform, but not substantial. The target system is a server process and the data is being communicated from CloverETL to this server process via Web Services.

Thoughts?
Brad

Hi Brad,

join optimization
you are absolutely correct. If you can filter out any data before actually performing any sorts, joins or any other processing. Reducing data volume improves performance significantly (depending on the selectivity of the filtering condition). Similarly, if there is some work to be done on the not-joined data, it might be worth doing it before joins as joins are likely to multiply number of records on the output.

outbound processing
It looks to me rather strange that you are going to process so huge data through webservice interface. If batch interface is available, I would expect it to be more efficient.

Anyway, do I understand it correctly that you would like to move all your processing from database into an ETL tool? If so, does the source still be database? Is there anything I can help with?

Regards,
Jan

Hi Jan,

Thanks for the response.

In the “reducing data” vein, is there any benefit to making the records as small as possible during most of the processing (say only the critical join key information), then perform the joins we need to perform (~190 of them) and then fatten the records up at the end with a final series of joins?

All data sent to the platform we are working with goes over web services (I believe, although that is abstracted away from us. It does batch incoming records, and that batch size is a configurable parameter on the CloverETL component that has been provided.

Yes, we are evaluating how much, and precisely which processing it makes sense to move into CloverETL. At this point, I think we’re going into the whole thing with an open mind either way.

The source is originally flat files, but we currently load them into a database, where they live. The one benefit of that is that we build indices once rather than everytime a set of data passes through for processing (at the expense of additional space, of course).

Cheers,
Brad

Thank you for the answers.

Hi Brad,

You are correct about reducing the number and size of records during processing. The general recommendations for efficient transformation development are:

  • Do filtering as soon as possible. Typical scenarios include: filtering before sorting, filtering before complex reformats and joins

  • Drop unused fields. If you won’t need some of the fields coming from source, get rid of them as soon as possible.Reducing record size optimizes memory and disk usage when transformation runs. This allows you to run the transformation faster and/or run more transformation in parallel with the same HW resources.

  • Avoid breaking the pipeline processing. Some of the components break the pipeline processing - typically sorts and unsorted aggregations. Use such components when necessary only.

  • Avoid unnecessary sorting. Many of the components preserve order of incoming records. Therefore, if the input is sorted by first field, the output of the component is still sorted by first field. Some of the components break this rule, see section Component Reference in our CloverETL documentation for details.

If I understand your last post correctly, you mention first dropping all the fields except for join key, doing the join and then joining the required fields back. I don’t think that this would be faster than doing the join with all required fields directly.

Hope this helps.

Regards,
Jan