I am attempting to ‘join’ a CSV file with a database table to produce another CSV file. The table has the columns that I need, the CSV file is used to ‘filter’ the table’s records–essentially only include the rows from the table that match the names in the CSV file.
When you use ExtHashJoin, you should keep in mind that this joiner cache slave data in the memory. Due to this fact, using of this joiner should be avoided in case of large inputs on the slave port. It is also the reason why you experience this issue on large dataset only.
I can see, you have a very few records incoming to ExtHashJoin through master port. You may resolve this issue by switching the ports with each other (so the master will be slave, and slave will be master). Then, you will have more records on master and less on slave port.
That worked. The video tutorial didn’t mention the caching bit; it makes, however. When I was testing with a CSV representation of the table (21MB), I would get Java heap exceptions–that should have given me a clue.