Heya,
I discovered that we had a significant slowdown when using lookup tables (dbLookup type), so I am doing some research on how to make this better. I have one possible solution, but want to make sure (1) I understand what it is doing, (2) it is the best solution and (3) it won’t cause any issues with multithreading or anything like that.
We were not using a cache and were calling free on the lookup table after every lookup - not good; lots of CPU sleeping. BUT, if we didn’t call free(), we were getting a “max number of cursor” error because we were creating a Lookup object each time we did a lookup. Based on the example here (http://wiki.clovergui.net/doku.php?id=g … s#dblookup - needs updating for newer Clover version, I am using v2.8.0), in init() I initialized my LookupTable and Lookup object and created a DataRecord object to use in Lookup.seek(), then just called Lookup.seek(DataRecord) in transform(), substituting in the values I needed in DataRecord. In finished(), I called LookupTable.free(). In , I set maxCached to a reasonable number and set storeNulls to true. For my example graph, it went from an average 35 minute run to a 13 minute run - much better!
Here is where it gets complicated and I am getting confused.
We have 4 REFORMAT nodes that need to use these lookups - they all extend off of an abstract class that extends DataRecordTransform. The abtract class has an init() method that initializes all the LookupTable objects when the first node class calls init() and then the last node’s class has its own finished() method that calls LookupTable.free() to clear everything.
Currently, each node class has its own set of Lookup and DataRecord objects (stored in private Map instance members in the base class), initialized in their init() methods (they all call super.init() to make sure LookupTable was initialized!). This means that each node has its own Lookup cache? If I’m looking at this right, node 1 could cache, say, 100 results that I could use in node 2, but node 2 will have its own Lookup object and also store those 100 results.
I could give them a common set by creating static class Map members in my base class instead, but I’m concerned if I do that I could end up with bad lookup results if two nodes are trying to use the same Lookup/DataRecord objects at the same time (this doesn’t seem to be an issue with a single node by the example code?).
If I don’t do this, there doesn’t seem to be a way to clear a Lookup object (since DBLookup.close() is not an interface method of Lookup) and you can’t get a Lookup object from its enclosing LookupTable object. Does this mean that each Lookup with its cache remains until I call LookupTable.free() at the very end?? So far I haven’t run out of memory, but I could see it maybe being a problem for bigger lookups.
Hope the above makes sense! I am pretty happy with the speed up, but am worried that I could get out-of-memory using my current solution or run into multithreading issues if I try to use the same Lookup/DataRecord objects in 4 nodes.
The other solutions I’ve been tasked to look at is to pull the lookup table into either a flat file, then read it in using a DataReader or figure out if I can pull the entire lookup table into memory (DB_INPUT_TABLE or LOOKUP_TABLE_READER_WRITER?). Is there any documentation you can point me to or do you think the path I’m going down is better than those solutions?
Thanks,
Anna