Quesion about TableLookup

Heya,

I discovered that we had a significant slowdown when using lookup tables (dbLookup type), so I am doing some research on how to make this better. I have one possible solution, but want to make sure (1) I understand what it is doing, (2) it is the best solution and (3) it won’t cause any issues with multithreading or anything like that. :slight_smile:

We were not using a cache and were calling free on the lookup table after every lookup - not good; lots of CPU sleeping. BUT, if we didn’t call free(), we were getting a “max number of cursor” error because we were creating a Lookup object each time we did a lookup. Based on the example here (http://wiki.clovergui.net/doku.php?id=g … s#dblookup - needs updating for newer Clover version, I am using v2.8.0), in init() I initialized my LookupTable and Lookup object and created a DataRecord object to use in Lookup.seek(), then just called Lookup.seek(DataRecord) in transform(), substituting in the values I needed in DataRecord. In finished(), I called LookupTable.free(). In , I set maxCached to a reasonable number and set storeNulls to true. For my example graph, it went from an average 35 minute run to a 13 minute run - much better!

Here is where it gets complicated and I am getting confused.

We have 4 REFORMAT nodes that need to use these lookups - they all extend off of an abstract class that extends DataRecordTransform. The abtract class has an init() method that initializes all the LookupTable objects when the first node class calls init() and then the last node’s class has its own finished() method that calls LookupTable.free() to clear everything.

Currently, each node class has its own set of Lookup and DataRecord objects (stored in private Map instance members in the base class), initialized in their init() methods (they all call super.init() to make sure LookupTable was initialized!). This means that each node has its own Lookup cache? If I’m looking at this right, node 1 could cache, say, 100 results that I could use in node 2, but node 2 will have its own Lookup object and also store those 100 results.

I could give them a common set by creating static class Map members in my base class instead, but I’m concerned if I do that I could end up with bad lookup results if two nodes are trying to use the same Lookup/DataRecord objects at the same time (this doesn’t seem to be an issue with a single node by the example code?).

If I don’t do this, there doesn’t seem to be a way to clear a Lookup object (since DBLookup.close() is not an interface method of Lookup) and you can’t get a Lookup object from its enclosing LookupTable object. Does this mean that each Lookup with its cache remains until I call LookupTable.free() at the very end?? So far I haven’t run out of memory, but I could see it maybe being a problem for bigger lookups.

Hope the above makes sense! I am pretty happy with the speed up, but am worried that I could get out-of-memory using my current solution or run into multithreading issues if I try to use the same Lookup/DataRecord objects in 4 nodes.

The other solutions I’ve been tasked to look at is to pull the lookup table into either a flat file, then read it in using a DataReader or figure out if I can pull the entire lookup table into memory (DB_INPUT_TABLE or LOOKUP_TABLE_READER_WRITER?). Is there any documentation you can point me to or do you think the path I’m going down is better than those solutions?

Thanks,
Anna

Hello Anna,
In 2.8 version Lookups work as you describe (I’ve reported it to our bug tracking system: http://bug.cloveretl.com/view.php?id=3474) and you certainly shouldn’t use the same Lookup object in more nodes. Maybe the best way is to create a new LookupTable in each Reformat and free it in finished method.
Since 2.9 version some steps are moved to preExecute() and postExecute() method.

Heya Agata,

Thank you for the reply to my post!

I did some further research and it looks like the best solution may be to use SimpleLookup. I read the lookup table’s information into a LookupTable (simpleLookup type) in Phase 0 using a DB_INPUT_TABLE/LOOKUP_TABLE_READER_WRITER combo. The 4 nodes in Phase 1 all create their own Lookup/DataRecord objects in init() and the last node calls LookupTable.free() in its finished() method.

I used this example as a template for reading in the lookup information: http://wiki.clovergui.net/files/example … Reader.grf.

Although this method didn’t improve a smaller dataset all that much:

-20,861 lookup records, 1 node using them.
-69,841 records in output file.
-35min [using DBLookup w/no cache] to 13min [using dbLookup w/cache] to 12min [using in-memory].

It really made a difference in a larger dataset:

-42,369 lookup records, 2 nodes using them.
-982,385 records in output file.
-92+hrs [using DBLookup w/no cache] to 27hr 20min [using dbLookup w/cache] to 5hr 33min [using SimpleLookup in-memory]).

(NOTE: I’m providing these numbers as an example. The run time between different datasets (small vs large) is really unimportant to us - we are interested in the time difference within a dataset when we switched the lookup implementation in its graph. I know I need to do repetitive tests for more accurate performance analysis.) :slight_smile:

Since each SimpleLookup has a pointer to the same data Map, it seems to resolve my issue with multiple cache, too. Each node still needs to have its own Lookup/DataRecord objects (to avoid multithreading issues), but its lookup is done against the same data Map and SimpleLookup does not create a “cache.” At least that’s what I thinks it’s doing…

This solution trades off higher memory usage for improved performance, so I will probably have to see how it does when there are multiple lookup tables or really large lookup datasets. I could see where the Phase 0 could take excessive time if there are a lot of lookup datasets to read into memory and it could also cause my graph to run out of memory if the amount of data is huge.

Any thoughts? Does this seem like a valid approach?

Thanks,
Anna

P.S. For DBLookup, your bug item says that “Freeing cache of DBLookup should be done after finishing execute method, not in free only” but in my specific scenario it seems like I would take a performance hit? For example, if node 1 caches 100 records and frees them after the node finishes, node 2 would have a re-retrieve and cache those 100 records again. That might be better than having to deal with synchronization issues on a global cache (probably haved to be stored in DbLookupTable instead of DBLookup), but is there a best-of-both-worlds solution? Just putting the thought out there…

Hi Anna,
Thank you for the deep analysis. You can also try to use PersistentLookup instead of SimpleLookup. This would help you to avoid memory exhausted problem.
I’ve talked with our developers about the DBLookup cache behavior and they have promised me to look out for the solution in the 2.8.2 version, which is to be released in few days.

Hello Anna,

I think your consideration about our lookup tables is pretty good and your elaboration is really exhaustive.

Currently we are not considering to make Lookup objects synchronized due performance issue of this solution. If you have four reformats with similar functionality - ‘lookuping’ same date, can you consolidate all these lookups at one component? This would be probably the fastest solution. Your latest approach with the simple lookup table is also very good and straightforward. If most of database data will be used by the lookup, what is probably your case, it is definitely fastest solution to unload data from database at once and then use it for simple lookup building (hash join component is also possible to use, for big lookup datasets you can sort the data first and use the merge join component instead). To avoid out-of-memory problem with this solution you can experiment with the persistence lookup table as was already mentioned.

Theoretically it is possible to share same instance of lookup object through all reformats via our Dictionary, what is something like key-value shared memory for inter-component communication. If you want to know more about this option check out our documentation at http://www.cloveretl.com/documentation/UserGuide/topic/com.cloveretl.gui.docs/docs/dictionary.html or let me know for futher information. You have to also take care about synchronization of multi-thread working with this shared instance.

If I miss something let me know I will complete my answer. Martin

Heya,

Hopefully these posts can help other people if they find themsleves with a similar situation. :slight_smile:

I implemented the SimpleLookup solution more-or-less as I outlined about. The only change I made was to call TableLookup.free() in a special “last node” in our graph that does some cleanup/statistics gathering after all the output data has passed through. I confirmed with my business folks that our lookup data is small enough that we shouldn’t have to worry about memory issues.

I will keep in mind the PersistentLookup if I do run into any issues. We cannot combine the lookups into a single component or use JOINs because we use lookups inside a “rule engine,” not directly in the grpah. I really appreciate your suggestions, though!

Thank you both for your replies!

Anna