Updating Dimensions

(I’m brand new to CloverETL and DataWarehousing in general, so feel free to let me know if I’m way off track)

Anyhow, I’m planning on using CloverETL to populate a data-mart. I’m trying to figure out the best way to create and update some of my dimensions (which will be relatively small, but dynamic).

As I understand it now, I could do an outer-join against the existing dimension table and locate the missings records for insertion. I would then have to remove duplicates (not sure how) and assign each record a unique id as primary key. I’d then have to re-run the outer-join in order to locate the foreign keys for my fact table.

Since the dimensions are relatively small, I was thinking of speeding up processing by using a LookupTable that I could dynamically update during processing with any new entries. I can use the table size to safely allocate new unique ids, and will just need to write the new entries to an external file so I can upload them to the data-mart.

SimpleLookupTable doesn’t seem to support the extensions I require (or a method to get the table size), so I may have to implement something new. Is this a practical solution to (what must be a common?) problem? Or have I missed something obvious?

Thanks in advance,
Peter

You might want ot try the new “Dat intersect” component. It can give you the dimensional inserts, deletes and updates in different out streams in a single pass.

Think about updating dimensions as a full outer join problem:

current fullouter previous (on key)–>

port 0 → Records in current but not in previous – New
Port 1 → Records in both current and previous – update
port 2 → Records in previous but not in current – Deletes

I would slap each of the above streams into a file, and bulk load (really fast) them into the database. Assign surrogate keys and the rest from within the database.

This is one of the fastest ways of updating your dimensions, which I have used in past. I used Abinitio though, not clover.

Single inserts/updates/deletes via jdbc is very slow.

Hi Peter !

In reverse order :wink:

B) you can have it withing one graph. If you need to do it as a two steps, use phases. Edges which crosses a phase boundary is internally buffered, so you can keep sending records into it and they are buffered till next phases starts, at which point, the edge starts delivering the buffered records.

A) Any approach which secures that you generate unique artifitial (surrogate) keys for your dimension is ok. Sequence seems to me be less problematic than hash-table size. For one (as you mentioned), when records are removed, the size shrinks thus there is potential danger of having duplicate keys.
If you don’t want to access DB during processing, then store the value in some file from which you read it during transformation init() function. Then, within reformat function, for each new record increment it by one.

David.

Hi David,

reversing the order again!

A) Yes, that makes perfect sense.

B) Not sure how to do this. The REFORMAT component only seems to write the transformed records to the output record (from target[0]). As a side effect of the transformation I will also produce a list of new records that need to be added to the dimension.

Ideally, I’d like to be able to grab a reference to another output port for the REFORMAT Node and write the new records to that (so they can be processed as part of the next phase of the graph).

For the moment, I write them to a file using a Formatter and re-read them again.

C) One issue I have with using formatters, is there doesnt seem to be a way of escaping data that may contain the delimiter character. I guess I should either write an escape/unescape transform before going to/from a delimited stream? Or is there an existing solution?

Thanks again,
Peter

Hi David,

Thanks for the prompt reply.

I guess the problem you are thinking about with using the size would be if records were removed from the dimension the size would not generate a unique id? (or is there something else I should be careful about?).

I’m trying to avoid needing online access to a database (to maintain the sequence), so for the moment, I will check the existing data in the lookup table at startup and find the next highest unused id.

I was originally planning to write the new dimension records to a file, but it might be nicer if I could write them to one of the current node edges, so they could stay part of the current transformation graph.

I don’t think this is possible in the REFORMAT component though. Do you think its A) a good idea? B) easy to do?

Thanks,
Peter

Hi Akhilesh,

I did look at using it, but I guess I would several passes to

a) identify the new rows for my dimension
b) assign a new artificial key to each unique row
c) re-process the data to get the new artifical keys

I would be interested in finding out how other people handle the problem.

The solution I have, essentially a dynamically updated SimpleLookupTable, allows me to update several relatively small, slowly changing, dimensions and include the artificial keys in the fact table in a single pass through the data.

Thanks,
Peter

Hello Peter !

Your description of steps required is quite correct. However I would not suggest using any kind of size number for assigning artifitial keys.
The better approach would be to use some internal counter for cutting the keys and store the final value somewhere for the next time (so you know from where to start counting).
The other possibility is to use database SEQUENCE and - you could probaly use DBLookupTable and not seatrch for something but just execute the query which queries SEQUENCE object for the next value.

Like: “select SEQ.nextval() from dual” instead of
“select mykey from mytable where x=?”.

Unfortunately, the support for sequences in Clover is not yet finished…

David.
david.pavliscentrum.cz

Hello Peter !

To solve your B) problem - the “standard” way I would use having the graph topology as you describe is to output all records (old & new) to the same port. Just add one field which indicates whether the record is old or new. Then connect after the REFORMAT some FILTER component which separates old (those are discarded - sent to TRASH) and new - those are inserted or furhter processed.

C) Currently, there is really no standard way of escaping data (strings). You could probably modifiy one of the current DataFormatters to do it for you. Or if you need to process the data only with Clover, use some other delimiter. As you may know, Clover can work with delimited data where every field is delimited by different character (or set of characters - up to 32).

David.