Hi,
I am currently evaluating the product, and want to understand how to implement dimension management from transaction data. For example, my web transactions could have domain and browser. As I go through the transaction logs, I want to assign a new key to each new occurrence of domain and browser. This one graph would load 3 tables:
domain_dim
- domain_id
- domain_name (key)
browser_dim
- browser_id
- brownser_name (key)
- browser_os (key)
- browser_version (key)
web_visit_fact
- …
- domain_id
- browser_id
- …
The source file has these fields:
- ip_address
- user_agent
- etc.
When you have the log parsed into fields (ip_address, user_agent etc.) you have to parse the field user_agent in Reformat component to gain fields browser_name, browser_os and browser_version. And you also need resolved ip_address to domain_name. The resolving has to be written as a transformation in Reformat component in Java.
After these two reformats you have fields domain_name, browser_name, browser_os, browser_version etc. Now you have to find if you already have all records in db tables domain_dim and browser_dim. You can use two components DBJoin for it with SQL queries in form of
select domain_id from domain_dim where domain_name = ?
select browser_id from browser_dim where browser_name = ? AND browser_os = ? AND browser_version = ?
The records that are already stored in the database are sent to Port 0 and the other records are sent to Port 1. The records from Port 1 are new records so you have to insert them to db table with component DBOutputTable. I’m not sure how you want to generate columns domain_id a browser_id. I prefer to use some database feature (sequences and triggers or auto filling columns) or you can define sequences in CloverETL and assign ids before inserting the records to tables domain_dim and browser_dim.
Now you have all records from input file in db tables browser_dim and domain_dim and you need to fill table web_visit_fact. For it you need to know the domain_id and browser_id for each input record. You can use two approches how to get them:
1. You look up them in database with component DBJoin, see SQL queries above. These DBJoins have to be in other phase of transfromation graph then inserts into tables browser_dim and domain_dim.
2. You can use the feature of DBOutputTable component named “Auto generated columns”. It allows you to get the values of auto generate column of record that have been inserted. But only if you use one of these databases: Oracle, MySQL, DB2 or Informix and you don’t use CloverETL sequences for generation browser_id and domain_id.
As I expect to get large volumes through the logs, I want to avoid going to the database for id’s. I get the gist of your explanation for the rest of the load, but my main concern is surrogate key management. I should have been more clear.
What I would like to do is take the dimensions and load them into memory, and use stream lookups. If a lookup cannot resolve an id, I would be able to get the next key sequence, and that new key gets added to the lookup so the next reference to it in my stream will see it.
It might be something like this:
domain_dim
id name
1 A
2 B
3 C
log comes through and I am able to resolve the ip_address to domain:
domain
A → 1
A → 1
A → 1
B → 2
A → 1
B → 2
C → 3
C → 3
D → 4 (new key added to lookup)
D → 4
I want to be able to get that resolved in a stream, to avoid the database round trip, otherwise, that will be a major bottleneck to my throughput - unless Clover implements some sort of caching.
I can replicate the streams for me to load dimensions as well as the fact and have little risk of RI violations. I can also avoid the database access bottleneck.
The other approach would be me for me to make two passes of the fact data with the first pass evaluating all the dimension attributes to load the dimensions, then load the fact table afterwards.
CloverETL doesn’t provide stream lookup as you described. But I suggest a work-around:
In the beginning after the input data are parsed you copy them with SimpleCopy component to 3 streams/branches: one for preparing domain dimension with new data, one for preparing browser dimension with new data and one for filling fact table.
At first I describe a “domain dimension branch” of graph, a “browse dimension branch” of graph is almost the same.
Domain dimension branch:
You have to sort input data by domain_name and deduplicate them in Dedup component by domain_name. After deduplication you have the sorted stream of unique domain_name values. You join this stream with data loaded from database table domain_dim in ExtHashJoin component. The stream of data from database is connected to slave port (Port 1), the sorted stream is connected to master port (Port 0). The parameter “Join type” of ExtHashJoin have to be set up to “Left outer join” so you can assign ids to new records.
Part of transformation code in CTL that assins a new id if the record is new:
After ExtHashJoin the stream contains actual domain dimension data that you can use in “fact table branch” and for inserting of new records to db table domain_dim.
In “fact table branch” you simply use ExtHashJoins with data from “dimensions” branches connected to slave port (Port 1).