Non sequential processing of large xml files

Hi,
I’m currently investigating if clover ETL could be used for processing xml files. The xml file sizes range between a few MB to GB(s).
The xml files would be subjectd to about 40-50 functional validation rules, before the xml file can be split, data extracted and loaded into database. Below are a couple of queries that I have:
1. The business validation rules apply over the entire xml file . Does clover ETL support XQuery functionality or something similar by which functional validation rules can be implemented. (Ex: similar to what is offered by Basex w.r.t xquery capability) (stream based processing instead of loading the complete xml into memory).
2. As a part of xml processing, SHA hash for the entire xml file has to be generated. Do we need to use the java execute componenet and implement the stream based hashing ourselves or are there any alternative method of achieving the same?
3. The file processing activities are kicked off by a SOAP/JMS notification message sent to a queue on tibco ems server. I have tried using the JMS reader component to consume the jms message and following are queries:
a. Do we have an option of setting the no. of sessions that can be created (Ex: say 10 sessions per connection). Each session consumes one message and kick starts processing of a specific xml file. The SOAP/JMS notification message would have the file name and file location information.
b. Option to use the client ack. mode instead of auto ack. mode (only after the file processing is completed, the message is ack. from the jms queue).
c. Options for setting up ssl connection with ems server.
d. My observation has been the JMS Reader component uses a poll mechanism to fetch messages from jms queue instead of event based (Queue Receiver) or pull mechanism (Queue consumer). If there are multiple consumers on the same queue (load balancing), is there a possiblility of leading to race condition in processing the message from the jms server.

Regards,
evn

Dear evn,

thank you for questions:

  1. We have few components for XML reading. For your purpose I would recommend XMLExtract.

XMLExtract
* http://doc.cloveretl.com/documentation/ … tract.html
* reads data in SAX/stream way
* memory efficient, can read XMLs of few GB size without problems
* have nice mapping designer UI

XMLReader
* http://doc.cloveretl.com/documentation/ … eader.html
* reads data into memory (DOM) and allows you to extract data using XPath selectors
* it can’t be used for large files because of tremendous DOM memory usage

XMLXPathReader
* uses DOM+XPath
* deprecated, replaced by XMLReader and XMLExtract

  1. For SHA calculation I would recommend to utilize external executable (http://linux.die.net/man/1/sha1sum) via http://doc.cloveretl.com/documentation/ … cript.html or http://doc.cloveretl.com/documentation/ … ecute.html instead of writing own java code.

a] I am not sure whether I understand here. Every received message is sent immediately over port to next component. If next component is in same phase as JMSReader then processing start also immediately. If you would like to stop graph after receiving e.g. 10 messages then you can use “Max msg count” property(http://doc.cloveretl.com/documentation/ … eader.html).

b] I am afraid there is no support for ACK mode instead of AUTO ACK. Maybe you can limit risk of failure by writing response into file and process in second step.

c] SSL should work, configuration depends on your JMS vendor. For example http://activemq.apache.org/how-do-i-use-ssl.html Just use “ssl://” prefix inside “URL” field of JMS Connection wizard (http://doc.cloveretl.com/documentation/ … izard.html) and configure trust/keystores.

d] I am going to answer this part later - I need to check for details.

I hope you will find my answers useful.

Dear evn,

to the 3-d question: Internally we use receive(long) method which indeed indicates POLL approach.

Regarding possible race conditions:
* Possible problems are discussed here. CloverETL should be free of them. Each JMSReader uses own JMS session - therefore thread-based race condition should not appear (caused by using same session by more threads).
* JMS queue itself was designed for concurrent access by multiple consumers having own session, only one consumer should obtain single message. Some information may be found here or here and here.

Please let me know if you need more information.