In files with generic names, only one at a time

acominotto · May 30, 2011, 12:00am

Hello,

I’m sorry for the non explicit title but I didn’t find a way to be more clear…

I am working on a cloverETL project (v2.9) that includes several graphs which will receive splitted data, process them and then write them.

The main problem I’m experiencing is that if I use the clover server with wildcards in my input files’ names (something like → zip:(*.zip)#*.xml), then it will process all the files with the same canvas.

The thing is that we cannot stop the infiles coming into the input directory of clover, then clover will process several data at a time, what I don’t want.

The other problem is that I used temporary files, then I think i will have a problem (data discarded) in the execution of several instances of the same graph because the naming of those files is static.

Is there a solution to keep working with the cloverETL server under those conditions or have I to run them by a script that would generate a parameter file and then pass it to the graph?

Thanks in advance,

Adrien

avackova · May 30, 2011, 3:46pm

Hello Adrien,
have you tried to create the File Event Listener? It starts the graph every time, the new file appears. In 2.9 version it doesn’t work very well with wild-cards, but maybe it would be enough for you to observe only input directory. The another problem may be with large files, when the listener starts the graph before the whole data file is uploaded.

acominotto · May 30, 2011, 3:57pm

Hello Agata,

Thanks for the quick response.

How will the temporary files be handled if I do this this way?

Thanks in advance.

PS : for the File Event Listener, I already have a trigger file trick (I create the trigger file only if my data is fully updated).

avackova · May 31, 2011, 6:24am

Hello Adrien,
where do you use temporary files? Can’t you really specify the file names or location?

acominotto · May 31, 2011, 7:00am

My graphs are divided in 3 main parts: reading - processing - writing (they are pretty big).

So I use temporary files between all those parts (reading-> write to temp files; read temp files-> processing → write to temp files; read temp files → writing)

The path where they are written is static and define by the parameter file.

Is there another way to define the location of the temporary files?

Thanks in advance.

avackova · May 31, 2011, 9:42am

What about using event_file_name parameter? Something like fileURL=“${DATATMP_DIR}/${event_file_name}.tmp1” in your Writers and Readers?

acominotto · May 31, 2011, 11:27am

Hi Agata,

This sounds great!

If I use an empty trigger file to know when the data is fully transfered to the server clover (eg. xxx.trg), do I have to name our zip file xxx.trg.zip to retrieve it in the graph, because the only information I will have is the name of the file that the event listener is waiting for.

The same for the output file, can I have a processing on the ouput file name or do I have to name it xxx.trg.zip because xxx.trg will be the only ‘dynamic’ part of my graph occurrence.

Again another question about this : we will be using large XML files, if we use the graphs in parallel like this, is there a chance to cause a memory dump due to the heap size of the JVM? Or do clover handle it well?

Thanks in advance and thanks again for the quick answers.

avackova · June 1, 2011, 8:54am

Hello Adrien,
I’m not sure if I understand you well, but I see the scenario as follows:

fileURL in the Reader is with wildcards, let say: zip:(*.zip)#*.xml
trigger file has an unique name, something like xxx.trg where xxx is a random number
temporary file’s names depend on the trigger file name, eg. fileURL=“${DATATMP_DIR}/${event_file_name}.phase0.tmp”
the problem can be with output file name - here I see 2 possibilities:
[list:mld97jba][*:mld97jba] use trigger file name as a matrix
use input file name (may be complicated): [list:mld97jba][*:mld97jba]add aoutofilling field (source_name) to your input metadata
put input file name to the dictionary during the transformation
add phase after the whole processing, that renames the output file name according to the name in dictionary
[/*mld97jba][/list:u:mld97jba][/*mld97jba][/list:u:mld97jba]

Regarding the question about the xml files and memory it depends on component you use for reading: XPathReader reads the whole file into memory, so when you try to process more files at once it can cause the OOM error; XMLExtract reads data sequentially, so different XMLExtracts can work parallel with no fear of OOM error

acominotto · June 1, 2011, 1:36pm

Ok, I’ll go with that!

Your scenario was pretty much it.

In fact I will use this temporary solution while waiting for the version 3.1 and then I will use another graph that will run my graphs, in the same JVM, with parameters that will allow me to have more simpler way of naming.

Thank you very much for the quick answers!

PS : thank you also about the description of the XMLExtract, this is very helpful!

Adrien

hneff1 · July 15, 2015, 8:26pm

I am not totally clear on how you are making sure the entire file has been transferred tot he ftp server prior to kicking off the job flow with the event listener based on file appearance. You mention something about using a “trigger” file. How do you do this?

Thanks,
Heather

slechtaj · July 17, 2015, 9:51am

Hi Heather,

The idea behind the trigger file is that the actual event is triggered not by the large file, but the tiny one. The tine file however, must be always created after the large file is fully loaded.

It’s just fine to add another (next) phase into a graph that loads data to FTP and in this phase create just the tiny file. This way you can be sure the large file is already loaded as it has to finish before the second file is created.

hneff1 · July 17, 2015, 2:42pm

Thanks for the reply Jan. I am still not totally clear. Hope I am not being dense :?

My scenario is that a customer ftp file to FTP server and file appears on FTP server. Once file is completely copied to FTP server, I want to use file event listener to kickoff job flow to process file.

You mentioned using the tiny trigger file as the event that kicks off the job flow. What causes the trigger file to be created and how is it created?

Thanks,
Heather

slechtaj · July 27, 2015, 2:32pm

Hi Heather,

That is actually the thing. The tiny file has to be uploaded by the same party right after the big file is uploaded.
Let me give you an example: Using CloverETL we produce a large file that is written directly to a remote location. On the given location we expect to receive this file, but since we don’t know when the file is completely uploaded we create another (tiny) file within the same graph (but in a later phase in order to make sure the writer in the earlier phases has already finished) - and set up an event listener waiting for appearance of the tiny file. When it appears, the large file is processed.
As you can see in this example, the tiny file is created by the Clover using the same graph as the large file. It of course does not have to be created by the Clover, but you need to make sure the application starts creating the tiny file after (not earlier) the large file is fully copied to the remote location.

Let me know if anything is unclear.

hneff1 · September 12, 2016, 11:09pm

Hi Jan. The use case is as follows: on the Clover server we want to have a daily scheduled job that will first call a jobflow which executes a graph to call a webservice that generates a file, checks for a complete export file in a client’s outbound SFTP directory, and then ftps it to them once it knows the file is complete. The process that is generating the file is a webservice that is being called by Clover. The webservice is asyncronous so it returns right away before the processing even starts so it has no way to know the file is complete to generate the tiny trigger file. Is there another way to accomplish knowing the file is complete prior to starting the next step in the jobflow?

Lukas_Cholasta · September 14, 2016, 1:39pm

Hi,

I was solving this use case with Heather via email but in case anyone else is interested in a solution, here it is.

In case you don’t control the transfer process, you need to setup a jobflow that will periodically check the size of the file and compare it to the size the file had during the last iteration. You can do this using Loop and Sleep components. Nevertheless, this process is prone to false positive result due to possible network errors or other aspects that may significantly slow the transfer down. Therefore, it is important to set the value of the delay in the Sleep component high enough. If the size of the file doesn’t change after this delay, the file is considered fully transferred. Attached is an example that should give you better idea.

Best regards,

file_check.jbf

Topic		Replies	Views
Process only unprocessed files CloverDX Platform	3	18	October 30, 2009
Get Input Filename CloverDX Platform	7	48	March 19, 2009
File Operations and JobFlow patterns CloverDX Platform	9	35	March 23, 2015
Loop based on metadata CloverDX Platform	2	25	June 19, 2015
Process each file separately CloverDX Platform	1	14	November 3, 2011

In files with generic names, only one at a time

Related topics