Having trouble with Streaming vs. In Memory

I have a problem with a graph where I am reading in several xls files through spreadsheet reader. Depending on the order I select the spreadsheets, if I am using “Streaming” i run into a problem where one of my columns that is a text meta data type (policy number), gets turned into a currency. It is still a text field but out of no where a dollar sign has been added and commas with .00 at the end. If I change the order i select the files or if I just do each one individually this doesn’t happen.

I was going to try to use In Memory instead of Streaming, which fixes the problem, but only if I select two or three small files instead of all the files I want to select at one time (10 or 15). When I do that I get this:

11:26:51,683 INFO [JobFinalizer_42] Cleaning up child job:43
11:26:54,823 INFO [JobFinalizer_42] local job aborted; runId=43 result:Response:ok:[RunRecord#43 SUBGRAPH STANDALONE ABORTED node01/Sandbox#1#Customers:shared/graph/StepSubs/Step2-SUB.sgrf parent:42] [Msg#info:MSG_GRAPH_KILLED [43, ABORTED]]
11:26:54,823 INFO [exNode_42_1496674929375_STEP2_SUB] local job aborted; runId=43 result:Response:ok:[RunRecord#43 SUBGRAPH STANDALONE ABORTED node01/Sandbox#1#Customers:shared/graph/StepSubs/Step2-SUB.sgrf parent:42] [Msg#info:MSG_GRAPH_KILLED [43, ABORTED]]
11:26:54,823 INFO [JobFinalizer_42] RunTime: 7 secs
11:26:54,823 ERROR [JobFinalizer_42]

I believe this error is caused by a lack of available memory but I’m not sure. Could anyone help shed some light on any possible solutions?

thanks,
Mike

Hi Mike,
there is an essential difference in the SpreadsheerDataReader read modes in the way how data is stored in the memory. The In-memory mode stores the whole input file in the memory which allows for faster reading and makes it a good candidate for smaller files. In the Streaming mode, the file is being read continuously without the need of storing in the memory. This mode is designed for reading large-sized files without the risk of running out of memory. Although your use case seems to coincide with the way how these 2 modes are designed to work I would not dare to conclude that using the former or the latter is behind the error you encountered. In order to narrow down the cause of both the first issue (incorrect number format) and the second issue (the reported error message), I would like to ask you for more details:

  • I would like to test run the graph on my end in order to replicate the same behavior of the system. Kindly attach the graph file(s), subgraph file(s), any externalized items (such as metadata) and sample input data which would enable me to run the graph and recreate the issue. Feel free to mask any sensitive information that might be contained within these items.

  • Are the input XSL files edited in Excel (or a similar program) by real users or are they just an output of a 3rd party software or process?

  • When you open these files in Excel (or a similar program) can you see any unusual/unexpected format already?

  • Please attach the whole stack trace of the graph run when you received the reported error message.

  • You mentioned a number of smaller input files and some other supposedly bigger files. Could you give me a rough estimate of the number of rows and columns in those files? In other words, what is a small input file in terms of your use case?

  • You also mentioned that depending on the order you selected the spreadsheets, the outcome is different. Could you clarify which order of files being read results in the proper column format and which order does not?

  • What is your CloverETL Designer (and possibly CloverETL Server) version? For instance, 4.8.2.

Kind regards,