Reading jagged XML file : Oracle Endeca

searchdev · May 25, 2012, 12:00am

Hi all,

I am rather new to CloverETL, using the technology to develop a datastore for an analytic web application (Oracle Endeca Information Discovery 2.3).

I am using the XMLExtract component to read a jagged file, essentially a scrape from a CMS directory. Each record in the file varies in width, carrying a different collection of source properties (anywhere from about 5 to 30).

Below is a snippet of the data:


[size=85]	
<RECORDS>

<RECORD>

		<PROP NAME="Source">
			<PVAL>CMS</PVAL>
		</PROP>
	
		<PROP NAME="Category">
			<PVAL></PVAL>
		</PROP>

		<PROP NAME="Product Active">
			<PVAL>No</PVAL>
		</PROP>
		
		<PROP NAME="Record Type-Legacy">
			<PVAL>No</PVAL>
		</PROP>
		
		<PROP NAME="Region">
			<PVAL>North America</PVAL>
		</PROP>
		
		<PROP NAME="Product Language">
			<PVAL>English</PVAL>
		</PROP>

</RECORD>

<RECORD>

[...]

</RECORD>

</RECORDS>[/size]

First, I tried without a schema using the source following source code, but cannot generate a tree structure to be edited in the visual mapping editor:


[size=85]<Mappings>
	<Mapping element="RECORDS">
	</Mapping>
	<Mapping element="RECORD">
	</Mapping>
	<Mapping element="PROP"/>
	<Mapping element="PVAL">
	</Mapping>
</Mappings>[/size]

Also, I am unsure of how best to configure the metadata for this file.
I would like to map each PROP and its PVAL on a source record to a Clover property with the PROP NAME. Is there a way to avoid doing this manually?

I would appreciate any direction or insight anyone could provide on this.

Many thanks!

-J

vacekm · May 29, 2012, 11:00am

Hello,

yes this is possible. However since the graph reads the metadata on it’s startup, you have to split the action into 2 graphs. First one generates your metadata through the XMLWriter component and the second graph can then use it. You can chain them together by the RunGraph component (as seen in the example). Also the metadata have to have no whitespaces, so I’ve added an reformat component which replaces all unwanted characters with underscore.
I’ve created a sample solution, which you can view.

EndecaXML.zip

searchdev · May 30, 2012, 8:29pm

Many thanks - this was an extremely helpful post.

-J

searchdev · May 31, 2012, 3:21pm

Marek,

If you have a chance, I am wondering about the following error:

ERROR [WatchDog] - Node REMOVE_ALL_WHITE_SPACES finished with status: Error occurred in nested transformation: ERROR caused by: Message: Transform failed! caused by: org.jetel.ctl.TransformLangExecutorRuntimeException: Interpreter runtime exception in function replace on line 6 column 21 - Unexpected null value.

Using your solution I was able to successfully deploy data from a couple of files. However, execution of the transform fails on a large file with the same XML structure.

Does this error indicate malformed XML? The file reads and maps properly in another environment (Endeca InFront). Anything else worth noting here?

Thanks,
-J

searchdev · June 1, 2012, 1:30pm

I believe this issue is due to empty properties in the XML file, like:

Is there a transformation to use to remove these properties?

vacekm · June 4, 2012, 8:50am

Hello,

if there are null records in your data, you need to strip the data flow rid of them. You could make an extra Reformat for this. I just incorporated this into the one that removes all whitespaces. It throws away all records that don’t have a NAME set (no property name). If you want to do something with these records, you can dump them to the second port for instance or, name them with something generic and include them back in the stream.
Open the RemoveAllWhitespaces component in both of your graphs, switch to the “source” tab and replace the code with the one provided below.

function integer transform() {
	if($in.0.Name!=null){ // Is the record OK?
	$out.0.Pval = $in.0.Pval;
	$out.0.Name = replace($in.0.Name, "[^a-zA-Z0-9_]", "_"); // Get rid of all "bad" characters.
	$out.0.id = $in.0.id;
	return OK; // This means "pass the record on".
	}
        else return -1; // This means "throw away this record".
}

Topic		Replies	Views
Simple XMLExtract -- Help for a noob CloverDX Platform	1	1	July 13, 2011
XMLextract CloverDX Platform	4	2	July 16, 2007
XMLExtract with XML CloverDX Platform	3	3	May 26, 2011
How to parse large XML file using custom XMLExtract in Java and store it in database CloverDX Platform	4	4	January 31, 2019
Issue with Xml Reader CloverDX Platform	3	2	May 13, 2016

Reading jagged XML file : Oracle Endeca

Related topics