Reading jagged XML file : Oracle Endeca

Hi all,

I am rather new to CloverETL, using the technology to develop a datastore for an analytic web application (Oracle Endeca Information Discovery 2.3).

I am using the XMLExtract component to read a jagged file, essentially a scrape from a CMS directory. Each record in the file varies in width, carrying a different collection of source properties (anywhere from about 5 to 30).

Below is a snippet of the data:


[size=85]	
<RECORDS>

<RECORD>

		<PROP NAME="Source">
			<PVAL>CMS</PVAL>
		</PROP>
	
		<PROP NAME="Category">
			<PVAL></PVAL>
		</PROP>

		<PROP NAME="Product Active">
			<PVAL>No</PVAL>
		</PROP>
		
		<PROP NAME="Record Type-Legacy">
			<PVAL>No</PVAL>
		</PROP>
		
		<PROP NAME="Region">
			<PVAL>North America</PVAL>
		</PROP>
		
		<PROP NAME="Product Language">
			<PVAL>English</PVAL>
		</PROP>

</RECORD>

<RECORD>

[...]

</RECORD>

</RECORDS>[/size]

First, I tried without a schema using the source following source code, but cannot generate a tree structure to be edited in the visual mapping editor:


[size=85]<Mappings>
	<Mapping element="RECORDS">
	</Mapping>
	<Mapping element="RECORD">
	</Mapping>
	<Mapping element="PROP"/>
	<Mapping element="PVAL">
	</Mapping>
</Mappings>[/size]

Also, I am unsure of how best to configure the metadata for this file.
I would like to map each PROP and its PVAL on a source record to a Clover property with the PROP NAME. Is there a way to avoid doing this manually?

I would appreciate any direction or insight anyone could provide on this.

Many thanks!

-J

Hello,

yes this is possible. However since the graph reads the metadata on it’s startup, you have to split the action into 2 graphs. First one generates your metadata through the XMLWriter component and the second graph can then use it. You can chain them together by the RunGraph component (as seen in the example). Also the metadata have to have no whitespaces, so I’ve added an reformat component which replaces all unwanted characters with underscore.
I’ve created a sample solution, which you can view.

Many thanks - this was an extremely helpful post.

-J

Marek,

If you have a chance, I am wondering about the following error:

ERROR [WatchDog] - Node REMOVE_ALL_WHITE_SPACES finished with status: Error occurred in nested transformation: ERROR caused by: Message: Transform failed! caused by: org.jetel.ctl.TransformLangExecutorRuntimeException: Interpreter runtime exception in function replace on line 6 column 21 - Unexpected null value.

Using your solution I was able to successfully deploy data from a couple of files. However, execution of the transform fails on a large file with the same XML structure.

Does this error indicate malformed XML? The file reads and maps properly in another environment (Endeca InFront). Anything else worth noting here?

Thanks,
-J

I believe this issue is due to empty properties in the XML file, like:

Is there a transformation to use to remove these properties?

Hello,

if there are null records in your data, you need to strip the data flow rid of them. You could make an extra Reformat for this. I just incorporated this into the one that removes all whitespaces. It throws away all records that don’t have a NAME set (no property name). If you want to do something with these records, you can dump them to the second port for instance or, name them with something generic and include them back in the stream.
Open the RemoveAllWhitespaces component in both of your graphs, switch to the “source” tab and replace the code with the one provided below.

function integer transform() {
	if($in.0.Name!=null){ // Is the record OK?
	$out.0.Pval = $in.0.Pval;
	$out.0.Name = replace($in.0.Name, "[^a-zA-Z0-9_]", "_"); // Get rid of all "bad" characters.
	$out.0.id = $in.0.id;
	return OK; // This means "pass the record on".
	}
        else return -1; // This means "throw away this record".
}