How to read a flat file or XML file as one string field?

How to read the whole flat file or XML file as one field of string data type?
------------------------------------------------------------------------

If you want to read a flat file or an XML file as one string field, use UniversalDataReader and set the metadata on its output port as follows:


<Metadata id="Metadata0" >
<Record name="recordName1"  type="delimited">
<Field eofAsDelimiter="true" name="field1" type="string"/>
</Record>
</Metadata>

In other words, you must delete both field delimiter and record delimiter and set the EOF as delimiter attribute to true. This way the whole file will be read as one record consisting of one field of string data type.

Great example, but I have a slightly more complicated request:

How would one go about reading an XML node and all its children as plain text, preferably using one of the XML readers? For example:


<root>
  <node1>
    <node2>
       <foo>bar</foo>
    </node2>
  </node1>
</root>

Let’s say I wanted node2 sent to output port 0. I’d like to be able to define one of the metadata fields as a string and have it contain the following:


    <node2>
       <foo>bar</foo>
    </node2>

Which I could then pass to another XML reader or store in another document. I could always use the above example then split out the nodes I want using regular expressions, but I would prefer to avoid that method if at all possible.

Hello Mike,

I do not understand why you want to save such an intermediate XML file. You can store the original XML file and read whatever you want and whenever you want using XMLExtract or XMLXPathReader. You do not need to store any subXML file before its reading.

Nevertheless, should you really insisted on saving subXML file, you could use our XSLTransform component.

You would read the original XML file by UniversalDataReader with metadata having onle one string field and “EOF as delimiter” as unique delimiter, then the data would go to XSLTransform where you could use the following transformation:


<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version='1.0'>

  <xsl:template match="node1">
    <xsl:copy-of select="*"/>
  </xsl:template>
</xsl:stylesheet>

Metadata on XSLTransform output would be the same metadata as on its input.

The result will be what you want with the following header:


<?xml version="1.0" encoding="UTF-8"?>

You can write the result by UniversalDataWriter in an yoursubfile.xml file or, if you want to remove the header, you can insert another UniversalDataReader between XSLTransformer and this UniversalDataWriter. In this reader, File URL would be:


port:$0.yourstringfieldname:discrete

and metadata on its output (between this UniversalDataReader and UniversalDataWriter) would have only one string field with \r\n as delimiter and Skip source rows set to 1.

This way you would get the result you requested.

(You could also use XMLExtract or XMLXPathReader - for reading the original XML file - connected to XMLWriter in which you would create such a subXML file.)

As you can see what you wanted is possible but much more better and recommendable is when you parse the original XML file.

Best regards

Tomas Waller

Thanks for the help

The problem is that I’m going to be dealing with potentially 100s of output ports due to the complexity of the XML file. I’m not sure it’s reasonable to have a component with that many ports - it will become far too confusing, and if I change one port, all of them will get screwed up. This is to help split the process up into multiple readers, or even into multiple graphs. Let me know if you have any better suggestions, especially one with less of a performance hit.

Hello Mike,

Reading any XML file is much more faster with XMLExtract or XMLXPath Reader.

However, if you need to split the original XML file into more subXML files, you can do this as described above:

For example, if you have a file with the following structure:


<root>
<node>
  <node1>
    <node2>
       <foo>bar1</foo>
    </node2>
  </node1>
  </node>
<node>
  <node1>
    <node2>
       <foo>bar2</foo>
    </node2>
  </node1>
  </node>
<node>
  <node1>
    <node2>
       <foo>bar3</foo>
    </node2>
  </node1>
  </node>
<node>
  <node1>
    <node2>
       <foo>bar4</foo>
    </node2>
  </node1>
  </node>
  </root>

Create the following graph:

UniversalDataReader → XSLTransformer → UniversalDataWriter.

First metadata has no other delimiter except “EOF as delimiter”.

XSLTransformer will be the following transformation code:


<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version='1.0'>

<xsl:output omit-xml-declaration="yes"/>

  <xsl:template match="node">
    <xsl:copy-of select="*"/>
  </xsl:template>

</xsl:stylesheet>

The second metadata will have only record delimiter. (Delete the default delimiter). The record delimiter will be:

UniversalDataWriter will be set to Records per file to 1.

File URL in UniversalDataWriter will be ${DATATMP_DIR}/yoursubxmlfile$$$.xml (number of wild cards should satisfy to the number of files created).

The resulting files will be yoursubxmlfile000.xml to yoursubxmlfile050, for example. Maybe the last one contains only the delimiter (). Thus, delete the last file if necessary.

Then you can use XMLExtract or XMLXPathReader where File URL is ${DATATMP_DIR}/yoursubxml*.xml.

This way, all these input files will be read one after another and the number of your output edges will be smaller than if the original XML file was read.

But remember, this will be slower than if you read the original XML file.

Best regards,

Tomas Waller

Hello Mike,

You can also send all data to input port of XMLExtract or XMLXPathReader and process directly the whole subfile as one large XML file what you wanted.

Thus, you would have:

UniversalDataReader → XSLTransformer → XMLExtract(XMLXPathReader) -(many edges)-> many writers.

File URL attribute in XMLExtract or XMLXPathReader would be:


port:$0.yourstringfieldname:stream

Regards,

Tomas Waller