XML Parser failed due to illegal unicode chareters

rstark · September 2, 2011, 12:00am

I am having issues parsing and XML file containing Unicode control characters (e.g. \u0006). I have tried regular expressions to remove these unsuccessfully. Any ideas?

avackova · September 5, 2011, 10:29am

Can you describe the problem more? What did you try and what did go wrong?

rstark · September 12, 2011, 1:45pm

An example would be a control character such as Acknowledge “/u0006” which is not legal in XML.

rstark · September 12, 2011, 8:39pm

An example would be a control character such as Acknowledge “/u0006” which is not legal in XML.

“rstark”

I have tried multiple regular expression to remove these prior to parsing the xml.

avackova · September 13, 2011, 7:37am

Hello,
following expression works for me:

regex = "([^\\u0009\\u000a\\u000d\\u0020-\\uD7FF\\uE000-\\uFFFD]|[\\u0092\\u007f]+)"

rstark · September 16, 2011, 1:00am

Agata,

The provided regex is still not working. Can you send me the Clover syntax e.g. replace(, regex, ‘’); ? Also what other options do I have besides replacing explict chars or ranges?

Thanks,
Ryan

avackova · September 16, 2011, 1:10pm

Hello Ryan,
following code replaces invalid parts with empty strings:

//#CTL2

// Transforms input record into output record.
function integer transform() {
	string regex = "([^\\u0009\\u000a\\u000d\\u0020-\\uD7FF\\uE000-\\uFFFD]|[\\u0092\\u007f]+)";
	$0.Data = replace($Data,regex,"");

	return ALL;
}

I’ve also developed the transformation in java, that removes invalid element’s value and sends it together with the number of invalid record and the name of invalid element to another output port of Reformat (see attached class).

RemoveInvalidChars.java

rstark · September 20, 2011, 8:30pm

Thanks Agata this worked.

Topic		Replies	Views
Invalid XML character CloverDX Platform	1	1	July 16, 2007
I want to check format of the data CloverDX Platform	3	1	July 25, 2008
Reading special characters from input file CloverDX Platform	1	7	December 22, 2011
Filtering nodes do not support multi-language character sets CloverDX Platform	9	3	January 9, 2008
Preprocessing a file to remove invalid characters CloverDX Platform	3	4	November 20, 2015

XML Parser failed due to illegal unicode chareters

Related topics