XML Parser failed due to illegal unicode chareters

I am having issues parsing and XML file containing Unicode control characters (e.g. \u0006). I have tried regular expressions to remove these unsuccessfully. Any ideas?

Can you describe the problem more? What did you try and what did go wrong?

An example would be a control character such as Acknowledge “/u0006” which is not legal in XML.

An example would be a control character such as Acknowledge “/u0006” which is not legal in XML.

“rstark”

I have tried multiple regular expression to remove these prior to parsing the xml.

Hello,
following expression works for me:

regex = "([^\\u0009\\u000a\\u000d\\u0020-\\uD7FF\\uE000-\\uFFFD]|[\\u0092\\u007f]+)"

Agata,

The provided regex is still not working. Can you send me the Clover syntax e.g. replace(, regex, ‘’); ? Also what other options do I have besides replacing explict chars or ranges?

Thanks,
Ryan

Hello Ryan,
following code replaces invalid parts with empty strings:

//#CTL2

// Transforms input record into output record.
function integer transform() {
	string regex = "([^\\u0009\\u000a\\u000d\\u0020-\\uD7FF\\uE000-\\uFFFD]|[\\u0092\\u007f]+)";
	$0.Data = replace($Data,regex,"");

	return ALL;
}

I’ve also developed the transformation in java, that removes invalid element’s value and sends it together with the number of invalid record and the name of invalid element to another output port of Reformat (see attached class).

Thanks Agata this worked.