Filtering nodes do not support multi-language character sets

hwhwhw · January 4, 2008, 12:00am

Filtering nodes do not support regular expressions are intended to characters in the \ w \ s, and so on, does not support multi-language character sets.

eg.
$Field1 ~= “^[0-9]\w.*”
substring($Field2,0,2) == “黄威”

Can support?

dpavlis · January 6, 2008, 11:26am

Can you try to use unicode escape sequence in place of the characters (both for the regex and the substring) ?

Also, I am not sure what problem are you describing - is it that 黄威 are not recognized as \w ?

Sorry, I am not familiar with Asian alphabets and need a hint here.

hwhwhw · January 7, 2008, 7:50am

table structure:
create table t1 (f1 varchar(50), f2 varchar(50));

record content:
黄威 20071976北京
huangwei 20071976beijing

extFilter node expression:
$f2 ~= ‘^[0-9]{8}[a-z]*’

outPort (0) output record
huangwei 20071976beijing

outPort (1) output record
黄威 20071976北京
----------------------------------------------------------
I want outPort (0) to output content below
黄威 20071976北京
----------------------------------------------------------

extFilter node expression:
$f2 ~= ‘^[0-9]{8}\p{InHanzi}*’
output error info:
ERROR [WatchDog] - EXT_FILTER_0 …FAILED !
Parser error when parsing expression: Encountered “\'^[0-9]{8}” at line 1, column

Was expecting:
<STRING_LITERAL> …

extFilter node expression:
substring($f3,8,2)==‘北京’

outPort (0) output record 0
outPort (1) output record 2
黄威 20071976北京
huangwei 20071976beijing

dpavlis · January 7, 2008, 8:46am

If you use \ (backslash) in your regex string in transform language, you have to escape it - like this:


$f2 ~= '^[0-9]{8}\\\\p{InHanzi}*'

The reason why is that the backslash gets preprocessed twice - first when the expression is read from XML and \\ is preprocess to \ and then again the TL language parser preprocesses \\ to \ - then it gets to Java’s regex evaluator.

We will try to fix this nuisance (in 2.3.x and earlier) in next release of Clover.

I will check the rest of the problem too, but check the updated expression above.

hwhwhw · January 7, 2008, 9:19am

ERROR [WatchDog] - EXT_FILTER_0 …FAILED !
Error when parsing expression: Illegal repetition near index 11
^[0-9]{8}\\p{InHanzi}*

--------------------------------

substring($f2,8,2)==‘北京’

Substring function Why not support the “北京”?

dpavlis · January 7, 2008, 9:30am

Well,interesting problem with the regex… I will see to it …

As for the substring - try to use unicode escape (\uxxxx) in place of the two chars - you will have to find their unicode numbers.

avackova · January 7, 2008, 12:03pm

I’ve found that such regex does not throw an exception:
“^[0-9]{8}[\\p{InHanzi}]*”

hwhwhw · January 8, 2008, 2:47am

Thank you for your response,Substring function issue has been resolved

dpavlis · January 8, 2008, 9:35am

Cool,
can I ask you how did you solve it ?

hwhwhw · January 9, 2008, 12:48am

Solutions to the inconvenient, the process is this.

D:\javasoft\Jdk1.5.0_04\bin>native2ascii
北京
\u5317\u4eac

extFilter node expression:
substring($f2,8,2)==‘\u5317\u4eac’

====================================
extFilter node expression:
$f2 ~= ‘^[0-9]{8}[\u4e00-\u9fa5]*’

[\u4e00-\u9fa5] On behalf of the Asian Regional Character Set,This realization is some trouble

Topic		Replies	Views
Regex Patterns not working in ExtFilter CloverDX Platform	2	3	July 24, 2008
XML Parser failed due to illegal unicode chareters CloverDX Platform	7	6	September 20, 2011
DB_OUTPUT_TABLE not support unicode field name CloverDX Platform	6	0	March 19, 2009
Join multivalue fields CloverDX Platform	1	0	August 18, 2016
<DBConnection> is not getting parsed/ saved CloverDX Platform	6	0	May 5, 2009

Filtering nodes do not support multi-language character sets

Related topics