Filtering nodes do not support multi-language character sets

Filtering nodes do not support regular expressions are intended to characters in the \ w \ s, and so on, does not support multi-language character sets.

eg.
$Field1 ~= “^[0-9]\w.*”
substring($Field2,0,2) == “黄威”

Can support?

Can you try to use unicode escape sequence in place of the characters (both for the regex and the substring) ?

Also, I am not sure what problem are you describing - is it that 黄威 are not recognized as \w ?

Sorry, I am not familiar with Asian alphabets and need a hint here.

table structure:
create table t1 (f1 varchar(50), f2 varchar(50));

record content:
黄威 20071976北京
huangwei 20071976beijing

extFilter node expression:
$f2 ~= ‘^[0-9]{8}[a-z]*’

outPort (0) output record
huangwei 20071976beijing

outPort (1) output record
黄威 20071976北京
----------------------------------------------------------
I want outPort (0) to output content below
黄威 20071976北京
----------------------------------------------------------

extFilter node expression:
$f2 ~= ‘^[0-9]{8}\p{InHanzi}*’
output error info:
ERROR [WatchDog] - EXT_FILTER_0 …FAILED !
Parser error when parsing expression: Encountered “\'^[0-9]{8}” at line 1, column

Was expecting:
<STRING_LITERAL> …

extFilter node expression:
substring($f3,8,2)==‘北京’

outPort (0) output record 0
outPort (1) output record 2
黄威 20071976北京
huangwei 20071976beijing

If you use \ (backslash) in your regex string in transform language, you have to escape it - like this:


$f2 ~= '^[0-9]{8}\\\\p{InHanzi}*'

The reason why is that the backslash gets preprocessed twice - first when the expression is read from XML and \\ is preprocess to \ and then again the TL language parser preprocesses \\ to \ - then it gets to Java’s regex evaluator.

We will try to fix this nuisance (in 2.3.x and earlier) in next release of Clover.

I will check the rest of the problem too, but check the updated expression above.

ERROR [WatchDog] - EXT_FILTER_0 …FAILED !
Error when parsing expression: Illegal repetition near index 11
^[0-9]{8}\\p{InHanzi}*

--------------------------------

substring($f2,8,2)==‘北京’

Substring function Why not support the “北京”?

Well,interesting problem with the regex… I will see to it …

As for the substring - try to use unicode escape (\uxxxx) in place of the two chars - you will have to find their unicode numbers.

I’ve found that such regex does not throw an exception:
“^[0-9]{8}[\\p{InHanzi}]*”

Thank you for your response,Substring function issue has been resolved

Cool,
can I ask you how did you solve it ?

Solutions to the inconvenient, the process is this.

D:\javasoft\Jdk1.5.0_04\bin>native2ascii
北京
\u5317\u4eac

extFilter node expression:
substring($f2,8,2)==‘\u5317\u4eac’

====================================
extFilter node expression:
$f2 ~= ‘^[0-9]{8}[\u4e00-\u9fa5]*’

[\u4e00-\u9fa5] On behalf of the Asian Regional Character Set,This realization is some trouble