Convert CDATA expression to Javascript RegExp

M

Max

Hello everyone!

Can anyone help me to convert the CDATA expression "CDATA ::= (Char* -
(Char* ']]>' Char*)" to Javascript Regular Expression?

Thanks,

Max
 
J

Joseph Kesselman

Translation to English: A CDATA's value can contain any legal XML
characters except the three-character sequence ]]> (which is used to
terminate the value.

I don't do Javascript, so you'll have to translate it the rest of the
way yourself.
 
U

usenet

Hello everyone!

Can anyone help me to convert the CDATA expression "CDATA ::= (Char* -
(Char* ']]>' Char*)" to Javascript Regular Expression?

Thanks,

Max

Doing regular expressions that end with a string of characters is
slightly involved. You need to do something like:

/([^\]]*|][^\]]|]][^>]|]]?$)*/

Not the easiest thing to see! Maybe the best thing is to break it
into it's component parts. e.g.:

var no_bracket = "[^\]]*";
var one_bracket = "][^\]]";
var two_brackets = "]][^>]";
var end_bracket = "]]?$";

var expr = "/(" + no_bracket + "|" + one_bracket + "|" + two_bracket +
+ "|" + end_bracket + ")*/";

I'll admit I haven't tested it, but hopefully it gives you an idea!
(The $ anchor may not work where it is. In which case try \Z in its
place.)

HTH,

Pete.
--
=============================================
Pete Cordell
Tech-Know-Ware Ltd
for XML to C++ data binding visit
http://www.tech-know-ware.com/lmx
http://www.codalogic.com/lmx
(or http://www.xml2cpp.com)
=============================================
 
U

usenet

Hello everyone!
Can anyone help me to convert the CDATA expression "CDATA ::= (Char* -
(Char* ']]>' Char*)" to Javascript Regular Expression?

Max

Doing regular expressions that end with a string of characters is
slightly involved. You need to do something like:

/([^\]]*|][^\]]|]][^>]|]]?$)*/

Not the easiest thing to see! Maybe the best thing is to break it
into it's component parts. e.g.:

var no_bracket = "[^\]]*";
var one_bracket = "][^\]]";
var two_brackets = "]][^>]";
var end_bracket = "]]?$";

var expr = "/(" + no_bracket + "|" + one_bracket + "|" + two_bracket +
+ "|" + end_bracket + ")*/";

I'll admit I haven't tested it, but hopefully it gives you an idea!
(The $ anchor may not work where it is. In which case try \Z in its
place.)

I was thinking more about this over night. The details of the regular
expression depend on what input string you want to apply the matching
on. If you could give an idea of the types of strings you want the
match to be applied (e.g. whole XML message, or element text etc) to
it might be possible to have a better pattern.

Pete.
--
=============================================
Pete Cordell
Tech-Know-Ware Ltd
for XML to C++ data binding visit
http://www.tech-know-ware.com/lmx
http://www.codalogic.com/lmx
(or http://www.xml2cpp.com)
=============================================
 
M

Max

Hello Pete!

I have written this regular expression:

<!\\[CDATA\\[(((?:\\u0009|\\u000A|\\u000D|[\\u0020-\\uD7FF]|[\\uE000-\\uFFFD]|[\\u10000-\\u10FFFF])*?)(]]>(?:\\u0009|\\u000A|\\u000D|[\\u0020-\\uD7FF]|[\\uE000-\\uFFFD]|[\\u10000-\\u10FFFF])*?)*)]]>

I break it into these component parts:

XParser.CHAR =
"(?:\\u0009|\\u000A|\\u000D|[\\u0020-\\uD7FF]|[\\uE000-\\uFFFD]|[\\u10000-\\u10FFFF])";
XParser.CDSTART = "<!\\[CDATA\\[";
XParser.CDATA = "((" + XParser.CHAR + "*?)(]]>" + XParser.CHAR + "*?)*)";
XParser.CDEND = "]]>";
XParser.CDSECT = XParser.CDSTART + XParser.CDATA + XParser.CDEND;

XML code example:

<![CDATA[this child is of <<<>nodeType CDATA]]>

The problem is been born expanding the simple regular expression for
CDATA ('(" + XParser.CHAR + "*?)') with the feature to capture more
markup ']]>'.
But in this way it capture also two or more CDSECT...

Example:
1 Tag: <![CDATA[this child is of <<<>nodeType CDATA]]>
Capture: this child is of <<<>nodeType CDATA

2 Tag: <![CDATA[this child is of <<<>nodeType CDATA]]><![CDATA[this
child is of <<<>nodeType CDATA]]>
Capture: this child is of <<<>nodeType CDATA]]><![CDATA[this child is of
<<<>nodeType CDATA

Is it possible to resolve this?

Thanks in advance,

Max
 
J

Joseph Kesselman

This sounds like it's really a Javascript programming question rather
than an XML question, since the question is how to express something in
that language's reg-exp syntax rather than what to express. So you might
get better answers by asking in a Javascript newsgroup than here.
 
J

Joseph Kesselman

(After all, most of us just use an existing XML parser and let *it* deal
with syntax.)
 
U

usenet

Hello Pete!

I have written this regular expression:

<!\\[CDATA\\[(((?:\\u0009|\\u000A|\\u000D|[\\u0020-\\uD7FF]|[\\uE000-\\uFFF­D]|[\\u10000-\\u10FFFF])*?)(]]>(?:\\u0009|\\u000A|\\u000D|[\\u0020-\\uD7FF]­|[\\uE000-\\uFFFD]|[\\u10000-\\u10FFFF])*?)*)]]>

I break it into these component parts:

XParser.CHAR =
"(?:\\u0009|\\u000A|\\u000D|[\\u0020-\\uD7FF]|[\\uE000-\\uFFFD]|[\\u10000-\­\u10FFFF])";
XParser.CDSTART = "<!\\[CDATA\\[";
XParser.CDATA = "((" + XParser.CHAR + "*?)(]]>" + XParser.CHAR + "*?)*)";
XParser.CDEND = "]]>";
XParser.CDSECT = XParser.CDSTART + XParser.CDATA + XParser.CDEND;

XML code example:

<![CDATA[this child is of <<<>nodeType CDATA]]>

The problem is been born expanding the simple regular expression for
CDATA ('(" + XParser.CHAR + "*?)') with the feature to capture more
markup ']]>'.
But in this way it capture also two or more CDSECT...

Example:
1 Tag: <![CDATA[this child is of <<<>nodeType CDATA]]>
Capture: this child is of <<<>nodeType CDATA

2 Tag: <![CDATA[this child is of <<<>nodeType CDATA]]><![CDATA[this
child is of <<<>nodeType CDATA]]>
Capture: this child is of <<<>nodeType CDATA]]><![CDATA[this child is of
<<<>nodeType CDATA

Is it possible to resolve this?

Thanks in advance,

Max

Hi Max,

In this case I think you need to rework your XParser.CDATA rule along
the lines of the following:

// You could write these using a similar approach to your XParser.CHAR
if you prefer
var no_bracket = "[^\\]]*";
var one_bracket = "][^\\]]";
var two_brackets = "]][^>]";

XParser.CDATA = "(" + no_bracket + "|" + one_bracket + "|" +
two_bracket + ")*" + "]*";

The logic is basically:

if( current char is not ] ||
current char is ] AND next char is NOT ] ||
current char is ] and the next char is ] and the next one is NOT
then OK;

which is more easily understood as:

if( current char is not ] ) then OK;
else if( current char is ] AND next char is NOT ] ) then OK;
else if( current char is ] and the next char is ] and the next one is
NOT > ) then OK;

The end just allow any number of ] characters if necessary.

HTH,

Pete.
--
=============================================
Pete Cordell
Tech-Know-Ware Ltd
for XML to C++ data binding visit
http://www.tech-know-ware.com/lmx
(or http://www.xml2cpp.com)
=============================================
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,042
Latest member
icassiem

Latest Threads

Top