parsing xml using perl regex help

A

Avalon1178

Hi,

I'm trying to write a somewhat simple parser using perl to parse an xml
element node, without having to use one of the CPAN libraries like
XML::Simple or SAX, etc. My application is fairly simple and I thought
I could accomplish it using regex. Basically, I just want to extract
the value of an attribute within an xml node.

For example, I have the following xml:

<?xml version="1.0" encoding="UTF-8"?>
<main>
<theNode name="blah" type="blah2" opt="blah3">
<desc>Some node</desc>
</theNode>
</main>

In my app, all I want is to extract the attribute value of "type" and
"opt", using regex, but I'm having trouble with it.

For starters, I attempted to extract the attribute value for "name", so
I did the following:

my $regex = qr/^\s*<theNode\s*name=\"(\w+)\".*/;
open(XML, "file.xml");
while(<XML>) {
if (/$regex/) {
print "Regex match: $1\n";
} else {
print "NO Regex match!\n";
}
}
close(XML);

Expecting $1 to be "blah" when it reaches the 3rd line of file.xml
(where file.xml is the sample xml that I posted above), instead I get a
NO Regex match.

Can anyone help? Thanks!

Avalon1178
 
M

Martijn Lievaart

Hi,

I'm trying to write a somewhat simple parser using perl to parse an xml
element node, without having to use one of the CPAN libraries like
XML::Simple or SAX, etc. My application is fairly simple and I thought
I could accomplish it using regex. Basically, I just want to extract
the value of an attribute within an xml node.

For example, I have the following xml:

<?xml version="1.0" encoding="UTF-8"?>
<main>
<theNode name="blah" type="blah2" opt="blah3">
<desc>Some node</desc>
</theNode>
</main>

In my app, all I want is to extract the attribute value of "type" and
"opt", using regex, but I'm having trouble with it.

For starters, I attempted to extract the attribute value for "name", so
I did the following:

use strict;
use warnings;
my $regex = qr/^\s*<theNode\s*name=\"(\w+)\".*/;
open(XML, "file.xml");
while(<XML>) {
if (/$regex/) {
print "Regex match: $1\n";
} else {
print "NO Regex match!\n";
}
}
close(XML);

Expecting $1 to be "blah" when it reaches the 3rd line of file.xml
(where file.xml is the sample xml that I posted above), instead I get a
NO Regex match.

Works for me.

[martijn@dexter tmp]$ ./test.pl
NO Regex match!
NO Regex match!
Regex match: blah
NO Regex match!
NO Regex match!
NO Regex match!
[martijn@dexter tmp]$

What's your output? (Which you should have posted already!)

M4
 
A

Avalon1178

Sherm said:
The goals you've stated here are contradictory - you say you want to keep
it simple, but you've already ruled out the simplest solution.

sherm--

Huh? How is that contradictory?

Anyway, when I mentioned simple, I meant my task is simple....my app
only needs to parse a known and specific attribute value...that's it.
Porting a library that parses XML like SAX and XML::Simple in a general
purpose way I think is a little overkill for my needs esp. if I only
want to accomplish a single task. I also keep in mind that when I
distribute my app, I'm assuming that the system where this app will be
running on does not have XML::Simple or SAX or DOM installed in that
system...that is why I'm attempting to see if I can do so using
regex...no matter how complicated that regex is (I never said the perl
code is going to be simple...I meant my task is simple :) )
 
A

Avalon1178

Martijn said:
Hi,

I'm trying to write a somewhat simple parser using perl to parse an xml
element node, without having to use one of the CPAN libraries like
XML::Simple or SAX, etc. My application is fairly simple and I thought
I could accomplish it using regex. Basically, I just want to extract
the value of an attribute within an xml node.

For example, I have the following xml:

<?xml version="1.0" encoding="UTF-8"?>
<main>
<theNode name="blah" type="blah2" opt="blah3">
<desc>Some node</desc>
</theNode>
</main>

In my app, all I want is to extract the attribute value of "type" and
"opt", using regex, but I'm having trouble with it.

For starters, I attempted to extract the attribute value for "name", so
I did the following:

use strict;
use warnings;
my $regex = qr/^\s*<theNode\s*name=\"(\w+)\".*/;
open(XML, "file.xml");
while(<XML>) {
if (/$regex/) {
print "Regex match: $1\n";
} else {
print "NO Regex match!\n";
}
}
close(XML);

Expecting $1 to be "blah" when it reaches the 3rd line of file.xml
(where file.xml is the sample xml that I posted above), instead I get a
NO Regex match.

Works for me.

[martijn@dexter tmp]$ ./test.pl
NO Regex match!
NO Regex match!
Regex match: blah
NO Regex match!
NO Regex match!
NO Regex match!
[martijn@dexter tmp]$

What's your output? (Which you should have posted already!)

M4

Hmm...you're right....it does work....it turns out its the source xml
that I'm using. The value of the attribute that I'm trying to extract
also contains a non-alpha numeric character so obviously the (\w+) will
not match.

Modifying the regex to:

my $regex = qr/^s*<theNode\s*name=\"(\S+)\".*/;

works :) Thanks!
 
A

Avalon1178

Great it works now...essentially by modifying (\w+) to (\S+)

So my new regex looks like the following if I want to extract the
attribute value of type and opt:

my $regex =
qr/^s*<theNode\s*name=\"(\S+)\"\s*type=\"(\S+)\"\s*opt=\"(\S+)\".*/;

I have one other question though...and maybe this is because of my
fairly novice knowledge of regex's. Is there another way using regex
to parse the attributes where the order of the attributes name may not
be always the same? In other words, in my xml sample, I could have an
attribute order of type--opt--name, or opt--name--type, etc. Of
course, I could create different regex for different order
combinations, but that's a bit overkill (the naive approach), but is
there a way to represent it in a single regex?

Martijn said:
Hi,

I'm trying to write a somewhat simple parser using perl to parse an xml
element node, without having to use one of the CPAN libraries like
XML::Simple or SAX, etc. My application is fairly simple and I thought
I could accomplish it using regex. Basically, I just want to extract
the value of an attribute within an xml node.

For example, I have the following xml:

<?xml version="1.0" encoding="UTF-8"?>
<main>
<theNode name="blah" type="blah2" opt="blah3">
<desc>Some node</desc>
</theNode>
</main>

In my app, all I want is to extract the attribute value of "type" and
"opt", using regex, but I'm having trouble with it.

For starters, I attempted to extract the attribute value for "name", so
I did the following:

use strict;
use warnings;
my $regex = qr/^\s*<theNode\s*name=\"(\w+)\".*/;
open(XML, "file.xml");
while(<XML>) {
if (/$regex/) {
print "Regex match: $1\n";
} else {
print "NO Regex match!\n";
}
}
close(XML);

Expecting $1 to be "blah" when it reaches the 3rd line of file.xml
(where file.xml is the sample xml that I posted above), instead I get a
NO Regex match.

Works for me.

[martijn@dexter tmp]$ ./test.pl
NO Regex match!
NO Regex match!
Regex match: blah
NO Regex match!
NO Regex match!
NO Regex match!
[martijn@dexter tmp]$

What's your output? (Which you should have posted already!)

M4

Hmm...you're right....it does work....it turns out its the source xml
that I'm using. The value of the attribute that I'm trying to extract
also contains a non-alpha numeric character so obviously the (\w+) will
not match.

Modifying the regex to:

my $regex = qr/^s*<theNode\s*name=\"(\S+)\".*/;

works :) Thanks!
 
J

John Bokma

Avalon1178 said:
I have one other question though...and maybe this is because of my
fairly novice knowledge of regex's.

Nah, it has much more to do with missing the point and ignoring advice.
The road you're following leads to disaster after disaster. There is a
reason that there are XML modules on CPAN.
there a way to represent it in a single regex?

Don't use regex when a parser is /needed/ even in very simple cases.
 
T

Tad McClellan

Avalon1178 said:
So my new regex looks like the following if I want to extract the
attribute value of type and opt:

my $regex =
qr/^s*<theNode\s*name=\"(\S+)\"\s*type=\"(\S+)\"\s*opt=\"(\S+)\".*/;
^ ^ ^ ^ ^ ^
^ ^ ^ ^ ^ ^

Double quotes are not meta in regexes, so you should not backslash them.

I have one other question though...


You only have one other question _now_, but you could end up with
all of these questions too:

How do I extract the attributes from these ones?

<theNode name='blah' type="blah2" opt='blah3'>

<!--
<theNode name="blah" type="blah2" opt="blah3">
-->

<theNode
name
=
"blah"
type
=
"blah2"
opt
="
blah3"
and maybe this is because of my
fairly novice knowledge of regex's.


No, rather it may be because of your fairly novice knowledge
of the grammar you need to parse (XML).

Is there another way using regex
to parse the attributes where the order of the attributes name may not
be always the same?


Yes, by making the regex and other code larger and more uncomely.

Then you need to add yet more regex and code changes when any
of the above cases arise.

Then you need to add yet *more* regex and code changes when the
inevitable others arise.

If you do it with a real XML Parsing module for this "simple" case,
then it will just silently keep working when any of the "one more"
things arise.

but is
there a way to represent it in a single regex?


You should be asking what is the best way, rather than asking for
only a way that uses the Wrong Tool.



[ snip TOFU.
Please do not top-post.
]
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top