parsing xml using perl regex help

Avalon1178 · Jan 17, 2007

Hi,

I'm trying to write a somewhat simple parser using perl to parse an xml
element node, without having to use one of the CPAN libraries like
XML::Simple or SAX, etc. My application is fairly simple and I thought
I could accomplish it using regex. Basically, I just want to extract
the value of an attribute within an xml node.

For example, I have the following xml:

<?xml version="1.0" encoding="UTF-8"?>
<main>
<theNode name="blah" type="blah2" opt="blah3">
<desc>Some node</desc>
</theNode>
</main>

In my app, all I want is to extract the attribute value of "type" and
"opt", using regex, but I'm having trouble with it.

For starters, I attempted to extract the attribute value for "name", so
I did the following:

my $regex = qr/^\s*<theNode\s*name=\"(\w+)\".*/;
open(XML, "file.xml");
while(<XML>) {
if (/$regex/) {
print "Regex match: $1\n";
} else {
print "NO Regex match!\n";
}
}
close(XML);

Expecting $1 to be "blah" when it reaches the 3rd line of file.xml
(where file.xml is the sample xml that I posted above), instead I get a
NO Regex match.

Can anyone help? Thanks!

Avalon1178

Martijn Lievaart · Jan 17, 2007

Hi,

I'm trying to write a somewhat simple parser using perl to parse an xml
element node, without having to use one of the CPAN libraries like
XML::Simple or SAX, etc. My application is fairly simple and I thought
I could accomplish it using regex. Basically, I just want to extract
the value of an attribute within an xml node.

For example, I have the following xml:

<?xml version="1.0" encoding="UTF-8"?>
<main>
<theNode name="blah" type="blah2" opt="blah3">
<desc>Some node</desc>
</theNode>
</main>

In my app, all I want is to extract the attribute value of "type" and
"opt", using regex, but I'm having trouble with it.

For starters, I attempted to extract the attribute value for "name", so
I did the following:

use strict;
use warnings;

my $regex = qr/^\s*<theNode\s*name=\"(\w+)\".*/;
open(XML, "file.xml");
while(<XML>) {
if (/$regex/) {
print "Regex match: $1\n";
} else {
print "NO Regex match!\n";
}
}
close(XML);

Expecting $1 to be "blah" when it reaches the 3rd line of file.xml
(where file.xml is the sample xml that I posted above), instead I get a
NO Regex match.

Works for me.

[martijn@dexter tmp]$ ./test.pl
NO Regex match!
NO Regex match!
Regex match: blah
NO Regex match!
NO Regex match!
NO Regex match!
[martijn@dexter tmp]$

What's your output? (Which you should have posted already!)

M4

Avalon1178 · Jan 17, 2007

Sherm said:
The goals you've stated here are contradictory - you say you want to keep
it simple, but you've already ruled out the simplest solution.

sherm--

Huh? How is that contradictory?

Anyway, when I mentioned simple, I meant my task is simple....my app
only needs to parse a known and specific attribute value...that's it.
Porting a library that parses XML like SAX and XML::Simple in a general
purpose way I think is a little overkill for my needs esp. if I only
want to accomplish a single task. I also keep in mind that when I
distribute my app, I'm assuming that the system where this app will be
running on does not have XML::Simple or SAX or DOM installed in that
system...that is why I'm attempting to see if I can do so using
regex...no matter how complicated that regex is (I never said the perl
code is going to be simple...I meant my task is simple

)

Avalon1178 · Jan 17, 2007

Martijn said:
Hi,

I'm trying to write a somewhat simple parser using perl to parse an xml
element node, without having to use one of the CPAN libraries like
XML::Simple or SAX, etc. My application is fairly simple and I thought
I could accomplish it using regex. Basically, I just want to extract
the value of an attribute within an xml node.

For example, I have the following xml:

<?xml version="1.0" encoding="UTF-8"?>
<main>
<theNode name="blah" type="blah2" opt="blah3">
<desc>Some node</desc>
</theNode>
</main>

In my app, all I want is to extract the attribute value of "type" and
"opt", using regex, but I'm having trouble with it.

For starters, I attempted to extract the attribute value for "name", so
I did the following:

Click to expand...

use strict;
use warnings;

my $regex = qr/^\s*<theNode\s*name=\"(\w+)\".*/;
open(XML, "file.xml");
while(<XML>) {
if (/$regex/) {
print "Regex match: $1\n";
} else {
print "NO Regex match!\n";
}
}
close(XML);

Expecting $1 to be "blah" when it reaches the 3rd line of file.xml
(where file.xml is the sample xml that I posted above), instead I get a
NO Regex match.

Click to expand...

Works for me.

[martijn@dexter tmp]$ ./test.pl
NO Regex match!
NO Regex match!
Regex match: blah
NO Regex match!
NO Regex match!
NO Regex match!
[martijn@dexter tmp]$

What's your output? (Which you should have posted already!)

M4

Hmm...you're right....it does work....it turns out its the source xml
that I'm using. The value of the attribute that I'm trying to extract
also contains a non-alpha numeric character so obviously the (\w+) will
not match.

Modifying the regex to:

my $regex = qr/^s*<theNode\s*name=\"(\S+)\".*/;

works

Thanks!

Avalon1178 · Jan 17, 2007

Great it works now...essentially by modifying (\w+) to (\S+)

So my new regex looks like the following if I want to extract the
attribute value of type and opt:

my $regex =
qr/^s*<theNode\s*name=\"(\S+)\"\s*type=\"(\S+)\"\s*opt=\"(\S+)\".*/;

I have one other question though...and maybe this is because of my
fairly novice knowledge of regex's. Is there another way using regex
to parse the attributes where the order of the attributes name may not
be always the same? In other words, in my xml sample, I could have an
attribute order of type--opt--name, or opt--name--type, etc. Of
course, I could create different regex for different order
combinations, but that's a bit overkill (the naive approach), but is
there a way to represent it in a single regex?

Martijn said:
Martijn said:

Hi,

I'm trying to write a somewhat simple parser using perl to parse an xml
element node, without having to use one of the CPAN libraries like
XML::Simple or SAX, etc. My application is fairly simple and I thought
I could accomplish it using regex. Basically, I just want to extract
the value of an attribute within an xml node.

For example, I have the following xml:

<?xml version="1.0" encoding="UTF-8"?>
<main>
<theNode name="blah" type="blah2" opt="blah3">
<desc>Some node</desc>
</theNode>
</main>

In my app, all I want is to extract the attribute value of "type" and
"opt", using regex, but I'm having trouble with it.

For starters, I attempted to extract the attribute value for "name", so
I did the following:

Click to expand...

use strict;
use warnings;

my $regex = qr/^\s*<theNode\s*name=\"(\w+)\".*/;
open(XML, "file.xml");
while(<XML>) {
if (/$regex/) {
print "Regex match: $1\n";
} else {
print "NO Regex match!\n";
}
}
close(XML);

Expecting $1 to be "blah" when it reaches the 3rd line of file.xml
(where file.xml is the sample xml that I posted above), instead I get a
NO Regex match.

Click to expand...

Works for me.

[martijn@dexter tmp]$ ./test.pl
NO Regex match!
NO Regex match!
Regex match: blah
NO Regex match!
NO Regex match!
NO Regex match!
[martijn@dexter tmp]$

What's your output? (Which you should have posted already!)

M4

Click to expand...

Hmm...you're right....it does work....it turns out its the source xml
that I'm using. The value of the attribute that I'm trying to extract
also contains a non-alpha numeric character so obviously the (\w+) will
not match.

Modifying the regex to:

my $regex = qr/^s*<theNode\s*name=\"(\S+)\".*/;

works Thanks!

John Bokma · Jan 17, 2007

Avalon1178 said:
I have one other question though...and maybe this is because of my
fairly novice knowledge of regex's.

Nah, it has much more to do with missing the point and ignoring advice.
The road you're following leads to disaster after disaster. There is a
reason that there are XML modules on CPAN.

there a way to represent it in a single regex?

Don't use regex when a parser is /needed/ even in very simple cases.

Tad McClellan · Jan 17, 2007

Avalon1178 said:
So my new regex looks like the following if I want to extract the
attribute value of type and opt:

my $regex =
qr/^s*<theNode\s*name=\"(\S+)\"\s*type=\"(\S+)\"\s*opt=\"(\S+)\".*/;

^ ^ ^ ^ ^ ^
^ ^ ^ ^ ^ ^

Double quotes are not meta in regexes, so you should not backslash them.

I have one other question though...

You only have one other question _now_, but you could end up with
all of these questions too:

How do I extract the attributes from these ones?

<theNode name='blah' type="blah2" opt='blah3'>



<theNode
name
=
"blah"
type
=
"blah2"
opt
="
blah3"

and maybe this is because of my
fairly novice knowledge of regex's.

No, rather it may be because of your fairly novice knowledge
of the grammar you need to parse (XML).

Is there another way using regex
to parse the attributes where the order of the attributes name may not
be always the same?

Yes, by making the regex and other code larger and more uncomely.

Then you need to add yet more regex and code changes when any
of the above cases arise.

Then you need to add yet *more* regex and code changes when the
inevitable others arise.

If you do it with a real XML Parsing module for this "simple" case,
then it will just silently keep working when any of the "one more"
things arise.

but is
there a way to represent it in a single regex?

You should be asking what is the best way, rather than asking for
only a way that uses the Wrong Tool.

[ snip TOFU.
Please do not top-post.
]

Perl regex function call	6	May 8, 2006
Help with perl special variable	5	Apr 11, 2012
How to debug a regex with (?DEFINE)?	0	Aug 7, 2012
How to make XML::XPath ignore namespaces?	0	May 21, 2013
Parsing multiple lines from text file using regex	0	Oct 27, 2013
Regular Expression for XML Parsing	4	Dec 27, 2007
Regex help	2	Sep 3, 2010
XML parsing using PERL.... URGENT	2	Apr 13, 2007

parsing xml using perl regex help

Avalon1178

Martijn Lievaart

Avalon1178

Avalon1178

Avalon1178

John Bokma

Tad McClellan

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads