parsing XML using a regular expression

L

Leif Wessman

Hi!

I'm trying to parse some xml with a regular expression (yes, i know
that there is several XML modules that I can use).

My problem is that I'm not that good in creating regular expressions.
The following code does not work as expected. I have a list of items in
xml. Each item has an id and an optional name (no <name>-tag or
<name/>). Each item can also have other tags that I'm not interested
in.

I'm trying to parse this simle xml document so that I extract the id
for each item and the name (if it's there).

However, the output of my program only displays the id:s, not any name.
That's my first problem. My second problem is that I would like to know
if it's possible to make my code more efficient (faster and using less
memory). In reality my xml-file can be quite large.

My code:
--------

#!/usr/bin/perl
use strict;
use warnings;

open (XML, "<items.xml") or die "open: $!";
my $xml;
while(my $line = <XML>) {
$xml = $xml . $line;
}

while ($xml =~
/<item>.*?<id>(.*?)<\/id>.*?(<name>(.*?)<\/name>)?.*?<\/item>/gs) {
print "id : $1\n";
if ($3) {
print "name: $3\n";
}
}

My xml-document:
----------------
<xml>
<item>
<id>mf3</id>
<color>blue</color>
<name>moto F3</name>
</item>
<item>
<id>nk1</id>
</item>
<item>
<id>jk8</id>
<name/>
</item>
<item>
<id>la2</id>
<name>labo 2</name>
</item>
<xml>
My output:
----------
id : mf3
id : nk1
id : jk8
id : la2


Leif
 
T

Tad McClellan

Leif Wessman said:
I'm trying to parse some xml with a regular expression (yes, i know
that there is several XML modules that I can use).


You have headed off the 2nd question.

The 1st question is: why do you want to do it with regular expressions
rather than with a real parse?

If you tell us the constraints that prompt your approach, that will
help us a lot for providing advice...
 
C

ChrisO

Bernard said:
This


I'm trying to parse some xml with a regular expression (yes, i
know that there is several XML modules that I can use).



when put together with this


My problem is that I'm not that good in creating regular
expressions. [...]



suggests using one of the modules you claim to know about.

But he's not allowed to use a module because his professor has
specifically indicated that that is not an option for his assignment.
Seems pretty clear to me... ;-)

-ceo
 
J

Jeremy Bowers

But he's not allowed to use a module because his professor has
specifically indicated that that is not an option for his assignment.
Seems pretty clear to me... ;-)

Are you serious, or joking?

Is there really a professor teaching regexs to parse XML? And giving
assignments on it?

In that case, the *correct* solution is to drop the class while you can
still get a refund. All the wonderful examples for regexes in the world
and (s)he chooses the one that can instill deep and abiding bad habits.

(Caveat: This is acceptable if the end result is to teach that regexes
aren't sufficient, in the school of hard knocks. In which case I think I
admire him/her.)
 
C

ChrisO

Jeremy said:
Are you serious, or joking?

I can't think of any other reason why someone would want to parse XML
and specifically state that they "didn't want" to use the XML modules...
(twice)? It seems to me to say loudly that this is a class room
assignment. I've seen worse requirements handed out...

-ceo
 
H

Helgi Briem

On 9 Sep 2004 00:44:18 -0700, "Leif Wessman" <[email protected]>
wrote:

Don't top-post. It annoys the regulars and severely damages
your chances of recieving useful help to your questions.
I was trying to find a general solution to parsing both HTML and
xml-files. And I didn't know that regular expressions was such a bad
idea when parsing XML. Now I know, and now I will build a solution
using regular expressions for HTML and an XML-parser for the XML-files.

Using regular expressions to parse HTML is just as bad
as using them to parse XML. HTML, after all, is just a
subset of XML. Use the appropriate modules to parse
HTML.

For details on why this is a bad idea, read the FAQ:

perldoc -q "remove HTML"

--
Helgi Briem hbriem AT simnet DOT is

Never worry about anything that you see on the news.
To get on the news it must be sufficiently rare
that your chances of being involved are negligible!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,754
Messages
2,569,521
Members
44,995
Latest member
PinupduzSap

Latest Threads

Top