Parse XML that isn't well formed


Milo Thurston

I have some XML looking like the following, other than being very much
larger (some files are up to 2GB):

<?xml version="1.0" encoding="UTF-8"?>

I've tried a few xml parsers such as xml-simple, libxml and quixml, but
all reject this data as badly formed. One answer would, of course, be
for the data to be re-generated using properly formed xml. Meanwhile, is
there anything that could be done with the existing files? Is it a case
of having to write regexps to parse this sort of thing?

Alex LeDonne

I have some XML looking like the following, other than being very much
larger (some files are up to 2GB):

<?xml version="1.0" encoding="UTF-8"?>

Note that there should be no </xml> - the line at the top is a
declaration, not an opening tag. Where did </xml> come from? What
happens if you remove that from the data?


Milo Thurston

Alex said:
Note that there should be no </xml> - the line at the top is a
declaration, not an opening tag. Where did </xml> come from? What
happens if you remove that from the data?

Good point about the XML. Unfortunately, these are the files I have
received and have to deal with them for now.

Removing the final tag gives:

file.xml:3: parser error : Extra content at the end of the document
rake aborted!

Jano Svitok

Good point about the XML. Unfortunately, these are the files I have
received and have to deal with them for now.

Removing the final tag gives:

.file.xml:3: parser error : Extra content at the end of the document
rake aborted!

You should have done two things: 1. add root node <server> (with
closing </server> just before </xml>) AND 2. remove the trailing

Then it'll be fine.

in your case it's easy:

data.gsub('?>', '?><server>').gsub('</xml>', '</server>')

Milo Thurston

Jano Svitok wrote:>
You should have done two things: 1. add root node <server> (with
closing </server> just before </xml>) AND 2. remove the trailing

Great, thanks.
That should sort out the "legacy" files, and future ones can be

I have also been parsing each line with IO.foreach and
/<(.+)[^>]*>(.+?)<(\/.+)>/, which though not as nice as a proper XML
parser does avoid loading huge files into memory in one go.

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Latest member
Vinay Kumar Nevatia0