XML::Twig doctype and entity handling

Z

Zed Pobre

I'm writing a program that needs to extract a clump of XML metadata
stored inside of a noncompliant HTML file and then perform a number of
operations on that metadata. (Specifically, for those curious, this
is part of a Mobipocket .prc to IPDF .epub ebook converter.)

The HTML file in question has no doctype declaration, and XHTML
entities may be found in the metadata portion. In particular, ©
is the first entity that XML::parser will choke on in my current test
data.

Could someone please provide me with an example of how to get
XML::Twig to recognize XHTML entities? (Or even just © to get me
started?) I came up with a workaround involving slurping the input
file and using a regular expression to split the metadata out into a
temporary file, then run tidy on it, but it's something of an evil
hack, given that I have to just read the results of that back into
XML::Twig anyway.

My last attempt at getting XML::Twig to read this looks like this:

$mobihtmltwig = XML::Twig->new(
load_DTD => 1,
twig_roots => { 'metadata' => 1 },
twig_handlers => { 'metadata' => \&twig_cut_metadata },
output_encoding => 'utf8',
pretty_print => 'indented',
twig_print_outside_roots => 'HTML'
);

$mobihtmltwig->set_doctype(
'package',
"http://openebook.org/dtds/oeb-1.2/oebpkg12.dtd",
"+//ISBN 0-9673008-1-9//DTD OEB 1.2 Package//EN");

$mobihtmltwig->entity_list->add_new_ent(copy => "©");

print $mobihtmltwig->entity_names,"\n";

$mobihtmltwig->parsefile($mobihtmlfile);


It dies at the parsefile command with:

undefined entity at line 1, column 413, byte 413 at
/usr/lib/perl5/XML/Parser.pm line 187

Byte 413 is the first ©. This is despite 'copy' being present in
the entity list.

Thanks for any help,
 
P

Peter J. Holzer

["Followup-To:" header set to comp.lang.perl.misc.]
I'm writing a program that needs to extract a clump of XML metadata
stored inside of a noncompliant HTML file and then perform a number of
operations on that metadata. (Specifically, for those curious, this
is part of a Mobipocket .prc to IPDF .epub ebook converter.)

The HTML file in question has no doctype declaration, and XHTML
entities may be found in the metadata portion. In particular, ©
is the first entity that XML::parser will choke on in my current test
data.

Could someone please provide me with an example of how to get
XML::Twig to recognize XHTML entities?

Just prepend a declaration. For example here is a snippet from one of my
scripts which deals with a similar situation:

while ($lines[0] =~ /\s*<use /) {
shift @lines;
}
my $encoding = "utf-8";
if ($lines[0] =~ / charset=["'](.*?)["']/) {
$encoding=$1
}
my $text = join('', (
"<?xml version='1.0' encoding='$encoding' ?>\n",
"<!DOCTYPE protokoll SYSTEM 'http://www.luga.at/dtd/protokoll.dtd'\n",
" [\n",
" <!ENTITY euro '€'>\n",
" <!ENTITY mdash '—'>\n",
" <!ENTITY rArr '⇒'>\n",
" ]\n",
">\n",
@lines
)
);

This first strips off a few extra lines (which start with "<use "), then
extracts the encoding from the first remaining line and then prepends an
XML declaration with the encoding and a doctype declaration with a few
entities.

hp
 
Z

Zed Pobre

Peter J. Holzer said:
["Followup-To:" header set to comp.lang.perl.misc.]
I'm writing a program that needs to extract a clump of XML metadata
stored inside of a noncompliant HTML file and then perform a number of
operations on that metadata. (Specifically, for those curious, this
is part of a Mobipocket .prc to IPDF .epub ebook converter.)

The HTML file in question has no doctype declaration, and XHTML
entities may be found in the metadata portion. In particular, &copy;
is the first entity that XML::parser will choke on in my current test
data.

Could someone please provide me with an example of how to get
XML::Twig to recognize XHTML entities?

Just prepend a declaration. For example here is a snippet from one of my
scripts which deals with a similar situation:

Thanks for the suggestion, but I think you misunderstand the situation
-- the input file looks something like this (and I don't have control
over its generation):

<html><head><metadata> <dc-metadata [...] </metadata></head><body>[...]

The goal is to avoid slurping the file, but extract and separate the
<metadata>...</metadata> block from the HTML via XML::Twig, outputting
HTML with the metadata block removed, parsing and modifying the XML
metadata block, then outputting that as a separate file. The source
files involved average half a megabyte in size, and can reach several
megabytes.

My hope was to use XML::Twig to keep memory usage down, and certainly
to avoid a twig root involving entire HTML+XMLmetadata structure. At
least, the Twig documentation implied that it could do this in a
low-memory fashion, pulling out only the parts needed. The
documentation also lists functions (that are either buggy or that I am
apparently using incorrectly) to define an entity list or assign a
doctype prior to a parse. I'm hoping that someone can give an example
of correct usage.

My current workaround is actually somewhat similar to yours, except at
a file level: I have a subroutine that slurps the file, regexps out
the metadata block, saves the metadata block to a new file with a
proper XML header and doctype appended, saves everything else to a
HTML-only file, and then returns, so I can call XML::Twig only on the
outputted XML file. This works, but still allocates a potentially
huge amount of memory during the splitting process, even if that
memory is available to Twig after it returns.

I've been contemplating bludgeoning out a low-memory solution with
sysread, since the metadata will always be at the top of the file and
has never so far been larger than about 8kb, but was hoping to see if
someone knew how to get Twig working first.

Thanks again,
 
J

John Bokma

Zed Pobre said:
I've been contemplating bludgeoning out a low-memory solution with
sysread, since the metadata will always be at the top of the file and
has never so far been larger than about 8kb, but was hoping to see if
someone knew how to get Twig working first.

If you want to reduce memory to a minimum you can't avoid using a
streaming solution. I probably would use XML::parser or SAX.

It's not clear if by HTML you actually mean XHTML (I guess yes, otherwise
you'll might bump into problems with XML parsing)
 
Z

Zed Pobre

John Bokma said:
If you want to reduce memory to a minimum you can't avoid using a
streaming solution. I probably would use XML::parser or SAX.

It's not clear if by HTML you actually mean XHTML (I guess yes, otherwise
you'll might bump into problems with XML parsing)

Unfortunately, I really do mean HTML, and very badly formed HTML at
that. The only part that can be relied upon to be well-formed is the
<metadata>...</metadata> clump that I was trying to extract with
twig_roots without actually parsing the rest of the file.

It turns out that this isn't possible even with XML::Twig.

One of the kind monks over at perlmonks.org pointed out that there's
nothing stopping me from passing parsefile() a pipe, so I got past the
doctype problem by passing it 'cat oeb12doctype.xml input.html|', at
which point the parse() cheerfully got so far as splitting off the
HTML with all of the metadata elements removed before die-ing horribly
on a mismatched tag. According to the Twig documentation, there is no
way to proceed and get the extracted elements anyway, so this entire
technique has been a dead end, though amusingly this technique does
work to split out the HTML without the <metadata> elements, since
twig_print_outside_roots will finish up before the parser dies from
mismatched tags. That probably isn't reliable, though.

I'll have to constrain the memory use by doing the initial split in
10k chunks, I guess.

Thanks for the help.
 
P

Peter J. Holzer

["Followup-To:" header set to comp.lang.perl.misc.]
Unfortunately, I really do mean HTML, and very badly formed HTML at
that.

Then you cannot/should not use an XML parser. XML is designed to have a
strict syntax, and all XML parsers I know rely on this and enforce it
(more or less).
The only part that can be relied upon to be well-formed is the
<metadata>...</metadata> clump that I was trying to extract with
twig_roots without actually parsing the rest of the file.

It turns out that this isn't possible even with XML::Twig.

One of the kind monks over at perlmonks.org pointed out that there's
nothing stopping me from passing parsefile() a pipe, so I got past the
doctype problem by passing it 'cat oeb12doctype.xml input.html|',

I was going to suggest that when I read about your memory constraints
(although I think slurping in a few MB of into a single string won't
hurt you - the big problem with large XML files is usually the parsed
representation (a tree with hundreds of thousands of nodes for a few
MB).

However, if you know that everything inside your metadata elements is
well-formed XML and that the metadata elements aren't nested and that
there are no CDATA sections which might contain the string "<metadata>"
or "</metadata>", you can easily extract those sections:

$/ = "</metadata>";

print $preamble, "<fakeroot>";
while (<>) {
# each record ends with </metadata>,
# so we can throw everything up to <metadata away:
s/.*(?=<metadata\s)//;
print $_
}
print "</fakeroot>"

put that in a subprocess and pass the pipe to XML::Twig. (or maybe
XML::Twig has a method with which you can feed it chunks of input - I
think it does, but a quick scanning of the man page didn't reveal it).

at which point the parse() cheerfully got so far as splitting off the
HTML with all of the metadata elements removed before die-ing horribly
on a mismatched tag. According to the Twig documentation, there is no
way to proceed and get the extracted elements anyway,

I think you can get the extracted elements, but there is no way to
continue after the first parse error. So unless you are certain that the
first error occurs after the sections you are interested in you can't
use that.

hp
 
P

Peter J. Holzer

However, if you know that everything inside your metadata elements is
well-formed XML and that the metadata elements aren't nested and that
there are no CDATA sections which might contain the string "<metadata>"
or "</metadata>", you can easily extract those sections:

$/ = "</metadata>";

print $preamble, "<fakeroot>";
while (<>) {
# each record ends with </metadata>,
# so we can throw everything up to <metadata away:
s/.*(?=<metadata\s)//;
print $_
}
print "</fakeroot>"

put that in a subprocess and pass the pipe to XML::Twig. (or maybe
XML::Twig has a method with which you can feed it chunks of input - I
think it does, but a quick scanning of the man page didn't reveal it).

Sorry, I missed that there is only one metadata element and that it is
always near the start of the document. In that case it's a lot simpler.
You don't need a loop as you only need to read one element. And you can
just call XML::Twig->parse with that element and don't need to wrap it
in a fake root element.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top