Z
Zed Pobre
I'm writing a program that needs to extract a clump of XML metadata
stored inside of a noncompliant HTML file and then perform a number of
operations on that metadata. (Specifically, for those curious, this
is part of a Mobipocket .prc to IPDF .epub ebook converter.)
The HTML file in question has no doctype declaration, and XHTML
entities may be found in the metadata portion. In particular, ©
is the first entity that XML:arser will choke on in my current test
data.
Could someone please provide me with an example of how to get
XML::Twig to recognize XHTML entities? (Or even just © to get me
started?) I came up with a workaround involving slurping the input
file and using a regular expression to split the metadata out into a
temporary file, then run tidy on it, but it's something of an evil
hack, given that I have to just read the results of that back into
XML::Twig anyway.
My last attempt at getting XML::Twig to read this looks like this:
$mobihtmltwig = XML::Twig->new(
load_DTD => 1,
twig_roots => { 'metadata' => 1 },
twig_handlers => { 'metadata' => \&twig_cut_metadata },
output_encoding => 'utf8',
pretty_print => 'indented',
twig_print_outside_roots => 'HTML'
);
$mobihtmltwig->set_doctype(
'package',
"http://openebook.org/dtds/oeb-1.2/oebpkg12.dtd",
"+//ISBN 0-9673008-1-9//DTD OEB 1.2 Package//EN");
$mobihtmltwig->entity_list->add_new_ent(copy => "©");
print $mobihtmltwig->entity_names,"\n";
$mobihtmltwig->parsefile($mobihtmlfile);
It dies at the parsefile command with:
undefined entity at line 1, column 413, byte 413 at
/usr/lib/perl5/XML/Parser.pm line 187
Byte 413 is the first ©. This is despite 'copy' being present in
the entity list.
Thanks for any help,
stored inside of a noncompliant HTML file and then perform a number of
operations on that metadata. (Specifically, for those curious, this
is part of a Mobipocket .prc to IPDF .epub ebook converter.)
The HTML file in question has no doctype declaration, and XHTML
entities may be found in the metadata portion. In particular, ©
is the first entity that XML:arser will choke on in my current test
data.
Could someone please provide me with an example of how to get
XML::Twig to recognize XHTML entities? (Or even just © to get me
started?) I came up with a workaround involving slurping the input
file and using a regular expression to split the metadata out into a
temporary file, then run tidy on it, but it's something of an evil
hack, given that I have to just read the results of that back into
XML::Twig anyway.
My last attempt at getting XML::Twig to read this looks like this:
$mobihtmltwig = XML::Twig->new(
load_DTD => 1,
twig_roots => { 'metadata' => 1 },
twig_handlers => { 'metadata' => \&twig_cut_metadata },
output_encoding => 'utf8',
pretty_print => 'indented',
twig_print_outside_roots => 'HTML'
);
$mobihtmltwig->set_doctype(
'package',
"http://openebook.org/dtds/oeb-1.2/oebpkg12.dtd",
"+//ISBN 0-9673008-1-9//DTD OEB 1.2 Package//EN");
$mobihtmltwig->entity_list->add_new_ent(copy => "©");
print $mobihtmltwig->entity_names,"\n";
$mobihtmltwig->parsefile($mobihtmlfile);
It dies at the parsefile command with:
undefined entity at line 1, column 413, byte 413 at
/usr/lib/perl5/XML/Parser.pm line 187
Byte 413 is the first ©. This is despite 'copy' being present in
the entity list.
Thanks for any help,