XML::Twig doctype and entity handling

Discussion in 'Perl Misc' started by Zed Pobre, Sep 5, 2008.

  1. Zed Pobre

    Zed Pobre Guest

    I'm writing a program that needs to extract a clump of XML metadata
    stored inside of a noncompliant HTML file and then perform a number of
    operations on that metadata. (Specifically, for those curious, this
    is part of a Mobipocket .prc to IPDF .epub ebook converter.)

    The HTML file in question has no doctype declaration, and XHTML
    entities may be found in the metadata portion. In particular, ©
    is the first entity that XML::parser will choke on in my current test
    data.

    Could someone please provide me with an example of how to get
    XML::Twig to recognize XHTML entities? (Or even just © to get me
    started?) I came up with a workaround involving slurping the input
    file and using a regular expression to split the metadata out into a
    temporary file, then run tidy on it, but it's something of an evil
    hack, given that I have to just read the results of that back into
    XML::Twig anyway.

    My last attempt at getting XML::Twig to read this looks like this:

    $mobihtmltwig = XML::Twig->new(
    load_DTD => 1,
    twig_roots => { 'metadata' => 1 },
    twig_handlers => { 'metadata' => \&twig_cut_metadata },
    output_encoding => 'utf8',
    pretty_print => 'indented',
    twig_print_outside_roots => 'HTML'
    );

    $mobihtmltwig->set_doctype(
    'package',
    "http://openebook.org/dtds/oeb-1.2/oebpkg12.dtd",
    "+//ISBN 0-9673008-1-9//DTD OEB 1.2 Package//EN");

    $mobihtmltwig->entity_list->add_new_ent(copy => "©");

    print $mobihtmltwig->entity_names,"\n";

    $mobihtmltwig->parsefile($mobihtmlfile);


    It dies at the parsefile command with:

    undefined entity at line 1, column 413, byte 413 at
    /usr/lib/perl5/XML/Parser.pm line 187

    Byte 413 is the first ©. This is despite 'copy' being present in
    the entity list.

    Thanks for any help,

    --
    Zed Pobre <> a.k.a. Zed Pobre <>
    PGP key and fingerprint available on finger; encrypted mail welcomed.
     
    Zed Pobre, Sep 5, 2008
    #1
    1. Advertising

  2. ["Followup-To:" header set to comp.lang.perl.misc.]
    On 2008-09-04 23:11, Zed Pobre <> wrote:
    > I'm writing a program that needs to extract a clump of XML metadata
    > stored inside of a noncompliant HTML file and then perform a number of
    > operations on that metadata. (Specifically, for those curious, this
    > is part of a Mobipocket .prc to IPDF .epub ebook converter.)
    >
    > The HTML file in question has no doctype declaration, and XHTML
    > entities may be found in the metadata portion. In particular, &copy;
    > is the first entity that XML::parser will choke on in my current test
    > data.
    >
    > Could someone please provide me with an example of how to get
    > XML::Twig to recognize XHTML entities?


    Just prepend a declaration. For example here is a snippet from one of my
    scripts which deals with a similar situation:

    while ($lines[0] =~ /\s*<use /) {
    shift @lines;
    }
    my $encoding = "utf-8";
    if ($lines[0] =~ / charset=["'](.*?)["']/) {
    $encoding=$1
    }
    my $text = join('', (
    "<?xml version='1.0' encoding='$encoding' ?>\n",
    "<!DOCTYPE protokoll SYSTEM 'http://www.luga.at/dtd/protokoll.dtd'\n",
    " [\n",
    " <!ENTITY euro '€'>\n",
    " <!ENTITY mdash '—'>\n",
    " <!ENTITY rArr '⇒'>\n",
    " ]\n",
    ">\n",
    @lines
    )
    );

    This first strips off a few extra lines (which start with "<use "), then
    extracts the encoding from the first remaining line and then prepends an
    XML declaration with the encoding and a doctype declaration with a few
    entities.

    hp
     
    Peter J. Holzer, Sep 6, 2008
    #2
    1. Advertising

  3. Zed Pobre

    Zed Pobre Guest

    Peter J. Holzer <> wrote:
    >
    >
    > ["Followup-To:" header set to comp.lang.perl.misc.]
    > On 2008-09-04 23:11, Zed Pobre <> wrote:
    >> I'm writing a program that needs to extract a clump of XML metadata
    >> stored inside of a noncompliant HTML file and then perform a number of
    >> operations on that metadata. (Specifically, for those curious, this
    >> is part of a Mobipocket .prc to IPDF .epub ebook converter.)
    >>
    >> The HTML file in question has no doctype declaration, and XHTML
    >> entities may be found in the metadata portion. In particular, &copy;
    >> is the first entity that XML::parser will choke on in my current test
    >> data.
    >>
    >> Could someone please provide me with an example of how to get
    >> XML::Twig to recognize XHTML entities?

    >
    > Just prepend a declaration. For example here is a snippet from one of my
    > scripts which deals with a similar situation:


    Thanks for the suggestion, but I think you misunderstand the situation
    -- the input file looks something like this (and I don't have control
    over its generation):

    <html><head><metadata> <dc-metadata [...] </metadata></head><body>[...]

    The goal is to avoid slurping the file, but extract and separate the
    <metadata>...</metadata> block from the HTML via XML::Twig, outputting
    HTML with the metadata block removed, parsing and modifying the XML
    metadata block, then outputting that as a separate file. The source
    files involved average half a megabyte in size, and can reach several
    megabytes.

    My hope was to use XML::Twig to keep memory usage down, and certainly
    to avoid a twig root involving entire HTML+XMLmetadata structure. At
    least, the Twig documentation implied that it could do this in a
    low-memory fashion, pulling out only the parts needed. The
    documentation also lists functions (that are either buggy or that I am
    apparently using incorrectly) to define an entity list or assign a
    doctype prior to a parse. I'm hoping that someone can give an example
    of correct usage.

    My current workaround is actually somewhat similar to yours, except at
    a file level: I have a subroutine that slurps the file, regexps out
    the metadata block, saves the metadata block to a new file with a
    proper XML header and doctype appended, saves everything else to a
    HTML-only file, and then returns, so I can call XML::Twig only on the
    outputted XML file. This works, but still allocates a potentially
    huge amount of memory during the splitting process, even if that
    memory is available to Twig after it returns.

    I've been contemplating bludgeoning out a low-memory solution with
    sysread, since the metadata will always be at the top of the file and
    has never so far been larger than about 8kb, but was hoping to see if
    someone knew how to get Twig working first.

    Thanks again,

    --
    Zed Pobre <> a.k.a. Zed Pobre <>
    PGP key and fingerprint available on finger; encrypted mail welcomed.
     
    Zed Pobre, Sep 7, 2008
    #3
  4. Zed Pobre

    John Bokma Guest

    Zed Pobre <> wrote:

    > I've been contemplating bludgeoning out a low-memory solution with
    > sysread, since the metadata will always be at the top of the file and
    > has never so far been larger than about 8kb, but was hoping to see if
    > someone knew how to get Twig working first.


    If you want to reduce memory to a minimum you can't avoid using a
    streaming solution. I probably would use XML::parser or SAX.

    It's not clear if by HTML you actually mean XHTML (I guess yes, otherwise
    you'll might bump into problems with XML parsing)

    --
    John http://johnbokma.com/ - Hacking & Hiking in Mexico

    Perl help in exchange for a gift:
    http://johnbokma.com/perl/help-in-exchange-for-a-gift.html
     
    John Bokma, Sep 7, 2008
    #4
  5. Zed Pobre

    Zed Pobre Guest

    John Bokma <> wrote:
    >
    > Zed Pobre <> wrote:
    >
    >> I've been contemplating bludgeoning out a low-memory solution with
    >> sysread, since the metadata will always be at the top of the file and
    >> has never so far been larger than about 8kb, but was hoping to see if
    >> someone knew how to get Twig working first.

    >
    > If you want to reduce memory to a minimum you can't avoid using a
    > streaming solution. I probably would use XML::parser or SAX.
    >
    > It's not clear if by HTML you actually mean XHTML (I guess yes, otherwise
    > you'll might bump into problems with XML parsing)


    Unfortunately, I really do mean HTML, and very badly formed HTML at
    that. The only part that can be relied upon to be well-formed is the
    <metadata>...</metadata> clump that I was trying to extract with
    twig_roots without actually parsing the rest of the file.

    It turns out that this isn't possible even with XML::Twig.

    One of the kind monks over at perlmonks.org pointed out that there's
    nothing stopping me from passing parsefile() a pipe, so I got past the
    doctype problem by passing it 'cat oeb12doctype.xml input.html|', at
    which point the parse() cheerfully got so far as splitting off the
    HTML with all of the metadata elements removed before die-ing horribly
    on a mismatched tag. According to the Twig documentation, there is no
    way to proceed and get the extracted elements anyway, so this entire
    technique has been a dead end, though amusingly this technique does
    work to split out the HTML without the <metadata> elements, since
    twig_print_outside_roots will finish up before the parser dies from
    mismatched tags. That probably isn't reliable, though.

    I'll have to constrain the memory use by doing the initial split in
    10k chunks, I guess.

    Thanks for the help.

    --
    Zed Pobre <> a.k.a. Zed Pobre <>
    PGP key and fingerprint available on finger; encrypted mail welcomed.
     
    Zed Pobre, Sep 8, 2008
    #5
  6. ["Followup-To:" header set to comp.lang.perl.misc.]
    On 2008-09-08 19:17, Zed Pobre <> wrote:
    > John Bokma <> wrote:
    >> Zed Pobre <> wrote:
    >>> I've been contemplating bludgeoning out a low-memory solution with
    >>> sysread, since the metadata will always be at the top of the file and
    >>> has never so far been larger than about 8kb, but was hoping to see if
    >>> someone knew how to get Twig working first.

    >>
    >> If you want to reduce memory to a minimum you can't avoid using a
    >> streaming solution. I probably would use XML::parser or SAX.
    >>
    >> It's not clear if by HTML you actually mean XHTML (I guess yes,
    >> otherwise you'll might bump into problems with XML parsing)

    >
    > Unfortunately, I really do mean HTML, and very badly formed HTML at
    > that.


    Then you cannot/should not use an XML parser. XML is designed to have a
    strict syntax, and all XML parsers I know rely on this and enforce it
    (more or less).

    > The only part that can be relied upon to be well-formed is the
    ><metadata>...</metadata> clump that I was trying to extract with
    > twig_roots without actually parsing the rest of the file.
    >
    > It turns out that this isn't possible even with XML::Twig.
    >
    > One of the kind monks over at perlmonks.org pointed out that there's
    > nothing stopping me from passing parsefile() a pipe, so I got past the
    > doctype problem by passing it 'cat oeb12doctype.xml input.html|',


    I was going to suggest that when I read about your memory constraints
    (although I think slurping in a few MB of into a single string won't
    hurt you - the big problem with large XML files is usually the parsed
    representation (a tree with hundreds of thousands of nodes for a few
    MB).

    However, if you know that everything inside your metadata elements is
    well-formed XML and that the metadata elements aren't nested and that
    there are no CDATA sections which might contain the string "<metadata>"
    or "</metadata>", you can easily extract those sections:

    $/ = "</metadata>";

    print $preamble, "<fakeroot>";
    while (<>) {
    # each record ends with </metadata>,
    # so we can throw everything up to <metadata away:
    s/.*(?=<metadata\s)//;
    print $_
    }
    print "</fakeroot>"

    put that in a subprocess and pass the pipe to XML::Twig. (or maybe
    XML::Twig has a method with which you can feed it chunks of input - I
    think it does, but a quick scanning of the man page didn't reveal it).


    > at which point the parse() cheerfully got so far as splitting off the
    > HTML with all of the metadata elements removed before die-ing horribly
    > on a mismatched tag. According to the Twig documentation, there is no
    > way to proceed and get the extracted elements anyway,


    I think you can get the extracted elements, but there is no way to
    continue after the first parse error. So unless you are certain that the
    first error occurs after the sections you are interested in you can't
    use that.

    hp
     
    Peter J. Holzer, Sep 9, 2008
    #6
  7. On 2008-09-09 13:25, Peter J. Holzer <> wrote:
    > However, if you know that everything inside your metadata elements is
    > well-formed XML and that the metadata elements aren't nested and that
    > there are no CDATA sections which might contain the string "<metadata>"
    > or "</metadata>", you can easily extract those sections:
    >
    > $/ = "</metadata>";
    >
    > print $preamble, "<fakeroot>";
    > while (<>) {
    > # each record ends with </metadata>,
    > # so we can throw everything up to <metadata away:
    > s/.*(?=<metadata\s)//;
    > print $_
    > }
    > print "</fakeroot>"
    >
    > put that in a subprocess and pass the pipe to XML::Twig. (or maybe
    > XML::Twig has a method with which you can feed it chunks of input - I
    > think it does, but a quick scanning of the man page didn't reveal it).


    Sorry, I missed that there is only one metadata element and that it is
    always near the start of the document. In that case it's a lot simpler.
    You don't need a loop as you only need to read one element. And you can
    just call XML::Twig->parse with that element and don't need to wrap it
    in a fake root element.
     
    Peter J. Holzer, Sep 9, 2008
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. __PPS__
    Replies:
    2
    Views:
    571
    __PPS__
    Sep 27, 2005
  2. markla
    Replies:
    1
    Views:
    572
    Steven Cheng
    Oct 6, 2008
  3. Sherman Willden
    Replies:
    1
    Views:
    148
    Sisyphus
    Jul 25, 2003
  4. alwaysonnet

    Get XML content using XML::Twig

    alwaysonnet, Apr 21, 2010, in forum: Perl Misc
    Replies:
    19
    Views:
    210
    Klaus
    Apr 29, 2010
  5. Larry Lindstrom
    Replies:
    19
    Views:
    1,333
    Jonathan N. Little
    Jun 12, 2012
Loading...

Share This Page