Need more info about problem resolving entity reference

Discussion in 'Perl Misc' started by David Karr, May 23, 2013.

  1. David Karr

    David Karr Guest

    I have a Cygwin Perl script makes numerous REST api calls to a local service, parses the results from those, and makes other calls with that data. It also runs some of these calls in multiple threads, using LWP::UserAgent.

    It mostly works, but I sometimes get errors like this:

    -----------------------
    caught error:
    500 Can't connect to www.w3.org:80 (Operation now in progress) http://www.w3.org/TR/html4/strict.dtd
    Handler couldn't resolve external entity at line 1, column 90, byte 92
    error in processing external entity reference at line 1, column 90, byte 92:
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
    =========================================================================================^
    <html>
    <head>
    at /usr/lib/perl5/vendor_perl/5.14/i686-cygwin-threads-64int/XML/Parser.pm line 187 thread 2
    ----------------------

    That's the entire error message. I have no idea where in the script this gets called from, and I'm not really sure what this error is telling me.
    David Karr, May 23, 2013
    #1
    1. Advertising

  2. David Karr

    David Karr Guest

    On Thursday, May 23, 2013 7:34:04 PM UTC-7, Ben Morrow wrote:
    > Quoth David Karr <>:
    >
    > > I have a Cygwin Perl script makes numerous REST api calls to a local

    >
    > > service, parses the results from those, and makes other calls with that

    >
    > > data. It also runs some of these calls in multiple threads, using

    >
    > > LWP::UserAgent.

    >
    > >

    >
    > > It mostly works, but I sometimes get errors like this:

    >
    > >

    >
    > > -----------------------

    >
    > > caught error:

    >
    > > 500 Can't connect to www.w3.org:80 (Operation now in progress)

    >
    > > http://www.w3.org/TR/html4/strict.dtd

    >
    > > Handler couldn't resolve external entity at line 1, column 90, byte 92

    >
    > > error in processing external entity reference at line 1, column 90, byte 92:

    >
    > > <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"

    >
    > > "http://www.w3.org/TR/html4/strict.dtd">

    >
    > > ======================================================================

    >
    > > ===================^

    >
    > > <html>

    >
    > > <head>

    >
    > > at

    >
    > > /usr/lib/perl5/vendor_perl/5.14/i686-cygwin-threads-64int/XML/Parser.pm

    >
    > > line 187 thread 2

    >
    > > ----------------------

    >
    > >

    >
    > > That's the entire error message. I have no idea where in the script

    >
    > > this gets called from, and I'm not really sure what this error is

    >
    > > telling me.

    >
    >
    >
    > This error comes from XML::parser. I assume you are invoking that
    >
    > directly, to parse the REST response? What's happening is that
    >
    > XML::parser sees a DOCTYPE declaration like
    >
    >
    >
    > <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
    >
    > "http://www.w3.org/TR/html4/strict.dtd">
    >
    >
    >
    > and, like a good little SGML-derived XML parser, tries to fetch the DTD
    >
    > (using LWP) so it can validate the rest of the file. For some reason,
    >
    > when it tries to connect to www.w3.org to download the DTD file, the
    >
    > connection is failing with EINPROGRESS. Since LWP isn't expecting that
    >
    > error code, it throws an error.
    >
    >
    >
    > So, what's the real problem? Well, first, that's an HTML doctype. You
    >
    > can't, in general, parse HTML with an XML parser, so are you sure you're
    >
    > getting the responses you expect? REST services are usually pretty good
    >
    > about getting their Content-types right, so you ought to be able to
    >
    > check for an XML Content-type before passing the data to XML::parser.


    I'm completely certain that in these anomalous cases, I'm definitely not getting the response I expect. The problem with this error message is that it gives me absolutely no clue where in the script this is happening. I'm guessing that our back-end server gets confused in some cases, but it's hardto diagnose when I don't know what URL was being attempted, or where in the script it was done.

    > Second, you really don't want to keep fetching the DTDs like that. Does
    >
    > the XML you're actually trying to parse use external DTDs? If not, then
    >
    > you want to pass the NoLWP option to XML::parser, so that it doesn't
    >
    > even try to fetch DTDs from the network. In the case of a public DTD
    >
    > like HTML the attempt to load it as a local file will fail, of course,
    >
    > but the parsing wasn't going to succeed anyway, because it wasn't XML.


    That "NoLWP" option sounds useful, but it's somewhat moot here.

    > However, I'm slightly confused here, because the XML::parser
    >
    > documentation seems to say it doesn't parse external DTDs by default.
    >
    > It's possible I'm misunderstanding; I don't think I've used XML::parser
    >
    > myself. Are you passing ParseParamEnt, and if so, why?


    I don't know what "ParseParamEnt" is, so I imagine I'm not.

    > Third, you probably don't want to be using XML::parser at all. As you
    >
    > can see, it's old and rather cronky, and while it's extremely solid code
    >
    > it also takes a rather SGMLish approach to parsing XML. Most of the
    >
    > time, with modern XML use, DTDs are not used, and instead the XML just
    >
    > needs to be well-formed and properly namespaced. For this sort of thing
    >
    > (small documents) I would use XML::LibXML (which, incidentally, also
    >
    > includes a reasonable HTML parser); if a streaming model is more
    >
    > appropriate, either because your documents may be ridiculously large or
    >
    > simply because your program is structured that way, I would use one of
    >
    > the SAX modules.


    The funny thing about searching in CPAN is that there are no packages (I'm guessing) that say "do not use this, use something better". I'll take a look at XML::LibXML to see what it does for me.

    > Finally, fourth, I have no idea where that EINPROGRESS is coming from.
    >
    > That error is supposed to be returned if a socket is connected while in
    >
    > non-blocking mode, and the connection cannot be completed without
    >
    > blocking; it's basically the equivalent of EAGAIN for connect(). This
    >
    > means it shouldn't be possible to get that error without having asked
    >
    > for it by setting nonblocking mode on the socket, which LWP does not
    >
    > (normally) do.
    >
    >
    >
    > Are you doing something peculiar which might cause this to happen?
    >
    > Alternatively, it's possible this is some sort of Cygwin peculiarity,
    >
    > which unfortunately may be difficult to track down; if you can isolate
    >
    > the conditions where the error occurs it would be useful. (For instance,
    >
    > does it tend to occur when the network goes down? When the network is
    >
    > overloaded? When the DNS doesn't respond promptly?)


    The script runs for perhaps 30-40 minutes, basically walking the entire data model of a REST api. It sends hundreds of requests to the (load-balanced) service, some from multiple threads. This kind of error happens several times during the run of the script, which means that the vast majority workwell enough. I ended up putting a hack into my "sendGet" sub that just checks for "DOCTYPE HTML" in the output and simply tries again, with a reasonable limit of retries. Almost all of the calls that detect this once or twice eventually get good data.
    David Karr, May 24, 2013
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Dean A. Hoover

    resolving an entity

    Dean A. Hoover, Dec 6, 2003, in forum: XML
    Replies:
    5
    Views:
    532
    Richard Tobin
    Dec 8, 2003
  2. Samuel van Laere

    Entity Name or Entity Number?

    Samuel van Laere, Feb 24, 2007, in forum: HTML
    Replies:
    4
    Views:
    1,594
    Jukka K. Korpela
    Feb 24, 2007
  3. markla
    Replies:
    1
    Views:
    531
    Steven Cheng
    Oct 6, 2008
  4. ThatsIT.net.au

    Entity, problem with entity key

    ThatsIT.net.au, Sep 6, 2009, in forum: ASP .Net
    Replies:
    1
    Views:
    1,179
    ThatsIT.net.au
    Sep 7, 2009
  5. thomas
    Replies:
    5
    Views:
    593
    Gert-Jan de Vos
    Nov 27, 2009
Loading...

Share This Page