Partly erratic wrong behaviour, Python 3, lxml

Discussion in 'Python' started by Jussi Piitulainen, Mar 4, 2010.

  1. Dear group,

    I am observing weird semi-erratic behaviour that involves Python 3 and
    lxml, is extremely sensitive to changes in the input data, and only
    occurs when I name a partial result. I would like some help with this,
    please. (Python 3.1.1; GNU/Linux; how do I find lxml version?)

    The test script regress/Tribug is at the end of this message, with a
    snippet to show the form of regress/tridata.py where the XML is.

    What I observe is this. Parsing an XML document (wrapped in BytesIO)
    with lxml.etree.parse and then extracting certain elements with xpath
    sometimes fails so that I get three times the correct number of
    elements. On the first run of the script, it fails in one way, and on
    each subsequent run in another way: subsequent runs are repeatable.

    Second, the bug only occurs when I give a name to the result from
    lxml.etree.parse! This is seen below in the lines labeled "name1" or
    "name2" that sometimes exhibit the bug, and lines labeled "nest1" or
    "nest2" that never do. That is, this fails in some complex way:

    result = etree.parse(BytesIO(body))
    n = len(result.xpath(title))

    This fails to fail:

    n = len(etree.parse(BytesIO(body)).xpath(title))

    I have failed to observe the error interactively. I believe the
    erroneus result lists are of the form [x x x y y y z z z] when they
    should be [x y z] but I do not know if the x's are identical or
    copies. I will know more later, of course, when I have written more
    complex tests, unless somebody can lead me to a more profitable way of
    debugging this.

    Two versions of the test runs follow, before and after a trivial
    change to the test data. Since the numbers are repeated n's of the
    above snippets, they should all be the same: 5 observed 1000 times.

    A first run after removing regress/tridata.pyc:

    [1202] $ regress/Tribug
    name1: size 5 observed 969 times
    name1: size 15 observed 31 times
    name2: size 5 observed 1000 times
    nest1: size 5 observed 1000 times
    nest2: size 5 observed 1000 times

    All subsequent runs, with regress/tridata.pyc recreated:

    [1203] $ regress/Tribug
    name1: size 5 observed 1000 times
    name2: size 5 observed 978 times
    name2: size 15 observed 22 times
    nest1: size 5 observed 1000 times
    nest2: size 5 observed 1000 times

    Adding an empty comment <!-- --> to the XML document;
    a first run:

    [1207] $ regress/Tribug
    name1: size 5 observed 992 times
    name1: size 15 observed 8 times
    name2: size 5 observed 1000 times
    nest1: size 5 observed 1000 times
    nest2: size 5 observed 1000 times

    And subsequent runs:

    [1208] $ regress/Tribug
    name1: size 5 observed 991 times
    name1: size 15 observed 9 times
    name2: size 5 observed 998 times
    name2: size 15 observed 2 times
    nest1: size 5 observed 1000 times
    nest2: size 5 observed 1000 times

    ---start of regress/Tribug---
    #! /bin/env python3
    # -*- mode: Python; -*-

    from io import BytesIO
    from lxml import etree
    from tridata import body, title

    def naming():
    sizes = dict()
    for k in range(0,1000):
    result = etree.parse(BytesIO(body))
    n = len(result.xpath(title))
    sizes[n] = 1 + sizes.get(n, 0)
    return sizes

    def nesting():
    sizes = dict()
    for k in range(0,1000):
    n = len(etree.parse(BytesIO(body)).xpath(title))
    sizes[n] = 1 + sizes.get(n, 0)
    return sizes

    def report(label, sizes):
    for size, count in sizes.items():
    print('{}: size {} observed {} times'
    .format(label, size, count))

    report('name1', naming())
    report('name2', naming())
    report('nest1', nesting())
    report('nest2', nesting())
    ---end of regress/Tribug---

    The file regress/tridata.py contains only the two constants. I omit
    most of the XML. It would be several screenfuls.

    ---start of regress/tridata.py---
    body = b'''<OAI-PMH xmlns="http://www.opena...
    ....
    </OAI-PMH>
    '''

    title = '//*[name()="record"]//*[name()="dc:title"]'
    ---end of regress/tridata.py---
     
    Jussi Piitulainen, Mar 4, 2010
    #1
    1. Advertising

  2. Jussi Piitulainen, 04.03.2010 11:46:
    > I am observing weird semi-erratic behaviour that involves Python 3 and
    > lxml, is extremely sensitive to changes in the input data, and only
    > occurs when I name a partial result. I would like some help with this,
    > please. (Python 3.1.1; GNU/Linux; how do I find lxml version?)


    Here's how to find the version:

    http://codespeak.net/lxml/FAQ.html#i-think-i-have-found-a-bug-in-lxml-what-should-i-do

    I'll give your test code a try when I get to it. However, note that the
    best place to discuss this is the lxml mailing list:

    http://codespeak.net/mailman/listinfo/lxml-dev

    Stefan
     
    Stefan Behnel, Mar 4, 2010
    #2
    1. Advertising

  3. Stefan Behnel writes:
    > Jussi Piitulainen, 04.03.2010 11:46:
    > > I am observing weird semi-erratic behaviour that involves Python 3
    > > and lxml, is extremely sensitive to changes in the input data, and
    > > only occurs when I name a partial result. I would like some help
    > > with this, please. (Python 3.1.1; GNU/Linux; how do I find lxml
    > > version?)

    >
    > Here's how to find the version:
    >
    > http://codespeak.net/lxml/FAQ.html#i-think-i-have-found-a-bug-in-lxml-what-should-i-do


    Ok, thank you. Here's the results:

    >>> print(et.LXML_VERSION, et.LIBXML_VERSION,

    .... et.LIBXML_COMPILED_VERSION, et.LIBXSLT_VERSION,
    .... et.LIBXSLT_COMPILED_VERSION)
    (2, 2, 4, 0) (2, 6, 26) (2, 6, 26) (1, 1, 17) (1, 1, 17)

    > I'll give your test code a try when I get to it. However, note that
    > the best place to discuss this is the lxml mailing list:
    >
    > http://codespeak.net/mailman/listinfo/lxml-dev


    Thank you. Two things, however. First, I snipped out most of the XML
    document in that post, so it won't be runnable as is. As I think I
    said, my attempts to edit it down to size made the bug hide
    itself. Second, it's very sensitive to any changes in that XML.

    Oh, and a third thing. I'm not at all sure yet that the bug is in
    lxml. It seems to me that Python itself does impossible things - I
    hope I'm just being blind to something obvious, really.

    But if you like to try it out, I'll post the full test data as a
    followup to this. It's just bogus test data.
     
    Jussi Piitulainen, Mar 4, 2010
    #3
  4. This is the full data file on which my regress/Tribug exhibits the
    behaviour that I find incomprehensible, described in the first post in
    this thread. The comment in the beginning of the file below was
    written before I commented out some records in the data, so the actual
    numbers now are not ten expected, thirty sometimes observed, but the
    wrong number is always the correct number tripled (5 and 15, I think).

    ---regress/tridata.py follows---
    # Exercise lxml.etree.parse(body).xpath(title)
    # which I think should always return a list of
    # ten elements but sometimes returns thirty,
    # with each of the ten in triplicate. And this
    # seems impossible to me. Yet I see it happening.

    body = b'''<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
    <responseDate>2010-03-02T09:38:47Z</responseDate>
    <request verb="ListRecords" from="2004-01-01T00:00:00Z" until="2004-12-31T23:59:59Z" metadataPrefix="oai_dc">http://localhost/pmh/que</request>
    <ListRecords>
    <record>
    <header><!-- x --><!-- -->
    <identifier>jrc32003R0055-pl.xml/2/0</identifier>
    <datestamp>2004-08-15T19:45:00Z</datestamp>
    <setSpec>pl</setSpec>
    </header>
    <metadata>
    <oai_dc:dc xmlns:eek:ai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
    <dc:title>RozporzÄ…dzenie</dc:title>
    </oai_dc:dc>
    </metadata>
    </record>
    <!-- <record>
    <header>
    <identifier>jrc32003R0055-pl.xml/2/1</identifier>
    <datestamp>2004-08-15T19:45:00Z</datestamp>
    <setSpec>pl</setSpec>
    </header>
    <metadata>
    <oai_dc:dc xmlns:eek:ai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
    <dc:title>Komisji</dc:title>
    </oai_dc:dc>
    </metadata>
    </record>
    <record>
    <header>
    <identifier>jrc32003R0055-pl.xml/2/2</identifier>
    <datestamp>2004-08-15T19:45:00Z</datestamp>
    <setSpec>pl</setSpec>
    </header>
    <metadata>
    <oai_dc:dc xmlns:eek:ai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
    <dc:title>(WE)</dc:title>
    </oai_dc:dc>
    </metadata>
    </record>
    <record>
    <header>
    <identifier>jrc32003R0055-pl.xml/2/3</identifier>
    <datestamp>2004-08-15T19:45:00Z</datestamp>
    <setSpec>pl</setSpec>
    </header>
    <metadata>
    <oai_dc:dc xmlns:eek:ai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
    <dc:title>nr</dc:title>
    </oai_dc:dc>
    </metadata>
    </record>
    <record>
    <header>
    <identifier>jrc32003R0055-pl.xml/2/4</identifier>
    <datestamp>2004-08-15T19:45:00Z</datestamp>
    <setSpec>pl</setSpec>
    </header>
    <metadata>
    <oai_dc:dc xmlns:eek:ai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
    <dc:title>55/2003</dc:title>
    </oai_dc:dc>
    </metadata>
    </record>
    <record>
    <header>
    <identifier>jrc32003R0055-pl.xml/3/0</identifier>
    <datestamp>2004-08-15T19:45:00Z</datestamp>
    <setSpec>pl</setSpec>
    </header>
    <metadata>
    <oai_dc:dc xmlns:eek:ai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
    <dc:title>z</dc:title>
    </oai_dc:dc>
    </metadata>
    </record> -->
    <record>
    <header>
    <identifier>jrc32003R0055-pl.xml/3/1</identifier>
    <datestamp>2004-08-15T19:45:00Z</datestamp>
    <setSpec>pl</setSpec>
    </header>
    <metadata>
    <oai_dc:dc xmlns:eek:ai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
    <dc:title>dnia</dc:title>
    </oai_dc:dc>
    </metadata>
    </record>
    <record>
    <header>
    <identifier>jrc32003R0055-pl.xml/3/2</identifier>
    <datestamp>2004-08-15T19:45:00Z</datestamp>
    <setSpec>pl</setSpec>
    </header>
    <metadata>
    <oai_dc:dc xmlns:eek:ai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
    <dc:title>13</dc:title>
    </oai_dc:dc>
    </metadata>
    </record>
    <record>
    <header>
    <identifier>jrc32003R0055-pl.xml/3/3</identifier>
    <datestamp>2004-08-15T19:45:00Z</datestamp>
    <setSpec>pl</setSpec>
    </header>
    <metadata>
    <oai_dc:dc xmlns:eek:ai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
    <dc:title>stycznia</dc:title>
    </oai_dc:dc>
    </metadata>
    </record>
    <record>
    <header>
    <identifier>jrc32003R0055-pl.xml/3/4</identifier>
    <datestamp>2004-08-15T19:45:00Z</datestamp>
    <setSpec>pl</setSpec>
    </header>
    <metadata>
    <oai_dc:dc xmlns:eek:ai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
    <dc:title>2003</dc:title>
    </oai_dc:dc>
    </metadata>
    </record>
    </ListRecords>
    </OAI-PMH>
    '''

    title = '//*[name()="record"]//*[name()="dc:title"]'
     
    Jussi Piitulainen, Mar 4, 2010
    #4
  5. Jussi Piitulainen, 04.03.2010 22:40:
    > Stefan Behnel writes:
    >> Jussi Piitulainen, 04.03.2010 11:46:
    >>> I am observing weird semi-erratic behaviour that involves Python 3
    >>> and lxml, is extremely sensitive to changes in the input data, and
    >>> only occurs when I name a partial result. I would like some help
    >>> with this, please. (Python 3.1.1; GNU/Linux; how do I find lxml
    >>> version?)

    >>
    >> Here's how to find the version:
    >>
    >> http://codespeak.net/lxml/FAQ.html#i-think-i-have-found-a-bug-in-lxml-what-should-i-do

    >
    > Ok, thank you. Here's the results:
    >
    > >>> print(et.LXML_VERSION, et.LIBXML_VERSION,

    > ... et.LIBXML_COMPILED_VERSION, et.LIBXSLT_VERSION,
    > ... et.LIBXSLT_COMPILED_VERSION)
    > (2, 2, 4, 0) (2, 6, 26) (2, 6, 26) (1, 1, 17) (1, 1, 17)


    I can't reproduce this with the latest lxml trunk (and Py3.2 trunk) and
    libxml2 2.7.6, even after running your test script for quite a while. I'd
    try to upgrade the libxml2 version.

    Stefan
     
    Stefan Behnel, Mar 5, 2010
    #5
  6. Stefan Behnel writes:
    > Jussi Piitulainen, 04.03.2010 22:40:
    > > Stefan Behnel writes:
    > >> Jussi Piitulainen, 04.03.2010 11:46:
    > >>> I am observing weird semi-erratic behaviour that involves Python 3
    > >>> and lxml, is extremely sensitive to changes in the input data, and
    > >>> only occurs when I name a partial result. I would like some help
    > >>> with this, please. (Python 3.1.1; GNU/Linux; how do I find lxml
    > >>> version?)
    > >>
    > >> Here's how to find the version:
    > >>
    > >> http://codespeak.net/lxml/FAQ.html#i-think-i-have-found-a-bug-in-lxml-what-should-i-do

    > >
    > > Ok, thank you. Here's the results:
    > >
    > > >>> print(et.LXML_VERSION, et.LIBXML_VERSION,

    > > ... et.LIBXML_COMPILED_VERSION, et.LIBXSLT_VERSION,
    > > ... et.LIBXSLT_COMPILED_VERSION)
    > > (2, 2, 4, 0) (2, 6, 26) (2, 6, 26) (1, 1, 17) (1, 1, 17)

    >
    > I can't reproduce this with the latest lxml trunk (and Py3.2 trunk)
    > and libxml2 2.7.6, even after running your test script for quite a
    > while. I'd try to upgrade the libxml2 version.


    Thank you much. I suppose that is good news. It's a big server with
    many users - I will ask the administrators to consider an upgrade when
    I get around to it.

    Turns out that lxml documentation warns not to use libxml2 version
    2.6.27 if I want to use xpath, and that is just a bit newer than we
    have. On that cue, I seem to have found a workaround: I replaced the
    xpath expression with findall(titlef) where

    titlef = ( '//{http://www.openarchives.org/OAI/2.0/}record'
    '//{http://purl.org/dc/elements/1.1}title' )

    In the previously broken naming() function I now have:

    result = etree.parse(BytesIO(body))
    n = len(result.findall(titlef))

    And in the previously working nesting() function:

    n = len(etree.parse(BytesIO(body)).findall(titlef))

    With these changes, the test script gives consistently the result that
    I expect, and the more complicated real test script where I first met
    the problem also appears to work without a hitch. So, this works.

    The other, broken behaviour is totally scary, though.
     
    Jussi Piitulainen, Mar 5, 2010
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Rusty
    Replies:
    2
    Views:
    429
    Hywel
    Feb 21, 2004
  2. John Fitzsimons

    Partly consistent web pages - How ?

    John Fitzsimons, Jul 23, 2004, in forum: HTML
    Replies:
    3
    Views:
    405
    Karl Groves
    Jul 23, 2004
  3. Maurizio Berlusconi

    recover a partly destroyed pickled structure

    Maurizio Berlusconi, Sep 25, 2004, in forum: Python
    Replies:
    1
    Views:
    281
    Mike C. Fletcher
    Sep 25, 2004
  4. Emanuele D'Arrigo

    etree, lxml unexpected behaviour

    Emanuele D'Arrigo, Jun 17, 2009, in forum: Python
    Replies:
    2
    Views:
    274
    Emanuele D'Arrigo
    Jun 17, 2009
  5. osmanjaro

    erratic JS behaviour

    osmanjaro, Dec 14, 2005, in forum: Javascript
    Replies:
    4
    Views:
    122
    BootNic
    Dec 21, 2005
Loading...

Share This Page