html5lib not thread safe. Is the Python SAX library thread-safe?

Discussion in 'Python' started by John Nagle, Mar 11, 2012.

  1. John Nagle

    John Nagle Guest

    "html5lib" is apparently not thread safe.
    (see "http://code.google.com/p/html5lib/issues/detail?id=189")
    Looking at the code, I've only found about three problems.
    They're all the usual "cached in a global without locking" bug.
    A few locks would fix that.

    But html5lib calls the XML SAX parser. Is that thread-safe?
    Or is there more trouble down at the bottom?

    (I run a multi-threaded web crawler, and currently use BeautifulSoup,
    which is thread safe, although dated. I'm looking at converting to
    html5lib.)

    John Nagle
     
    John Nagle, Mar 11, 2012
    #1
    1. Advertising

  2. On 11Mar2012 13:30, John Nagle <> wrote:
    | "html5lib" is apparently not thread safe.
    | (see "http://code.google.com/p/html5lib/issues/detail?id=189")
    | Looking at the code, I've only found about three problems.
    | They're all the usual "cached in a global without locking" bug.
    | A few locks would fix that.
    |
    | But html5lib calls the XML SAX parser. Is that thread-safe?
    | Or is there more trouble down at the bottom?
    |
    | (I run a multi-threaded web crawler, and currently use BeautifulSoup,
    | which is thread safe, although dated. I'm looking at converting to
    | html5lib.)

    IIRC, BeautifulSoup4 may do that for you:

    http://www.crummy.com/software/BeautifulSoup/bs4/doc/

    http://www.crummy.com/software/BeautifulSoup/bs4/doc/#you-need-a-parser
    "Beautiful Soup 4 uses html.parser by default, but you can plug in
    lxml or html5lib and use that instead."

    Just for interest, re locking, I wrote a little decorator the other day,
    thus:

    @locked_property
    def foo(self):
    compute foo here ...
    return foo value

    and am rolling its use out amongst my classes. Code:

    def locked_property(func, lock_name='_lock', prop_name=None, unset_object=None):
    ''' A property whose access is controlled by a lock if unset.
    '''
    if prop_name is None:
    prop_name = '_' + func.func_name
    def getprop(self):
    ''' Attempt lockless fetch of property first.
    Use lock if property is unset.
    '''
    p = getattr(self, prop_name)
    if p is unset_object:
    with getattr(self, lock_name):
    p = getattr(self, prop_name)
    if p is unset_object:
    p = func(self)
    setattr(self, prop_name, p)
    return p
    return property(getprop)

    It tries to be lockless in the common case. I suspect it is only safe in
    CPython where there is a GIL. If raw python assignments and fetches can
    overlap (eg Jypthon I think?) I probably need shared "read" lock around
    the first "p = getattr(self, prop_name). Any remarks?

    Cheers,
    --
    Cameron Simpson <> DoD#743
    http://www.cskk.ezoshosting.com/cs/

    Ed Campbell's <> pointers for long trips:
    1. lay out the bare minimum of stuff that you need to take with you, then
    put at least half of it back.
     
    Cameron Simpson, Mar 11, 2012
    #2
    1. Advertising

  3. John Nagle

    John Nagle Guest

    On 3/11/2012 2:45 PM, Cameron Simpson wrote:
    > On 11Mar2012 13:30, John Nagle<> wrote:
    > | "html5lib" is apparently not thread safe.
    > | (see "http://code.google.com/p/html5lib/issues/detail?id=189")
    > | Looking at the code, I've only found about three problems.
    > | They're all the usual "cached in a global without locking" bug.
    > | A few locks would fix that.
    > |
    > | But html5lib calls the XML SAX parser. Is that thread-safe?
    > | Or is there more trouble down at the bottom?
    > |
    > | (I run a multi-threaded web crawler, and currently use BeautifulSoup,
    > | which is thread safe, although dated. I'm looking at converting to
    > | html5lib.)
    >
    > IIRC, BeautifulSoup4 may do that for you:
    >
    > http://www.crummy.com/software/BeautifulSoup/bs4/doc/
    >
    > http://www.crummy.com/software/BeautifulSoup/bs4/doc/#you-need-a-parser
    > "Beautiful Soup 4 uses html.parser by default, but you can plug in
    > lxml or html5lib and use that instead."


    I want to use HTML5 standard parsing of bad HTML. (HTML5 formally
    defines how to parse bad comments, for example.) I currently have
    a modified version of BeautifulSoup that's more robust than the
    standard one, but it doesn't handle errors the same way browsers do.

    John Nagle
     
    John Nagle, Mar 12, 2012
    #3
  4. John Nagle

    Paul Rubin Guest

    John Nagle <> writes:
    > But html5lib calls the XML SAX parser. Is that thread-safe?
    > Or is there more trouble down at the bottom?


    According to

    http://xmlbench.sourceforge.net/results/features200303/index.html

    libxml and expat both purport to be thread-safe. I've used the python
    expat library (not from multiple threads) and it works fine, though the
    python calls slow it down by worse than an order of magnitude.
     
    Paul Rubin, Mar 12, 2012
    #4
  5. John Nagle, 11.03.2012 21:30:
    > "html5lib" is apparently not thread safe.
    > (see "http://code.google.com/p/html5lib/issues/detail?id=189")
    > Looking at the code, I've only found about three problems.
    > They're all the usual "cached in a global without locking" bug.
    > A few locks would fix that.
    >
    > But html5lib calls the XML SAX parser. Is that thread-safe?
    > Or is there more trouble down at the bottom?
    >
    > (I run a multi-threaded web crawler, and currently use BeautifulSoup,
    > which is thread safe, although dated. I'm looking at converting to
    > html5lib.)


    You may also consider moving to lxml. BeautifulSoup supports it as a parser
    backend these days, so you wouldn't even have to rewrite your code to use
    it. And performance-wise, well ...

    http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

    Stefan
     
    Stefan Behnel, Mar 12, 2012
    #5
  6. John Nagle

    John Nagle Guest

    On 3/12/2012 3:05 AM, Stefan Behnel wrote:
    > John Nagle, 11.03.2012 21:30:
    >> "html5lib" is apparently not thread safe.
    >> (see "http://code.google.com/p/html5lib/issues/detail?id=189")
    >> Looking at the code, I've only found about three problems.
    >> They're all the usual "cached in a global without locking" bug.
    >> A few locks would fix that.
    >>
    >> But html5lib calls the XML SAX parser. Is that thread-safe?
    >> Or is there more trouble down at the bottom?
    >>
    >> (I run a multi-threaded web crawler, and currently use BeautifulSoup,
    >> which is thread safe, although dated. I'm looking at converting to
    >> html5lib.)

    >
    > You may also consider moving to lxml. BeautifulSoup supports it as a parser
    > backend these days, so you wouldn't even have to rewrite your code to use
    > it. And performance-wise, well ...
    >
    > http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
    >
    > Stefan


    I want to move to html5lib because it handles HTML errors as
    specified by the HTML5 spec, which is what all newer browsers do.
    The HTML5 spec actually specifies, in great detail, how to parse
    common errors in HTML. It's amusing seeing that formalized.
    Malformed comments ( <- instead of <-- ) are now handled in
    a standard way, for example. So I'm trying to get html5parser
    fixed for thread safety.

    John Nagle
     
    John Nagle, Mar 12, 2012
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. James Graham

    [ANN] html5lib 0.2

    James Graham, Jan 9, 2007, in forum: Python
    Replies:
    0
    Views:
    306
    James Graham
    Jan 9, 2007
  2. Steven Woody
    Replies:
    7
    Views:
    1,016
    James Kanze
    Sep 27, 2007
  3. jpatrcik
    Replies:
    3
    Views:
    1,168
    bruce barker
    May 23, 2008
  4. Gabriel Rossetti
    Replies:
    0
    Views:
    1,328
    Gabriel Rossetti
    Aug 29, 2008
  5. Aredridel

    Not just $SAFE, but damn $SAFE

    Aredridel, Sep 2, 2004, in forum: Ruby
    Replies:
    19
    Views:
    244
Loading...

Share This Page