BeautifulSoup

Discussion in 'Python' started by yamamoto, Jan 13, 2010.

  1. yamamoto

    yamamoto Guest

    Hi,
    I am new to Python. I'd like to extract "a" tag from a website by
    using "beautifulsoup" module.
    but it doesnt work!

    //sample.py

    from BeautifulSoup import BeautifulSoup as bs
    import urllib
    url="http://www.d-addicts.com/forum/torrents.php"
    doc=urllib.urlopen(url).read()
    soup=bs(doc)
    result=soup.findAll("a")
    for i in result:
    print i


    Traceback (most recent call last):
    File "C:\Users\falcon\workspace\p\pyqt\ex1.py", line 8, in <module>
    soup=bs(doc)
    File "C:\Python26\lib\site-packages\BeautifulSoup.py", line 1499, in
    __init__
    BeautifulStoneSoup.__init__(self, *args, **kwargs)
    File "C:\Python26\lib\site-packages\BeautifulSoup.py", line 1230, in
    __init__
    self._feed(isHTML=isHTML)
    File "C:\Python26\lib\site-packages\BeautifulSoup.py", line 1263, in
    _feed
    self.builder.feed(markup)
    File "C:\Python26\lib\HTMLParser.py", line 108, in feed
    self.goahead(0)
    File "C:\Python26\lib\HTMLParser.py", line 148, in goahead
    k = self.parse_starttag(i)
    File "C:\Python26\lib\HTMLParser.py", line 226, in parse_starttag
    endpos = self.check_for_whole_start_tag(i)
    File "C:\Python26\lib\HTMLParser.py", line 301, in
    check_for_whole_start_tag
    self.error("malformed start tag")
    File "C:\Python26\lib\HTMLParser.py", line 115, in error
    raise HTMLParseError(message, self.getpos())
    HTMLParser.HTMLParseError: malformed start tag, at line 276, column 36

    any suggestion?
    thanks in advance
     
    yamamoto, Jan 13, 2010
    #1
    1. Advertising

  2. yamamoto

    Peter Otten Guest

    yamamoto wrote:

    > Hi,
    > I am new to Python. I'd like to extract "a" tag from a website by
    > using "beautifulsoup" module.
    > but it doesnt work!
    >
    > //sample.py
    >
    > from BeautifulSoup import BeautifulSoup as bs
    > import urllib
    > url="http://www.d-addicts.com/forum/torrents.php"
    > doc=urllib.urlopen(url).read()
    > soup=bs(doc)
    > result=soup.findAll("a")
    > for i in result:
    > print i
    >
    >
    > Traceback (most recent call last):
    > File "C:\Users\falcon\workspace\p\pyqt\ex1.py", line 8, in <module>
    > soup=bs(doc)
    > File "C:\Python26\lib\site-packages\BeautifulSoup.py", line 1499, in
    > __init__
    > BeautifulStoneSoup.__init__(self, *args, **kwargs)
    > File "C:\Python26\lib\site-packages\BeautifulSoup.py", line 1230, in
    > __init__
    > self._feed(isHTML=isHTML)
    > File "C:\Python26\lib\site-packages\BeautifulSoup.py", line 1263, in
    > _feed
    > self.builder.feed(markup)
    > File "C:\Python26\lib\HTMLParser.py", line 108, in feed
    > self.goahead(0)
    > File "C:\Python26\lib\HTMLParser.py", line 148, in goahead
    > k = self.parse_starttag(i)
    > File "C:\Python26\lib\HTMLParser.py", line 226, in parse_starttag
    > endpos = self.check_for_whole_start_tag(i)
    > File "C:\Python26\lib\HTMLParser.py", line 301, in
    > check_for_whole_start_tag
    > self.error("malformed start tag")
    > File "C:\Python26\lib\HTMLParser.py", line 115, in error
    > raise HTMLParseError(message, self.getpos())
    > HTMLParser.HTMLParseError: malformed start tag, at line 276, column 36
    >
    > any suggestion?


    When BeautifulSoup encounters an error that it cannot fix the first thing
    you need is a better error message:


    from BeautifulSoup import BeautifulSoup as bs
    import urllib
    import HTMLParser

    url = "http://www.d-addicts.com/forum/torrents.php"
    doc = urllib.urlopen(url).read()

    #doc = doc.replace("\>", "/>")

    try:
    soup=bs(doc)
    except HTMLParser.HTMLParseError as e:
    lines = doc.splitlines(True)
    print lines[e.lineno-1].rstrip()
    print " " * e.offset + "^"
    else:
    result = soup.findAll("a")
    for i in result:
    print i

    Once you know the origin of the problem you can devise a manual fix. Here
    you could uncomment the line

    doc = doc.replace("\>", "/>")

    Keep in mind though that what fixes this broken document may break another
    (valid) one.

    Peter
     
    Peter Otten, Jan 13, 2010
    #2
    1. Advertising

  3. Hi,

    Also you can check a high-level framework for scrapping:
    http://scrapy.org/

    In their docs includes an example of extracting torrents data from mininova
    http://doc.scrapy.org/intro/overview.html

    You will need to understand regular expressions, xpath expressions,
    callbacks, etc.
    In the faq explains how does Scrapy compare to BeatufilSoup.
    http://doc.scrapy.org/faq.html#how-does-scrapy-compare-to-beautifulsoul-or-lxml

    Regards,

    On Wed, Jan 13, 2010 at 8:46 AM, yamamoto <> wrote:
    > Hi,
    > I am new to Python. I'd like to extract "a" tag from a website by
    > using "beautifulsoup" module.
    > but it doesnt work!
    >

    [snip]

    --
    Rolando Espinoza La fuente
    www.rolandoespinoza.info
     
    Rolando Espinoza La Fuente, Jan 13, 2010
    #3
  4. yamamoto

    Phlip Guest

    John Nagle wrote:

    > It's just somebody pirating movies. Ineptly. Ignore.


    Anyone who leaves their movies hanging out in <a> tags, without a daily download
    limit or a daily hashtag, deserves to be taught a lesson!

    --
    Phlip
     
    Phlip, Jan 15, 2010
    #4
  5. yamamoto

    John Nagle Guest

    It's just somebody pirating movies. Ineptly. Ignore.

    John Nagle

    yamamoto wrote:
    > Hi,
    > I am new to Python. I'd like to extract "a" tag from a website by
    > using "beautifulsoup" module.
    > but it doesnt work!
    >
    > //sample.py
    >
    > from BeautifulSoup import BeautifulSoup as bs
    > import urllib
    > url="http://www.d-addicts.com/forum/torrents.php"
    > doc=urllib.urlopen(url).read()
    > soup=bs(doc)
    > result=soup.findAll("a")
    > for i in result:
    > print i
    >
    >
    > Traceback (most recent call last):
    > File "C:\Users\falcon\workspace\p\pyqt\ex1.py", line 8, in <module>
    > soup=bs(doc)
    > File "C:\Python26\lib\site-packages\BeautifulSoup.py", line 1499, in
    > __init__
    > BeautifulStoneSoup.__init__(self, *args, **kwargs)
    > File "C:\Python26\lib\site-packages\BeautifulSoup.py", line 1230, in
    > __init__
    > self._feed(isHTML=isHTML)
    > File "C:\Python26\lib\site-packages\BeautifulSoup.py", line 1263, in
    > _feed
    > self.builder.feed(markup)
    > File "C:\Python26\lib\HTMLParser.py", line 108, in feed
    > self.goahead(0)
    > File "C:\Python26\lib\HTMLParser.py", line 148, in goahead
    > k = self.parse_starttag(i)
    > File "C:\Python26\lib\HTMLParser.py", line 226, in parse_starttag
    > endpos = self.check_for_whole_start_tag(i)
    > File "C:\Python26\lib\HTMLParser.py", line 301, in
    > check_for_whole_start_tag
    > self.error("malformed start tag")
    > File "C:\Python26\lib\HTMLParser.py", line 115, in error
    > raise HTMLParseError(message, self.getpos())
    > HTMLParser.HTMLParseError: malformed start tag, at line 276, column 36
    >
    > any suggestion?
    > thanks in advance
    >
     
    John Nagle, Jan 15, 2010
    #5
  6. yamamoto

    John Bokma Guest

    yamamoto <> writes:

    > Hi,
    > I am new to Python. I'd like to extract "a" tag from a website by
    > using "beautifulsoup" module.
    > but it doesnt work!


    [..]

    > check_for_whole_start_tag
    > self.error("malformed start tag")
    > File "C:\Python26\lib\HTMLParser.py", line 115, in error
    > raise HTMLParseError(message, self.getpos())
    > HTMLParser.HTMLParseError: malformed start tag, at line 276, column 36
    >
    > any suggestion?


    I guess you're using 3.1.0. If yes, see:
    http://www.crummy.com/software/BeautifulSoup/3.1-problems.html

    You might want to do:

    sudo easy_install -U "BeautifulSoup==3.0.7a"

    and try again.

    --
    John Bokma j3b

    Hacking & Hiking in Mexico - http://johnbokma.com/
    http://castleamber.com/ - Perl & Python Development
     
    John Bokma, Jan 15, 2010
    #6
  7. yamamoto

    John Bokma Guest

    John Nagle <> writes:

    > It's just somebody pirating movies. Ineptly. Ignore.


    Wow, what a childish reply. You should've followed your own advice and
    ignored the OP instead of replying with a top post + full quote (!).

    --
    John Bokma j3b

    Hacking & Hiking in Mexico - http://johnbokma.com/
    http://castleamber.com/ - Perl & Python Development
     
    John Bokma, Jan 15, 2010
    #7
  8. yamamoto

    John Bokma Guest

    John Bokma <> writes:

    > yamamoto <> writes:
    >
    >> Hi,
    >> I am new to Python. I'd like to extract "a" tag from a website by
    >> using "beautifulsoup" module.
    >> but it doesnt work!

    >
    > [..]
    >
    >> check_for_whole_start_tag
    >> self.error("malformed start tag")
    >> File "C:\Python26\lib\HTMLParser.py", line 115, in error
    >> raise HTMLParseError(message, self.getpos())
    >> HTMLParser.HTMLParseError: malformed start tag, at line 276, column 36
    >>
    >> any suggestion?

    >
    > I guess you're using 3.1.0. If yes, see:
    > http://www.crummy.com/software/BeautifulSoup/3.1-problems.html
    >
    > You might want to do:
    >
    > sudo easy_install -U "BeautifulSoup==3.0.7a"
    >
    > and try again.


    Forgot to add, see also:
    http://johnbokma.com/mexit/2009/09/26/python-downgrading-beatifulsoup.html

    --
    John Bokma j3b

    Hacking & Hiking in Mexico - http://johnbokma.com/
    http://castleamber.com/ - Perl & Python Development
     
    John Bokma, Jan 15, 2010
    #8
  9. yamamoto

    Phlip Guest

    John Bokma wrote:

    > John Nagle writes:
    >
    >> It's just somebody pirating movies. Ineptly. Ignore.

    >
    > Wow, what a childish reply. You should've followed your own advice and
    > ignored the OP instead of replying with a top post + full quote (!).


    Mr Manners reminds the Gentle Poster(s) that...

    A> as Google vs China shows, all programmers should resist hacking, no
    matter how inept it may be, by any means necessary

    B> John should not have attempted to leave a dead trail in the archives.
    Searches for BeautifulSoup should always return answered questions.

    --
    Phlip
     
    Phlip, Jan 15, 2010
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Dan Stromberg

    HTML purifier using BeautifulSoup?

    Dan Stromberg, Dec 21, 2004, in forum: Python
    Replies:
    1
    Views:
    413
    Jonathan Clark
    Jan 7, 2005
  2. Steve Young

    BeautifulSoup

    Steve Young, Aug 19, 2005, in forum: Python
    Replies:
    4
    Views:
    490
    Paul McGuire
    Aug 20, 2005
  3. ted

    BeautifulSoup fetch help

    ted, Jan 7, 2006, in forum: Python
    Replies:
    2
    Views:
    450
  4. ye juan

    how to run BeautifulSoup in Jython

    ye juan, Feb 3, 2006, in forum: Python
    Replies:
    1
    Views:
    350
    Diez B. Roggisch
    Feb 5, 2006
  5. Replies:
    7
    Views:
    774
    Kent Johnson
    Apr 4, 2006
Loading...

Share This Page