Extracting text from a Webpage using BeautifulSoup

Discussion in 'Python' started by Magnus.Moraberg@gmail.com, May 27, 2008.

  1. Guest

    Hi,

    I wish to extract all the words on a set of webpages and store them in
    a large dictionary. I then wish to procuce a list with the most common
    words for the language under consideration. So, my code below reads
    the page -

    http://news.bbc.co.uk/welsh/hi/newsid_7420000/newsid_7420900/7420967.stm

    a welsh language page. I hope to then establish the 1000 most commonly
    used words in Welsh. The problem I'm having is that
    soup.findAll(text=True) is returning the likes of -

    u'doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://
    www.w3.org/TR/REC-html40/loose.dtd"'

    and -

    <a href=" \'+url+\'?rss=\'+rssURI+\'" class="sel"

    Any suggestions how I might overcome this problem?

    Thanks,

    Barry.


    Here's my code -

    import urllib
    import urllib2
    from BeautifulSoup import BeautifulSoup

    # proxy_support = urllib2.ProxyHandler({"http":"http://
    999.999.999.999:8080"})
    # opener = urllib2.build_opener(proxy_support)
    # urllib2.install_opener(opener)

    page = urllib2.urlopen('http://news.bbc.co.uk/welsh/hi/newsid_7420000/
    newsid_7420900/7420967.stm')
    soup = BeautifulSoup(page)

    pageText = soup.findAll(text=True)
    print pageText
     
    , May 27, 2008
    #1
    1. Advertising

  2. On Tue, 27 May 2008 03:01:30 -0700, Magnus.Moraberg wrote:

    > I wish to extract all the words on a set of webpages and store them in
    > a large dictionary. I then wish to procuce a list with the most common
    > words for the language under consideration. So, my code below reads
    > the page -
    >
    > http://news.bbc.co.uk/welsh/hi/newsid_7420000/newsid_7420900/7420967.stm
    >
    > a welsh language page. I hope to then establish the 1000 most commonly
    > used words in Welsh. The problem I'm having is that
    > soup.findAll(text=True) is returning the likes of -
    >
    > u'doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://
    > www.w3.org/TR/REC-html40/loose.dtd"'


    Just extract the text from the body of the document.

    body_texts = soup.body(text=True)

    > and -
    >
    > <a href=" \'+url+\'?rss=\'+rssURI+\'" class="sel"
    >
    > Any suggestions how I might overcome this problem?


    Ask the BBC to produce HTML that's less buggy. ;-)

    http://validator.w3.org/ reports bugs like "'body' tag not allowed here"
    or closing tags without opening ones and so on.

    Ciao,
    Marc 'BlackJack' Rintsch
     
    Marc 'BlackJack' Rintsch, May 27, 2008
    #2
    1. Advertising

  3. Guest

    On 27 Maj, 12:54, Marc 'BlackJack' Rintsch <> wrote:
    > On Tue, 27 May 2008 03:01:30 -0700, Magnus.Moraberg wrote:
    > > I wish to extract all the words on a set of webpages and store them in
    > > a large dictionary. I then wish to procuce a list with the most common
    > > words for the language under consideration. So, my code below reads
    > > the page -

    >
    > >http://news.bbc.co.uk/welsh/hi/newsid_7420000/newsid_7420900/7420967.stm

    >
    > > a welsh language page. I hope to then establish the 1000 most commonly
    > > used words in Welsh. The problem I'm having is that
    > > soup.findAll(text=True) is returning the likes of -

    >
    > > u'doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://
    > >www.w3.org/TR/REC-html40/loose.dtd"'

    >
    > Just extract the text from the body of the document.
    >
    > body_texts = soup.body(text=True)
    >
    > > and -

    >
    > > <a href=" \'+url+\'?rss=\'+rssURI+\'" class="sel"

    >
    > > Any suggestions how I might overcome this problem?

    >
    > Ask the BBC to produce HTML that's less buggy. ;-)
    >
    > http://validator.w3.org/reports bugs like "'body' tag not allowed here"
    > or closing tags without opening ones and so on.
    >
    > Ciao,
    > Marc 'BlackJack' Rintsch


    Great, thanks!
     
    , May 27, 2008
    #3
  4. Paul McGuire Guest

    On May 27, 5:01 am, wrote:
    > Hi,
    >
    > I wish to extract all the words on a set of webpages and store them in
    > a large dictionary. I then wish to procuce a list with the most common
    > words for the language under consideration. So, my code below reads
    > the page -
    >
    > http://news.bbc.co.uk/welsh/hi/newsid_7420000/newsid_7420900/7420967.stm
    >
    > a welsh language page. I hope to then establish the 1000 most commonly
    > used words in Welsh. The problem I'm having is that
    > soup.findAll(text=True) is returning the likes of -
    >
    > u'doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"'
    >
    > and -
    >
    > <a href=" \'+url+\'?rss=\'+rssURI+\'" class="sel"
    >
    > Any suggestions how I might overcome this problem?
    >
    > Thanks,
    >
    > Barry.
    >
    > Here's my code -
    >
    > import urllib
    > import urllib2
    > from BeautifulSoup import BeautifulSoup
    >
    > # proxy_support = urllib2.ProxyHandler({"http":"http://
    > 999.999.999.999:8080"})
    > # opener = urllib2.build_opener(proxy_support)
    > # urllib2.install_opener(opener)
    >
    > page = urllib2.urlopen('http://news.bbc.co.uk/welsh/hi/newsid_7420000/
    > newsid_7420900/7420967.stm')
    > soup = BeautifulSoup(page)
    >
    > pageText = soup.findAll(text=True)
    > print pageText


    As an alternative datapoint, you can try out the htmlStripper example
    on the pyparsing wiki: http://pyparsing.wikispaces.com/space/showimage/htmlStripper.py

    -- Paul
     
    Paul McGuire, May 28, 2008
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. TC
    Replies:
    0
    Views:
    822
  2. Tj Superfly

    Extracting Data from a Webpage

    Tj Superfly, Jan 27, 2008, in forum: Ruby
    Replies:
    16
    Views:
    210
    7stud --
    Jan 28, 2008
  3. Replies:
    2
    Views:
    107
    Gunnar Hjalmarsson
    Apr 29, 2008
  4. shankar_perl_rookie

    Extracting html urls on a webpage using linktext

    shankar_perl_rookie, Jan 26, 2011, in forum: Perl Misc
    Replies:
    1
    Views:
    120
  5. sifar
    Replies:
    5
    Views:
    456
Loading...

Share This Page