Using Beautiful Soup to entangle bookmarks.html

Discussion in 'Python' started by Francach, Sep 7, 2006.

  1. Francach

    Francach Guest

    Hi,

    I'm trying to use the Beautiful Soup package to parse through the
    "bookmarks.html" file which Firefox exports all your bookmarks into.
    I've been struggling with the documentation trying to figure out how to
    extract all the urls. Has anybody got a couple of longer examples using
    Beautiful Soup I could play around with?

    Thanks,
    Martin.
     
    Francach, Sep 7, 2006
    #1
    1. Advertising

  2. Francach schrieb:
    > Hi,
    >
    > I'm trying to use the Beautiful Soup package to parse through the
    > "bookmarks.html" file which Firefox exports all your bookmarks into.
    > I've been struggling with the documentation trying to figure out how to
    > extract all the urls. Has anybody got a couple of longer examples using
    > Beautiful Soup I could play around with?


    Why do you use BeautifulSoup on that? It's generated content, and I
    suppose it is well-formed, most probably even xml. So use a standard
    parser here, better yet somthing like lxml/elementtree

    Diez
     
    Diez B. Roggisch, Sep 7, 2006
    #2
    1. Advertising

  3. Francach

    waylan Guest

    waylan, Sep 7, 2006
    #3
  4. Diez B. Roggisch wrote:
    > Francach schrieb:
    >
    >> Hi,
    >>
    >> I'm trying to use the Beautiful Soup package to parse through the
    >> "bookmarks.html" file which Firefox exports all your bookmarks into.
    >> I've been struggling with the documentation trying to figure out how to
    >> extract all the urls. Has anybody got a couple of longer examples using
    >> Beautiful Soup I could play around with?

    >
    >
    > Why do you use BeautifulSoup on that? It's generated content, and I
    > suppose it is well-formed, most probably even xml. So use a standard
    > parser here, better yet somthing like lxml/elementtree
    >
    > Diez


    Once upon a time I have written for my own purposes some code on this
    subject, so maybe it can be used as a starter (tested a bit, but
    consider its status as a kind of alpha release):

    <code>
    from urllib import urlopen
    from sgmllib import SGMLParser

    class mySGMLParserClassProvidingListOf_HREFs(SGMLParser):
    # provides only HREFs <a href="someURL"> for links to another pages skipping
    # references to:
    # - internal links on same page : "#..."
    # - email adresses : "mailto:..."
    # and skipping part with appended internal link info, so that e.g.:
    # - "LinkSpec#internalLinkID" will be listed as "LinkSpec" only
    # ---
    # reset() overwrites an empty function available in SGMLParser class
    def reset(self):
    SGMLParser.reset(self)
    self.A_HREFs = []
    #: def reset(self)

    # start_a() overwrites an empty function available in SGMLParser class
    # from which this class is derived. start_a() will be called each
    time the
    # SGMLParser detects an <a ...> tag within the feed(ed) HTML document:
    def start_a(self, tagAttributes_asListOfNameValuePairs):
    for attrName, attrValue in tagAttributes_asListOfNameValuePairs:
    if attrName=='href':
    if attrValue[0] != '#' and attrValue[:7] !='mailto:':
    if attrValue.find('#') >= 0:
    attrValue = attrValue[:attrValue.find('#')]
    #: if
    self.A_HREFs.append(attrValue)
    #: if
    #: if
    #: for
    #: def start_a(self, attributes_NamesAndValues_AsListOfTuples)
    #: class mySGMLParserClassProvidingListOf_HREFs(SGMLParser)
    #
    ------------------------------------------------------------------------------
    # ---
    # Execution block:
    fileLikeObjFrom_urlopen = urlopen('www.google.com') # set URL
    mySGMLParserClassObj_withListOfHREFs =
    mySGMLParserClassProvidingListOf_HREFs()
    mySGMLParserClassObj_withListOfHREFs.feed(fileLikeObjFrom_urlopen.read())
    mySGMLParserClassObj_withListOfHREFs.close()
    fileLikeObjFrom_urlopen.close()

    for href in mySGMLParserClassObj_withListOfHREFs.A_HREFs:
    print href
    #: for
    </code>

    Claudio Grondi
     
    Claudio Grondi, Sep 7, 2006
    #4
  5. waylan schrieb:
    > Diez B. Roggisch wrote:
    >> suppose it is well-formed, most probably even xml.

    >
    > Maybe not. Otherwise, why would there be a script like this one[1]?
    > Anyway, I found that and other scripts that work with firefox
    > bookmarks.html files with a quick search [2]. Perhaps you will find
    > something there that is helpful.


    I have to admit: I didn't check on that file, and simply couldn't
    believe it was so badly written as it apparently is.

    But I was at least capable of shoving it through HTMLParser. But I'm not
    sure if that is of any use.

    Excuse me causing confusion.

    Diez
     
    Diez B. Roggisch, Sep 7, 2006
    #5
  6. Francach

    Adam Jones Guest

    Francach wrote:
    > Hi,
    >
    > I'm trying to use the Beautiful Soup package to parse through the
    > "bookmarks.html" file which Firefox exports all your bookmarks into.
    > I've been struggling with the documentation trying to figure out how to
    > extract all the urls. Has anybody got a couple of longer examples using
    > Beautiful Soup I could play around with?
    >
    > Thanks,
    > Martin.


    If the only thing you want out of the document is the URL's why not
    search for: href="..." ? You could get a regular expression that
    matches that pretty easily. I think this should just about get you
    there, but my regular expressions have gotten very rusty.

    /href=\".+\"/
     
    Adam Jones, Sep 7, 2006
    #6
  7. Francach

    Tim Williams Guest

    On 7 Sep 2006 14:30:25 -0700, Adam Jones <> wrote:
    >
    > Francach wrote:
    > > Hi,
    > >
    > > I'm trying to use the Beautiful Soup package to parse through the
    > > "bookmarks.html" file which Firefox exports all your bookmarks into.
    > > I've been struggling with the documentation trying to figure out how to
    > > extract all the urls. Has anybody got a couple of longer examples using
    > > Beautiful Soup I could play around with?
    > >
    > > Thanks,
    > > Martin.

    >
    > If the only thing you want out of the document is the URL's why not
    > search for: href="..." ? You could get a regular expression that
    > matches that pretty easily. I think this should just about get you
    > there, but my regular expressions have gotten very rusty.
    >
    > /href=\".+\"/
    >


    I doubt the bookmarks file is huge so something simple like

    f = open('bookmarks.html').readlines()
    data = [x for x in f if x.strip().startswith('<DT><A ')]

    would get you started.

    On my exported firefox bookmarks, this gives me all the urls, they
    just need to be parsed a bit more accurately, I might be tempted to
    just use a couple of splits() to keep it real simple.

    HTH


    --

    Tim Williams
     
    Tim Williams, Sep 7, 2006
    #7
  8. Francach wrote:
    > Hi,
    >
    > I'm trying to use the Beautiful Soup package to parse through the
    > "bookmarks.html" file which Firefox exports all your bookmarks into.
    > I've been struggling with the documentation trying to figure out how to
    > extract all the urls. Has anybody got a couple of longer examples using
    > Beautiful Soup I could play around with?
    >
    > Thanks,
    > Martin.


    from BeautifulSoup import BeautifulSoup
    urls = [tag['href'] for tag in
    BeautifulSoup(open('bookmarks.html')).findAll('a')]

    Regards,
    George
     
    George Sakkis, Sep 8, 2006
    #8
  9. Francach

    Francach Guest

    Hi,

    thanks for the helpful reply.
    I wanted to do two things - learn to use Beautiful Soup and bring out
    all the information
    in the bookmarks file to import into another application. So I need to
    be able to travel down the tree in the bookmarks file. bookmarks seems
    to use header tags which can then contain a tags where the href
    attributes are. What I don't understand is how to create objects which
    can then be used to return the information in the next level of the
    tree.

    Thanks again,
    Martin.



    George Sakkis wrote:
    > Francach wrote:
    > > Hi,
    > >
    > > I'm trying to use the Beautiful Soup package to parse through the
    > > "bookmarks.html" file which Firefox exports all your bookmarks into.
    > > I've been struggling with the documentation trying to figure out how to
    > > extract all the urls. Has anybody got a couple of longer examples using
    > > Beautiful Soup I could play around with?
    > >
    > > Thanks,
    > > Martin.

    >
    > from BeautifulSoup import BeautifulSoup
    > urls = [tag['href'] for tag in
    > BeautifulSoup(open('bookmarks.html')).findAll('a')]
    >
    > Regards,
    > George
     
    Francach, Sep 8, 2006
    #9
  10. Francach wrote:
    > George Sakkis wrote:
    > > Francach wrote:
    > > > Hi,
    > > >
    > > > I'm trying to use the Beautiful Soup package to parse through the
    > > > "bookmarks.html" file which Firefox exports all your bookmarks into.
    > > > I've been struggling with the documentation trying to figure out how to
    > > > extract all the urls. Has anybody got a couple of longer examples using
    > > > Beautiful Soup I could play around with?
    > > >
    > > > Thanks,
    > > > Martin.

    > >
    > > from BeautifulSoup import BeautifulSoup
    > > urls = [tag['href'] for tag in
    > > BeautifulSoup(open('bookmarks.html')).findAll('a')]

    > Hi,
    >
    > thanks for the helpful reply.
    > I wanted to do two things - learn to use Beautiful Soup and bring out
    > all the information
    > in the bookmarks file to import into another application. So I need to
    > be able to travel down the tree in the bookmarks file. bookmarks seems
    > to use header tags which can then contain a tags where the href
    > attributes are. What I don't understand is how to create objects which
    > can then be used to return the information in the next level of the
    > tree.
    >
    > Thanks again,
    > Martin.


    I'm not sure I understand what you want to do. Originally you asked to
    extract all urls and BeautifulSoup can do this for you in one line. Why
    do you care about intermediate objects or if the anchor tags are nested
    under header tags or not ? Read and embrace BeautifulSoup's philosophy:
    "You didn't write that awful page. You're just trying to get some data
    out of it. Right now, you don't really care what HTML is supposed to
    look like."

    George
     
    George Sakkis, Sep 8, 2006
    #10
  11. Francach

    Francach Guest

    Hi George,

    Firefox lets you group the bookmarks along with other information into
    directories and sub-directories. Firefox uses header tags for this
    purpose. I'd like to get this grouping information out aswell.

    Regards,
    Martin.


    the idea is to extract.
    George Sakkis wrote:
    > Francach wrote:
    > > George Sakkis wrote:
    > > > Francach wrote:
    > > > > Hi,
    > > > >
    > > > > I'm trying to use the Beautiful Soup package to parse through the
    > > > > "bookmarks.html" file which Firefox exports all your bookmarks into.
    > > > > I've been struggling with the documentation trying to figure out how to
    > > > > extract all the urls. Has anybody got a couple of longer examples using
    > > > > Beautiful Soup I could play around with?
    > > > >
    > > > > Thanks,
    > > > > Martin.
    > > >
    > > > from BeautifulSoup import BeautifulSoup
    > > > urls = [tag['href'] for tag in
    > > > BeautifulSoup(open('bookmarks.html')).findAll('a')]

    > > Hi,
    > >
    > > thanks for the helpful reply.
    > > I wanted to do two things - learn to use Beautiful Soup and bring out
    > > all the information
    > > in the bookmarks file to import into another application. So I need to
    > > be able to travel down the tree in the bookmarks file. bookmarks seems
    > > to use header tags which can then contain a tags where the href
    > > attributes are. What I don't understand is how to create objects which
    > > can then be used to return the information in the next level of the
    > > tree.
    > >
    > > Thanks again,
    > > Martin.

    >
    > I'm not sure I understand what you want to do. Originally you asked to
    > extract all urls and BeautifulSoup can do this for you in one line. Why
    > do you care about intermediate objects or if the anchor tags are nested
    > under header tags or not ? Read and embrace BeautifulSoup's philosophy:
    > "You didn't write that awful page. You're just trying to get some data
    > out of it. Right now, you don't really care what HTML is supposed to
    > look like."
    >
    > George
     
    Francach, Sep 8, 2006
    #11
  12. Francach

    Paul Boddie Guest

    Francach wrote:
    >
    > Firefox lets you group the bookmarks along with other information into
    > directories and sub-directories. Firefox uses header tags for this
    > purpose. I'd like to get this grouping information out aswell.


    import libxml2dom # http://www.python.org/pypi/libxml2dom
    d = libxml2dom.parse("bookmarks.html", html=1)
    for node in d.xpath("html/body//dt/*[1]"):
    if node.localName == "h3":
    print "Section:", node.nodeValue
    elif node.localName == "a":
    print "Link:", node.getAttribute("href")

    One exercise, using the above code as a starting point, would be to
    reproduce the hierarchy exactly, rather than just showing the section
    names and the links which follow them. Ultimately, you may be looking
    for a way to just convert the HTML into a simple XML document or into
    another hierarchical representation which excludes the HTML baggage and
    details irrelevant to your problem.

    Paul
     
    Paul Boddie, Sep 8, 2006
    #12
  13. Francach wrote:
    > Hi George,
    >
    > Firefox lets you group the bookmarks along with other information into
    > directories and sub-directories. Firefox uses header tags for this
    > purpose. I'd like to get this grouping information out aswell.
    >
    > Regards,
    > Martin.


    Here's what I came up with:
    http://rafb.net/paste/results/G91EAo70.html. Tested only on my
    bookmarks; see if it works for you.

    For each subfolder there is a recursive call that walks the respective
    subtree, so it's probably not the most efficient solution, but I
    couldn't think of any one-pass way to do it using BeautifulSoup.

    George
     
    George Sakkis, Sep 9, 2006
    #13
  14. Francach

    Francach Guest

    Hallo George,

    thanks a lot! This is exactly the direction I had in mind.
    Your script demonstrates nicely how Beautiful Soup works.

    Regards,
    Martin.

    George Sakkis wrote:
    > Francach wrote:
    > > Hi George,
    > >
    > > Firefox lets you group the bookmarks along with other information into
    > > directories and sub-directories. Firefox uses header tags for this
    > > purpose. I'd like to get this grouping information out aswell.
    > >
    > > Regards,
    > > Martin.

    >
    > Here's what I came up with:
    > http://rafb.net/paste/results/G91EAo70.html. Tested only on my
    > bookmarks; see if it works for you.
    >
    > For each subfolder there is a recursive call that walks the respective
    > subtree, so it's probably not the most efficient solution, but I
    > couldn't think of any one-pass way to do it using BeautifulSoup.
    >
    > George
     
    Francach, Sep 9, 2006
    #14
  15. Francach

    robin Guest

    "George Sakkis" <> wrote:

    >Here's what I came up with:
    >http://rafb.net/paste/results/G91EAo70.html. Tested only on my
    >bookmarks; see if it works for you.


    That URL is dead. Got another?

    -----
    robin
    noisetheatre.blogspot.com
     
    robin, Sep 21, 2006
    #15
  16. George Sakkis, Sep 21, 2006
    #16
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    2
    Views:
    564
    Enigma Curry
    Mar 11, 2006
  2. Tempo

    Using Beautiful Soup

    Tempo, Aug 19, 2006, in forum: Python
    Replies:
    1
    Views:
    615
    Jorge Godoy
    Aug 19, 2006
  3. Anthra Norell
    Replies:
    0
    Views:
    478
    Anthra Norell
    Sep 7, 2006
  4. PicURLPy
    Replies:
    3
    Views:
    1,239
    David Coffin
    Dec 4, 2006
  5. cjl
    Replies:
    3
    Views:
    1,009
Loading...

Share This Page