Using Beautiful Soup to entangle bookmarks.html

Discussion in 'Python' started by Anthra Norell, Sep 7, 2006.

  1. > Hi,
    >
    > I'm trying to use the Beautiful Soup package to parse through the
    > "bookmarks.html" file which Firefox exports all your bookmarks into.
    > 've been struggling with the documentation trying to figure out how to
    > extract all the urls. Has anybody got a couple of longer examples using
    > Beautiful Soup I could play around with?
    >
    > Thanks,
    > Martin.



    Martin,

    SE is a stream editor that does not introduce the overhead and complications of overkill parsing. See if it suits your needs:
    http://cheeseshop.python.org/pypi/SE/2.2 beta

    >>> import SE
    >>> Bookmark_Filter = SE.SE ('''

    <EAT> # delete all unmatched input
    "~(?i)<a.*?href.*?>~==\n" # keep hrefs and add a new line
    "~(?i)[^>]+/a>~==\n\n" # keep text till end of anchor and add two newlines
    | # run
    <a= <A= </a>= </A>= href\== HREF\== >= # delete the noise (extend to your liking)
    ''')

    >>> print Bookmark_Filter (r'C:\WINDOWS\Application Data\Mozilla\Profiles\default\wwaidm0p.slt\bookmarks.html', '') # 2nd

    parameter '' commands string output. Default is a file.
    ....

    "http://www.inksupply.com/index.cfm?source=html/main2.html" ADD_DATE="1016024829" LAST_VISIT="1039439802" LAST_CHARSET="ISO-8859-1"
    MIS Associates Inc.

    "http://www.weink.com/" ADD_DATE="1016034183" LAST_VISIT="1118782455" LAST_CHARSET="windows-1252"
    Inkjet, Laser, Copier, Fax Supplies

    "http://www.nextrend.com/analysis/content/pr_9-19-2000.asp" ADD_DATE="1018037196" LAST_VISIT="1126289805" LAST_CHARSET="ISO-8859-1"
    NexTrend - Press Releases

    "http://wp.netscape.com/escapes/search/netsearch_E.html" ADD_DATE="1021644432" LAST_VISIT="1023182857" LAST_CHARSET="ISO-8859-1"
    Net Search Page - Google

    "http://www.python.org/" ADD_DATE="1021651575" LAST_VISIT="1121690494" LAST_CHARSET="ISO-8859-1"
    Python Language Website

    "http://www.teldir.com/real/frame.asp?page=http://www.whitepages.ch" ADD_DATE="1027354641" LAST_VISIT="1115386846"
    LAST_CHARSET="windows-1252"
    http://www.teldir.com/real/frame.asp?page=http://www.whitepages.ch

    .... etc.


    You may refine this further by adding more deletions or substitutions. Adding them one by one and examining the output each time
    around is very easy and straightforward. The SE object accepts strings as well as file names and then returns strings by default, so
    developing interactively in an IDLE window using a sample data string is extremely fast and painless, because it is possible to
    develop incrementally, one step at a time.

    >>> Bookmark_Filter.save ('bookmark_filter.se') # Save definitions to an editable text file
    >>> Bookmark_Filter = SE. SE. ('bookmark_filter.se') # Next time naming the definition file makes the same object


    Regards

    Frederic
    Anthra Norell, Sep 7, 2006
    #1
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    2
    Views:
    536
    Enigma Curry
    Mar 11, 2006
  2. Tempo

    Using Beautiful Soup

    Tempo, Aug 19, 2006, in forum: Python
    Replies:
    1
    Views:
    549
    Jorge Godoy
    Aug 19, 2006
  3. Francach
    Replies:
    15
    Views:
    720
    George Sakkis
    Sep 21, 2006
  4. PicURLPy
    Replies:
    3
    Views:
    1,153
    David Coffin
    Dec 4, 2006
  5. cjl
    Replies:
    3
    Views:
    956
Loading...

Share This Page