PyParsing module or HTMLParser

Discussion in 'Python' started by Lad, Mar 28, 2005.

  1. Lad

    Lad Guest

    I came across pyparsing module by Paul McGuire. It seems to be nice but
    I am not sure if it is the best for my need.
    I need to extract some text from html page. The text is in tables and a
    table can be inside another table.
    Is it better and easier to use the pyparsing module or HTMLparser?

    Thanks for suggestions.
    La.
     
    Lad, Mar 28, 2005
    #1
    1. Advertising

  2. Lad

    Bill Mill Guest

    On 28 Mar 2005 12:01:34 -0800, Lad <> wrote:
    > I came across pyparsing module by Paul McGuire. It seems to be nice but
    > I am not sure if it is the best for my need.
    > I need to extract some text from html page. The text is in tables and a
    > table can be inside another table.
    > Is it better and easier to use the pyparsing module or HTMLparser?
    >


    You might want to check out BeautifulSoup at:
    http://www.crummy.com/software/BeautifulSoup/ .

    Peace
    Bill Mill
    bill.mill at gmail.com
     
    Bill Mill, Mar 28, 2005
    #2
    1. Advertising

  3. Lad

    EuGeNe Guest

    Lad wrote:
    > I came across pyparsing module by Paul McGuire. It seems to be nice but
    > I am not sure if it is the best for my need.
    > I need to extract some text from html page. The text is in tables and a
    > table can be inside another table.
    > Is it better and easier to use the pyparsing module or HTMLparser?
    >
    > Thanks for suggestions.
    > La.
    >


    Check BeautifulSoup (http://www.crummy.com/software/BeautifulSoup/)it
    did the job for me!

    --
    EuGeNe

    [----
    www.boardkulture.com
    www.actiphot.com
    www.xsbar.com
    ----]
     
    EuGeNe, Mar 28, 2005
    #3
  4. Lad

    Paul McGuire Guest

    La -

    In general, I have shied away from doing general-purpose HTML parsing
    with pyparsing. It's a crowded field, and it's likely that there are
    better candidates out there for your problem. I've heard good things
    about BeautifulSoup, but I've also heard from at least one person that
    they prefer pyparsing to BS.

    I personally have had good luck with *simple* HTML scraping with
    pyparsing, such as extracting data from tables. It just depends on how
    variable your source text is. Tables within tables may be a bit
    challenging, but we'll never know unless you provide more to go on. If
    you post a URL or some sample HTML, I could give you a more definitive
    answer (possibly even a working code sample, you never know).

    -- Paul
     
    Paul McGuire, Mar 29, 2005
    #4
  5. Lad

    Lad Guest

    Paul,
    Thank you for your reply.

    Here is a test page that I woul like to test with PyParsing

    http://www.ourglobalmarket.com/Test.htm

    >From that

    I would like to extract the tittle ( it is below Lanjin Electronics
    Co., Ltd. )
    (Sell 2.4GHz Wireless Mini Color Camera With Audio Function )

    description - below the tittle next to the picture
    Contact person
    Company name
    Address
    fax
    phone
    Website Address

    Do you think that the PyParsing will work for that?

    Best regards,
    Lad.
     
    Lad, Mar 30, 2005
    #5
  6. Lad

    Paul McGuire Guest

    Lad -

    Well, here's what I've got so far. I'll leave the extraction of the
    description to you as an exercise, but as a clue, it looks like it is
    delimited by "<b>View Detail</b></a></td></tr></tbody></table> <br>" at
    the beginning, and "Quantity: 500<br>" at the end, where 500 could be
    any number. This program will print out:

    ['Title:', 'Sell 2.4GHz Wireless Mini Color Camera With Audio Function
    Manufacturers Hong Kong - Exporters, Suppliers, Factories, Seller']
    ['Contact:', 'Mr. Simon Cheung']
    ['Company:', 'Lanjin Electronics Co., Ltd.']
    ['Address:', 'Rm 602, 6/F., Tung Ning Bldg., 2 Hillier Street, Sheung
    Wan , Hong Kong\n , HK\n ( Hong Kong
    )']
    ['Phone:', '852 35763877']
    ['Fax:', '852 31056238']
    ['Mobile:', '852-96439737']

    So I think pyparsing will get you pretty far along the way. Code
    attached below (unfortunately, I am posting thru Google Groups, which
    strips leading whitespace, so I have inserted '.'s to preserve code
    indentation; just strip the leading '.' characters).

    -- Paul

    ===================================
    from pyparsing import *
    import urllib

    # get input data
    url = "http://www.ourglobalmarket.com/Test.htm"
    page = urllib.urlopen( url )
    pageHTML = page.read()
    page.close()

    #~ I would like to extract the tittle ( it is below Lanjin Electronics
    #~ Co., Ltd. )
    #~ (Sell 2.4GHz Wireless Mini Color Camera With Audio Function )

    #~ description - below the tittle next to the picture
    #~ Contact person
    #~ Company name
    #~ Address
    #~ fax
    #~ phone
    #~ Website Address

    LANGBRK = Literal("<")
    RANGBRK = Literal(">")
    SLASH = Literal("/")
    tagAttr = Word(alphanums) + "=" + dblQuotedString

    # helpers for defining HTML tag expressions
    def startTag( tagname ):
    .....return ( LANGBRK + CaselessLiteral(tagname) + \
    ................ZeroOrMore(tagAttr) + RANGBRK ).suppress()
    def endTag( tagname ):
    .....return ( LANGBRK + SLASH + CaselessLiteral(tagname) + RANGBRK
    ).suppress()
    def makeHTMLtags( tagname ):
    .....return startTag(tagname), endTag(tagname)
    def strong( expr ):
    .....return strongStartTag + expr + strongEndTag

    strongStartTag, strongEndTag = makeHTMLtags("strong")
    titleStart, titleEnd = makeHTMLtags("title")
    tdStart, tdEnd = makeHTMLtags("td")
    h1Start, h1End = makeHTMLtags("h1")

    title = titleStart + SkipTo( titleEnd ).setResultsName("title") +
    titleEnd
    contactPerson = tdStart + h1Start + \
    ................SkipTo( h1End ).setResultsName("contact")
    company = ( tdStart + strong("Company:") + tdEnd + tdStart ) + \
    ................SkipTo( tdEnd ).setResultsName("company")
    address = ( tdStart + strong("Address:") + tdEnd + tdStart ) + \
    ................SkipTo( tdEnd ).setResultsName("address")
    phoneNum = ( tdStart + strong("Phone:") + tdEnd + tdStart ) + \
    ................SkipTo( tdEnd ).setResultsName("phoneNum")
    faxNum = ( tdStart + strong("Fax:") + tdEnd + tdStart ) + \
    ................SkipTo( tdEnd ).setResultsName("faxNum")
    mobileNum = ( tdStart + strong("Mobile:") + tdEnd + tdStart ) + \
    ................SkipTo( tdEnd ).setResultsName("mobileNum")
    webSite = ( tdStart + strong("Website Address:") + tdEnd + tdStart )
    + \
    ................SkipTo( tdEnd ).setResultsName("webSite")
    scrapes = title | contactPerson | company | address | phoneNum | faxNum
    | mobileNum | webSite

    # use parse actions to remove hyperlinks
    linkStart, linkEnd = makeHTMLtags("a")
    linkExpr = linkStart + SkipTo( linkEnd ) + linkEnd
    def stripHyperLink(s,l,t):
    .....return [ t[0], linkExpr.transformString( t[1] ) ]
    company.setParseAction( stripHyperLink )

    # use parse actions to add labels for data elements that don't
    # have labels in the HTML
    def prependLabel(pre):
    .....def prependAction(s,l,t):
    .........return [pre] + t[:]
    .....return prependAction
    title.setParseAction( prependLabel("Title:") )
    contactPerson.setParseAction( prependLabel("Contact:") )

    for tokens,start,end in scrapes.scanString( pageHTML ):
    .....print tokens
     
    Paul McGuire, Mar 30, 2005
    #6
  7. Lad

    Lad Guest

    Paul, thanks a lot.
    It seems to work but I will have to study the sample hard to be able to
    do the exercise (the extraction of the
    description ) successfully. Is it possible to email you if I need some
    help with that exercise?
    Thanks again for help
    Lad.
     
    Lad, Mar 31, 2005
    #7
  8. Lad

    Paul McGuire Guest

    Yes, drop me a note if you get stuck.

    -- Paul
    base64.decodestring('cHRtY2dAYXVzdGluLnJyLmNvbQ==')
     
    Paul McGuire, Mar 31, 2005
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Tan Vu Ngoc

    HTMLParser solution!

    Tan Vu Ngoc, Nov 18, 2003, in forum: Java
    Replies:
    0
    Views:
    387
    Tan Vu Ngoc
    Nov 18, 2003
  2. mike
    Replies:
    0
    Views:
    905
  3. Adonis
    Replies:
    1
    Views:
    375
    Carl Banks
    Jul 28, 2003
  4. Paul McGuire
    Replies:
    0
    Views:
    274
    Paul McGuire
    Dec 24, 2003
  5. Mike
    Replies:
    2
    Views:
    467
Loading...

Share This Page