parsing in python

Discussion in 'Python' started by Peter Sprenger, Jun 9, 2004.

  1. Hello,

    I hope somebody can help me with my problem. I am writing Zope python
    scripts that will do parsing on text for dynamic webpages: I am getting
    a text from an oracle database that contains different tags that have to
    be converted to a HTML expression. E.g. "<pic#>" ( # is an integer
    number) has to be converted to <img src="..."> where the image data
    comes also from a database table.
    Since strings are immutable, is there an effective way to parse such
    texts in Python? In the process of finding and converting the embedded
    tags I also would like to make a word wrap on the generated HTML output
    to increase the readability of the generated HTML source.
    Can I write an efficient parser in Python or should I extend Python with
    a C routine that will do this task in O(n)?

    Regards

    Peter Sprenger
     
    Peter Sprenger, Jun 9, 2004
    #1
    1. Advertising

  2. Peter Sprenger

    Gandalf Guest

    Peter Sprenger wrote:

    > Hello,
    >
    > I hope somebody can help me with my problem. I am writing Zope python
    > scripts that will do parsing on text for dynamic webpages: I am
    > getting a text from an oracle database that contains different tags
    > that have to
    > be converted to a HTML expression. E.g. "<pic#>" ( # is an integer
    > number) has to be converted to <img src="..."> where the image data
    > comes also from a database table.
    > Since strings are immutable, is there an effective way to parse such
    > texts in Python? In the process of finding and converting the embedded
    > tags I also would like to make a word wrap on the generated HTML
    > output to increase the readability of the generated HTML source.
    > Can I write an efficient parser in Python or should I extend Python
    > with a C routine that will do this task in O(n)?


    I do not know any search algorigthm that can do string search in O(n).
    Do you?

    By the way, I'm almost sure that you do not need a fast program here. It
    seems you are developing an internet application.
    The HTML pages you generate are...

    1.) Downloaded by the client relatively slowly
    2.) They are read by the client even more slowly

    so I think that the bottleneck will be the network bandwidth. If you are
    developing a system for your intranet, the bottleneck can be the read
    spead of humans. Or are you so lucky that you do a site with millions of
    hits a day? In that case, I would suggest to create a set of web
    servers. Sometimes it is better to create a load balanced server than a
    single hard-coded, optimized server. The reasons:

    1.) It is extremely easy to create a load balanced web server (I'm not
    speaking about the database server, it can be a single computer)
    2.) If you do load balancing, then you will have redundancy. When your
    server blows up you still have other servers alive
    3.) You can develop your system in a higher level language. When there
    is a need to improve performance, you can add new servers anytime. More
    scaleable, and of course when your site is so familiar it will not be a
    problem to buy and add a new server....

    These were my thoughs; you can of course create and optimized C code
    just for fun. ;-)

    Best,

    G
     
    Gandalf, Jun 9, 2004
    #2
    1. Advertising

  3. Peter Sprenger

    Duncan Booth Guest

    Peter Sprenger <> wrote in
    news:ca6ep3$8ni$01$-online.com:

    > I hope somebody can help me with my problem. I am writing Zope python
    > scripts that will do parsing on text for dynamic webpages: I am getting
    > a text from an oracle database that contains different tags that have to
    > be converted to a HTML expression. E.g. "<pic#>" ( # is an integer
    > number) has to be converted to <img src="..."> where the image data
    > comes also from a database table.
    > Since strings are immutable, is there an effective way to parse such
    > texts in Python? In the process of finding and converting the embedded
    > tags I also would like to make a word wrap on the generated HTML output
    > to increase the readability of the generated HTML source.
    > Can I write an efficient parser in Python or should I extend Python with
    > a C routine that will do this task in O(n)?


    You do realise that O(n) says nothing useful about how fast it will run?

    Answering your other questions, yes, there are lots of effective ways to
    parse text strings in Python. Were I in your position, I wouldn't even
    consider C until I had demonstrated that the most obvious and clean
    solution wasn't fast enough.

    You don't really describe your data in sufficient detail, so I can only
    give general suggestions:

    You could use a regular expression replace to convert <pic#> tags with the
    appropriate image tag.

    you could use sgmllib to parse the data.

    you could use one of Python's many xml parsers to parse the data (provided
    it is valid xml, which it may not be).

    you could use the split method on strings to split the data on '<'. Each
    string (other than the first) then begins with a potential tag which you
    can match with the startswith method or a regular expression.

    You could replace '<' with '%(' and '>' with ')s' then use the % operator
    to process all the replacements using a class with a custom __getitem__
    method.

    If you want to word wrap and pretty print the HTML, then that is better
    done as a separate pass. Just get a general purpose HTML pretty printer
    (e.g. mxTidy) and call it. That way you can easily turn it off for
    production use if you really are concerned about speed.
     
    Duncan Booth, Jun 9, 2004
    #3
  4. Peter Sprenger

    Paul McGuire Guest

    "Peter Sprenger" <> wrote in message
    news:ca6ep3$8ni$01$-online.com...
    > Hello,
    >
    > I hope somebody can help me with my problem. I am writing Zope python
    > scripts that will do parsing on text for dynamic webpages: I am getting
    > a text from an oracle database that contains different tags that have to
    > be converted to a HTML expression. E.g. "<pic#>" ( # is an integer
    > number) has to be converted to <img src="..."> where the image data
    > comes also from a database table.
    > Since strings are immutable, is there an effective way to parse such
    > texts in Python? In the process of finding and converting the embedded
    > tags I also would like to make a word wrap on the generated HTML output
    > to increase the readability of the generated HTML source.
    > Can I write an efficient parser in Python or should I extend Python with
    > a C routine that will do this task in O(n)?
    >
    > Regards
    >
    > Peter Sprenger


    Peter -

    Not sure how this holds up to "high-performance" requirements, but this
    should work as a prototype until you need something better. (Requires
    download of latest pyparsing 1.2beta3, at http://pyparsing.sourceforge.net
    ..) Note that this grammar is tolerant of upper or lowercase PIC, plus
    inclusion of whitespace between tokens and tag attributes within the
    <pic###> tag.

    BTW, I'll be the first one to admit that this is a lot wordier (and very
    possibly slower) than something like re.sub(). But it is *very* productive
    from a programming standpoint, and implicitly takes care of nuisance issues
    like unexpected whitespace. It is also simple from a maintenance
    standpoint: adding support for caseless matching on 'pic', or for additional
    tag attributes, was very straightforward. I'm just not that good with re's
    to be able to make similar changes in as short a time, or as readable a
    style.

    (While we are talking about performance, I'll also mention that
    transformString() does not use string concatenation to construct its output.
    As the input is processed, the transformed text fragments and intervening
    original text are accumulated into a list; at the end, the list is converted
    to a string using "".join(). )

    -- Paul

    ===================
    from pyparsing import CharsNotIn,Word,Literal,Optional,CaselessLiteral

    testdata = """
    <HTML>
    <BODY>
    <pic38>
    <pic22 align="left">
    < PIC 17 >

    < pic99 >
    </BODY></HTML>
    """

    # Define parse action to convert <pic###> tags to <img src=...> tags
    def convertPicNumToImgSrc(src,loc,toks):
    imgFile = imageFiles.get( toks.picnum, "default.jpg" )
    retstring = '<img src="%s"%s>' % (imgFile, toks.picAttribs)
    return retstring

    # Define grammar for matching text pattern - don't forget that there might
    be HTML tag attributes
    # included in the <pic###> tag
    # Return parse results as:
    # picnum - the numeric part of the <pic###> tag, converted to an integer
    # picAttribs - optional HTML attributes that might be defined in the
    <pic###> tag
    #
    integer = Word("0123456789").setParseAction( lambda s,l,t: int(t[0]) )
    picTagDefn = ( Literal("<") +
    CaselessLiteral("pic") +
    integer.setResultsName("picnum") +
    Optional( CharsNotIn(">") ).setResultsName("picAttribs") +
    ">").setParseAction( convertPicNumToImgSrc )

    # Set up lookup table of pic #'s to image file names
    # (in reality, these would be read from database table)
    imageFiles = {
    22 : "flower.jpg",
    17 : "house.jpg",
    38 : "dog.jpg",
    }

    # Run transformString
    print picTagDefn.transformString(testdata)

    ===================
    output:

    <HTML>
    <BODY>
    <img src="dog.jpg">
    <img src="flower.jpg" align="left">
    <img src="house.jpg" >

    <img src="default.jpg" >
    </BODY></HTML>
     
    Paul McGuire, Jun 9, 2004
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. GIMME
    Replies:
    2
    Views:
    877
    GIMME
    Feb 11, 2004
  2. Naren
    Replies:
    0
    Views:
    584
    Naren
    May 11, 2004
  3. Christopher Diggins
    Replies:
    0
    Views:
    612
    Christopher Diggins
    Jul 9, 2007
  4. Christopher Diggins
    Replies:
    0
    Views:
    437
    Christopher Diggins
    Jul 9, 2007
  5. John Levine
    Replies:
    0
    Views:
    732
    John Levine
    Feb 2, 2012
Loading...

Share This Page