Newbie regular expression and whitespace question

Discussion in 'Python' started by googleboy, Sep 22, 2005.

  1. googleboy

    googleboy Guest

    Hi.

    I am trying to collapse an html table into a single line. Basically,
    anytime I see ">" & "<" with nothing but whitespace between them, I'd
    like to remove all the whitespace, including newlines. I've read the
    how-to and I have tried a bunch of things, but nothing seems to work
    for me:

    --

    table = open(r'D:\path\to\tabletest.txt', 'rb')
    strTable = table.read()

    #Below find the different sort of things I have tried, one at a time:

    strTable = strTable.replace(">\s<", "><") #I got this from the module
    docs

    strTable = strTable.replace(">.<", "><")

    strTable = ">\s+<".join(strTable)

    strTable = ">\s<".join(strTable)

    print strTable

    --

    The table in question looks like this:

    <table width="80%" border="0">
    <tr>
    <td>&nbsp;</td>
    <td colspan="2">Introduction</td>
    <td><div align="right">3</div></td>
    </tr>
    <tr>
    <td>&nbsp;</td>
    </tr>
    <tr>
    <td><i>ONE</i></td>
    <td colspan="2">Childraising for Parrots</td>
    <td><div align="right">11</div></td>
    </tr>
    </table>



    For extra kudos (and I confess I have been so stuck on the above
    problem I haven't put much thought into how to do this one) I'd like to
    be able to measure the number of characters between the <p> & </p>
    tags, and then insert a newline character at the end of the next word
    after an arbitrary number of characters..... I am reading in to a
    script a bunch of paragraphs formatted for a webpage, but they're all
    on one big long line and I would like to split them for readability.

    TIA

    Googleboy
     
    googleboy, Sep 22, 2005
    #1
    1. Advertising

  2. googleboy

    Paul McGuire Guest

    "googleboy" <> wrote in message
    news:...
    > Hi.
    >
    > I am trying to collapse an html table into a single line. Basically,
    > anytime I see ">" & "<" with nothing but whitespace between them, I'd
    > like to remove all the whitespace, including newlines. I've read the
    > how-to and I have tried a bunch of things, but nothing seems to work
    > for me:
    >
    > --
    >
    > table = open(r'D:\path\to\tabletest.txt', 'rb')
    > strTable = table.read()
    >
    > #Below find the different sort of things I have tried, one at a time:
    >
    > strTable = strTable.replace(">\s<", "><") #I got this from the module
    > docs
    >
    > strTable = strTable.replace(">.<", "><")
    >
    > strTable = ">\s+<".join(strTable)
    >
    > strTable = ">\s<".join(strTable)
    >
    > print strTable
    >
    > --
    >
    > The table in question looks like this:
    >
    > <table width="80%" border="0">
    > <tr>
    > <td>&nbsp;</td>
    > <td colspan="2">Introduction</td>
    > <td><div align="right">3</div></td>
    > </tr>
    > <tr>
    > <td>&nbsp;</td>
    > </tr>
    > <tr>
    > <td><i>ONE</i></td>
    > <td colspan="2">Childraising for Parrots</td>
    > <td><div align="right">11</div></td>
    > </tr>
    > </table>
    >
    >
    >
    > For extra kudos (and I confess I have been so stuck on the above
    > problem I haven't put much thought into how to do this one) I'd like to
    > be able to measure the number of characters between the <p> & </p>
    > tags, and then insert a newline character at the end of the next word
    > after an arbitrary number of characters..... I am reading in to a
    > script a bunch of paragraphs formatted for a webpage, but they're all
    > on one big long line and I would like to split them for readability.
    >
    > TIA
    >
    > Googleboy
    >

    If you're absolutely stuck on using RE's, then others will have to step
    forward. Meanwhile, here's a pyparsing solution (get pyparsing at
    http://pyparsing.sourceforge.net):

    ---------------
    from pyparsing import *

    LT = Literal("<")
    GT = Literal(">")

    collapsableSpace = GT + LT # matches with or without intervening
    whitespace
    collapsableSpace.setParseAction( replaceWith("><") )

    print collapsableSpace.transformString(data)
    ---------------

    The reason this works is that pyparsing implicitly skips over whitespace
    while looking for matches of collapsable space (a '>' followed by a '<').
    When found, the parse action is triggered, which in this case, replaces
    whatever was matched with the string "><". Finally, the input data (in this
    case your HTML table, stored in the string variable, data) is passed to
    transformString, which scans for matches of the collapsableSpace expression,
    runs the parse action when they are found, and returns the final transformed
    string.

    As for word wrapping within <p>...</p> tags, there are at least two recipes
    in the Python Cookbook for word wrapping. Be careful, though, as many HTML
    pages are very bad about omitting the trailing </p> tags.

    -- Paul
     
    Paul McGuire, Sep 22, 2005
    #2
    1. Advertising

  3. Paul McGuire wrote:

    > If you're absolutely stuck on using RE's, then others will have to step
    > forward. Meanwhile, here's a pyparsing solution (get pyparsing at
    > http://pyparsing.sourceforge.net):


    so, let's see. using ...

    from pyparsing import *
    import re

    data = """ ... table example from op ... """

    def test1():
    LT = Literal("<")
    GT = Literal(">")
    collapsableSpace = GT + LT
    collapsableSpace.setParseAction( replaceWith("><") )
    return collapsableSpace.transformString(data)

    def test2():
    return re.sub(">\s+<", "><", data)

    I get

    > timeit -s "import test" "test.test1()"

    100 loops, best of 3: 6.8 msec per loop

    > timeit -s "import test" "test.test2()"

    10000 loops, best of 3: 33.3 usec per loop

    or in other words, five lines instead of one, and a 200x slowdown.

    but alright, maybe we should precompile the expressions to get a
    fair comparision. adding

    LT = Literal("<")
    GT = Literal(">")
    collapsableSpace = GT + LT
    collapsableSpace.setParseAction( replaceWith("><") )

    def test3():
    return collapsableSpace.transformString(data)

    p = re.compile(">\s+<")

    def test4():
    return p.sub("><", data)

    to the first program, I get

    > timeit -s "import test" "test.test3()"

    100 loops, best of 3: 6.73 msec per loop

    > timeit -s "import test" "test.test4()"

    10000 loops, best of 3: 27.8 usec per loop

    that's a 240x slowdown. hmm.

    </F>
     
    Fredrik Lundh, Sep 22, 2005
    #3
  4. googleboy a écrit :
    > Hi.
    >
    > I am trying to collapse an html table into a single line. Basically,
    > anytime I see ">" & "<" with nothing but whitespace between them, I'd
    > like to remove all the whitespace, including newlines. I've read the
    > how-to and I have tried a bunch of things, but nothing seems to work
    > for me:
    >
    > --
    >
    > table = open(r'D:\path\to\tabletest.txt', 'rb')
    > strTable = table.read()
    >
    > #Below find the different sort of things I have tried, one at a time:
    >
    > strTable = strTable.replace(">\s<", "><") #I got this from the module
    > docs


    From which module's doc ?

    ">\s<" is the litteral string ">\s<", not a regular expression. Please
    re-read the re module doc, and the re howto (you'll find a link to it in
    the re module's doc...)
     
    Bruno Desthuilliers, Sep 22, 2005
    #4
  5. "googleboy" <> wrote:
    > Hi.
    >
    > I am trying to collapse an html table into a single line. Basically,
    > anytime I see ">" & "<" with nothing but whitespace between them, I'd
    > like to remove all the whitespace, including newlines. I've read the
    > how-to and I have tried a bunch of things, but nothing seems to work
    > for me:
    >
    > [snip]


    As others have shown you already, you need to use the sub method of the re module:

    import re
    regex = re.compile(r'>\s*<')
    print regex.sub('><',data)

    > For extra kudos (and I confess I have been so stuck on the above
    > problem I haven't put much thought into how to do this one) I'd like to
    > be able to measure the number of characters between the <p> & </p>
    > tags, and then insert a newline character at the end of the next word
    > after an arbitrary number of characters..... I am reading in to a
    > script a bunch of paragraphs formatted for a webpage, but they're all
    > on one big long line and I would like to split them for readability.


    What I guess you want to do is wrap some text. Do not reinvent the wheel, there's already a module
    for that:

    import textwrap
    print textwrap.fill(oneBigLongLine, 60)

    HTH,
    George
     
    George Sakkis, Sep 23, 2005
    #5
  6. googleboy

    googleboy Guest

    Thanks for the great positive responses. I was close with what I was
    trying, I guess, but close only counts in horseshoes and um..
    something else that close counts in.

    :)

    googleboy
     
    googleboy, Sep 23, 2005
    #6
  7. googleboy

    Paul McGuire Guest

    "Fredrik Lundh" <> wrote in message
    news:...
    <snip - timing and sample code, comparing pyparsing (test3) with comparable
    regexp (test4)>
    >
    > > timeit -s "import test" "test.test3()"

    > 100 loops, best of 3: 6.73 msec per loop
    >
    > > timeit -s "import test" "test.test4()"

    > 10000 loops, best of 3: 27.8 usec per loop
    >
    > that's a 240x slowdown. hmm.
    >
    > </F>
    >
    >

    Well, what of it? How fast does it have to be? Is it a one-shot
    conversion? People tend to be willing to wait a bit longer for one-time
    conversion programs. What else is going on in this program? Is this the
    bottleneck? Are we reading the input over the Internet through HTTP?

    If I'm running this program and waiting for the results, 7 msec isn't
    perceptibly slower than 28 usec - both are going to seem pretty much
    instantaneous. On the other hand, if I'm processing 100 files, then this
    goes up to, um, .7 sec vs 3 msec.

    There is no question, regexp's beat the pants off of pyparsing in raw
    performance. But this newsgroup has visited the raw performance issue many
    times in the past, usually when responding to the "Python can't be very
    fast, it's interpreted" argument. Raw performance is just one aspect in
    determining suitability of a given technical approach.

    -- Paul
     
    Paul McGuire, Sep 23, 2005
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. VSK
    Replies:
    2
    Views:
    2,379
  2. Oli Filth
    Replies:
    9
    Views:
    3,365
    Uncle Pirate
    Jan 17, 2005
  3. Replies:
    10
    Views:
    798
    Eric Brunel
    Dec 16, 2008
  4. MRAB
    Replies:
    3
    Views:
    406
  5. geoffrobinson
    Replies:
    6
    Views:
    96
    Dr John Stockton
    Feb 7, 2006
Loading...

Share This Page