Program inefficiency?

Discussion in 'Python' started by hall.jeff@gmail.com, Sep 29, 2007.

  1. Guest

    I wrote the following simple program to loop through our help files
    and fix some errors (in case you can't see the subtle RE search that's
    happening, we're replacing spaces in bookmarks with _'s)

    the program works great except for one thing. It's significantly
    slower through the later files in the search then through the early
    ones... Before anyone criticizes, I recognize that that middle section
    could be simplified with a for loop... I just haven't cleaned it
    up...

    The problem is that the first 300 files take about 10-15 seconds and
    the last 300 take about 2 minutes... If we do more than about 1500
    files in one run, it just hangs up and never finishes...

    Is there a solution here that I'm missing? What am I doing that is so
    inefficient?

    # File: masseditor.py

    import re
    import os
    import time

    def massreplace():
    editfile = open("pathname\editfile.txt")
    filestring = editfile.read()
    filelist = filestring.splitlines()
    ## errorcheck = re.compile('(a name=)+(.*)(-)+(.*)(></a>)+')
    for i in range(len(filelist)):
    source = open(filelist)
    starttext = source.read()
    interimtext = replacecycle(starttext)
    interimtext = replacecycle(interimtext)
    interimtext = replacecycle(interimtext)
    interimtext = replacecycle(interimtext)
    interimtext = replacecycle(interimtext)
    interimtext = replacecycle(interimtext)
    interimtext = replacecycle(interimtext)
    interimtext = replacecycle(interimtext)
    interimtext = replacecycle(interimtext)
    interimtext = replacecycle(interimtext)
    interimtext = replacecycle(interimtext)
    interimtext = replacecycle(interimtext)
    finaltext = replacecycle(interimtext)
    source.close()
    source = open(filelist,"w")
    source.write(finaltext)
    source.close()
    ## if errorcheck.findall(finaltext)!=[]:
    ## print errorcheck.findall(finaltext)
    ## print filelist
    if i == 100:
    print "done 100"
    print time.clock()
    elif i == 300:
    print "done 300"
    print time.clock()
    elif i == 600:
    print "done 600"
    print time.clock()
    elif i == 1000:
    print "done 1000"
    print time.clock()
    print "done"
    print i
    print time.clock()

    def replacecycle(starttext):
    p1= re.compile('(href=|HREF=)+(.*)(#)+(.*)( )+(.*)(">)+')
    p2= re.compile('(name=")+(.*)( )+(.*)(">)+')
    p3= re.compile('(href=|HREF=)+(.*)(#)+(.*)(\')+(.*)(">)+')
    p4= re.compile('(name=")+(.*)(\')+(.*)(">)+')
    p5= re.compile('(href=|HREF=)+(.*)(#)+(.*)(-)+(.*)(">)+')
    p6= re.compile('(name=")+(.*)(-)+(.*)(">)+')
    p7= re.compile('(href=|HREF=)+(.*)(#)+(.*)(<)+(.*)(">)+')
    p8= re.compile('(name=")+(.*)(<)+(.*)(">)+')
    p7= re.compile('(href=|HREF=")+(.*)(#)+(.*):))+(.*)(">)+')
    p8= re.compile('(name=")+(.*):))+(.*)(">)+')
    p9= re.compile('(href=|HREF=")+(.*)(#)+(.*)(\?)+(.*)(">)+')
    p10= re.compile('(name=")+(.*)(\?)+(.*)(">)+')
    p100= re.compile('(a name=)+(.*)(-)+(.*)(></a>)+')
    q1= r"\1\2\3\4_\6\7"
    q2= r"\1\2_\4\5"
    interimtext = p1.sub(q1, starttext)
    interimtext = p2.sub(q2, interimtext)
    interimtext = p3.sub(q1, interimtext)
    interimtext = p4.sub(q2, interimtext)
    interimtext = p5.sub(q1, interimtext)
    interimtext = p6.sub(q2, interimtext)
    interimtext = p7.sub(q1, interimtext)
    interimtext = p8.sub(q2, interimtext)
    interimtext = p9.sub(q1, interimtext)
    interimtext = p10.sub(q2, interimtext)
    interimtext = p100.sub(q2, interimtext)

    return interimtext

    massreplace()
     
    , Sep 29, 2007
    #1
    1. Advertising

  2. > [...]
    > the program works great except for one thing. It's significantly
    > slower through the later files in the search then through the early
    > ones... Before anyone criticizes, I recognize that that middle section
    > could be simplified with a for loop... I just haven't cleaned it
    > up...
    >
    > The problem is that the first 300 files take about 10-15 seconds and
    > the last 300 take about 2 minutes... If we do more than about 1500
    > files in one run, it just hangs up and never finishes...
    >
    > Is there a solution here that I'm missing? What am I doing that is so
    > inefficient?


    The only thing I see is that you compile all of the RE's every
    time you call replacecycle(). They really only need to be
    compiled once, but I don't know why that would cause the
    progressive slowing.

    FWIW, it seems to me like a shell+sed script would be the
    obvious solution to the problem.

    > # File: masseditor.py
    >
    > import re
    > import os
    > import time
    >
    > def massreplace():
    > editfile = open("pathname\editfile.txt")
    > filestring = editfile.read()
    > filelist = filestring.splitlines()
    > ## errorcheck = re.compile('(a name=)+(.*)(-)+(.*)(></a>)+')
    > for i in range(len(filelist)):
    > source = open(filelist)
    > starttext = source.read()
    > interimtext = replacecycle(starttext)
    > interimtext = replacecycle(interimtext)
    > interimtext = replacecycle(interimtext)
    > interimtext = replacecycle(interimtext)
    > interimtext = replacecycle(interimtext)
    > interimtext = replacecycle(interimtext)
    > interimtext = replacecycle(interimtext)
    > interimtext = replacecycle(interimtext)
    > interimtext = replacecycle(interimtext)
    > interimtext = replacecycle(interimtext)
    > interimtext = replacecycle(interimtext)
    > interimtext = replacecycle(interimtext)
    > finaltext = replacecycle(interimtext)
    > source.close()
    > source = open(filelist,"w")
    > source.write(finaltext)
    > source.close()
    > ## if errorcheck.findall(finaltext)!=[]:
    > ## print errorcheck.findall(finaltext)
    > ## print filelist
    > if i == 100:
    > print "done 100"
    > print time.clock()
    > elif i == 300:
    > print "done 300"
    > print time.clock()
    > elif i == 600:
    > print "done 600"
    > print time.clock()
    > elif i == 1000:
    > print "done 1000"
    > print time.clock()
    > print "done"
    > print i
    > print time.clock()
    >
    > def replacecycle(starttext):
    > p1= re.compile('(href=|HREF=)+(.*)(#)+(.*)( )+(.*)(">)+')
    > p2= re.compile('(name=")+(.*)( )+(.*)(">)+')
    > p3= re.compile('(href=|HREF=)+(.*)(#)+(.*)(\')+(.*)(">)+')
    > p4= re.compile('(name=")+(.*)(\')+(.*)(">)+')
    > p5= re.compile('(href=|HREF=)+(.*)(#)+(.*)(-)+(.*)(">)+')
    > p6= re.compile('(name=")+(.*)(-)+(.*)(">)+')
    > p7= re.compile('(href=|HREF=)+(.*)(#)+(.*)(<)+(.*)(">)+')
    > p8= re.compile('(name=")+(.*)(<)+(.*)(">)+')
    > p7= re.compile('(href=|HREF=")+(.*)(#)+(.*):))+(.*)(">)+')
    > p8= re.compile('(name=")+(.*):))+(.*)(">)+')
    > p9= re.compile('(href=|HREF=")+(.*)(#)+(.*)(\?)+(.*)(">)+')
    > p10= re.compile('(name=")+(.*)(\?)+(.*)(">)+')
    > p100= re.compile('(a name=)+(.*)(-)+(.*)(></a>)+')
    > q1= r"\1\2\3\4_\6\7"
    > q2= r"\1\2_\4\5"
    > interimtext = p1.sub(q1, starttext)
    > interimtext = p2.sub(q2, interimtext)
    > interimtext = p3.sub(q1, interimtext)
    > interimtext = p4.sub(q2, interimtext)
    > interimtext = p5.sub(q1, interimtext)
    > interimtext = p6.sub(q2, interimtext)
    > interimtext = p7.sub(q1, interimtext)
    > interimtext = p8.sub(q2, interimtext)
    > interimtext = p9.sub(q1, interimtext)
    > interimtext = p10.sub(q2, interimtext)
    > interimtext = p100.sub(q2, interimtext)
    >
    > return interimtext
    >
    > massreplace()
    >



    --
    Grant Edwards grante Yow! Are you still
    at SEXUALLY ACTIVE? Did you
    visi.com BRING th' REINFORCEMENTS?
     
    Grant Edwards, Sep 29, 2007
    #2
    1. Advertising

  3. On Sat, 2007-09-29 at 15:22 +0000, wrote:
    > [...]
    > def replacecycle(starttext):
    > p1= re.compile('(href=|HREF=)+(.*)(#)+(.*)( )+(.*)(">)+')
    > p2= re.compile('(name=")+(.*)( )+(.*)(">)+')
    > p3= re.compile('(href=|HREF=)+(.*)(#)+(.*)(\')+(.*)(">)+')
    > p4= re.compile('(name=")+(.*)(\')+(.*)(">)+')
    > p5= re.compile('(href=|HREF=)+(.*)(#)+(.*)(-)+(.*)(">)+')
    > p6= re.compile('(name=")+(.*)(-)+(.*)(">)+')
    > p7= re.compile('(href=|HREF=)+(.*)(#)+(.*)(<)+(.*)(">)+')
    > p8= re.compile('(name=")+(.*)(<)+(.*)(">)+')
    > p7= re.compile('(href=|HREF=")+(.*)(#)+(.*):))+(.*)(">)+')
    > p8= re.compile('(name=")+(.*):))+(.*)(">)+')
    > p9= re.compile('(href=|HREF=")+(.*)(#)+(.*)(\?)+(.*)(">)+')
    > p10= re.compile('(name=")+(.*)(\?)+(.*)(">)+')
    > p100= re.compile('(a name=)+(.*)(-)+(.*)(></a>)+')
    > [...]


    One obvious opportunity for optimization is to compile those re's only
    once at the beginning of the program instead of every time
    replacecycle() is called (which is inexplicably called 13 times for each
    file).

    --
    Carsten Haese
    http://informixdb.sourceforge.net
     
    Carsten Haese, Sep 29, 2007
    #3
  4. Guest

    I did try moveing the re.compile's up and out of the replacecylce()
    but it didn't impact the time in any meaningful way (2 seconds
    maybe)...

    I'm not sure what an shell+sed script is... I'm fairly new to Python
    and my only other coding experience is with VBA... This was my first
    Python program

    In case it helps... We started with only 6 loops of replacecycle() but
    had to keep adding progressively more as we found more and more links
    with lots of spaces in them... As we did that, the program's time grew
    progressively longer but the length grew multiplicatively with the
    added number of cycles... This is exactly what I would have expected
    and it leads me to believe that the problem does not lie in the
    replacecycle() def but in the masseditor() def... *shrug*
     
    , Sep 29, 2007
    #4
  5. wrote:
    > Is there a solution here that I'm missing? What am I doing that is so
    > inefficient?
    >


    Hi Jeff,

    Yes, it seems you have plenty of performance leaks.
    Please see my notes below.

    > def massreplace():
    > editfile = open("pathname\editfile.txt")
    > filestring = editfile.read()
    > filelist = filestring.splitlines()
    > ## errorcheck = re.compile('(a name=)+(.*)(-)+(.*)(></a>)+')
    > for i in range(len(filelist)):
    > source = open(filelist)
    >
    >


    Read this post:
    http://mail.python.org/pipermail/python-list/2004-August/275319.html
    Instead of reading the whole document, storing it in a variable,
    splitting it and the iterating, you could simply do:

    def massreplace():
    editfile = open("pathname\editfile.txt")
    for source in editfile:


    > starttext = source.read()
    > interimtext = replacecycle(starttext)
    > (...)
    >


    Excuse me, but this is insane. Do just one call (or none at all, I don't
    see why you need to split this into two functions) and let the function
    manage the replacement "layers".

    I'm skipping the next part (don't want to understand all your logic now).

    > (...)
    >
    > def replacecycle(starttext):
    >



    Unneeded, IMHO.

    > p1= re.compile('(href=|HREF=)+(.*)(#)+(.*)( )+(.*)(">)+')
    > (...)
    > interimtext = p100.sub(q2, interimtext)
    >


    Same euphemism applies here. I might be wrong, but I'm pretty confident
    you can make all this in one simple regex.
    Anyway, although regexes are supposed to be cached, don't need to define
    them every time the function gets called. Do it once, outside the
    function. At the very least you save one of the most important
    performance hits in python, function calls. Read this:
    http://wiki.python.org/moin/PythonSpeed/PerformanceTips
    Also, if you are parsing HTML consider using BeautifulSoup or
    ElementTree, or something (particularly if you don't feel particularly
    confident with regexes).


    Hope you find this helpful.
    Pablo
     
    Pablo Ziliani, Sep 29, 2007
    #5
  6. On 2007-09-29, <> wrote:

    > I'm not sure what an shell+sed script is...


    http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_05_01.html#sect_05_01_01
    http://tldp.org/LDP/Bash-Beginners-Guide/html/chap_05.html
    http://www.grymoire.com/Unix/Sed.html

    http://www.gnu.org/software/bash/
    http://en.wikipedia.org/wiki/Bash

    Unfortuantely it appears you're using Windows (a partucular bad
    choice for this sort of file processing). You can, however,
    get bash and sed for Windows if you wish:

    http://www.cygwin.com/

    > In case it helps... We started with only 6 loops of replacecycle() but
    > had to keep adding progressively more as we found more and more links
    > with lots of spaces in them...


    I would think with the correct RE's you'd only have to call it
    once.

    > As we did that, the program's time grew progressively longer
    > but the length grew multiplicatively with the added number of
    > cycles... This is exactly what I would have expected and it
    > leads me to believe that the problem does not lie in the
    > replacecycle() def but in the masseditor() def... *shrug*


    As the program runs on progressively more files does the
    process's memory usage grow without bounds? Does the machine
    start swapping?

    --
    Grant Edwards grante Yow! I'm pretending that
    at we're all watching PHIL
    visi.com SILVERS instead of RICARDO
    MONTALBAN!
     
    Grant Edwards, Sep 29, 2007
    #6
  7. Guest

    no swaps... memory usage is about 14k (these are small Html files)...
    no hard drive cranking away or fan on my laptop going nutty... CPU
    usage isn't even pegged... that's what makes me think it's not some
    sort of bizarre memory leak... Unfortunately, it also means I'm out of
    ideas...
     
    , Sep 29, 2007
    #7
  8. Guest

    For anyone that cares, I figured out the "problem"... the webhelp
    files that it hits the wall on are the compiled search files... They
    are the only files in the system that have line lengths that are
    RIDICULOUS in length... I'm looking at one right now that has 32767
    characters all on one line...

    I'm absolutely certain that that's the problem...

    Thanks for everyone's help
     
    , Sep 29, 2007
    #8
  9. thebjorn Guest

    On Sep 29, 5:22 pm, wrote:
    > I wrote the following simple program to loop through our help files
    > and fix some errors (in case you can't see the subtle RE search that's
    > happening, we're replacing spaces in bookmarks with _'s)
    >
    > the program works great except for one thing. It's significantly
    > slower through the later files in the search then through the early
    > ones... Before anyone criticizes, I recognize that that middle section
    > could be simplified with a for loop... I just haven't cleaned it
    > up...
    >
    > The problem is that the first 300 files take about 10-15 seconds and
    > the last 300 take about 2 minutes... If we do more than about 1500
    > files in one run, it just hangs up and never finishes...
    >
    > Is there a solution here that I'm missing? What am I doing that is so
    > inefficient?


    Ugh, that was entirely too many regexps for my taste :)

    How about something like:

    def attr_ndx_iter(txt, attribute):
    "Return all the start and end indices for the values of
    attribute."
    txt = txt.lower()
    attribute = attribute.lower() + '='
    alen = len(attribute)
    chunks = txt.split(attribute)
    if len(chunks) == 1:
    return

    start = len(chunks[0]) + alen
    end = -1

    for chunk in chunks[1:]:
    qchar = chunk[0]
    end = start + chunk.index(qchar, 1)
    yield start + 1, end
    start += len(chunk) + alen

    def substr_map(txt, indices, fn):
    "Apply fn to text within indices."
    res = []
    cur = 0

    for i,j in indices:
    res.append(txt[cur:i])
    res.append(fn(txt[i:j]))
    cur = j

    res.append(txt[cur:])
    return ''.join(res)

    def transform(s):
    "The transformation to do on the attribute values."
    return s.replace(' ', '_')

    def zap_spaces(txt, *attributes):
    for attr in attributes:
    txt = substr_map(txt, attr_ndx_iter(txt, attr), transform)
    return txt

    def mass_replace():
    import sys
    w = sys.stdout.write

    for f in open(r'pathname\editfile.txt'):
    try:
    open(f, 'w').write(zap_spaces(open(f).read(), 'href',
    'name'))
    w('.') # progress-meter :)
    except:
    print 'Error processing file:', f

    minimally-tested'ly y'rs
    -- bjorn
     
    thebjorn, Sep 29, 2007
    #9
  10. thebjorn wrote:
    > On Sep 29, 5:22 pm, wrote:
    >
    >> I wrote the following simple program to loop through our help files
    >> and fix some errors (in case you can't see the subtle RE search that's
    >> happening, we're replacing spaces in bookmarks with _'s)
    >> (...)
    >>

    >
    > Ugh, that was entirely too many regexps for my taste :)
    >
    > How about something like:
    >
    > def attr_ndx_iter(txt, attribute):
    > (...)
    > def substr_map(txt, indices, fn):
    > (...)
    > def transform(s):
    > (...)
    > def zap_spaces(txt, *attributes):
    > (...)
    > def mass_replace():
    > (...)


    Oh yeah, now it's clear as mud.
    I do think that the whole program shouldn't take more than 10 lines of
    code using one sensible regex (impossible to define without knowing the
    real input and output formats).
    And (sorry to tell) I'm convinced this is a problem for regexes, in
    spite of anybody's personal taste.

    Pablo
     
    Pablo Ziliani, Sep 29, 2007
    #10
  11. thebjorn Guest

    On Sep 29, 7:55 pm, Pablo Ziliani <> wrote:
    > thebjorn wrote:
    > > On Sep 29, 5:22 pm, wrote:

    >
    > >> I wrote the following simple program to loop through our help files
    > >> and fix some errors (in case you can't see the subtle RE search that's
    > >> happening, we're replacing spaces in bookmarks with _'s)
    > >> (...)

    >
    > > Ugh, that was entirely too many regexps for my taste :)

    >
    > > How about something like:

    >
    > > def attr_ndx_iter(txt, attribute):
    > > (...)
    > > def substr_map(txt, indices, fn):
    > > (...)
    > > def transform(s):
    > > (...)
    > > def zap_spaces(txt, *attributes):
    > > (...)
    > > def mass_replace():
    > > (...)

    >
    > Oh yeah, now it's clear as mud.


    I'm anxiously awaiting your beacon of clarity ;-)

    > I do think that the whole program shouldn't take more than 10 lines of
    > code


    Well, my mass_replace above is 10 lines, and the actual replacement
    code is a one liner. Perhaps you'd care to illustrate how you'd
    shorten that while still keeping it "clear"?

    > using one sensible regex


    I have no doubt that it would be possible to do with a single regex.
    Whether it would be sensible or not is another matter entirely...

    > (impossible to define without knowing the real input and output formats).


    Of course, but I don't think you can guess too terribly wrong. My
    version handles upper and lower case attributes, quoting with single
    (') and double (") quotes, and any number of spaces in attribute
    values. It maintains all other text as-is, and converts spaces to
    underscores in href and name attributes. Did I get anything majorly
    wrong?

    > And (sorry to tell) I'm convinced this is a problem for regexes, in
    > spite of anybody's personal taste.


    Well, let's see it then :)

    smack-smack'ly y'rs
    -- bjorn
     
    thebjorn, Sep 29, 2007
    #11
  12. Guest

    The search is trying to replace the spaces in our bookmarks (and the
    links that go to those bookmarks)...

    The bookmark tag looks like this:

    <a href="Web_Sites.htm#A Web Sites">

    and the bookmark tag looks like this

    <a name="A Web Sites"></a>

    some pitfalls I've already run up against...
    SOMETIMES (but not often) the a and the href (or name) is split across
    a line... this led me to just drop the "<a" from the front
    If there are no spaces, SOME (but again, not all) of the "<a name"
    tags don't have "'s... this is a problem because we're having to
    replace all special characters with _'s...
    Some of our bookmarks are quite wordy (we found one yesterday with 11
    spaces)
    href is sometimes all caps (HREF)

    As you can imagine, there are alot of corner cases and I felt it was
    easier just to be inefficient and write out all the regex cases and
    loop through them repeatedly... I've also got to work around the stuff
    already in the system (for example, I need to make certain I'm looking
    behind the #'s in the bookmark links, otherwise I'll end up replacing
    legitimate -'s in external web site addresses)

    I think Pablo is correct that a single (or perhaps two) RE statements
    are all that is needed... perhaps:

    p1= re.compile('(href=|HREF=)+(.*)(#)+(.*)(\w\'\?-<: )+(.*)(">)+')
    and the corresponding name replace and then the one corner case we ran
    into of
    p100= re.compile('(a name=)+(.*)(-)+(.*)(></a>)+')
     
    , Sep 29, 2007
    #12
  13. Guest

    It think he's saying it should look like this:

    # File: masseditor.py

    import re
    import os
    import time

    p1= re.compile('(href=|HREF=)+(.*)(#)+(.*)(\w\'\?-<:)+(.*)(">)+')
    p2= re.compile('(name=")+(.*)(\w\'\?-<:)+(.*)(">)+')
    p100= re.compile('(a name=)+(.*)(-)+(.*)(></a>)+')
    q1= r"\1\2\3\4_\6\7"
    q2= r"\1\2_\4\5"

    def massreplace():
    editfile = open("C:\Program Files\Credit Risk Management\Masseditor
    \editfile.txt")
    filestring = editfile.read()
    filelist = filestring.splitlines()

    for i in range(len(filelist)):
    source = open(filelist)
    starttext = source.read()

    for i in range (13):
    interimtext = p1.sub(q1, starttext)
    interimtext= p2.sub(q2, interimtext)
    interimtext= p100.sub(q2, interimtext)
    source.close()
    source = open(filelist,"w")
    source.write(finaltext)
    source.close()

    massreplace()

    I'll try that and see how it works...
     
    , Sep 29, 2007
    #13
  14. thebjorn Guest

    On Sep 29, 8:32 pm, wrote:
    > It think he's saying it should look like this:
    >
    > # File: masseditor.py
    >
    > import re
    > import os
    > import time
    >
    > p1= re.compile('(href=|HREF=)+(.*)(#)+(.*)(\w\'\?-<:)+(.*)(">)+')
    > p2= re.compile('(name=")+(.*)(\w\'\?-<:)+(.*)(">)+')
    > p100= re.compile('(a name=)+(.*)(-)+(.*)(></a>)+')
    > q1= r"\1\2\3\4_\6\7"
    > q2= r"\1\2_\4\5"
    >
    > def massreplace():
    > editfile = open("C:\Program Files\Credit Risk Management\Masseditor
    > \editfile.txt")
    > filestring = editfile.read()
    > filelist = filestring.splitlines()
    >
    > for i in range(len(filelist)):
    > source = open(filelist)
    > starttext = source.read()
    >
    > for i in range (13):
    > interimtext = p1.sub(q1, starttext)
    > interimtext= p2.sub(q2, interimtext)
    > interimtext= p100.sub(q2, interimtext)
    > source.close()
    > source = open(filelist,"w")
    > source.write(finaltext)
    > source.close()
    >
    > massreplace()
    >
    > I'll try that and see how it works...


    Ok, if you want a single RE... How about:


    test = '''
    <a href="Web_Sites.htm#A Web Sites">
    <a name="A Web Sites"></a>
    <a
    href="Web_Sites.htm#A Web Sites">
    <a
    name="A Web Sites"></a>
    <a HREF="Web_Sites.htm#A Web Sites">
    <a name=Quoteless></a>
    <a name = "oo ps"></a>
    '''

    import re

    r = re.compile(r'''
    (?:href=['"][^#]+[#]([^"']+)["'])
    | (?:name=['"]?([^'">]+))
    ''', re.IGNORECASE | re.MULTILINE | re.DOTALL | re.VERBOSE)

    def zap_space(m):
    return m.group(0).replace(' ', '_')

    print r.sub(zap_space, test)

    It prints out

    <a href="Web_Sites.htm#A_Web_Sites">
    <a name="A_Web_Sites"></a>
    <a
    href="Web_Sites.htm#A_Web_Sites">
    <a
    name="A_Web_Sites"></a>
    <a HREF="Web_Sites.htm#A_Web_____________________________Sites">
    <a name=Quoteless></a>
    <a name = "oo ps"></a>

    -- bjorn
     
    thebjorn, Sep 29, 2007
    #14
  15. On Sat, 29 Sep 2007 18:24:53 -0000, declaimed the
    following in comp.lang.python:

    > The search is trying to replace the spaces in our bookmarks (and the
    > links that go to those bookmarks)...
    >
    > The bookmark tag looks like this:
    >
    > <a href="Web_Sites.htm#A Web Sites">
    >
    > and the bookmark tag looks like this
    >
    > <a name="A Web Sites"></a>
    >
    > some pitfalls I've already run up against...
    > SOMETIMES (but not often) the a and the href (or name) is split across
    > a line... this led me to just drop the "<a" from the front
    > If there are no spaces, SOME (but again, not all) of the "<a name"
    > tags don't have "'s... this is a problem because we're having to
    > replace all special characters with _'s...
    > Some of our bookmarks are quite wordy (we found one yesterday with 11
    > spaces)
    > href is sometimes all caps (HREF)
    >

    Sure sounds more like a use for an HTML parser that can walk through
    the file returning the elements for correction...
    --
    Wulfraed Dennis Lee Bieber KD6MOG

    HTTP://wlfraed.home.netcom.com/
    (Bestiaria Support Staff: )
    HTTP://www.bestiaria.com/
     
    Dennis Lee Bieber, Sep 29, 2007
    #15
  16. On Sep 29, 2:32 pm, wrote:

    > It think he's saying it should look like this:
    >
    > (line noise snipped)


    Or you can let BeautifulSoup do the dirty job for you and forget all
    this ugliness:


    from BeautifulSoup import BeautifulSoup

    soup = BeautifulSoup(text)
    for a in soup.findAll('a'):
    for attr in 'href','name':
    val = a.get(attr)
    if val:
    a[attr] = val.replace(' ','_')
    print soup


    George
     
    George Sakkis, Sep 29, 2007
    #16
  17. thebjorn wrote:
    > On Sep 29, 7:55 pm, Pablo Ziliani <> wrote:
    >
    >> thebjorn wrote:
    >>
    >>> Ugh, that was entirely too many regexps for my taste :)

    >> Oh yeah, now it's clear as mud.
    >>

    >
    > I'm anxiously awaiting your beacon of clarity ;-)
    >


    Admittedly, that was a bit arrogant from my part. Sorry.

    >> I do think that the whole program shouldn't take more than 10 lines of
    >> code
    >>

    >
    > Well, my mass_replace above is 10 lines, and the actual replacement
    > code is a one liner. Perhaps you'd care to illustrate how you'd
    > shorten that while still keeping it "clear"?
    >


    I don't think he relevant code was only those 10 lines, but well, you
    have already responded to the other question yourself in a subsequent
    post (thanks for saving me a lot of time).
    I think that "clear" is a compromise between code legibility (most of
    what you sacrifice using regexes) and overall code length. Even regexes
    can be legible enough when they are well documented, not to mention the
    fact that is an idiom common to various languages.

    >> using one sensible regex
    >>

    >
    > I have no doubt that it would be possible to do with a single regex.
    > Whether it would be sensible or not is another matter entirely...
    >


    Putting it in those terms, I completely agree with you (that's why I
    suggested letting e.g. BeautifulSoup deal with them). But by "sensible"
    I meant something different, inherent to the regex itself.
    For instance, I don't think I need to explain to you why this is not
    sensible: (href=|HREF=)+(.*)(#)+(.*)(\w\'\?-<:)+(.*)(">)+


    >
    >> (impossible to define without knowing the real input and output formats).
    >>

    >
    > Of course, but I don't think you can guess too terribly wrong. My
    > version handles upper and lower case attributes, quoting with single
    > (') and double (") quotes, and any number of spaces in attribute
    > values. It maintains all other text as-is, and converts spaces to
    > underscores in href and name attributes. Did I get anything majorly
    > wrong?
    >


    Well, you spent some time interpreting his code. No doubt you are smart,
    but being a lazy person (not proud of that, unlike other people stating
    the same) I prefer leaving that part to the interested party.


    >
    >> And (sorry to tell) I'm convinced this is a problem for regexes, in
    >> spite of anybody's personal taste.
    >>

    >
    > Well, let's see it then :)


    IMO, your second example proves it well enough.

    FWIW I did some changes to your code (see attached), because it wasn't
    taking into account the tag name (<a>), and the names of the attributes
    (href, name) can appear in other tags as well, so it's a problem. It
    still doesn't solve the problem of one tag having both attributes with
    spaces (which can be easily fixed with a second regex, but that was out
    of question :p), and there can be a lot of other problems (both because
    I'm far from being an expert in regexes and because I only tested it
    against the given string), but should provide at least some guidance.
    I made it also match the id of the target anchor, since a fragment can
    point both to its name or its id, depending on the doctype.


    Regards,
    Pablo
     
    Pablo Ziliani, Sep 30, 2007
    #17
  18. On Sat, Sep 29, 2007 at 12:05:26PM -0700, thebjorn wrote:
    > Ok, if you want a single RE... How about:
    >...
    > r = re.compile(r'''
    > (?:href=['"][^#]+[#]([^"']+)["'])
    > | (?:name=['"]?([^'">]+))
    > ''', re.IGNORECASE | re.MULTILINE | re.DOTALL | re.VERBOSE)


    maybe a little bit easier to read with ungreedy operators:

    r = re.compile(r'''
    (?:href=['"].+?[#](.+?)["'])
    | (?:name=['"]?(.+?)['">]))
    ''', re.IGNORECASE | re.MULTILINE | re.DOTALL | re.VERBOSE)

    flo.
     
    Florian Schmidt, Oct 1, 2007
    #18
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Roedy Green

    Serialisation inefficiency

    Roedy Green, Sep 16, 2003, in forum: Java
    Replies:
    8
    Views:
    387
    Robert Olofsson
    Sep 18, 2003
  2. Frederick Gotham

    Inherent inefficiency in domestic "for" loop?

    Frederick Gotham, Jun 26, 2006, in forum: C Programming
    Replies:
    34
    Views:
    792
  3. Replies:
    6
    Views:
    305
    Dimiter \malkia\ Stanev
    Oct 1, 2007
  4. Antoninus Twink

    Bug/Gross InEfficiency in HeathField's fgetline program

    Antoninus Twink, Oct 7, 2007, in forum: C Programming
    Replies:
    436
    Views:
    6,085
    user923005
    Nov 13, 2007
  5. Dik T. Winter
    Replies:
    2
    Views:
    355
    Charlie Gordon
    Nov 7, 2007
Loading...

Share This Page