speed of string chunks file parsing

Discussion in 'Python' started by Hyunchul Kim, Apr 6, 2009.

  1. Hyunchul Kim

    Hyunchul Kim Guest

    Hi, all

    I have a simple script.
    Can you improve algorithm of following 10 line script, with a view point
    of speed ?
    Following script do exactly what I want but I want to improve the speed.

    This parse a file and accumulate lines till a line match a given regular
    expression.
    Then, when a line match a given regular expression, this function yield
    lines before the matched lines.

    ****************
    import re
    resultlist = []
    cp_regularexpression = re.compile('^a complex regular expression here$)
    for line in file(inputfile):
    if cp_regularexpression.match(line):
    if resultlist != []:
    yield resultlist
    resultlist = []
    resultlist.append(line)
    yield resultlist
    ****************

    Thank you in advance,

    Hyunchul
     
    Hyunchul Kim, Apr 6, 2009
    #1
    1. Advertising

  2. Hyunchul Kim

    Guest

    Hyunchul Kim:

    > Following script do exactly what I want but I want to improve the speed.


    This may be a bit faster, especially if sequences are long (code
    untested):

    import re
    from collections import deque

    def scanner1(deque=deque):
    result_seq = deque()
    cp_regular_expression = re.compile("^a complex regular expression
    here$")
    for line in file(inputfile):
    if cp_regular_expression.match(line) and result_seq:
    yield result_list
    result_seq = deque()
    result_seq.append(line)
    yield result_seq

    If the sequences are processed on the fly then you don't need to
    create new ones and you can clear old ones, this may be a bit faster:

    def scanner2(deque=deque):
    result_seq = deque()
    cp_regular_expression = re.compile("^a complex regular expression
    here$")
    for line in file(inputfile):
    if cp_regular_expression.match(line) and result_seq:
    yield result_list
    result_seq.clear()
    result_seq.append(line)
    yield result_seq

    Note that most of the time may be used by the regular expression,
    often there are ways to speed it up using string methods, even as a
    first faster approximate match too.

    Bye,
    bearophile
     
    , Apr 6, 2009
    #2
    1. Advertising

  3. Hyunchul Kim

    John Machin Guest

    On Apr 6, 11:48 pm, Hyunchul Kim <> wrote:
    > Hi, all
    >
    > I have a simple script.
    > Can you improve algorithm of following 10 line script, with a view point
    > of speed ?
    > Following script do exactly what I want but I want to improve the speed.


    So the first thing to do is to try to work out where it is spending
    the time.

    Step 1: read your code:

    uh-oh: re.compile('^a complex regular expression here$)

    By the way, you are using match() so the "^" is redundant.

    Step 2: Run some timing tests:

    (a) script as written
    (b) script modified to do nothing else but count the lines in the
    file. Is the time difference worth any further work? If so:
    (c) script modified to ignore the regex and pretend the match fails on
    each line
    (d) script modified to ignore the regex and pretend the match succeeds
    on each line
    (e) script modified to ignore the appending and yielding; just count
    how many matches

    So where is most of the time(a) - time(b) difference being taken:
    matching or appending/yielding?

    >
    > This parse a file and accumulate lines till a line match a given regular
    > expression.
    > Then, when a line match a given regular expression, this function yield
    > lines before the matched lines.


    I think your code documents the requirement slightly better. Before
    you start worrying about speed, it's a good idea to ensure that the
    requirements are correct. It would be useful to include explicit
    statements to this effect:

    Given N matching lines, there will be (N+1) yields. The last yield
    will contain the lines (if any) after the last matching line (if any).
    No other yield will be empty. Consequently: an empty file will yield
    [], and a file with no matching lines will yield [all the lines in the
    file].

    Presuming of course that is what you want; consider this:
    first yield: [line1, line2]
    second and final yield: [line3, line4]
    The caller has no way of knowing (without using the regex!) if the
    last line in the file matches the regex or not. Does this matter?

    >
    > ****************
    > import re
    > resultlist = []
    > cp_regularexpression = re.compile('^a complex regular expression here$)
    > for line in file(inputfile):
    >         if cp_regularexpression.match(line):
    >                 if resultlist != []:


    if resultlist: # slightly faster and much less ugly

    >                         yield resultlist
    >                         resultlist = []
    >         resultlist.append(line)
    > yield resultlist


    If it turns out that you don't want lines after the last matching
    line, you need to remove the above yield statement.

    HTH,
    John
     
    John Machin, Apr 7, 2009
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Nawaz Ijaz
    Replies:
    1
    Views:
    2,609
    Patrice
    Jun 8, 2005
  2. MackS
    Replies:
    6
    Views:
    322
    Steven D'Aprano
    Nov 15, 2005
  3. mdefoor

    split string into chunks

    mdefoor, Dec 8, 2005, in forum: C Programming
    Replies:
    1
    Views:
    711
    Joe Estock
    Dec 8, 2005
  4. Martin Marcher

    reading file objects in chunks

    Martin Marcher, Nov 12, 2007, in forum: Python
    Replies:
    1
    Views:
    311
    Marc 'BlackJack' Rintsch
    Nov 12, 2007
  5. Weng Lei-QCH1840
    Replies:
    1
    Views:
    192
    Thomas
    Aug 15, 2003
Loading...

Share This Page