Re: regex over files

Discussion in 'Python' started by Robin Becker, Apr 25, 2005.

  1. Robin Becker

    Robin Becker Guest

    Gerald Klix wrote:
    > Map the file into RAM by using the mmap module.
    > The file's contents than is availabel as a seachable string.
    >


    that's a good idea, but I wonder if it actually saves on memory? I just tried
    regexing through a 25Mb file and end up with 40Mb as working set (it rose
    linearly as the loop progessed through the file). Am I actually saving anything
    by not letting normal vm do its thing?

    > HTH,
    > Gerald
    >
    > Robin Becker schrieb:
    >
    >> Is there any way to get regexes to work on non-string/unicode objects.
    >> I would like to split large files by regex and it seems relatively
    >> hard to do so without having the whole file in memory. Even with
    >> buffers it seems hard to get regexes to indicate that they failed
    >> because of buffer termination and getting a partial match to be
    >> resumable seems out of the question.
    >>
    >> What interface does re actually need for its src objects?

    >
    >



    --
    Robin Becker
     
    Robin Becker, Apr 25, 2005
    #1
    1. Advertising

  2. "Robin Becker" <> wrote in message
    news:...
    > Gerald Klix wrote:
    > > Map the file into RAM by using the mmap module.
    > > The file's contents than is availabel as a seachable string.
    > >

    >
    > that's a good idea, but I wonder if it actually saves on memory? I just tried
    > regexing through a 25Mb file and end up with 40Mb as working set (it rose
    > linearly as the loop progessed through the file). Am I actually saving anything
    > by not letting normal vm do its thing?


    You aren't saving memory in that sense, no. If you have any RAM spare the
    file will end up in it. However, if you are short on memory though, mmaping the
    file gives the VM the opportunity to discard pages from the file, instead of paging
    them out. Try again with a 25Gb file and watch the difference ;) YMMV.
     
    Richard Brodie, Apr 26, 2005
    #2
    1. Advertising

  3. Robin Becker

    Robin Becker Guest

    Richard Brodie wrote:
    > "Robin Becker" <> wrote in message
    > news:...
    >
    >>Gerald Klix wrote:
    >>
    >>>Map the file into RAM by using the mmap module.
    >>>The file's contents than is availabel as a seachable string.
    >>>

    >>
    >>that's a good idea, but I wonder if it actually saves on memory? I just tried
    >>regexing through a 25Mb file and end up with 40Mb as working set (it rose
    >>linearly as the loop progessed through the file). Am I actually saving anything
    >>by not letting normal vm do its thing?

    >
    >
    > You aren't saving memory in that sense, no. If you have any RAM spare the
    > file will end up in it. However, if you are short on memory though, mmaping the
    > file gives the VM the opportunity to discard pages from the file, instead of paging
    > them out. Try again with a 25Gb file and watch the difference ;) YMMV.
    >
    >


    :)

    So we avoid dirty page writes etc etc. However, I still think I could
    get away with a small window into the file which would be more efficient.
    --
    Robin Becker
     
    Robin Becker, Apr 26, 2005
    #3
  4. Robin Becker

    Steve Holden Guest

    Robin Becker wrote:
    > Richard Brodie wrote:
    >
    >> "Robin Becker" <> wrote in message
    >> news:...
    >>
    >>> Gerald Klix wrote:
    >>>
    >>>> Map the file into RAM by using the mmap module.
    >>>> The file's contents than is availabel as a seachable string.
    >>>>
    >>>
    >>> that's a good idea, but I wonder if it actually saves on memory? I
    >>> just tried
    >>> regexing through a 25Mb file and end up with 40Mb as working set (it
    >>> rose
    >>> linearly as the loop progessed through the file). Am I actually
    >>> saving anything
    >>> by not letting normal vm do its thing?

    >>
    >>
    >>
    >> You aren't saving memory in that sense, no. If you have any RAM spare the
    >> file will end up in it. However, if you are short on memory though,
    >> mmaping the
    >> file gives the VM the opportunity to discard pages from the file,
    >> instead of paging
    >> them out. Try again with a 25Gb file and watch the difference ;) YMMV.
    >>
    >>

    >
    > :)
    >
    > So we avoid dirty page writes etc etc. However, I still think I could
    > get away with a small window into the file which would be more efficient.


    I seem to remember that the Medusa code contains a fairly good
    overlapped search for a terminator string, if you want to chunk the file.

    Take a look at the handle_read() method of class async_chat in the
    standard library's asynchat.py.

    regards
    Steve
    --
    Steve Holden +1 703 861 4237 +1 800 494 3119
    Holden Web LLC http://www.holdenweb.com/
    Python Web Programming http://pydish.holdenweb.com/
     
    Steve Holden, Apr 26, 2005
    #4
  5. Robin Becker

    Robin Becker Guest

    Steve Holden wrote:
    .......
    >
    > I seem to remember that the Medusa code contains a fairly good
    > overlapped search for a terminator string, if you want to chunk the file.
    >
    > Take a look at the handle_read() method of class async_chat in the
    > standard library's asynchat.py.

    ......
    thanks I'll give it a whirl

    --
    Robin Becker
     
    Robin Becker, Apr 26, 2005
    #5
  6. Robin Becker

    Steve Holden Guest

    Robin Becker wrote:
    > Steve Holden wrote:
    > ......
    >
    >>
    >> I seem to remember that the Medusa code contains a fairly good
    >> overlapped search for a terminator string, if you want to chunk the file.
    >>
    >> Take a look at the handle_read() method of class async_chat in the
    >> standard library's asynchat.py.

    >
    > .....
    > thanks I'll give it a whirl
    >

    Whoops, I don't think it's a regex search :-(

    You should be able to adapt the logic fairly easily, I hope.

    regards
    Steve
    --
    Steve Holden +1 703 861 4237 +1 800 494 3119
    Holden Web LLC http://www.holdenweb.com/
    Python Web Programming http://pydish.holdenweb.com/
     
    Steve Holden, Apr 26, 2005
    #6
  7. Robin Becker

    Robin Becker Guest

    Steve Holden wrote:
    .....
    >> .....
    >> thanks I'll give it a whirl
    >>

    > Whoops, I don't think it's a regex search :-(
    >
    > You should be able to adapt the logic fairly easily, I hope.
    >

    .....

    The buffering logic is half the problem; doing it quickly is the other half.
    The third half of the problem is getting re to co-operate and probably halves 4
    5..... :)
    --
    Robin Becker
     
    Robin Becker, Apr 26, 2005
    #7
  8. Robin> So we avoid dirty page writes etc etc. However, I still think I
    Robin> could get away with a small window into the file which would be
    Robin> more efficient.

    It's hard to imagine how sliding a small window onto a file within Python
    would be more efficient than the operating system's paging system. ;-)

    Skip
     
    Skip Montanaro, Apr 26, 2005
    #8
  9. Robin Becker

    Robin Becker Guest

    Skip Montanaro wrote:
    > Robin> So we avoid dirty page writes etc etc. However, I still think I
    > Robin> could get away with a small window into the file which would be
    > Robin> more efficient.
    >
    > It's hard to imagine how sliding a small window onto a file within Python
    > would be more efficient than the operating system's paging system. ;-)
    >
    > Skip

    well it might be if I only want to scan forward through the file (think lexical
    analysis). Most lexical analyzers use a buffer and produce a stream of tokens ie
    a compressed version of the input. There are problems crossing buffers etc, but
    we never normally need the whole file in memory.

    If the lexical analyzer reads the whole file into memory then we need more
    pages. The mmap thing might help as we need only read pages (for a lexical scanner).

    Scanners work by detecting the transitions between tokens so even if the tokens
    are very long we don't need to store them twice (in the input stream and token
    accumulator); I suppose that could be true of regex pattern matchers, but it
    doesn't seem to be for re ie we need the entire pattern in the input before we
    can match and extract to an accumulator.
    --
    Robin Becker
     
    Robin Becker, Apr 26, 2005
    #9
  10. On Tue, 26 Apr 2005 19:32:29 +0100, Robin Becker wrote:

    > Skip Montanaro wrote:
    >> Robin> So we avoid dirty page writes etc etc. However, I still think I
    >> Robin> could get away with a small window into the file which would be
    >> Robin> more efficient.
    >>
    >> It's hard to imagine how sliding a small window onto a file within Python
    >> would be more efficient than the operating system's paging system. ;-)
    >>
    >> Skip

    > well it might be if I only want to scan forward through the file (think lexical
    > analysis). Most lexical analyzers use a buffer and produce a stream of tokens ie
    > a compressed version of the input. There are problems crossing buffers etc, but
    > we never normally need the whole file in memory.


    I think you might have a misunderstanding here. mmap puts a file into
    *virtual* memory. It does *not* read the whole thing into physical memory;
    if it did, there would be no purpose to mmap support in the OS in the
    first place, as a thin wrapper around existing file calls would work.

    > If the lexical analyzer reads the whole file into memory then we need more
    > pages. The mmap thing might help as we need only read pages (for a lexical scanner).


    The read-write status of the pages is not why mmap is an advantage; the
    advantage is that the OS naturally and transparent is taking care of
    loading just the portions you want, and intelligently discarding them when
    you are done (more intelligently than you could, even in theory, since it
    can take advantage of knowing the entire state of the system, your program
    can't).

    In other words, as Skip was trying to tell you, mmap *already
    does* what you are saying might be better, and it does it better than you
    can, even in theory, from inside a process (as the OS will not reveal to
    you the data structures it has that you would need to match that
    performance).

    As you try to understand mmap, make sure your mental model can take into
    account the fact that it is easy and quite common to mmap a file several
    times larger than your physical memory, and it does not even *try* to read
    the whole thing in at any given time. You may benefit from
    reviewing/studying the difference between virtual memory and physical
    memory.
     
    Jeremy Bowers, Apr 26, 2005
    #10
  11. >> It's hard to imagine how sliding a small window onto a file within Python
    >> would be more efficient than the operating system's paging system. ;-)


    Robin> well it might be if I only want to scan forward through the file
    Robin> (think lexical analysis). Most lexical analyzers use a buffer and
    Robin> produce a stream of tokens ie a compressed version of the
    Robin> input. There are problems crossing buffers etc, but we never
    Robin> normally need the whole file in memory.

    If I mmap() a file, it's not slurped into main memory immediately, though as
    you pointed out, it's charged to my process's virtual memory. As I access
    bits of the file's contents, it will page in only what's necessary. If I
    mmap() a huge file, then print out a few bytes from the middle, only the
    page containing the interesting bytes is actually copied into physical
    memory.

    Skip
     
    Skip Montanaro, Apr 26, 2005
    #11
  12. Robin Becker

    Robin Becker Guest

    Skip Montanaro wrote:
    ....
    > If I mmap() a file, it's not slurped into main memory immediately, though as
    > you pointed out, it's charged to my process's virtual memory. As I access
    > bits of the file's contents, it will page in only what's necessary. If I
    > mmap() a huge file, then print out a few bytes from the middle, only the
    > page containing the interesting bytes is actually copied into physical
    > memory.

    .....
    my simple rather stupid experiment indicates that windows mmap at least
    will reserve 25Mb of paged file for a linear scan through a 25Mb file. I
    probably only need 4096b to scan. That's a lot less than even the page
    table requirement. This isn't rocket science just an old style observation.
    --
    Robin Becker
     
    Robin Becker, Apr 26, 2005
    #12
  13. On Tue, 26 Apr 2005 20:54:53 +0000, Robin Becker wrote:

    > Skip Montanaro wrote:
    > ...
    >> If I mmap() a file, it's not slurped into main memory immediately, though as
    >> you pointed out, it's charged to my process's virtual memory. As I access
    >> bits of the file's contents, it will page in only what's necessary. If I
    >> mmap() a huge file, then print out a few bytes from the middle, only the
    >> page containing the interesting bytes is actually copied into physical
    >> memory.

    > ....
    > my simple rather stupid experiment indicates that windows mmap at least
    > will reserve 25Mb of paged file for a linear scan through a 25Mb file. I
    > probably only need 4096b to scan. That's a lot less than even the page
    > table requirement. This isn't rocket science just an old style observation.


    Are you trying to claim Skip is wrong, or what? There's little value in
    saying that by mapping a file of 25MB into VM pages, you've increased your
    allocated paged file space by 25MB. That's effectively tautological.

    If you are trying to claim Skip is wrong, you *do not understand* what you
    are talking about. Talk less, listen and study more. (This is my best
    guess, as like I said, observing that allocating things increases the
    number of things that are allocated isn't worth posting so my thought is
    you think you are proving something. If you really are just posting
    something tautological, my apologies and disregard this paragraph but,
    well, it's certainly not out of line at this point.)
     
    Jeremy Bowers, Apr 26, 2005
    #13
  14. Robin Becker

    Robin Becker Guest

    Jeremy Bowers wrote:
    > On Tue, 26 Apr 2005 20:54:53 +0000, Robin Becker wrote:
    >
    >
    >>Skip Montanaro wrote:
    >>...
    >>
    >>>If I mmap() a file, it's not slurped into main memory immediately, though as
    >>>you pointed out, it's charged to my process's virtual memory. As I access
    >>>bits of the file's contents, it will page in only what's necessary. If I
    >>>mmap() a huge file, then print out a few bytes from the middle, only the
    >>>page containing the interesting bytes is actually copied into physical
    >>>memory.

    >>
    >>....
    >>my simple rather stupid experiment indicates that windows mmap at least
    >>will reserve 25Mb of paged file for a linear scan through a 25Mb file. I
    >>probably only need 4096b to scan. That's a lot less than even the page
    >>table requirement. This isn't rocket science just an old style observation.

    >
    >
    > Are you trying to claim Skip is wrong, or what? There's little value in
    > saying that by mapping a file of 25MB into VM pages, you've increased your
    > allocated paged file space by 25MB. That's effectively tautological.
    >
    > If you are trying to claim Skip is wrong, you *do not understand* what you
    > are talking about. Talk less, listen and study more. (This is my best
    > guess, as like I said, observing that allocating things increases the
    > number of things that are allocated isn't worth posting so my thought is
    > you think you are proving something. If you really are just posting
    > something tautological, my apologies and disregard this paragraph but,
    > well, it's certainly not out of line at this point.)


    Well I obviously don't understand so perhaps you can explain these results

    I implemented a simple scanning algorithm in two ways. First buffered scan
    tscan0.py; second mmapped scan tscan1.py.

    For small file sizes the times are comparable.

    C:\code\reportlab\demos\gadflypaper>\tmp\tscan0.py bingo.pdf
    len=27916653 w=103 time=22.13

    C:\code\reportlab\demos\gadflypaper>\tmp\tscan1.py bingo.pdf
    len=27916653 w=103 time=22.20


    for large file sizes when paging becomes of interest buffered scan wins even
    though it has to do a lot more python statements. If this were coded in C the
    results would be plainer still. As I said this isn't about right or wrong it's
    an observation. If I inspect the performance monitor tscan0 is at 100%, but
    tscan1 is at 80-90% and all of memory gets used up so paging is important. This
    may be an effect of the poor design of xp if so perhaps it won't hold for other
    os's.

    C:\code\reportlab\demos\gadflypaper>\tmp\tscan0.py dingo.dat
    len=139583265 w=103 time=110.91

    C:\code\reportlab\demos\gadflypaper>\tmp\tscan1.py dingo.dat
    len=139583265 w=103 time=140.53


    C:\code\reportlab\demos\gadflypaper>cat \tmp\tscan0.py
    import sys, time
    fn = sys.argv[1]
    f=open(fn,'rb')
    n=0
    w=0
    t0 = time.time()
    while 1:
    buf = f.read(4096)
    lb = len(buf)
    if not lb: break
    n += lb
    for i in xrange(lb):
    w ^= ord(buf)
    t1 = time.time()

    print "len=%d w=%d time=%.2f" % (n, w, (t1-t0))

    C:\code\reportlab\demos\gadflypaper>cat \tmp\tscan1.py
    import sys, time, mmap, os
    fn = sys.argv[1]
    fh=os.open(fn,os.O_BINARY|os.O_RDONLY)
    s=mmap.mmap(fh,0,access=mmap.ACCESS_READ)
    n=len(s)
    w=0
    t0 = time.time()
    for i in xrange(n):
    w ^= ord(s)
    t1 = time.time()

    print "len=%d w=%d time=%.2f" % (n, w, (t1-t0))




    --
    Robin Becker
     
    Robin Becker, Apr 27, 2005
    #14
  15. Robin> I implemented a simple scanning algorithm in two ways. First buffered scan
    Robin> tscan0.py; second mmapped scan tscan1.py.

    ...

    Robin> C:\code\reportlab\demos\gadflypaper>\tmp\tscan0.py dingo.dat
    Robin> len=139583265 w=103 time=110.91

    Robin> C:\code\reportlab\demos\gadflypaper>\tmp\tscan1.py dingo.dat
    Robin> len=139583265 w=103 time=140.53

    I'm not sure why the mmap() solution is so much slower for you. Perhaps on
    some systems files opened for reading are mmap'd under the covers. I'm sure
    it's highly platform-dependent. (My results on MacOSX - see below - are
    somewhat better.)

    Let me return to your original problem though, doing regex operations on
    files. I modified your two scripts slightly:

    tscan0.py:

    import sys, time, re
    fn = sys.argv[1]
    f=open(fn,'rb')
    n=0
    t0 = time.time()
    while 1:
    buf = f.read(4096)
    if not buf: break
    for i in re.split("XXXXX", buf):
    n += 1
    t1 = time.time()

    print "n=%d time=%.2f" % (n, (t1-t0))

    tscan1.py:

    import sys, time, mmap, os, re
    fn = sys.argv[1]
    fh=os.open(fn,os.O_RDONLY)
    s=mmap.mmap(fh,0,access=mmap.ACCESS_READ)
    t0 = time.time()
    n = 0
    for mat in re.split("XXXXX", s):
    n += 1
    t1 = time.time()

    print "n=%d time=%.2f" % (n, (t1-t0))

    The mmap version is almost obviously correct, assuming what we want to do is
    split the file on "XXXXX". The buffered read version is almost certainly
    incorrect, given our understanding that corner cases lurk at buffer
    boundaries.

    I took the file from Bengt Richter's example and replicated it a bunch of
    times to get a 122MB file. I then ran the above two programs against it:

    % python tscan1.py splitX
    n=2112001 time=8.88
    % python tscan0.py splitX
    n=2139845 time=10.26

    So the mmap'd version is within 15% of the performance of the buffered read
    version and we don't have to solve the problem of any corner cases (note the
    different values of n). I'm happy to take the extra runtime in exchange for
    simpler code.

    Skip
     
    Skip Montanaro, Apr 28, 2005
    #15
  16. On Wed, 27 Apr 2005 21:39:45 -0500, Skip Montanaro <> wrote:

    >
    > Robin> I implemented a simple scanning algorithm in two ways. First buffered scan
    > Robin> tscan0.py; second mmapped scan tscan1.py.
    >
    > ...
    >
    > Robin> C:\code\reportlab\demos\gadflypaper>\tmp\tscan0.py dingo.dat
    > Robin> len=139583265 w=103 time=110.91
    >
    > Robin> C:\code\reportlab\demos\gadflypaper>\tmp\tscan1.py dingo.dat
    > Robin> len=139583265 w=103 time=140.53
    >
    >I'm not sure why the mmap() solution is so much slower for you. Perhaps on
    >some systems files opened for reading are mmap'd under the covers. I'm sure
    >it's highly platform-dependent. (My results on MacOSX - see below - are
    >somewhat better.)
    >
    >Let me return to your original problem though, doing regex operations on
    >files. I modified your two scripts slightly:
    >
    >tscan0.py:
    >
    > import sys, time, re
    > fn = sys.argv[1]
    > f=open(fn,'rb')
    > n=0
    > t0 = time.time()
    > while 1:
    > buf = f.read(4096)
    > if not buf: break
    > for i in re.split("XXXXX", buf):

    To be fairer, I think you'd want to hoist the re compilation out of the loop.
    But also to be fairer, maybe include the overhead of splitting correctly, at
    least for the simple case regex in my example -- or is a you-goofed post
    for me in the usenet forwarding queues somewhere still? ;-)

    > n += 1
    > t1 = time.time()
    >
    > print "n=%d time=%.2f" % (n, (t1-t0))
    >
    >tscan1.py:
    >
    > import sys, time, mmap, os, re
    > fn = sys.argv[1]
    > fh=os.open(fn,os.O_RDONLY)
    > s=mmap.mmap(fh,0,access=mmap.ACCESS_READ)
    > t0 = time.time()
    > n = 0
    > for mat in re.split("XXXXX", s):
    > n += 1
    > t1 = time.time()
    >
    > print "n=%d time=%.2f" % (n, (t1-t0))
    >
    >The mmap version is almost obviously correct, assuming what we want to do is
    >split the file on "XXXXX". The buffered read version is almost certainly
    >incorrect, given our understanding that corner cases lurk at buffer
    >boundaries.
    >I took the file from Bengt Richter's example and replicated it a bunch of
    >times to get a 122MB file. I then ran the above two programs against it:
    >
    > % python tscan1.py splitX
    > n=2112001 time=8.88
    > % python tscan0.py splitX
    > n=2139845 time=10.26
    >
    >So the mmap'd version is within 15% of the performance of the buffered read

    with regex recompilation in loop ;-)
    >version and we don't have to solve the problem of any corner cases (note the
    >different values of n). I'm happy to take the extra runtime in exchange for
    >simpler code.
    >

    Agree. Hm, I wonder if the OS notices sequential page faults and schedules speculative
    read-ahead. Hm2, I wonder if you can just touch bytes from another coordinated thread
    to cause that, if it isn't happening ;-) Not for 15% though ;-)

    Regards,
    Bengt Richter
     
    Bengt Richter, Apr 28, 2005
    #16
  17. Robin Becker

    Robin Becker Guest

    Skip Montanaro wrote:
    .......
    >
    > I'm not sure why the mmap() solution is so much slower for you. Perhaps on
    > some systems files opened for reading are mmap'd under the covers. I'm sure
    > it's highly platform-dependent. (My results on MacOSX - see below - are
    > somewhat better.)
    >


    I'll have a go at doing the experiment on some other platforms I have
    available. The problem is certainly paging related. Perhaps the fact
    that we don't need to write dirty pages is moot when the system is
    actually writing out other processes' pages to make room for the
    incoming ones needed by the cpu hog. I do know that I cannot control
    that in detail. Also it's entirely possible that file caching/readahead
    etc etc can skew the results.

    All my old compiler texts recommend the buffered read approach, but that
    might be because mmap etc weren't around. Perhaps some compiler expert
    can say? Also I suspect that in a low level language the minor overhead
    caused by the book keeping is lower than that for the paging code.

    > Let me return to your original problem though, doing regex operations on
    > files. I modified your two scripts slightly:
    >

    .......
    > I took the file from Bengt Richter's example and replicated it a bunch of
    > times to get a 122MB file. I then ran the above two programs against it:
    >
    > % python tscan1.py splitX
    > n=2112001 time=8.88
    > % python tscan0.py splitX
    > n=2139845 time=10.26
    >
    > So the mmap'd version is within 15% of the performance of the buffered read
    > version and we don't have to solve the problem of any corner cases (note the
    > different values of n). I'm happy to take the extra runtime in exchange for
    > simpler code.
    >
    > Skip


    I will have a go at repeating this on my system. Perhaps with Bengt's
    code in the buffered case as that would be more realistic.

    It has been my experience that all systems crawl when driven into the
    swapping region and some users of our code seem anxious to run huge
    print jobs.
    --
    Robin Becker
     
    Robin Becker, Apr 28, 2005
    #17
  18. Robin Becker

    Robin Becker Guest

    Skip Montanaro wrote:
    ......


    >
    > Let me return to your original problem though, doing regex operations on
    > files. I modified your two scripts slightly:
    >

    ......
    > Skip

    I'm sure my results are dependent on something other than the coding style
    I suspect file/disk cache and paging operates here. Note that we now agree on
    total match length and split count. However, when the windows VM goes into
    paging mode the mmap thing falls off the world as I would expect for a thrashing
    system.

    eg small memory (relatively)
    C:\code\reportlab\demos\gadflypaper>\tmp\sscan0.py xxx_100mb.dat
    fn=xxx_100mb.dat n=1898737 l=90506416 time=3.55

    C:\code\reportlab\demos\gadflypaper>\tmp\sscan1.py xxx_100mb.dat
    fn=xxx_100mb.dat n=1898737 l=90506416 time=8.25

    C:\code\reportlab\demos\gadflypaper>\tmp\sscan1.py xxx_100mb.dat
    fn=xxx_100mb.dat n=1898737 l=90506416 time=9.77

    C:\code\reportlab\demos\gadflypaper>\tmp\sscan0.py xxx_100mb.dat
    fn=xxx_100mb.dat n=1898737 l=90506416 time=5.09

    C:\code\reportlab\demos\gadflypaper>\tmp\sscan1.py xxx_100mb.dat
    fn=xxx_100mb.dat n=1898737 l=90506416 time=6.17

    C:\code\reportlab\demos\gadflypaper>\tmp\sscan0.py xxx_100mb.dat
    fn=xxx_100mb.dat n=1898737 l=90506416 time=4.64

    and large memory
    C:\code\reportlab\demos\gadflypaper>\tmp\sscan0.py xxx_200mb.dat
    fn=xxx_200mb.dat n=3797470 l=181012689 time=20.16

    C:\code\reportlab\demos\gadflypaper>\tmp\sscan1.py xxx_200mb.dat
    fn=xxx_200mb.dat n=3797470 l=181012689 time=136.42

    At the end of this run I had to wait quite a long time for other things to
    become responsive (ie things were entirely paged out).


    Here I've implemented slightly modified versions of the scanners that you put
    forward.

    eg

    #sscan0.py thanks to Bengt
    import sys, time, re
    fn = sys.argv[1]
    rxo = re.compile('XXXXX')

    def frxsplit(path, rxo, chunksize=4096):
    buffer = ''
    for chunk in iter((lambda f=open(path,'rb'): f.read(chunksize)),''):
    buffer += chunk
    pieces = rxo.split(buffer)
    for piece in pieces[:-1]: yield piece
    buffer = pieces[-1]
    yield buffer
    l=n=0
    t0 = time.time()
    for mat in frxsplit(fn,rxo):
    n += 1
    l += len(mat)
    t1 = time.time()

    print "fn=%s n=%d l=%d time=%.2f" % (fn, n, l, (t1-t0))

    #sscan1.py thanks to Skip
    import sys, time, mmap, os, re
    fn = sys.argv[1]
    fh=os.open(fn,os.O_BINARY|os.O_RDONLY)
    s=mmap.mmap(fh,0,access=mmap.ACCESS_READ)
    l=n=0
    t0 = time.time()
    for mat in re.split("XXXXX", s):
    n += 1
    l += len(mat)
    t1 = time.time()

    print "fn=%s n=%d l=%d time=%.2f" % (fn, n, l, (t1-t0))
    --
    Robin Becker
     
    Robin Becker, Apr 28, 2005
    #18
  19. Robin Becker

    Robin Becker Guest

    Jeremy Bowers wrote:
    ......
    >
    > As you try to understand mmap, make sure your mental model can take into
    > account the fact that it is easy and quite common to mmap a file several
    > times larger than your physical memory, and it does not even *try* to read
    > the whole thing in at any given time. You may benefit from
    > reviewing/studying the difference between virtual memory and physical
    > memory.

    I've been using vm systems for 30 years and I suspect my mental model is a bit
    decrepit. However, as convincingly demonstrated by testing my mental model seems
    able to predict low memory problems. When systems run out of memory they tend to
    perform poorly. I'm not sure the horrible degradation I see with large files is
    necessary, but I know it occurs on at least one common vm implementation.
    --
    Robin Becker
     
    Robin Becker, Apr 28, 2005
    #19
  20. Bengt> To be fairer, I think you'd want to hoist the re compilation out
    Bengt> of the loop.

    The re module compiles and caches regular expressions, so I doubt it would
    affect the runtime of either version.

    Bengt> But also to be fairer, maybe include the overhead of splitting
    Bengt> correctly, at least for the simple case regex in my example -- or
    Bengt> is a you-goofed post for me in the usenet forwarding queues
    Bengt> somewhere still? ;-)

    I was just too lazy to incorporate (something like) your change. You will
    note that I was also lazy enough to simply steal your XXXXX file. <wink>

    Skip
     
    Skip Montanaro, Apr 28, 2005
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Robin Becker

    regex over files

    Robin Becker, Apr 25, 2005, in forum: Python
    Replies:
    1
    Views:
    375
    Bengt Richter
    Apr 27, 2005
  2. utab
    Replies:
    3
    Views:
    906
  3. Replies:
    3
    Views:
    823
    Reedick, Andrew
    Jul 1, 2008
  4. karthikbalaguru
    Replies:
    3
    Views:
    3,096
    Chris Dollin
    Nov 27, 2008
  5. RolfK
    Replies:
    1
    Views:
    1,921
    Martin Honnen
    Jun 7, 2009
Loading...

Share This Page