streaming a file object through re.finditer

E

Erick

Hello,

I've been looking for a while for an answer, but so far I haven't been
able to turn anything up yet. Basically, what I'd like to do is to use
re.finditer to search a large file (or a file stream), but I haven't
figured out how to get finditer to work without loading the entire file
into memory, or just reading one line at a time (or more complicated
buffering).

For example, say I do this:
cat a b c > blah

Then run this python script:.... print m.group()
....
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: buffer object expected

Of course, this works fine, but it loads the file completely into
memory (right?):.... print m.group()
....
a
b
c

So, is there any way to do this?

Thanks,

-e
 
E

Erick

Ack, typo. What I meant was this:
cat a b c > blah

.... print m.group()
....
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: buffer object expected

Of course, this works fine, but it loads the file completely into
memory (right?):.... print m.group()
....
a
b
c
 
D

Daniel Bickett

The following example loads the file into memory only one line at a
time, so it should suit your purposes:

now read it....
for x in re.finditer( "\w+" , line):
print x.group()
line = data.readline()


this
is
important
data
 
D

Daniel Bickett

Erick said:
True, but it doesn't work with multiline regular expressions :(

If your intent is for the expression to traverse multiple lines (and
possibly match *across* multiple lines,) then, as far as I know, you
have no choice but to load the whole file into memory.
 
E

Erik Johnson

Is it not possible to wrap your loop below within a loop doing
file.read([size]) (or readline() or readlines([size]),
reading the file a chunk at a time then running your re on a per-chunk
basis?

-ej
 
S

Steven Bethard

Erick said:
Hello,

I've been looking for a while for an answer, but so far I haven't been
able to turn anything up yet. Basically, what I'd like to do is to use
re.finditer to search a large file (or a file stream), but I haven't
figured out how to get finditer to work without loading the entire file
into memory, or just reading one line at a time (or more complicated
buffering).

Can you use mmap?

http://docs.python.org/lib/module-mmap.html

"You can use mmap objects in most places where strings are expected; for
example, you can use the re module to search through a memory-mapped file."

Seems applicable, and it should keep your memory use down, but I'm not
very experienced with it...

Steve
 
S

Steve Holden

Erick said:
Hello,

I've been looking for a while for an answer, but so far I haven't been
able to turn anything up yet. Basically, what I'd like to do is to use
re.finditer to search a large file (or a file stream), but I haven't
figured out how to get finditer to work without loading the entire file
into memory, or just reading one line at a time (or more complicated
buffering).

For example, say I do this:
cat a b c > blah

Then run this python script:


.... print m.group()
....
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: buffer object expected

Of course, this works fine, but it loads the file completely into
memory (right?):


.... print m.group()
....
a
b
c

So, is there any way to do this?

Thanks,

-e
If you look at the code of Medusa (IIRC) you will find a piece of code
that very carefully checks for a given string in a buffered stream. You
could probably adapt this to reading chunks of a file fairly easily.

regards
Steve
 
C

Christos TZOTZIOY Georgiou

If your intent is for the expression to traverse multiple lines (and
possibly match *across* multiple lines,) then, as far as I know, you
have no choice but to load the whole file into memory.

*If* the OP knows that their multiline re won't match more than, say, 4 lines at
a time, the code attached at the end of this post could be useful. Usage:

for group_of_lines in line_groups(<file>, line_count=4):
# bla bla

The OP should take care to ignore multiple matches as the n-line window scans
through the input file; eg. if your re searches for '3\n4', it will match 3
times in the first example of my code.

|import collections
|
|def line_groups(fileobj, line_count=2):
| iterator = iter(fileobj)
| group = collections.deque()
| joiner = ''.join
|
| try:
| while len(group) < line_count:
| group.append(iterator.next())
| except StopIteration:
| yield joiner(group)
| return
|
| for line in iterator:
| group.append(line)
| del group[0]
| yield joiner(group)
|
|if __name__=="__main__":
| import os, tempfile
|
| # create two temp file for 4-line groups
|
| # write n+3 lines in first file
| testname1= tempfile.mktemp() # depracated & insecure but ok for this test
| testfile= open(testname1, "w")
| testfile.write('\n'.join(map(str, range(7))))
| testfile.close()
|
| # write n-2 lines in second file
| testname2= tempfile.mktemp()
| testfile= open(testname2, "w")
| testfile.write('\n'.join(map(str, range(2))))
| testfile.close()
|
| # now iterate over four line groups
|
| for bunch_o_lines in line_groups( open(testname1), line_count=4):
| print repr(bunch_o_lines),
| print
|
| for bunch_o_lines in line_groups( open(testname2), line_count=4):
| print repr(bunch_o_lines),
| print
|
| os.remove(testname1); os.remove(testname2)
 
E

Erick

I did try to see if I could get that to work, but I couldn't figure it
out. I'll see if I can play around more with that api.

So say I did investigate a little more to see how much work it would
take to adapt the re module to accept an iterator (while leaving the
current string api as another code path). Depending on how complicated
a change this would be, how much interest would there be in other
people using this feature? From what I understand about regular
expressions, they're essentially stream processing and don't need
backtracking, so reading from an interator should work too (right?).

Thanks,

-e
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,572
Members
45,045
Latest member
DRCM

Latest Threads

Top