streaming a file object through re.finditer

Erick · Feb 3, 2005

Hello,

I've been looking for a while for an answer, but so far I haven't been
able to turn anything up yet. Basically, what I'd like to do is to use
re.finditer to search a large file (or a file stream), but I haven't
figured out how to get finditer to work without loading the entire file
into memory, or just reading one line at a time (or more complicated
buffering).

For example, say I do this:
cat a b c > blah

Then run this python script:.... print m.group()
....
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: buffer object expected

Of course, this works fine, but it loads the file completely into
memory (right?):.... print m.group()
....
a
b
c

So, is there any way to do this?

Thanks,

-e

Erick · Feb 3, 2005

Ack, typo. What I meant was this:
cat a b c > blah

.... print m.group()
....
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: buffer object expected

Of course, this works fine, but it loads the file completely into
memory (right?):.... print m.group()
....
a
b
c

Daniel Bickett · Feb 3, 2005

The following example loads the file into memory only one line at a
time, so it should suit your purposes:

now read it....
for x in re.finditer( "\w+" , line):
print x.group()
line = data.readline()

this
is
important
data

Erick · Feb 3, 2005

True, but it doesn't work with multiline regular expressions

-e

Daniel Bickett · Feb 3, 2005

Erick said:
True, but it doesn't work with multiline regular expressions

If your intent is for the expression to traverse multiple lines (and
possibly match *across* multiple lines,) then, as far as I know, you
have no choice but to load the whole file into memory.

Erik Johnson · Feb 3, 2005

Is it not possible to wrap your loop below within a loop doing
file.read([size]) (or readline() or readlines([size]),
reading the file a chunk at a time then running your re on a per-chunk
basis?

-ej

Steven Bethard · Feb 3, 2005

Erick said:
Hello,

I've been looking for a while for an answer, but so far I haven't been
able to turn anything up yet. Basically, what I'd like to do is to use
re.finditer to search a large file (or a file stream), but I haven't
figured out how to get finditer to work without loading the entire file
into memory, or just reading one line at a time (or more complicated
buffering).

Can you use mmap?

http://docs.python.org/lib/module-mmap.html

"You can use mmap objects in most places where strings are expected; for
example, you can use the re module to search through a memory-mapped file."

Seems applicable, and it should keep your memory use down, but I'm not
very experienced with it...

Steve

Steve Holden · Feb 3, 2005

Erick said:
Hello,

I've been looking for a while for an answer, but so far I haven't been
able to turn anything up yet. Basically, what I'd like to do is to use
re.finditer to search a large file (or a file stream), but I haven't
figured out how to get finditer to work without loading the entire file
into memory, or just reading one line at a time (or more complicated
buffering).

For example, say I do this:
cat a b c > blah

Then run this python script:

.... print m.group()
....
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: buffer object expected

Of course, this works fine, but it loads the file completely into
memory (right?):

.... print m.group()
....
a
b
c

So, is there any way to do this?

Thanks,

-e

If you look at the code of Medusa (IIRC) you will find a piece of code
that very carefully checks for a given string in a buffered stream. You
could probably adapt this to reading chunks of a file fairly easily.

regards
Steve

Christos TZOTZIOY Georgiou · Feb 3, 2005

If your intent is for the expression to traverse multiple lines (and
possibly match *across* multiple lines,) then, as far as I know, you
have no choice but to load the whole file into memory.

Erick · Feb 3, 2005

I did try to see if I could get that to work, but I couldn't figure it
out. I'll see if I can play around more with that api.

So say I did investigate a little more to see how much work it would
take to adapt the re module to accept an iterator (while leaving the
current string api as another code path). Depending on how complicated
a change this would be, how much interest would there be in other
people using this feature? From what I understand about regular
expressions, they're essentially stream processing and don't need
backtracking, so reading from an interator should work too (right?).

Thanks,

-e

Survey details won't go through using php, ajax, Mysql	0	Oct 26, 2023
How to loop through all the other pages in a pdf using python	3	May 16, 2023
using re.finditer()	4	Oct 27, 2004
How to loop in folder through all excel files and all sheets using pandas?	0	Dec 1, 2022
Loading modules from files through C++	0	May 17, 2014
Problem with KMKfw libraries	1	May 11, 2023
object().__dict__	0	Apr 23, 2014
How can I view / open / render / display a pdf file with c code?	0	Sep 23, 2023

streaming a file object through re.finditer

Erick

Erick

Daniel Bickett

Erick

Daniel Bickett

Erik Johnson

Steven Bethard

Steve Holden

Christos TZOTZIOY Georgiou

Erick

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads