Using regular expressions to extract substrings from files

T

Timothy Hume

Hi,

I am new to Python, and was wondering if it is possible to operate on
files using regular expressions.

What I mean is this:
- It is easy to search for a substring of a string using regular
expressions
- Can I also search for a substring inside a file using regular
expressions? The substring may span several lines (ie there may be
embedded new line and carriage return characters).

So far, the only way I know how to do this is to read the entire file into
a string, and then parse the resulting string with regular expressions.
This is OK for small files (in fact it is probably quite efficient,
because the disc I/O is done all at once). However, once the files get
large, there is the risk I will run out of memory. The closest UNIX tool I
can think of to do this sort of job is grep, but that doesn't have the
power and flexibility of Python.

Any ideas would be appreciated.

Tim Hume
Bureau of Meteorology Research Centre
Melbourne
Australia
 
J

Jason Lai

Timothy said:
Hi,

I am new to Python, and was wondering if it is possible to operate on
files using regular expressions.

What I mean is this:
- It is easy to search for a substring of a string using regular
expressions
- Can I also search for a substring inside a file using regular
expressions? The substring may span several lines (ie there may be
embedded new line and carriage return characters).

So far, the only way I know how to do this is to read the entire file into
a string, and then parse the resulting string with regular expressions.
This is OK for small files (in fact it is probably quite efficient,
because the disc I/O is done all at once). However, once the files get
large, there is the risk I will run out of memory. The closest UNIX tool I
can think of to do this sort of job is grep, but that doesn't have the
power and flexibility of Python.

Any ideas would be appreciated.

Tim Hume
Bureau of Meteorology Research Centre
Melbourne
Australia

http://docs.python.org/lib/module-mmap.html
 
B

Brian Szmyd

Timothy said:
Hi,

I am new to Python, and was wondering if it is possible to operate on
files using regular expressions.

What I mean is this:
- It is easy to search for a substring of a string using regular
expressions
- Can I also search for a substring inside a file using regular
expressions? The substrin g may span several lines (ie there may be
embedded new line and carriage return characters).

So far, the only way I know how to do this is to read the entire file into
a string, and then parse the resulting string with regular expressions.
This is OK for small files (in fact it is probably quite efficient,
because the disc I/O is done all at once). However, once the files get
large, there is the risk I will run out of memory. The closest UNIX tool I
can think of to do this sort of job is grep, but that doesn't have the
power and flexibility of Python.

Any ideas would be appreciated.

Tim Hume
Bureau of Meteorology Research Centre
Melbourne
Australia

You could always call grep from python if that will work for you, otherwise
you'll probably have to read in the file using some buffer and check the
buffer each time, problem is, what if it spans two buffers right?

As for spanning lines, they fall under the category of "whitespace", so
allowing them in your regular expression would be appropriate.

-regards
brian szmyd
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,754
Messages
2,569,528
Members
45,000
Latest member
MurrayKeync

Latest Threads

Top