Reading a file, sans whitespace

U

Uri

I have a file that looks like this: (but longer, no wordwrap)

Name: Date: Time: Company: Employee Number:
Jim 2.03.04 12:00 JimEnt 4
Steve 3.04.32 03:00 SteveEnt 5

I want to load 'Jim' and '12:00' and those types of answers into
variables in my program, the only delimiter in the file is whitespace.
How do I do this?

I can do it with string.split(" ",[0]) type line for a file that's
only delimited by single spaces, but when I'm searching for white
space, how do I do it?

THanks!
 
M

Michael Geary

Uri said:
I have a file that looks like this: (but longer, no wordwrap)

Name: Date: Time: Company: Employee Number:
Jim 2.03.04 12:00 JimEnt 4
Steve 3.04.32 03:00 SteveEnt 5

I want to load 'Jim' and '12:00' and those types of answers
into variables in my program, the only delimiter in the file
is whitespace. How do I do this?

I can do it with string.split(" ",[0]) type line for a file that's
only delimited by single spaces, but when I'm searching
for white space, how do I do it?

Use a regular expression. For speed, precompile it at the beginning of your
program:

reWhitespace = re.compile( r'\s+' )

Then, split each line with:

fields = reWhitespace.split( line )

-Mike
 
R

Rick L. Ratzel

How about this:
.... print re.split( "\s+", line.strip() )
....
['Name:', 'Date:', 'Time:', 'Company:', 'Employee', 'Number:']
['Jim', '2.03.04', '12:00', 'JimEnt', '4']
['Steve', '3.04.32', '03:00', 'SteveEnt', '5']
-Rick Ratzel

I have a file that looks like this: (but longer, no wordwrap)

Name: Date: Time: Company: Employee Number:
Jim 2.03.04 12:00 JimEnt 4
Steve 3.04.32 03:00 SteveEnt 5

I want to load 'Jim' and '12:00' and those types of answers into
variables in my program, the only delimiter in the file is whitespace.
How do I do this?

I can do it with string.split(" ",[0]) type line for a file that's
only delimited by single spaces, but when I'm searching for white
space, how do I do it?

THanks!
 
T

Tim Daneliuk

Uri said:
I have a file that looks like this: (but longer, no wordwrap)

Name: Date: Time: Company: Employee Number:
Jim 2.03.04 12:00 JimEnt 4
Steve 3.04.32 03:00 SteveEnt 5

I want to load 'Jim' and '12:00' and those types of answers into
variables in my program, the only delimiter in the file is whitespace.
How do I do this?

I can do it with string.split(" ",[0]) type line for a file that's
only delimited by single spaces, but when I'm searching for white
space, how do I do it?

THanks!

Say you have read a line in the above format into variable 's'.
Then,

l = s.split()

will return a list containing each of the fields of the line as
an entry with the whitespace stripped out. Then,

VarName = l[0]
VarDate = l[1]
VarTime = l[2]
VarCo = l[3]
VarEmp = l[4]


Is this what you had in mind?
 
M

Michael Geary

Uri said:
I have a file that looks like this: (but longer, no wordwrap)

Name: Date: Time: Company: Employee Number:
Jim 2.03.04 12:00 JimEnt 4
Steve 3.04.32 03:00 SteveEnt 5

I want to load 'Jim' and '12:00' and those types of answers into
variables in my program, the only delimiter in the file is whitespace.
How do I do this?

I can do it with string.split(" ",[0]) type line for a file that's
only delimited by single spaces, but when I'm searching for white
space, how do I do it?

THanks!

Tim said:
Say you have read a line in the above format into variable 's'.
Then,

l = s.split()

will return a list containing each of the fields of the line as
an entry with the whitespace stripped out. Then,

VarName = l[0]
VarDate = l[1]
VarTime = l[2]
VarCo = l[3]
VarEmp = l[4]

D'oh! That's much better than the regular expression solution I posted.

The regular expression split is good to know about for more complicated
patterns, but for simple whitespace splitting there's no need for it.

Thanks,

-Mike
 
U

Uri

Tim said:
Say you have read a line in the above format into variable 's'.
Then,

l = s.split()

will return a list containing each of the fields of the line as
an entry with the whitespace stripped out. Then,

VarName = l[0]
VarDate = l[1]
VarTime = l[2]
VarCo = l[3]
VarEmp = l[4]

D'oh! That's much better than the regular expression solution I posted.

The regular expression split is good to know about for more complicated
patterns, but for simple whitespace splitting there's no need for it.

Thanks,

-Mike

Thanks guys! Tim's idea seems like the easiest for a newbie to
implement, but I'll play around with Mike's pre-compiling thing, too.
I don't really understand what the compile part does, could you
expound upon that?

Thanks for all your help guys!
 
M

Michael Geary

Uri said:
Thanks guys! Tim's idea seems like the easiest for a newbie
to implement, but I'll play around with Mike's pre-compiling
thing, too. I don't really understand what the compile part
does, could you expound upon that?

It's just a way to make a regular expression more efficient when you use it
repeatedly. When you use a regular expression, Python does two things:
First, it compiles the regular expression into a special customized function
that implements the string matching that the regular expression specifies.
Then, it runs that function on the string you're using. If you're going to
use the regular expression repeatedly, you can compile it every time, or
compile it once and use the precompiled version after that.

For example, these do exactly the same thing:

import re
for line in file( 'inputFile' ).readlines():
print re.split( '\s+', line.strip() )

import re
reWhitespace = re.compile( '\s+' )
for line in file( 'inputFile' ).readlines():
print reWhitespace.split( line.strip() )

But for a large file, the second version will be faster because the regular
expression is compiled only once instead of every time through the loop. It
may not make much difference for a simple regular expression like this (and
of course string.split is even simpler and probably faster), but for a
complicated regular expression it will make more of a difference in
performance.

-Mike
 
J

Jeff Epler

It looks like the impact of not compiling should be fairly small.
sre.split calls sre._compile() which has a 100-entry cache. The total
overhead is something like 3 function calls and a dictionary lookup.

You might switch from in-line REs to compiled ones to get the last
little bit of speed out of some code, but IMO the big reason is for
code clarity and to re-use the same RE in multiple places without
duplicating the RE string.

Jeff
 
K

Konstantin Veretennicov

Michael Geary said:
import re
reWhitespace = re.compile( '\s+' )
for line in file( 'inputFile' ).readlines():
print reWhitespace.split( line.strip() )

But for a large file, the second version will be faster because the regular

And you'll want to use "for line in file('inputFile')"
instead of "for line in file('inputFile').readlines()",
especially for large files ;)

- kv
 
T

Terry Reedy

Michael Geary said:
Uri wrote:
For example, these do exactly the same thing:

import re
for line in file( 'inputFile' ).readlines():
print re.split( '\s+', line.strip() )

import re
reWhitespace = re.compile( '\s+' )
for line in file( 'inputFile' ).readlines():
print reWhitespace.split( line.strip() )

But for a large file, the second version will be faster because the regular
expression is compiled only once instead of every time through the loop.

I am curious whether you have actually timed this or seen others timings.
My impression (from other posts and from reading the code a year ago) is
that the current re implementation caches compiled re's
(recache[hash(restring)] = re.compile(restring)) just so that the first
example will *not* recompile every time thru the loop. If so, I think one
should name an re for pretty much the same reasons as for anything else:
conceptual chunking and reuse in multiple places.

Terry J. Reedy
 
M

Michael Geary

Terry said:
I am curious whether you have actually timed this or seen others
timings. My impression (from other posts and from reading the
code a year ago) is that the current re implementation caches
compiled re's (recache[hash(restring)] = re.compile(restring))
just so that the first example will *not* recompile every time thru
the loop. If so, I think one should name an re for pretty much the
same reasons as for anything else: conceptual chunking and reuse
in multiple places.

Oh man, is my face red! No, I didn't know about the caching, and I hadn't
timed this. One should never make assumptions about performance issues! :)

Also, as Konstantin pointed out, file( 'inputFile' ).readlines() should be
just file( 'inputFile' ), and I just noticed that I didn't use raw strings
for the regular expressions. '\s+' happens to work, but it would be better
to be in the habit of writing r'\s+' instead. This was not my day for
posting good code samples!

Now that you've shamed me into actually testing the performance, it turns
out that precompiling the regular expression does make a difference.
Consider these examples:

import re, time
input = []
for i in xrange( 1000000 ):
input.append( '%d abc def ghi jkl mno pqr stu' % i )
start = time.time()
for line in input:
result = re.split( r'\s+', line )
print time.time() - start

import re, time
input = []
for i in xrange( 1000000 ):
input.append( '%d abc def ghi jkl mno pqr stu' % i )
start = time.time()
reWhitespace = re.compile( r'\s+' )
for line in input:
result = reWhitespace.split( line )
print time.time() - start

On my PIII-1.2GHz system, the first version runs in 27 seconds, and the
second version runs in 18 seconds, quite an improvement. I would guess that
the hash lookup for the cached regular expression is what's taking the extra
time in the first version, but I don't want to assume that's what it is. :)

-Mike
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,432
Messages
2,571,682
Members
48,796
Latest member
Greg L.

Latest Threads

Top