Reading a file, sans whitespace

Uri · May 22, 2004

I have a file that looks like this: (but longer, no wordwrap)

Name: Date: Time: Company: Employee Number:
Jim 2.03.04 12:00 JimEnt 4
Steve 3.04.32 03:00 SteveEnt 5

I want to load 'Jim' and '12:00' and those types of answers into
variables in my program, the only delimiter in the file is whitespace.
How do I do this?

I can do it with string.split(" ",[0]) type line for a file that's
only delimited by single spaces, but when I'm searching for white
space, how do I do it?

THanks!

Michael Geary · May 22, 2004

Uri said:
I have a file that looks like this: (but longer, no wordwrap)

Name: Date: Time: Company: Employee Number:
Jim 2.03.04 12:00 JimEnt 4
Steve 3.04.32 03:00 SteveEnt 5

I want to load 'Jim' and '12:00' and those types of answers
into variables in my program, the only delimiter in the file
is whitespace. How do I do this?

I can do it with string.split(" ",[0]) type line for a file that's
only delimited by single spaces, but when I'm searching
for white space, how do I do it?

Use a regular expression. For speed, precompile it at the beginning of your
program:

reWhitespace = re.compile( r'\s+' )

Then, split each line with:

fields = reWhitespace.split( line )

-Mike

Rick L. Ratzel · May 22, 2004

How about this:
.... print re.split( "\s+", line.strip() )
....
['Name:', 'Date:', 'Time:', 'Company:', 'Employee', 'Number:']
['Jim', '2.03.04', '12:00', 'JimEnt', '4']
['Steve', '3.04.32', '03:00', 'SteveEnt', '5']
-Rick Ratzel

I have a file that looks like this: (but longer, no wordwrap)

Name: Date: Time: Company: Employee Number:
Jim 2.03.04 12:00 JimEnt 4
Steve 3.04.32 03:00 SteveEnt 5

I want to load 'Jim' and '12:00' and those types of answers into
variables in my program, the only delimiter in the file is whitespace.
How do I do this?

I can do it with string.split(" ",[0]) type line for a file that's
only delimited by single spaces, but when I'm searching for white
space, how do I do it?

THanks!

Tim Daneliuk · May 22, 2004

Uri said:
I have a file that looks like this: (but longer, no wordwrap)

Name: Date: Time: Company: Employee Number:
Jim 2.03.04 12:00 JimEnt 4
Steve 3.04.32 03:00 SteveEnt 5

I want to load 'Jim' and '12:00' and those types of answers into
variables in my program, the only delimiter in the file is whitespace.
How do I do this?

I can do it with string.split(" ",[0]) type line for a file that's
only delimited by single spaces, but when I'm searching for white
space, how do I do it?

THanks!

Say you have read a line in the above format into variable 's'.
Then,

l = s.split()

will return a list containing each of the fields of the line as
an entry with the whitespace stripped out. Then,

VarName = l[0]
VarDate = l[1]
VarTime = l[2]
VarCo = l[3]
VarEmp = l[4]

Is this what you had in mind?

Michael Geary · May 22, 2004

Uri said:
I have a file that looks like this: (but longer, no wordwrap)

Name: Date: Time: Company: Employee Number:
Jim 2.03.04 12:00 JimEnt 4
Steve 3.04.32 03:00 SteveEnt 5

I want to load 'Jim' and '12:00' and those types of answers into
variables in my program, the only delimiter in the file is whitespace.
How do I do this?

I can do it with string.split(" ",[0]) type line for a file that's
only delimited by single spaces, but when I'm searching for white
space, how do I do it?

THanks!

Click to expand...

Tim said:
Say you have read a line in the above format into variable 's'.
Then,

l = s.split()

will return a list containing each of the fields of the line as
an entry with the whitespace stripped out. Then,

VarName = l[0]
VarDate = l[1]
VarTime = l[2]
VarCo = l[3]
VarEmp = l[4]

D'oh! That's much better than the regular expression solution I posted.

The regular expression split is good to know about for more complicated
patterns, but for simple whitespace splitting there's no need for it.

Thanks,

-Mike

Uri · May 23, 2004

Tim said:
Say you have read a line in the above format into variable 's'.
Then,

l = s.split()

will return a list containing each of the fields of the line as
an entry with the whitespace stripped out. Then,

VarName = l[0]
VarDate = l[1]
VarTime = l[2]
VarCo = l[3]
VarEmp = l[4]

Click to expand...

D'oh! That's much better than the regular expression solution I posted.

The regular expression split is good to know about for more complicated
patterns, but for simple whitespace splitting there's no need for it.

Thanks,

-Mike

Thanks guys! Tim's idea seems like the easiest for a newbie to
implement, but I'll play around with Mike's pre-compiling thing, too.
I don't really understand what the compile part does, could you
expound upon that?

Thanks for all your help guys!

Michael Geary · May 23, 2004

Uri said:
Thanks guys! Tim's idea seems like the easiest for a newbie
to implement, but I'll play around with Mike's pre-compiling
thing, too. I don't really understand what the compile part
does, could you expound upon that?

It's just a way to make a regular expression more efficient when you use it
repeatedly. When you use a regular expression, Python does two things:
First, it compiles the regular expression into a special customized function
that implements the string matching that the regular expression specifies.
Then, it runs that function on the string you're using. If you're going to
use the regular expression repeatedly, you can compile it every time, or
compile it once and use the precompiled version after that.

For example, these do exactly the same thing:

import re
for line in file( 'inputFile' ).readlines():
print re.split( '\s+', line.strip() )

import re
reWhitespace = re.compile( '\s+' )
for line in file( 'inputFile' ).readlines():
print reWhitespace.split( line.strip() )

But for a large file, the second version will be faster because the regular
expression is compiled only once instead of every time through the loop. It
may not make much difference for a simple regular expression like this (and
of course string.split is even simpler and probably faster), but for a
complicated regular expression it will make more of a difference in
performance.

-Mike

Jeff Epler · May 23, 2004

It looks like the impact of not compiling should be fairly small.
sre.split calls sre._compile() which has a 100-entry cache. The total
overhead is something like 3 function calls and a dictionary lookup.

You might switch from in-line REs to compiled ones to get the last
little bit of speed out of some code, but IMO the big reason is for
code clarity and to re-use the same RE in multiple places without
duplicating the RE string.

Jeff

Konstantin Veretennicov · May 24, 2004

Michael Geary said:
import re
reWhitespace = re.compile( '\s+' )
for line in file( 'inputFile' ).readlines():
print reWhitespace.split( line.strip() )

But for a large file, the second version will be faster because the regular

And you'll want to use "for line in file('inputFile')"
instead of "for line in file('inputFile').readlines()",
especially for large files

- kv

Terry Reedy · May 24, 2004

Michael Geary said:
Uri wrote:
For example, these do exactly the same thing:

import re
for line in file( 'inputFile' ).readlines():
print re.split( '\s+', line.strip() )

import re
reWhitespace = re.compile( '\s+' )
for line in file( 'inputFile' ).readlines():
print reWhitespace.split( line.strip() )

But for a large file, the second version will be faster because the regular
expression is compiled only once instead of every time through the loop.

I am curious whether you have actually timed this or seen others timings.
My impression (from other posts and from reading the code a year ago) is
that the current re implementation caches compiled re's
(recache[hash(restring)] = re.compile(restring)) just so that the first
example will *not* recompile every time thru the loop. If so, I think one
should name an re for pretty much the same reasons as for anything else:
conceptual chunking and reuse in multiple places.

Terry J. Reedy

Michael Geary · May 24, 2004

Terry said:
I am curious whether you have actually timed this or seen others
timings. My impression (from other posts and from reading the
code a year ago) is that the current re implementation caches
compiled re's (recache[hash(restring)] = re.compile(restring))
just so that the first example will *not* recompile every time thru
the loop. If so, I think one should name an re for pretty much the
same reasons as for anything else: conceptual chunking and reuse
in multiple places.

Oh man, is my face red! No, I didn't know about the caching, and I hadn't
timed this. One should never make assumptions about performance issues!

Also, as Konstantin pointed out, file( 'inputFile' ).readlines() should be
just file( 'inputFile' ), and I just noticed that I didn't use raw strings
for the regular expressions. '\s+' happens to work, but it would be better
to be in the habit of writing r'\s+' instead. This was not my day for
posting good code samples!

Now that you've shamed me into actually testing the performance, it turns
out that precompiling the regular expression does make a difference.
Consider these examples:

import re, time
input = []
for i in xrange( 1000000 ):
input.append( '%d abc def ghi jkl mno pqr stu' % i )
start = time.time()
for line in input:
result = re.split( r'\s+', line )
print time.time() - start

import re, time
input = []
for i in xrange( 1000000 ):
input.append( '%d abc def ghi jkl mno pqr stu' % i )
start = time.time()
reWhitespace = re.compile( r'\s+' )
for line in input:
result = reWhitespace.split( line )
print time.time() - start

On my PIII-1.2GHz system, the first version runs in 27 seconds, and the
second version runs in 18 seconds, quite an improvement. I would guess that
the hash lookup for the cached regular expression is what's taking the extra
time in the first version, but I don't want to assume that's what it is.

-Mike

write whitespace/tab to a text file	1	Oct 19, 2007
Seeking help: reading text file with genfromtxt	0	Apr 4, 2012
Reading a file into a data structure....	8	Oct 13, 2011
write whitespace/tab to a text file	4	Oct 19, 2007
Help with my responsive home page	2	Dec 14, 2022
A Exhibition Of Tech Geekers Incompetence: Emacs whitespace-mode	12	Aug 13, 2009
Database Manager: A C++ Console Application	14	May 12, 2025
whitespace in element content	2	Oct 31, 2004

Reading a file, sans whitespace

Uri

Michael Geary

Rick L. Ratzel

Tim Daneliuk

Michael Geary

Uri

Michael Geary

Jeff Epler

Konstantin Veretennicov

Terry Reedy

Michael Geary

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads