Stripping whitespace

ryan k · Jan 23, 2008

Hello. I have a string like 'LNAME
PASTA ZONE'. I want to create a list of those words and
basically replace all the whitespace between them with one space so i
could just do lala.split(). Thank you!

Ryan Kaskel

Marc 'BlackJack' Rintsch · Jan 23, 2008

Hello. I have a string like 'LNAME
PASTA ZONE'. I want to create a list of those words and
basically replace all the whitespace between them with one space so i
could just do lala.split(). Thank you!

You *can* just do ``lala.split()``:

In [97]: lala = 'LNAME PASTA ZONE'

In [98]: lala.split()
Out[98]: ['LNAME', 'PASTA', 'ZONE']

Ciao,
Marc 'BlackJack' Rintsch

Paul Rubin · Jan 23, 2008

ryan k said:
Hello. I have a string like 'LNAME
PASTA ZONE'. I want to create a list of those words and
basically replace all the whitespace between them with one space so i
could just do lala.split(). Thank you!

import re
s = 'LNAME PASTA ZONE'
re.split('\s+', s)

John Machin · Jan 23, 2008

Hello. I have a string like 'LNAME
PASTA ZONE'. I want to create a list of those words and
basically replace all the whitespace between them with one space so i
could just do lala.split(). Thank you!

Ryan Kaskel

So when you go to the Python interactive prompt and type firstly
lala = 'LNAME PASTA ZONE'
and then
lala.split()
what do you see, and what more do you need to meet your requirements?

James Matthews · Jan 23, 2008

Using the split method is the easiest!

Hello. I have a string like 'LNAME
PASTA ZONE'. I want to create a list of those words and
basically replace all the whitespace between them with one space so i
could just do lala.split(). Thank you!

Click to expand...

You *can* just do ``lala.split()``:

In [97]: lala = 'LNAME PASTA ZONE'

In [98]: lala.split()
Out[98]: ['LNAME', 'PASTA', 'ZONE']

Ciao,
Marc 'BlackJack' Rintsch

ryan k · Jan 23, 2008

You *can* just do ``lala.split()``:

Indeed you can thanks!

In [97]: lala = 'LNAME PASTA ZONE'

In [98]: lala.split()
Out[98]: ['LNAME', 'PASTA', 'ZONE']

Ciao,
Marc 'BlackJack' Rintsch

ryan k · Jan 23, 2008

I am taking a database class so I'm not asking for specific answers.
Well I have this text tile:

http://www.cs.tufts.edu/comp/115/projects/proj0/customer.txt

And this code:

# Table and row classes used for queries

class Row(object):
def __init__(self, column_list, row_vals):
print len(column_list)
print len(row_vals)
for column, value in column_list, row_vals:
if column and value:
setattr(self, column.lower(), value)

class Table(object):
def __init__(self, table_name, table_fd):
self.name = table_name
self.table_fd = table_fd
self.rows = []
self._load_table()

def _load_table(self):
counter = 0
for line in self.table_fd:
# Skip the second line
if not '-----' in line:
if counter == 0:
# This line contains the columns, parse it
column_list = line.split()
else:
# This is a row, parse it
row_vals = line.split()
# Create a new Row object and add it to the
table's
# row list
self.rows.append(Row(column_list, row_vals))
counter += 1

Because the addresses contain spaces, this won't work because there
are too many values being unpacked in row's __init__'s for loop. Any
suggestions for a better way to parse this file? I don't want to cheat
but just some general ideas would be nice. Thanks!

John Machin · Jan 23, 2008

import re
s = 'LNAME PASTA ZONE'
re.split('\s+', s)

That is (a) excessive for the OP's problem as stated and (b) unlike
str.split will cause him to cut you out of his will if his problem
turns out to include leading/trailing whitespace:

lala = ' LNAME PASTA ZONE '
import re
re.split(r'\s+', lala) ['', 'LNAME', 'PASTA', 'ZONE', '']
lala.split() ['LNAME', 'PASTA', 'ZONE']

Click to expand...

Click to expand...

John Machin · Jan 23, 2008

I am taking a database class so I'm not asking for specific answers.
Well I have this text tile:

http://www.cs.tufts.edu/comp/115/projects/proj0/customer.txt

Uh-huh, "column-aligned" output.

And this code:
[snip]

Because the addresses contain spaces, this won't work because there
are too many values being unpacked in row's __init__'s for loop. Any
suggestions for a better way to parse this file?

Tedious (and dumb) way:
field0 = line[start0:end0+1].rstrip()
field1 = line[start1:end1+1].rstrip()
etc

Why dumb: if the column sizes change, you have to suffer the tedium
again. While your sample appears to pad out each field to some
predetermined width, some reporting software (e.g. the Query Analyzer
that comes with MS SQL Server) will tailor the widths to the maximum
size actually observed in the data in each run of the report ... so
you write your program based on some tiny test output and next day you
run it for real and there's a customer whose name is Marmaduke
Rubberduckovitch-Featherstonehaugh or somesuch and your name is mud.

Smart way: note that the second line (the one with all the dashes)
gives you all the information you need to build lists of start and end
positions.

ryan k · Jan 23, 2008

I am taking a database class so I'm not asking for specific answers.
Well I have this text tile:

Click to expand...

http://www.cs.tufts.edu/comp/115/projects/proj0/customer.txt

Click to expand...

Uh-huh, "column-aligned" output.

And this code:
[snip]

Because the addresses contain spaces, this won't work because there
are too many values being unpacked in row's __init__'s for loop. Any
suggestions for a better way to parse this file?

Click to expand...

Tedious (and dumb) way:
field0 = line[start0:end0+1].rstrip()
field1 = line[start1:end1+1].rstrip()
etc

Why dumb: if the column sizes change, you have to suffer the tedium
again. While your sample appears to pad out each field to some
predetermined width, some reporting software (e.g. the Query Analyzer
that comes with MS SQL Server) will tailor the widths to the maximum
size actually observed in the data in each run of the report ... so
you write your program based on some tiny test output and next day you
run it for real and there's a customer whose name is Marmaduke
Rubberduckovitch-Featherstonehaugh or somesuch and your name is mud.

Smart way: note that the second line (the one with all the dashes)
gives you all the information you need to build lists of start and end
positions.

Thank you for your detailed response Mr. Machin. The teacher *said*
that the columns were supposed to be tab delimited but they aren't. So
yea i will just have to count dashes. Thank you!

John Machin · Jan 23, 2008

So yea i will just have to count dashes.

Read my lips: *you* counting dashes is dumb. Writing your code so that
*code* is counting dashes each time it opens the file is smart.

ryan k · Jan 23, 2008

Read my lips: *you* counting dashes is dumb. Writing your code so that
*code* is counting dashes each time it opens the file is smart.

Okay it's almost working ...

new parser function:

def _load_table(self):
counter = 0
for line in self.table_fd:
# Skip the second line
if counter == 0:
# This line contains the columns, parse it
column_list = line.split()
elif counter == 1:
# These are the dashes
line_l = line.split()
column_width = [len(i) for i in line_l]
print column_width
else:
# This is a row, parse it
marker = 0
row_vals = []
for col in column_width:
start = sum(column_width[:marker])
finish = sum(column_width[:marker+1])
print line[start:finish].strip()
row_vals.append(line[start:finish].strip())
marker += 1
self.rows.append(Row(column_list, row_vals))
counter += 1

Something obvious you can see wrong with my start finish code?

['rimon', 'rimon', 'Barr', 'Rimon', '22 Greenside Cres., Thornhill, ON
L3T 6W9', '2', '', 'm', '102', '100
-', '22.13 1234567890', '(e-mail address removed)', '']
['UNAME', 'PASSWD', 'LNAME', 'FNAME', 'ADDR', 'ZONE', 'SEX', 'AGE',
'LIMIT', 'BALANCE', 'CREDITCARD', 'EMAIL',
'ACTIVE']

Reedick, Andrew · Jan 23, 2008

-----Original Message-----

From: [email protected] [mailtoython-
[email protected]] On Behalf Of ryan k
Sent: Wednesday, January 23, 2008 3:24 PM
To: (e-mail address removed)
Subject: Re: Stripping whitespace

Read my lips: *you* counting dashes is dumb. Writing your code so that
*code* is counting dashes each time it opens the file is smart.

Click to expand...

Okay it's almost working ...

Why is it that so many Python people are regex adverse? Use the dashed
line as a regex. Convert the dashes to dots. Wrap the dots in
parentheses. Convert the whitespace chars to '\s'. Presto! Simpler,
cleaner code.

import re

state = 0
header_line = ''
pattern = ''
f = open('a.txt', 'r')
for line in f:
if line[-1:] == '\n':
line = line[:-1]

if state == 0:
header_line = line
state += 1
elif state == 1:
pattern = re.sub(r'-', r'.', line)
pattern = re.sub(r'\s', r'\\s', pattern)
pattern = re.sub(r'([.]+)', r'(\1)', pattern)
print pattern
state += 1

headers = re.match(pattern, header_line)
if headers:
print headers.groups()
else:
state = 2
m = re.match(pattern, line)
if m:
print m.groups()

f.close()

*****

The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential, proprietary, and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from all computers. GA625

John Machin · Jan 23, 2008

Okay it's almost working ...

new parser function:

def _load_table(self):
counter = 0
for line in self.table_fd:
# Skip the second line

The above comment is a nonsense.

if counter == 0:
# This line contains the columns, parse it
column_list = line.split()

In generality, you would have to allow for the headings to contain
spaces as well -- this means *saving* a reference to the heading line
and splitting it *after* you've processed the line with the dashes.

elif counter == 1:
# These are the dashes
line_l = line.split()
column_width = [len(i) for i in line_l]

Whoops.
column_width = [len(i) + 1 for i in line_l]

print column_width
else:
# This is a row, parse it
marker = 0
row_vals = []
for col in column_width:
start = sum(column_width[:marker])
finish = sum(column_width[:marker+1])
print line[start:finish].strip()

If you had printed just line[start:finish], it would have been obvious
what the problem was. See below for an even better suggestion.

row_vals.append(line[start:finish].strip())
marker += 1

Using sum is a tad ugly. Here's an alternative:

row_vals = []
start = 0
for width in column_width:
finish = start + width
#DEBUG# print repr(line[start:finish].replace(' ', '~'))
row_vals.append(line[start:finish].strip())
start = finish

self.rows.append(Row(column_list, row_vals))
counter += 1

Something obvious you can see wrong with my start finish code?

See above.

Steven D'Aprano · Jan 23, 2008

import re
s = 'LNAME PASTA ZONE'
re.split('\s+', s)

Please tell me you're making fun of the poor newbie and didn't mean to
seriously suggest using a regex merely to split on whitespace?

import timeit
timeit.Timer("s.split()", "s = 'one two three four'").repeat() [1.4074358940124512, 1.3505148887634277, 1.3469438552856445]
timeit.Timer("re.split('\s+', s)", "import re;s = 'one two

Click to expand...

Click to expand...

three four'").repeat()
[7.9205508232116699, 7.8833441734313965, 7.9301259517669678]

John Machin · Jan 23, 2008

Why is it that so many Python people are regex adverse? Use the dashed
line as a regex. Convert the dashes to dots. Wrap the dots in
parentheses. Convert the whitespace chars to '\s'. Presto! Simpler,
cleaner code.

Woo-hoo! Yesterday was HTML day, today is code review day. Yee-haa!

import re

state = 0
header_line = ''
pattern = ''
f = open('a.txt', 'r')
for line in f:
if line[-1:] == '\n':
line = line[:-1]

if state == 0:
header_line = line
state += 1

state = 1

elif state == 1:
pattern = re.sub(r'-', r'.', line)
pattern = re.sub(r'\s', r'\\s', pattern)
pattern = re.sub(r'([.]+)', r'(\1)', pattern)

Consider this:
pattern = ' '.join('(.{%d})' % len(x) for x in line.split())

print pattern
state += 1

state = 2

headers = re.match(pattern, header_line)
if headers:
print headers.groups()
else:
state = 2

assert state == 2

ryan k · Jan 23, 2008

import re
s = 'LNAME PASTA ZONE'
re.split('\s+', s)

Click to expand...

Please tell me you're making fun of the poor newbie and didn't mean to
seriously suggest using a regex merely to split on whitespace?

[1.4074358940124512, 1.3505148887634277, 1.3469438552856445]>>> timeit.Timer("re.split('\s+', s)", "import re;s = 'one two

three four'").repeat()
[7.9205508232116699, 7.8833441734313965, 7.9301259517669678]

The main topic is not an issue anymore.

John Machin · Jan 23, 2008

Please tell me you're making fun of the poor newbie and didn't mean to
seriously suggest using a regex merely to split on whitespace?

Click to expand...

[1.4074358940124512, 1.3505148887634277, 1.3469438552856445]>>> timeit.Timer("re.split('\s+', s)", "import re;s = 'one two

Click to expand...

three four'").repeat()
[7.9205508232116699, 7.8833441734313965, 7.9301259517669678]

Click to expand...

The main topic is not an issue anymore.

We know that. This thread will continue with biffo and brickbats long
after your assignment has been submitted

ryan k · Jan 23, 2008

import re
s = 'LNAME PASTA ZONE'
re.split('\s+', s)

Click to expand...

Please tell me you're making fun of the poor newbie and didn't mean to
seriously suggest using a regex merely to split on whitespace?

[1.4074358940124512, 1.3505148887634277, 1.3469438552856445]>>> timeit.Timer("re.split('\s+', s)", "import re;s = 'one two

three four'").repeat()
[7.9205508232116699, 7.8833441734313965, 7.9301259517669678]

Much thanks to Machin for helping with the parsing job. Steven
D'Aprano, you are a prick.

John Machin · Jan 23, 2008

Steven D'Aprano, you are a prick.

And your reasons for coming to that stridently expressed conclusion
after reading a posting that was *not* addressed to you are .....?

Looking For Advice	1	Dec 10, 2022
sys.arg whitespace problem	1	Nov 18, 2007
join()	6	Mar 20, 2013
Need Advice	10	Dec 9, 2022
PEP 8 and extraneous whitespace	0	Jul 21, 2011
Select Eof extension files based on text list of filenames with if condition	0	May 4, 2022
Replacement Problem	1	Apr 28, 2022
Select files based on text list of filenames(part of the name:date) with condition	0	May 4, 2022

Stripping whitespace

ryan k

Marc 'BlackJack' Rintsch

Paul Rubin

John Machin

James Matthews

ryan k

ryan k

John Machin

John Machin

ryan k

John Machin

ryan k

Reedick, Andrew

John Machin

Steven D'Aprano

John Machin

ryan k

John Machin

ryan k

John Machin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads