Stripping whitespace

R

ryan k

Hello. I have a string like 'LNAME
PASTA ZONE'. I want to create a list of those words and
basically replace all the whitespace between them with one space so i
could just do lala.split(). Thank you!

Ryan Kaskel
 
M

Marc 'BlackJack' Rintsch

Hello. I have a string like 'LNAME
PASTA ZONE'. I want to create a list of those words and
basically replace all the whitespace between them with one space so i
could just do lala.split(). Thank you!

You *can* just do ``lala.split()``:

In [97]: lala = 'LNAME PASTA ZONE'

In [98]: lala.split()
Out[98]: ['LNAME', 'PASTA', 'ZONE']

Ciao,
Marc 'BlackJack' Rintsch
 
P

Paul Rubin

ryan k said:
Hello. I have a string like 'LNAME
PASTA ZONE'. I want to create a list of those words and
basically replace all the whitespace between them with one space so i
could just do lala.split(). Thank you!

import re
s = 'LNAME PASTA ZONE'
re.split('\s+', s)
 
J

John Machin

Hello. I have a string like 'LNAME
PASTA ZONE'. I want to create a list of those words and
basically replace all the whitespace between them with one space so i
could just do lala.split(). Thank you!

Ryan Kaskel

So when you go to the Python interactive prompt and type firstly
lala = 'LNAME PASTA ZONE'
and then
lala.split()
what do you see, and what more do you need to meet your requirements?
 
J

James Matthews

Using the split method is the easiest!

Hello. I have a string like 'LNAME
PASTA ZONE'. I want to create a list of those words and
basically replace all the whitespace between them with one space so i
could just do lala.split(). Thank you!

You *can* just do ``lala.split()``:

In [97]: lala = 'LNAME PASTA ZONE'

In [98]: lala.split()
Out[98]: ['LNAME', 'PASTA', 'ZONE']

Ciao,
Marc 'BlackJack' Rintsch
 
R

ryan k

I am taking a database class so I'm not asking for specific answers.
Well I have this text tile:

http://www.cs.tufts.edu/comp/115/projects/proj0/customer.txt

And this code:

# Table and row classes used for queries

class Row(object):
def __init__(self, column_list, row_vals):
print len(column_list)
print len(row_vals)
for column, value in column_list, row_vals:
if column and value:
setattr(self, column.lower(), value)

class Table(object):
def __init__(self, table_name, table_fd):
self.name = table_name
self.table_fd = table_fd
self.rows = []
self._load_table()

def _load_table(self):
counter = 0
for line in self.table_fd:
# Skip the second line
if not '-----' in line:
if counter == 0:
# This line contains the columns, parse it
column_list = line.split()
else:
# This is a row, parse it
row_vals = line.split()
# Create a new Row object and add it to the
table's
# row list
self.rows.append(Row(column_list, row_vals))
counter += 1

Because the addresses contain spaces, this won't work because there
are too many values being unpacked in row's __init__'s for loop. Any
suggestions for a better way to parse this file? I don't want to cheat
but just some general ideas would be nice. Thanks!
 
J

John Machin

import re
s = 'LNAME PASTA ZONE'
re.split('\s+', s)

That is (a) excessive for the OP's problem as stated and (b) unlike
str.split will cause him to cut you out of his will if his problem
turns out to include leading/trailing whitespace:
lala = ' LNAME PASTA ZONE '
import re
re.split(r'\s+', lala) ['', 'LNAME', 'PASTA', 'ZONE', '']
lala.split() ['LNAME', 'PASTA', 'ZONE']
 
J

John Machin

I am taking a database class so I'm not asking for specific answers.
Well I have this text tile:

http://www.cs.tufts.edu/comp/115/projects/proj0/customer.txt

Uh-huh, "column-aligned" output.
And this code:
[snip]


Because the addresses contain spaces, this won't work because there
are too many values being unpacked in row's __init__'s for loop. Any
suggestions for a better way to parse this file?


Tedious (and dumb) way:
field0 = line[start0:end0+1].rstrip()
field1 = line[start1:end1+1].rstrip()
etc

Why dumb: if the column sizes change, you have to suffer the tedium
again. While your sample appears to pad out each field to some
predetermined width, some reporting software (e.g. the Query Analyzer
that comes with MS SQL Server) will tailor the widths to the maximum
size actually observed in the data in each run of the report ... so
you write your program based on some tiny test output and next day you
run it for real and there's a customer whose name is Marmaduke
Rubberduckovitch-Featherstonehaugh or somesuch and your name is mud.

Smart way: note that the second line (the one with all the dashes)
gives you all the information you need to build lists of start and end
positions.
 
R

ryan k

I am taking a database class so I'm not asking for specific answers.
Well I have this text tile:

Uh-huh, "column-aligned" output.


And this code:
[snip]



Because the addresses contain spaces, this won't work because there
are too many values being unpacked in row's __init__'s for loop. Any
suggestions for a better way to parse this file?

Tedious (and dumb) way:
field0 = line[start0:end0+1].rstrip()
field1 = line[start1:end1+1].rstrip()
etc

Why dumb: if the column sizes change, you have to suffer the tedium
again. While your sample appears to pad out each field to some
predetermined width, some reporting software (e.g. the Query Analyzer
that comes with MS SQL Server) will tailor the widths to the maximum
size actually observed in the data in each run of the report ... so
you write your program based on some tiny test output and next day you
run it for real and there's a customer whose name is Marmaduke
Rubberduckovitch-Featherstonehaugh or somesuch and your name is mud.

Smart way: note that the second line (the one with all the dashes)
gives you all the information you need to build lists of start and end
positions.

Thank you for your detailed response Mr. Machin. The teacher *said*
that the columns were supposed to be tab delimited but they aren't. So
yea i will just have to count dashes. Thank you!
 
J

John Machin

So yea i will just have to count dashes.

Read my lips: *you* counting dashes is dumb. Writing your code so that
*code* is counting dashes each time it opens the file is smart.
 
R

ryan k

Read my lips: *you* counting dashes is dumb. Writing your code so that
*code* is counting dashes each time it opens the file is smart.

Okay it's almost working ...

new parser function:

def _load_table(self):
counter = 0
for line in self.table_fd:
# Skip the second line
if counter == 0:
# This line contains the columns, parse it
column_list = line.split()
elif counter == 1:
# These are the dashes
line_l = line.split()
column_width = [len(i) for i in line_l]
print column_width
else:
# This is a row, parse it
marker = 0
row_vals = []
for col in column_width:
start = sum(column_width[:marker])
finish = sum(column_width[:marker+1])
print line[start:finish].strip()
row_vals.append(line[start:finish].strip())
marker += 1
self.rows.append(Row(column_list, row_vals))
counter += 1

Something obvious you can see wrong with my start finish code?

['rimon', 'rimon', 'Barr', 'Rimon', '22 Greenside Cres., Thornhill, ON
L3T 6W9', '2', '', 'm', '102', '100
-', '22.13 1234567890', '(e-mail address removed)', '']
['UNAME', 'PASSWD', 'LNAME', 'FNAME', 'ADDR', 'ZONE', 'SEX', 'AGE',
'LIMIT', 'BALANCE', 'CREDITCARD', 'EMAIL',
'ACTIVE']
 
R

Reedick, Andrew

-----Original Message-----
From: [email protected] [mailto:python-
[email protected]] On Behalf Of ryan k
Sent: Wednesday, January 23, 2008 3:24 PM
To: (e-mail address removed)
Subject: Re: Stripping whitespace

Read my lips: *you* counting dashes is dumb. Writing your code so that
*code* is counting dashes each time it opens the file is smart.

Okay it's almost working ...

Why is it that so many Python people are regex adverse? Use the dashed
line as a regex. Convert the dashes to dots. Wrap the dots in
parentheses. Convert the whitespace chars to '\s'. Presto! Simpler,
cleaner code.

import re

state = 0
header_line = ''
pattern = ''
f = open('a.txt', 'r')
for line in f:
if line[-1:] == '\n':
line = line[:-1]

if state == 0:
header_line = line
state += 1
elif state == 1:
pattern = re.sub(r'-', r'.', line)
pattern = re.sub(r'\s', r'\\s', pattern)
pattern = re.sub(r'([.]+)', r'(\1)', pattern)
print pattern
state += 1

headers = re.match(pattern, header_line)
if headers:
print headers.groups()
else:
state = 2
m = re.match(pattern, line)
if m:
print m.groups()


f.close()



*****

The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential, proprietary, and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from all computers. GA625
 
J

John Machin

Okay it's almost working ...

new parser function:

def _load_table(self):
counter = 0
for line in self.table_fd:
# Skip the second line

The above comment is a nonsense.
if counter == 0:
# This line contains the columns, parse it
column_list = line.split()

In generality, you would have to allow for the headings to contain
spaces as well -- this means *saving* a reference to the heading line
and splitting it *after* you've processed the line with the dashes.
elif counter == 1:
# These are the dashes
line_l = line.split()
column_width = [len(i) for i in line_l]

Whoops.
column_width = [len(i) + 1 for i in line_l]
print column_width
else:
# This is a row, parse it
marker = 0
row_vals = []
for col in column_width:
start = sum(column_width[:marker])
finish = sum(column_width[:marker+1])
print line[start:finish].strip()

If you had printed just line[start:finish], it would have been obvious
what the problem was. See below for an even better suggestion.
row_vals.append(line[start:finish].strip())
marker += 1

Using sum is a tad ugly. Here's an alternative:

row_vals = []
start = 0
for width in column_width:
finish = start + width
#DEBUG# print repr(line[start:finish].replace(' ', '~'))
row_vals.append(line[start:finish].strip())
start = finish
self.rows.append(Row(column_list, row_vals))
counter += 1

Something obvious you can see wrong with my start finish code?

See above.
 
S

Steven D'Aprano

import re
s = 'LNAME PASTA ZONE'
re.split('\s+', s)

Please tell me you're making fun of the poor newbie and didn't mean to
seriously suggest using a regex merely to split on whitespace?
import timeit
timeit.Timer("s.split()", "s = 'one two three four'").repeat() [1.4074358940124512, 1.3505148887634277, 1.3469438552856445]
timeit.Timer("re.split('\s+', s)", "import re;s = 'one two
three four'").repeat()
[7.9205508232116699, 7.8833441734313965, 7.9301259517669678]
 
J

John Machin

Why is it that so many Python people are regex adverse? Use the dashed
line as a regex. Convert the dashes to dots. Wrap the dots in
parentheses. Convert the whitespace chars to '\s'. Presto! Simpler,
cleaner code.

Woo-hoo! Yesterday was HTML day, today is code review day. Yee-haa!
import re

state = 0
header_line = ''
pattern = ''
f = open('a.txt', 'r')
for line in f:
if line[-1:] == '\n':
line = line[:-1]

if state == 0:
header_line = line
state += 1

state = 1
elif state == 1:
pattern = re.sub(r'-', r'.', line)
pattern = re.sub(r'\s', r'\\s', pattern)
pattern = re.sub(r'([.]+)', r'(\1)', pattern)

Consider this:
pattern = ' '.join('(.{%d})' % len(x) for x in line.split())
print pattern
state += 1

state = 2
headers = re.match(pattern, header_line)
if headers:
print headers.groups()
else:
state = 2

assert state == 2
 
R

ryan k

import re
s = 'LNAME PASTA ZONE'
re.split('\s+', s)

Please tell me you're making fun of the poor newbie and didn't mean to
seriously suggest using a regex merely to split on whitespace?

[1.4074358940124512, 1.3505148887634277, 1.3469438552856445]>>> timeit.Timer("re.split('\s+', s)", "import re;s = 'one two

three four'").repeat()
[7.9205508232116699, 7.8833441734313965, 7.9301259517669678]

The main topic is not an issue anymore.
 
J

John Machin

Please tell me you're making fun of the poor newbie and didn't mean to
seriously suggest using a regex merely to split on whitespace?
[1.4074358940124512, 1.3505148887634277, 1.3469438552856445]>>> timeit.Timer("re.split('\s+', s)", "import re;s = 'one two
three four'").repeat()
[7.9205508232116699, 7.8833441734313965, 7.9301259517669678]

The main topic is not an issue anymore.

We know that. This thread will continue with biffo and brickbats long
after your assignment has been submitted :)
 
R

ryan k

import re
s = 'LNAME PASTA ZONE'
re.split('\s+', s)

Please tell me you're making fun of the poor newbie and didn't mean to
seriously suggest using a regex merely to split on whitespace?

[1.4074358940124512, 1.3505148887634277, 1.3469438552856445]>>> timeit.Timer("re.split('\s+', s)", "import re;s = 'one two

three four'").repeat()
[7.9205508232116699, 7.8833441734313965, 7.9301259517669678]

Much thanks to Machin for helping with the parsing job. Steven
D'Aprano, you are a prick.
 
J

John Machin

Steven D'Aprano, you are a prick.

And your reasons for coming to that stridently expressed conclusion
after reading a posting that was *not* addressed to you are .....?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top