Stripping whitespace

Discussion in 'Python' started by ryan k, Jan 23, 2008.

  1. ryan k

    ryan k Guest

    Hello. I have a string like 'LNAME
    PASTA ZONE'. I want to create a list of those words and
    basically replace all the whitespace between them with one space so i
    could just do lala.split(). Thank you!

    Ryan Kaskel
     
    ryan k, Jan 23, 2008
    #1
    1. Advertising

  2. On Wed, 23 Jan 2008 10:50:02 -0800, ryan k wrote:

    > Hello. I have a string like 'LNAME
    > PASTA ZONE'. I want to create a list of those words and
    > basically replace all the whitespace between them with one space so i
    > could just do lala.split(). Thank you!


    You *can* just do ``lala.split()``:

    In [97]: lala = 'LNAME PASTA ZONE'

    In [98]: lala.split()
    Out[98]: ['LNAME', 'PASTA', 'ZONE']

    Ciao,
    Marc 'BlackJack' Rintsch
     
    Marc 'BlackJack' Rintsch, Jan 23, 2008
    #2
    1. Advertising

  3. ryan k

    Paul Rubin Guest

    ryan k <> writes:
    > Hello. I have a string like 'LNAME
    > PASTA ZONE'. I want to create a list of those words and
    > basically replace all the whitespace between them with one space so i
    > could just do lala.split(). Thank you!


    import re
    s = 'LNAME PASTA ZONE'
    re.split('\s+', s)
     
    Paul Rubin, Jan 23, 2008
    #3
  4. ryan k

    John Machin Guest

    On Jan 24, 5:50 am, ryan k <> wrote:
    > Hello. I have a string like 'LNAME
    > PASTA ZONE'. I want to create a list of those words and
    > basically replace all the whitespace between them with one space so i
    > could just do lala.split(). Thank you!
    >
    > Ryan Kaskel


    So when you go to the Python interactive prompt and type firstly
    lala = 'LNAME PASTA ZONE'
    and then
    lala.split()
    what do you see, and what more do you need to meet your requirements?
     
    John Machin, Jan 23, 2008
    #4
  5. Using the split method is the easiest!

    On 23 Jan 2008 19:04:38 GMT, Marc 'BlackJack' Rintsch <> wrote:
    > On Wed, 23 Jan 2008 10:50:02 -0800, ryan k wrote:
    >
    > > Hello. I have a string like 'LNAME
    > > PASTA ZONE'. I want to create a list of those words and
    > > basically replace all the whitespace between them with one space so i
    > > could just do lala.split(). Thank you!

    >
    > You *can* just do ``lala.split()``:
    >
    > In [97]: lala = 'LNAME PASTA ZONE'
    >
    > In [98]: lala.split()
    > Out[98]: ['LNAME', 'PASTA', 'ZONE']
    >
    > Ciao,
    > Marc 'BlackJack' Rintsch
    >
    > --
    > http://mail.python.org/mailman/listinfo/python-list
    >




    --
    http://search.goldwatches.com/?Search=Movado Watches
    http://www.jewelerslounge.com
    http://www.goldwatches.com
     
    James Matthews, Jan 23, 2008
    #5
  6. ryan k

    ryan k Guest

    On Jan 23, 2:04 pm, Marc 'BlackJack' Rintsch <> wrote:
    > On Wed, 23 Jan 2008 10:50:02 -0800, ryan k wrote:
    > > Hello. I have a string like 'LNAME
    > > PASTA ZONE'. I want to create a list of those words and
    > > basically replace all the whitespace between them with one space so i
    > > could just do lala.split(). Thank you!

    >
    > You *can* just do ``lala.split()``:


    Indeed you can thanks!

    >
    > In [97]: lala = 'LNAME PASTA ZONE'
    >
    > In [98]: lala.split()
    > Out[98]: ['LNAME', 'PASTA', 'ZONE']
    >
    > Ciao,
    > Marc 'BlackJack' Rintsch
     
    ryan k, Jan 23, 2008
    #6
  7. ryan k

    ryan k Guest

    I am taking a database class so I'm not asking for specific answers.
    Well I have this text tile:

    http://www.cs.tufts.edu/comp/115/projects/proj0/customer.txt

    And this code:

    # Table and row classes used for queries

    class Row(object):
    def __init__(self, column_list, row_vals):
    print len(column_list)
    print len(row_vals)
    for column, value in column_list, row_vals:
    if column and value:
    setattr(self, column.lower(), value)

    class Table(object):
    def __init__(self, table_name, table_fd):
    self.name = table_name
    self.table_fd = table_fd
    self.rows = []
    self._load_table()

    def _load_table(self):
    counter = 0
    for line in self.table_fd:
    # Skip the second line
    if not '-----' in line:
    if counter == 0:
    # This line contains the columns, parse it
    column_list = line.split()
    else:
    # This is a row, parse it
    row_vals = line.split()
    # Create a new Row object and add it to the
    table's
    # row list
    self.rows.append(Row(column_list, row_vals))
    counter += 1

    Because the addresses contain spaces, this won't work because there
    are too many values being unpacked in row's __init__'s for loop. Any
    suggestions for a better way to parse this file? I don't want to cheat
    but just some general ideas would be nice. Thanks!
     
    ryan k, Jan 23, 2008
    #7
  8. ryan k

    John Machin Guest

    On Jan 24, 6:05 am, Paul Rubin <http://> wrote:
    > ryan k <> writes:
    > > Hello. I have a string like 'LNAME
    > > PASTA ZONE'. I want to create a list of those words and
    > > basically replace all the whitespace between them with one space so i
    > > could just do lala.split(). Thank you!

    >
    > import re
    > s = 'LNAME PASTA ZONE'
    > re.split('\s+', s)


    That is (a) excessive for the OP's problem as stated and (b) unlike
    str.split will cause him to cut you out of his will if his problem
    turns out to include leading/trailing whitespace:

    >>> lala = ' LNAME PASTA ZONE '
    >>> import re
    >>> re.split(r'\s+', lala)

    ['', 'LNAME', 'PASTA', 'ZONE', '']
    >>> lala.split()

    ['LNAME', 'PASTA', 'ZONE']
    >>>
     
    John Machin, Jan 23, 2008
    #8
  9. ryan k

    John Machin Guest

    On Jan 24, 6:17 am, ryan k <> wrote:
    > I am taking a database class so I'm not asking for specific answers.
    > Well I have this text tile:
    >
    > http://www.cs.tufts.edu/comp/115/projects/proj0/customer.txt


    Uh-huh, "column-aligned" output.

    >
    > And this code:
    >

    [snip]

    >
    > Because the addresses contain spaces, this won't work because there
    > are too many values being unpacked in row's __init__'s for loop. Any
    > suggestions for a better way to parse this file?



    Tedious (and dumb) way:
    field0 = line[start0:end0+1].rstrip()
    field1 = line[start1:end1+1].rstrip()
    etc

    Why dumb: if the column sizes change, you have to suffer the tedium
    again. While your sample appears to pad out each field to some
    predetermined width, some reporting software (e.g. the Query Analyzer
    that comes with MS SQL Server) will tailor the widths to the maximum
    size actually observed in the data in each run of the report ... so
    you write your program based on some tiny test output and next day you
    run it for real and there's a customer whose name is Marmaduke
    Rubberduckovitch-Featherstonehaugh or somesuch and your name is mud.

    Smart way: note that the second line (the one with all the dashes)
    gives you all the information you need to build lists of start and end
    positions.
     
    John Machin, Jan 23, 2008
    #9
  10. ryan k

    ryan k Guest

    On Jan 23, 2:53 pm, John Machin <> wrote:
    > On Jan 24, 6:17 am, ryan k <> wrote:
    >
    > > I am taking a database class so I'm not asking for specific answers.
    > > Well I have this text tile:

    >
    > >http://www.cs.tufts.edu/comp/115/projects/proj0/customer.txt

    >
    > Uh-huh, "column-aligned" output.
    >
    >
    >
    > > And this code:

    >
    > [snip]
    >
    >
    >
    > > Because the addresses contain spaces, this won't work because there
    > > are too many values being unpacked in row's __init__'s for loop. Any
    > > suggestions for a better way to parse this file?

    >
    > Tedious (and dumb) way:
    > field0 = line[start0:end0+1].rstrip()
    > field1 = line[start1:end1+1].rstrip()
    > etc
    >
    > Why dumb: if the column sizes change, you have to suffer the tedium
    > again. While your sample appears to pad out each field to some
    > predetermined width, some reporting software (e.g. the Query Analyzer
    > that comes with MS SQL Server) will tailor the widths to the maximum
    > size actually observed in the data in each run of the report ... so
    > you write your program based on some tiny test output and next day you
    > run it for real and there's a customer whose name is Marmaduke
    > Rubberduckovitch-Featherstonehaugh or somesuch and your name is mud.
    >
    > Smart way: note that the second line (the one with all the dashes)
    > gives you all the information you need to build lists of start and end
    > positions.


    Thank you for your detailed response Mr. Machin. The teacher *said*
    that the columns were supposed to be tab delimited but they aren't. So
    yea i will just have to count dashes. Thank you!
     
    ryan k, Jan 23, 2008
    #10
  11. ryan k

    John Machin Guest

    On Jan 24, 6:57 am, ryan k <> wrote:

    > So yea i will just have to count dashes.


    Read my lips: *you* counting dashes is dumb. Writing your code so that
    *code* is counting dashes each time it opens the file is smart.
     
    John Machin, Jan 23, 2008
    #11
  12. ryan k

    ryan k Guest

    On Jan 23, 3:02 pm, John Machin <> wrote:
    > On Jan 24, 6:57 am, ryan k <> wrote:
    >
    > > So yea i will just have to count dashes.

    >
    > Read my lips: *you* counting dashes is dumb. Writing your code so that
    > *code* is counting dashes each time it opens the file is smart.


    Okay it's almost working ...

    new parser function:

    def _load_table(self):
    counter = 0
    for line in self.table_fd:
    # Skip the second line
    if counter == 0:
    # This line contains the columns, parse it
    column_list = line.split()
    elif counter == 1:
    # These are the dashes
    line_l = line.split()
    column_width = [len(i) for i in line_l]
    print column_width
    else:
    # This is a row, parse it
    marker = 0
    row_vals = []
    for col in column_width:
    start = sum(column_width[:marker])
    finish = sum(column_width[:marker+1])
    print line[start:finish].strip()
    row_vals.append(line[start:finish].strip())
    marker += 1
    self.rows.append(Row(column_list, row_vals))
    counter += 1

    Something obvious you can see wrong with my start finish code?

    ['rimon', 'rimon', 'Barr', 'Rimon', '22 Greenside Cres., Thornhill, ON
    L3T 6W9', '2', '', 'm', '102', '100
    -', '22.13 1234567890', '', '']
    ['UNAME', 'PASSWD', 'LNAME', 'FNAME', 'ADDR', 'ZONE', 'SEX', 'AGE',
    'LIMIT', 'BALANCE', 'CREDITCARD', 'EMAIL',
    'ACTIVE']
     
    ryan k, Jan 23, 2008
    #12
  13. > -----Original Message-----
    > From: python-list-bounces+jr9445= [mailto:python-
    > list-bounces+jr9445=] On Behalf Of ryan k
    > Sent: Wednesday, January 23, 2008 3:24 PM
    > To:
    > Subject: Re: Stripping whitespace
    >
    > On Jan 23, 3:02 pm, John Machin <> wrote:
    > > On Jan 24, 6:57 am, ryan k <> wrote:
    > >
    > > > So yea i will just have to count dashes.

    > >
    > > Read my lips: *you* counting dashes is dumb. Writing your code so

    > that
    > > *code* is counting dashes each time it opens the file is smart.

    >
    > Okay it's almost working ...
    >


    Why is it that so many Python people are regex adverse? Use the dashed
    line as a regex. Convert the dashes to dots. Wrap the dots in
    parentheses. Convert the whitespace chars to '\s'. Presto! Simpler,
    cleaner code.

    import re

    state = 0
    header_line = ''
    pattern = ''
    f = open('a.txt', 'r')
    for line in f:
    if line[-1:] == '\n':
    line = line[:-1]

    if state == 0:
    header_line = line
    state += 1
    elif state == 1:
    pattern = re.sub(r'-', r'.', line)
    pattern = re.sub(r'\s', r'\\s', pattern)
    pattern = re.sub(r'([.]+)', r'(\1)', pattern)
    print pattern
    state += 1

    headers = re.match(pattern, header_line)
    if headers:
    print headers.groups()
    else:
    state = 2
    m = re.match(pattern, line)
    if m:
    print m.groups()


    f.close()



    *****

    The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential, proprietary, and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from all computers. GA625
     
    Reedick, Andrew, Jan 23, 2008
    #13
  14. ryan k

    John Machin Guest

    On Jan 24, 7:23 am, ryan k <> wrote:
    > On Jan 23, 3:02 pm, John Machin <> wrote:
    >
    > > On Jan 24, 6:57 am, ryan k <> wrote:

    >
    > > > So yea i will just have to count dashes.

    >
    > > Read my lips: *you* counting dashes is dumb. Writing your code so that
    > > *code* is counting dashes each time it opens the file is smart.

    >
    > Okay it's almost working ...
    >
    > new parser function:
    >
    > def _load_table(self):
    > counter = 0
    > for line in self.table_fd:
    > # Skip the second line


    The above comment is a nonsense.

    > if counter == 0:
    > # This line contains the columns, parse it
    > column_list = line.split()


    In generality, you would have to allow for the headings to contain
    spaces as well -- this means *saving* a reference to the heading line
    and splitting it *after* you've processed the line with the dashes.

    > elif counter == 1:
    > # These are the dashes
    > line_l = line.split()
    > column_width = [len(i) for i in line_l]


    Whoops.
    column_width = [len(i) + 1 for i in line_l]

    > print column_width
    > else:
    > # This is a row, parse it
    > marker = 0
    > row_vals = []
    > for col in column_width:
    > start = sum(column_width[:marker])
    > finish = sum(column_width[:marker+1])
    > print line[start:finish].strip()


    If you had printed just line[start:finish], it would have been obvious
    what the problem was. See below for an even better suggestion.

    > row_vals.append(line[start:finish].strip())
    > marker += 1


    Using sum is a tad ugly. Here's an alternative:

    row_vals = []
    start = 0
    for width in column_width:
    finish = start + width
    #DEBUG# print repr(line[start:finish].replace(' ', '~'))
    row_vals.append(line[start:finish].strip())
    start = finish

    > self.rows.append(Row(column_list, row_vals))
    > counter += 1
    >
    > Something obvious you can see wrong with my start finish code?


    See above.
     
    John Machin, Jan 23, 2008
    #14
  15. On Wed, 23 Jan 2008 11:05:01 -0800, Paul Rubin wrote:

    > ryan k <> writes:
    >> Hello. I have a string like 'LNAME
    >> PASTA ZONE'. I want to create a list of those words and
    >> basically replace all the whitespace between them with one space so i
    >> could just do lala.split(). Thank you!

    >
    > import re
    > s = 'LNAME PASTA ZONE'
    > re.split('\s+', s)


    Please tell me you're making fun of the poor newbie and didn't mean to
    seriously suggest using a regex merely to split on whitespace?

    >>> import timeit
    >>> timeit.Timer("s.split()", "s = 'one two three four'").repeat()

    [1.4074358940124512, 1.3505148887634277, 1.3469438552856445]
    >>> timeit.Timer("re.split('\s+', s)", "import re;s = 'one two

    three four'").repeat()
    [7.9205508232116699, 7.8833441734313965, 7.9301259517669678]





    --
    Steven
     
    Steven D'Aprano, Jan 23, 2008
    #15
  16. ryan k

    John Machin Guest

    On Jan 24, 7:57 am, "Reedick, Andrew" <> wrote:
    >
    > Why is it that so many Python people are regex adverse? Use the dashed
    > line as a regex. Convert the dashes to dots. Wrap the dots in
    > parentheses. Convert the whitespace chars to '\s'. Presto! Simpler,
    > cleaner code.


    Woo-hoo! Yesterday was HTML day, today is code review day. Yee-haa!

    >
    > import re
    >
    > state = 0
    > header_line = ''
    > pattern = ''
    > f = open('a.txt', 'r')
    > for line in f:
    > if line[-1:] == '\n':
    > line = line[:-1]
    >
    > if state == 0:
    > header_line = line
    > state += 1


    state = 1

    > elif state == 1:
    > pattern = re.sub(r'-', r'.', line)
    > pattern = re.sub(r'\s', r'\\s', pattern)
    > pattern = re.sub(r'([.]+)', r'(\1)', pattern)


    Consider this:
    pattern = ' '.join('(.{%d})' % len(x) for x in line.split())

    > print pattern
    > state += 1


    state = 2

    >
    > headers = re.match(pattern, header_line)
    > if headers:
    > print headers.groups()
    > else:
    > state = 2


    assert state == 2

    > m = re.match(pattern, line)
    > if m:
    > print m.groups()
    >
    > f.close()
    >
     
    John Machin, Jan 23, 2008
    #16
  17. ryan k

    ryan k Guest

    On Jan 23, 5:37 pm, Steven D'Aprano <st...@REMOVE-THIS-
    cybersource.com.au> wrote:
    > On Wed, 23 Jan 2008 11:05:01 -0800, Paul Rubin wrote:
    > > ryan k <> writes:
    > >> Hello. I have a string like 'LNAME
    > >> PASTA ZONE'. I want to create a list of those words and
    > >> basically replace all the whitespace between them with one space so i
    > >> could just do lala.split(). Thank you!

    >
    > > import re
    > > s = 'LNAME PASTA ZONE'
    > > re.split('\s+', s)

    >
    > Please tell me you're making fun of the poor newbie and didn't mean to
    > seriously suggest using a regex merely to split on whitespace?
    >
    > >>> import timeit
    > >>> timeit.Timer("s.split()", "s = 'one two three four'").repeat()

    >
    > [1.4074358940124512, 1.3505148887634277, 1.3469438552856445]>>> timeit.Timer("re.split('\s+', s)", "import re;s = 'one two
    >
    > three four'").repeat()
    > [7.9205508232116699, 7.8833441734313965, 7.9301259517669678]
    >
    > --
    > Steven


    The main topic is not an issue anymore.
     
    ryan k, Jan 23, 2008
    #17
  18. ryan k

    John Machin Guest

    On Jan 24, 9:47 am, ryan k <> wrote:
    > On Jan 23, 5:37 pm, Steven D'Aprano <st...@REMOVE-THIS-
    >
    >
    >
    > cybersource.com.au> wrote:
    > > On Wed, 23 Jan 2008 11:05:01 -0800, Paul Rubin wrote:
    > > > ryan k <> writes:
    > > >> Hello. I have a string like 'LNAME
    > > >> PASTA ZONE'. I want to create a list of those words and
    > > >> basically replace all the whitespace between them with one space so i
    > > >> could just do lala.split(). Thank you!

    >
    > > > import re
    > > > s = 'LNAME PASTA ZONE'
    > > > re.split('\s+', s)

    >
    > > Please tell me you're making fun of the poor newbie and didn't mean to
    > > seriously suggest using a regex merely to split on whitespace?

    >
    > > >>> import timeit
    > > >>> timeit.Timer("s.split()", "s = 'one two three four'").repeat()

    >
    > > [1.4074358940124512, 1.3505148887634277, 1.3469438552856445]>>> timeit.Timer("re.split('\s+', s)", "import re;s = 'one two

    >
    > > three four'").repeat()
    > > [7.9205508232116699, 7.8833441734313965, 7.9301259517669678]

    >
    > > --
    > > Steven

    >
    > The main topic is not an issue anymore.


    We know that. This thread will continue with biffo and brickbats long
    after your assignment has been submitted :)
     
    John Machin, Jan 23, 2008
    #18
  19. ryan k

    ryan k Guest

    On Jan 23, 5:37 pm, Steven D'Aprano <st...@REMOVE-THIS-
    cybersource.com.au> wrote:
    > On Wed, 23 Jan 2008 11:05:01 -0800, Paul Rubin wrote:
    > > ryan k <> writes:
    > >> Hello. I have a string like 'LNAME
    > >> PASTA ZONE'. I want to create a list of those words and
    > >> basically replace all the whitespace between them with one space so i
    > >> could just do lala.split(). Thank you!

    >
    > > import re
    > > s = 'LNAME PASTA ZONE'
    > > re.split('\s+', s)

    >
    > Please tell me you're making fun of the poor newbie and didn't mean to
    > seriously suggest using a regex merely to split on whitespace?
    >
    > >>> import timeit
    > >>> timeit.Timer("s.split()", "s = 'one two three four'").repeat()

    >
    > [1.4074358940124512, 1.3505148887634277, 1.3469438552856445]>>> timeit.Timer("re.split('\s+', s)", "import re;s = 'one two
    >
    > three four'").repeat()
    > [7.9205508232116699, 7.8833441734313965, 7.9301259517669678]
    >
    > --
    > Steven


    Much thanks to Machin for helping with the parsing job. Steven
    D'Aprano, you are a prick.
     
    ryan k, Jan 23, 2008
    #19
  20. ryan k

    John Machin Guest

    On Jan 24, 9:50 am, ryan k <> wrote:

    > Steven D'Aprano, you are a prick.


    And your reasons for coming to that stridently expressed conclusion
    after reading a posting that was *not* addressed to you are .....?
     
    John Machin, Jan 23, 2008
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Tim Tyler
    Replies:
    7
    Views:
    23,073
    danielson317
    Sep 15, 2011
  2. Oli Filth
    Replies:
    9
    Views:
    3,350
    Uncle Pirate
    Jan 17, 2005
  3. Andy Jefferies
    Replies:
    1
    Views:
    1,158
    Andy Jefferies
    Jun 26, 2003
  4. Taylor Strait
    Replies:
    13
    Views:
    391
    Chris Gernon
    Dec 28, 2006
  5. Douglas Wells
    Replies:
    8
    Views:
    156
    Nobuyoshi Nakada
    Jan 27, 2007
Loading...

Share This Page