Newby: how to transform text into lines of text

Discussion in 'Python' started by vsoler, Jan 25, 2009.

  1. vsoler

    vsoler Guest

    Hello,

    I'va read a text file into variable "a"

    a=open('FicheroTexto.txt','r')
    a.read()

    "a" contains all the lines of the text separated by '\n' characters.

    Now, I want to work with each line separately, without the '\n'
    character.

    How can I get variable "b" as a list of such lines?

    Thank you for your help
    vsoler, Jan 25, 2009
    #1
    1. Advertising

  2. vsoler schrieb:
    > Hello,
    >
    > I'va read a text file into variable "a"
    >
    > a=open('FicheroTexto.txt','r')
    > a.read()
    >
    > "a" contains all the lines of the text separated by '\n' characters.


    No, it doesn't. "a.read()" *returns* the contents, but you don't assign
    it, so it is discarded.

    > Now, I want to work with each line separately, without the '\n'
    > character.
    >
    > How can I get variable "b" as a list of such lines?



    The idiomatic way would be iterating over the file-object itself - which
    will get you the lines:

    with open("foo.txt") as inf:
    for line in inf:
    print line


    The advantage is that this works even for large files that otherwise
    won't fit into memory. Your approach of reading the full contents can be
    used like this:

    content = a.read()
    for line in content.split("\n"):
    print line


    Diez
    Diez B. Roggisch, Jan 25, 2009
    #2
    1. Advertising

  3. vsoler

    Tim Chase Guest

    > The idiomatic way would be iterating over the file-object itself - which
    > will get you the lines:
    >
    > with open("foo.txt") as inf:
    > for line in inf:
    > print line


    In versions of Python before the "with" was introduced (as in the
    2.4 installations I've got at both home and work), this can simply be

    for line in open("foo.txt"):
    print line

    If you are processing lots of files, you can use

    f = open("foo.txt")
    for line in f:
    print line
    f.close()

    One other caveat here, "line" contains the newline at the end, so
    you might have

    print line.rstrip('\r\n')

    to remove them.


    > content = a.read()
    > for line in content.split("\n"):
    > print line


    Strings have a "splitlines()" method for this purpose:

    content = a.read()
    for line in content.splitlines():
    print line

    -tkc
    Tim Chase, Jan 25, 2009
    #3
  4. vsoler

    vsoler Guest

    On 25 ene, 14:36, "Diez B. Roggisch" <> wrote:
    > vsoler schrieb:
    >
    > > Hello,

    >
    > > I'va read a text file into variable "a"

    >
    > >      a=open('FicheroTexto.txt','r')
    > >      a.read()

    >
    > > "a" contains all the lines of the text separated by '\n' characters.

    >
    > No, it doesn't. "a.read()" *returns* the contents, but you don't assign
    > it, so it is discarded.
    >
    > > Now, I want to work with each line separately, without the '\n'
    > > character.

    >
    > > How can I get variable "b" as a list of such lines?

    >
    > The idiomatic way would be iterating over the file-object itself - which
    > will get you the lines:
    >
    > with open("foo.txt") as inf:
    >      for line in inf:
    >          print line
    >
    > The advantage is that this works even for large files that otherwise
    > won't fit into memory. Your approach of reading the full contents can be
    > used like this:
    >
    > content = a.read()
    > for line in content.split("\n"):
    >      print line
    >
    > Diez


    Thanks a lot. Very quick and clear
    vsoler, Jan 25, 2009
    #4
  5. vsoler

    John Machin Guest

    On Jan 26, 12:54 am, Tim Chase <> wrote:

    > One other caveat here, "line" contains the newline at the end, so
    > you might have
    >
    >   print line.rstrip('\r\n')
    >
    > to remove them.


    I don't understand the presence of the '\r' there. Any '\x0d' that
    remains after reading the file in text mode and is removed by that
    rstrip would be a strange occurrence in the data which the OP may
    prefer to find out about and deal with; it is not part of "the
    newline". Why suppress one particular data character in preference to
    others?

    The same applies in any case to the use of rstrip('\n'); if that finds
    more than one ocurrence of '\x0a' to remove, it has exceeded the
    mandate of removing the newline (if any).

    So, we are left with the unfortunately awkward
    if line.endswith('\n'):
    line = line[:-1]

    Cheers,
    John
    John Machin, Jan 25, 2009
    #5
  6. vsoler

    Tim Chase Guest

    >> One other caveat here, "line" contains the newline at the end, so
    >> you might have
    >>
    >> print line.rstrip('\r\n')
    >>
    >> to remove them.

    >
    > I don't understand the presence of the '\r' there. Any '\x0d' that
    > remains after reading the file in text mode and is removed by that
    > rstrip would be a strange occurrence in the data which the OP may
    > prefer to find out about and deal with; it is not part of "the
    > newline". Why suppress one particular data character in preference to
    > others?


    In an ideal world where everybody knew how to make a proper
    text-file, it wouldn't be an issue. Recreating the form of some
    of the data I get from customers/providers:

    >>> f = file('tmp/x.txt', 'wb')
    >>> f.write('headers\n') # headers in Unix format
    >>> f.write('data1\r\n') # data in Dos format
    >>> f.write('data2\r\n')
    >>> f.write('data3') # no trailing newline of any sort
    >>> f.close()


    Then reading it back in:

    >>> for line in file('tmp/x.txt'): print repr(line)

    ...
    'headers\n'
    'data1\r\n'
    'data2\r\n'
    'data3'

    As for wanting to know about stray '\r' characters, I only want
    the data -- I don't particularly like to be reminded of the
    incompetence of those who send me malformed text-files ;-)

    > The same applies in any case to the use of rstrip('\n'); if that finds
    > more than one ocurrence of '\x0a' to remove, it has exceeded the
    > mandate of removing the newline (if any).


    I believe that using the formulaic "for line in file(FILENAME)"
    iteration guarantees that each "line" will have at most only one
    '\n' and it will be at the end (again, a malformed text-file with
    no terminal '\n' may cause it to be absent from the last line)

    > So, we are left with the unfortunately awkward
    > if line.endswith('\n'):
    > line = line[:-1]


    You're welcome to it, but I'll stick with my more DWIM solution
    of "get rid of anything that resembles an attempt at a CR/LF".

    Thank goodness I haven't found any of my data-sources using
    "\n\r" instead, which would require me to left-strip '\r'
    characters as well. Sigh. My kingdom for competency. :-/

    -tkc
    Tim Chase, Jan 25, 2009
    #6
  7. vsoler

    John Machin Guest

    On 26/01/2009 10:34 AM, Tim Chase wrote:

    > I believe that using the formulaic "for line in file(FILENAME)"
    > iteration guarantees that each "line" will have at most only one '\n'
    > and it will be at the end (again, a malformed text-file with no terminal
    > '\n' may cause it to be absent from the last line)


    It seems that you are right -- not that I can find such a guarantee
    written anywhere. I had armchair-philosophised that writing
    "foo\n\r\nbar\r\n" to a file in binary mode and reading it on Windows in
    text mode would be strict and report the first line as "foo\n\n"; I was
    wrong.

    >
    >> So, we are left with the unfortunately awkward
    >> if line.endswith('\n'):
    >> line = line[:-1]

    >
    > You're welcome to it, but I'll stick with my more DWIM solution of "get
    > rid of anything that resembles an attempt at a CR/LF".


    Thanks, but I don't want it. My point was that you didn't TTOPEWYM (tell
    the OP exactly what you meant).

    My approach to DWIM with data is, given
    norm_space = lambda s: u' '.join(s.split())
    to break up the line into fields first (just in case the field delimiter
    == '\t') then apply norm_space to each field. This gets rid of your '\r'
    at end (or start!) of line, and multiple whitespace characters are
    replaced by a single space. Whitespace includes NBSP (U+00A0) as an
    added bonus for being righteous and using Unicode :)

    > Thank goodness I haven't found any of my data-sources using "\n\r"
    > instead, which would require me to left-strip '\r' characters as well.
    > Sigh. My kingdom for competency. :-/


    Indeed. I actually got data in that format once from a *x programmer who
    was so kind as to do it that way just for me because he knew that I use
    Windows and he thought that's what Windows text files looked like. No
    kidding.

    Cheers,
    John
    John Machin, Jan 26, 2009
    #7
  8. On Sun, 25 Jan 2009 17:34:18 -0600, Tim Chase wrote:

    > Thank goodness I haven't found any of my data-sources using "\n\r"
    > instead, which would require me to left-strip '\r' characters as well.
    > Sigh. My kingdom for competency. :-/


    If I recall correctly, one of the accounting systems I used eight years
    ago gave you the option of exporting text files with either \r\n or \n\r
    as the end-of-line mark. Neither \n nor \r (POSIX or classic Mac) line
    endings were supported, as that would have been useful.

    (It may have been Arrow Accounting, but don't quote me on that.)

    I can only imagine the developer couldn't remember which order the
    characters were supposed to go, so rather than look it up, he made it
    optional.



    --
    Steven
    Steven D'Aprano, Jan 26, 2009
    #8
  9. vsoler

    Tim Chase Guest

    Scott David Daniels wrote:
    > Here's how I'd do it:
    > with open('deheap/deheap.py', 'rU') as source:
    > for line in source:
    > print line.rstrip() # Avoid trailing spaces as well.
    >
    > This should handle \n, \r\n, and \n\r lines.



    Unfortunately, a raw rstrip() eats other whitespace that may be
    important. I frequently get tab-delimited files, using the
    following pseudo-code:

    def clean_line(line):
    return line.rstrip('\r\n').split('\t')

    f = file('customer_x.txt')
    headers = clean_line(f.next())
    for line in f:
    field1, field2, field3 = clean_line(line)
    do_stuff()

    if field3 is empty in the source-file, using rstrip(None) as you
    suggest triggers errors on the tuple assignment because it eats
    the tab that defined it.

    I suppose if I were really smart, I'd dig a little deeper in the
    CSV module to sniff out the "right" way to parse tab-delimited files.

    -tkc
    Tim Chase, Jan 26, 2009
    #9
  10. En Sun, 25 Jan 2009 23:30:33 -0200, Tim Chase
    <> escribió:

    > Unfortunately, a raw rstrip() eats other whitespace that may be
    > important. I frequently get tab-delimited files, using the following
    > pseudo-code:
    >
    > def clean_line(line):
    > return line.rstrip('\r\n').split('\t')
    >
    > f = file('customer_x.txt')
    > headers = clean_line(f.next())
    > for line in f:
    > field1, field2, field3 = clean_line(line)
    > do_stuff()
    >
    > if field3 is empty in the source-file, using rstrip(None) as you suggest
    > triggers errors on the tuple assignment because it eats the tab that
    > defined it.
    >
    > I suppose if I were really smart, I'd dig a little deeper in the CSV
    > module to sniff out the "right" way to parse tab-delimited files.


    It's so easy that don't doing that is just inexcusable lazyness :)
    Your own example, written using the csv module:

    import csv

    f = csv.reader(open('customer_x.txt','rb'), delimiter='\t')
    headers = f.next()
    for line in f:
    field1, field2, field3 = line
    do_stuff()

    --
    Gabriel Genellina
    Gabriel Genellina, Jan 26, 2009
    #10
  11. vsoler

    John Machin Guest

    On Jan 26, 1:03 pm, "Gabriel Genellina" <>
    wrote:
    > En Sun, 25 Jan 2009 23:30:33 -0200, Tim Chase  
    > <> escribió:
    >
    >
    >
    > > Unfortunately, a raw rstrip() eats other whitespace that may be  
    > > important.  I frequently get tab-delimited files, using the following  
    > > pseudo-code:

    >
    > >    def clean_line(line):
    > >      return line.rstrip('\r\n').split('\t')

    >
    > >    f = file('customer_x.txt')
    > >    headers = clean_line(f.next())
    > >    for line in f:
    > >      field1, field2, field3 = clean_line(line)
    > >      do_stuff()

    >
    > > if field3 is empty in the source-file, using rstrip(None) as you suggest  
    > > triggers errors on the tuple assignment because it eats the tab that  
    > > defined it.

    >
    > > I suppose if I were really smart, I'd dig a little deeper in the CSV  
    > > module to sniff out the "right" way to parse tab-delimited files.

    >
    > It's so easy that don't doing that is just inexcusable lazyness :)
    > Your own example, written using the csv module:
    >
    > import csv
    >
    > f = csv.reader(open('customer_x.txt','rb'), delimiter='\t')
    > headers = f.next()
    > for line in f:
    >      field1, field2, field3 = line
    >      do_stuff()
    >


    And where in all of that do you recommend that .decode(some_encoding)
    be inserted?
    John Machin, Jan 26, 2009
    #11
  12. En Mon, 26 Jan 2009 00:23:30 -0200, John Machin <>
    escribió:
    > On Jan 26, 1:03 pm, "Gabriel Genellina" <>
    > wrote:


    >> It's so easy that don't doing that is just inexcusable lazyness :)
    >> Your own example, written using the csv module:
    >>
    >> import csv
    >>
    >> f = csv.reader(open('customer_x.txt','rb'), delimiter='\t')
    >> headers = f.next()
    >> for line in f:
    >>      field1, field2, field3 = line
    >>      do_stuff()

    >
    > And where in all of that do you recommend that .decode(some_encoding)
    > be inserted?


    For encodings that don't use embedded NUL bytes (latin1, utf8) I'd decode
    the fields right when extracting them:

    field1, field2, field3 = (field.decode('utf8') for field in line)

    For encodings that allow NUL bytes, I'd use any of the recipes in the csv
    module documentation.

    (That is, if I care about the encoding at all. Perhaps the file contains
    only numbers. Perhaps it contains only ASCII characters. Perhaps I'm only
    interested in some fields for which the encoding is irrelevant. Perhaps it
    is an internally generated file and it doesn't matter as long as I use the
    same encoding on output)
    But I admit that in general, the "decode input early when reading, work in
    unicode, encode output late when writing" is the best practice.

    --
    Gabriel Genellina
    Gabriel Genellina, Jan 26, 2009
    #12
  13. vsoler

    Tim Rowe Guest

    2009/1/25 Tim Chase <>:

    > (again, a malformed text-file with no terminal '\n' may cause it
    > to be absent from the last line)


    Ahem. That may be "malformed" for some specific file specification,
    but it is only "malformed" in general if you are using an operating
    system that treats '\n' as a terminator (eg, Linux) rather than as a
    separator (eg, MS DOS/Windows).

    Perhaps what you don't /really/ want to be reminded of is the
    existence of operating systems other than your preffered one?

    --
    Tim Rowe
    Tim Rowe, Jan 26, 2009
    #13
  14. Diez B. Roggisch <> wrote:
    > [ ... ] Your approach of reading the full contents can be
    >used like this:
    >
    >content = a.read()
    >for line in content.split("\n"):
    > print line
    >


    Or if you want the full content in memory but only ever access it on a
    line-by-line basis:

    content = a.readlines()

    (Just because we can now write "for line in file" doesn't mean that
    readlines() is *totally* redundant.)

    --
    \S -- -- http://www.chaos.org.uk/~sion/
    "Frankly I have no feelings towards penguins one way or the other"
    -- Arthur C. Clarke
    her nu becomeþ se bera eadward ofdun hlæddre heafdes bæce bump bump bump
    Sion Arrowsmith, Jan 26, 2009
    #14
  15. On Mon, 26 Jan 2009 12:22:18 +0000, Sion Arrowsmith wrote:

    > content = a.readlines()
    >
    > (Just because we can now write "for line in file" doesn't mean that
    > readlines() is *totally* redundant.)


    But ``content = list(a)`` is shorter. :)

    Ciao,
    Marc 'BlackJack' Rintsch
    Marc 'BlackJack' Rintsch, Jan 26, 2009
    #15
  16. On 26 Jan 2009 14:51:33 GMT Marc 'BlackJack' Rintsch <>
    wrote:

    > On Mon, 26 Jan 2009 12:22:18 +0000, Sion Arrowsmith wrote:
    >
    > > content = a.readlines()
    > >
    > > (Just because we can now write "for line in file" doesn't mean that
    > > readlines() is *totally* redundant.)

    >
    > But ``content = list(a)`` is shorter. :)
    >

    But much less clear, wouldn't you say?

    content is now what? A list of lines? Characters? Bytes? I-Nodes?
    Dates? Granted, it can be inferred from the fact that a file is its
    own iterator over its lines, but that is a mental step that readlines()
    frees you from doing.

    My ~0.0154 €.

    /W

    --
    My real email address is constructed by swapping the domain with the
    recipient (local part).
    Andreas Waldenburger, Jan 26, 2009
    #16
  17. En Mon, 26 Jan 2009 13:35:39 -0200, J. Cliff Dyer <>
    escribió:
    > On Sun, 2009-01-25 at 18:23 -0800, John Machin wrote:
    >> On Jan 26, 1:03 pm, "Gabriel Genellina" <>
    >> wrote:
    >> > En Sun, 25 Jan 2009 23:30:33 -0200, Tim Chase
    >> > <> escribió:


    >> > > I suppose if I were really smart, I'd dig a little deeper in the CSV
    >> > > module to sniff out the "right" way to parse tab-delimited files.
    >> >
    >> > It's so easy that don't doing that is just inexcusable lazyness :)
    >> > Your own example, written using the csv module:
    >> >
    >> > import csv
    >> >
    >> > f = csv.reader(open('customer_x.txt','rb'), delimiter='\t')
    >> > headers = f.next()
    >> > for line in f:
    >> > field1, field2, field3 = line
    >> > do_stuff()
    >> >

    >>
    >> And where in all of that do you recommend that .decode(some_encoding)
    >> be inserted?

    >
    > If encoding is an issue for your application, then I'd recommend you use
    > codecs.open('customer_x.txt', 'rb', encoding='ebcdic') instead of open()


    This would be the best way *if* the csv module could handle Unicode input,
    but unfortunately this is not the case. See my other reply.

    --
    Gabriel Genellina
    Gabriel Genellina, Jan 26, 2009
    #17
  18. En Mon, 26 Jan 2009 13:35:39 -0200, J. Cliff Dyer <>
    escribió:
    > On Sun, 2009-01-25 at 18:23 -0800, John Machin wrote:
    >> On Jan 26, 1:03 pm, "Gabriel Genellina" <>
    >> wrote:
    >> > En Sun, 25 Jan 2009 23:30:33 -0200, Tim Chase
    >> > <> escribió:


    >> > > I suppose if I were really smart, I'd dig a little deeper in the CSV
    >> > > module to sniff out the "right" way to parse tab-delimited files.
    >> >
    >> > It's so easy that don't doing that is just inexcusable lazyness :)
    >> > Your own example, written using the csv module:
    >> >
    >> > import csv
    >> >
    >> > f = csv.reader(open('customer_x.txt','rb'), delimiter='\t')
    >> > headers = f.next()
    >> > for line in f:
    >> > field1, field2, field3 = line
    >> > do_stuff()
    >> >

    >>
    >> And where in all of that do you recommend that .decode(some_encoding)
    >> be inserted?

    >
    > If encoding is an issue for your application, then I'd recommend you use
    > codecs.open('customer_x.txt', 'rb', encoding='ebcdic') instead of open()


    This would be the best way *if* the csv module could handle Unicode input,
    but unfortunately this is not the case. See my other reply.

    --
    Gabriel Genellina
    Gabriel Genellina, Jan 26, 2009
    #18
  19. On Mon, 26 Jan 2009 16:10:11 +0100, Andreas Waldenburger wrote:

    > On 26 Jan 2009 14:51:33 GMT Marc 'BlackJack' Rintsch <>
    > wrote:
    >
    >> On Mon, 26 Jan 2009 12:22:18 +0000, Sion Arrowsmith wrote:
    >>
    >> > content = a.readlines()
    >> >
    >> > (Just because we can now write "for line in file" doesn't mean that
    >> > readlines() is *totally* redundant.)

    >>
    >> But ``content = list(a)`` is shorter. :)
    >>

    > But much less clear, wouldn't you say?


    Okay, so let's make it clearer and even shorter: ``lines = list(a)``. :)

    Ciao,
    Marc 'BlackJack' Rintsch
    Marc 'BlackJack' Rintsch, Jan 26, 2009
    #19
  20. On 26 Jan 2009 22:12:43 GMT Marc 'BlackJack' Rintsch <>
    wrote:

    > On Mon, 26 Jan 2009 16:10:11 +0100, Andreas Waldenburger wrote:
    >
    > > On 26 Jan 2009 14:51:33 GMT Marc 'BlackJack' Rintsch
    > > <> wrote:
    > >
    > >> On Mon, 26 Jan 2009 12:22:18 +0000, Sion Arrowsmith wrote:
    > >>
    > >> > content = a.readlines()
    > >> >
    > >> > (Just because we can now write "for line in file" doesn't mean
    > >> > that readlines() is *totally* redundant.)
    > >>
    > >> But ``content = list(a)`` is shorter. :)
    > >>

    > > But much less clear, wouldn't you say?

    >
    > Okay, so let's make it clearer and even shorter: ``lines =
    > list(a)``. :)
    >

    OK, you win. :)

    /W

    --
    My real email address is constructed by swapping the domain with the
    recipient (local part).
    Andreas Waldenburger, Jan 26, 2009
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Row
    Replies:
    0
    Views:
    480
  2. R. P.
    Replies:
    3
    Views:
    8,244
    Joe Kesselman
    Jun 22, 2006
  3. Murali
    Replies:
    2
    Views:
    559
    Jerry Coffin
    Mar 9, 2006
  4. vsoler
    Replies:
    4
    Views:
    243
    Steve Holden
    Feb 1, 2009
  5. Cah Sableng
    Replies:
    0
    Views:
    236
    Cah Sableng
    Apr 23, 2007
Loading...

Share This Page