Newby: how to transform text into lines of text

V

vsoler

Hello,

I'va read a text file into variable "a"

a=open('FicheroTexto.txt','r')
a.read()

"a" contains all the lines of the text separated by '\n' characters.

Now, I want to work with each line separately, without the '\n'
character.

How can I get variable "b" as a list of such lines?

Thank you for your help
 
D

Diez B. Roggisch

vsoler said:
Hello,

I'va read a text file into variable "a"

a=open('FicheroTexto.txt','r')
a.read()

"a" contains all the lines of the text separated by '\n' characters.

No, it doesn't. "a.read()" *returns* the contents, but you don't assign
it, so it is discarded.
Now, I want to work with each line separately, without the '\n'
character.

How can I get variable "b" as a list of such lines?


The idiomatic way would be iterating over the file-object itself - which
will get you the lines:

with open("foo.txt") as inf:
for line in inf:
print line


The advantage is that this works even for large files that otherwise
won't fit into memory. Your approach of reading the full contents can be
used like this:

content = a.read()
for line in content.split("\n"):
print line


Diez
 
T

Tim Chase

The idiomatic way would be iterating over the file-object itself - which
will get you the lines:

with open("foo.txt") as inf:
for line in inf:
print line

In versions of Python before the "with" was introduced (as in the
2.4 installations I've got at both home and work), this can simply be

for line in open("foo.txt"):
print line

If you are processing lots of files, you can use

f = open("foo.txt")
for line in f:
print line
f.close()

One other caveat here, "line" contains the newline at the end, so
you might have

print line.rstrip('\r\n')

to remove them.

content = a.read()
for line in content.split("\n"):
print line

Strings have a "splitlines()" method for this purpose:

content = a.read()
for line in content.splitlines():
print line

-tkc
 
V

vsoler

No, it doesn't. "a.read()" *returns* the contents, but you don't assign
it, so it is discarded.



The idiomatic way would be iterating over the file-object itself - which
will get you the lines:

with open("foo.txt") as inf:
     for line in inf:
         print line

The advantage is that this works even for large files that otherwise
won't fit into memory. Your approach of reading the full contents can be
used like this:

content = a.read()
for line in content.split("\n"):
     print line

Diez

Thanks a lot. Very quick and clear
 
J

John Machin

One other caveat here, "line" contains the newline at the end, so
you might have

  print line.rstrip('\r\n')

to remove them.

I don't understand the presence of the '\r' there. Any '\x0d' that
remains after reading the file in text mode and is removed by that
rstrip would be a strange occurrence in the data which the OP may
prefer to find out about and deal with; it is not part of "the
newline". Why suppress one particular data character in preference to
others?

The same applies in any case to the use of rstrip('\n'); if that finds
more than one ocurrence of '\x0a' to remove, it has exceeded the
mandate of removing the newline (if any).

So, we are left with the unfortunately awkward
if line.endswith('\n'):
line = line[:-1]

Cheers,
John
 
T

Tim Chase

One other caveat here, "line" contains the newline at the end, so
I don't understand the presence of the '\r' there. Any '\x0d' that
remains after reading the file in text mode and is removed by that
rstrip would be a strange occurrence in the data which the OP may
prefer to find out about and deal with; it is not part of "the
newline". Why suppress one particular data character in preference to
others?

In an ideal world where everybody knew how to make a proper
text-file, it wouldn't be an issue. Recreating the form of some
of the data I get from customers/providers:

Then reading it back in:
...
'headers\n'
'data1\r\n'
'data2\r\n'
'data3'

As for wanting to know about stray '\r' characters, I only want
the data -- I don't particularly like to be reminded of the
incompetence of those who send me malformed text-files ;-)
The same applies in any case to the use of rstrip('\n'); if that finds
more than one ocurrence of '\x0a' to remove, it has exceeded the
mandate of removing the newline (if any).

I believe that using the formulaic "for line in file(FILENAME)"
iteration guarantees that each "line" will have at most only one
'\n' and it will be at the end (again, a malformed text-file with
no terminal '\n' may cause it to be absent from the last line)
So, we are left with the unfortunately awkward
if line.endswith('\n'):
line = line[:-1]

You're welcome to it, but I'll stick with my more DWIM solution
of "get rid of anything that resembles an attempt at a CR/LF".

Thank goodness I haven't found any of my data-sources using
"\n\r" instead, which would require me to left-strip '\r'
characters as well. Sigh. My kingdom for competency. :-/

-tkc
 
J

John Machin

I believe that using the formulaic "for line in file(FILENAME)"
iteration guarantees that each "line" will have at most only one '\n'
and it will be at the end (again, a malformed text-file with no terminal
'\n' may cause it to be absent from the last line)

It seems that you are right -- not that I can find such a guarantee
written anywhere. I had armchair-philosophised that writing
"foo\n\r\nbar\r\n" to a file in binary mode and reading it on Windows in
text mode would be strict and report the first line as "foo\n\n"; I was
wrong.
So, we are left with the unfortunately awkward
if line.endswith('\n'):
line = line[:-1]

You're welcome to it, but I'll stick with my more DWIM solution of "get
rid of anything that resembles an attempt at a CR/LF".

Thanks, but I don't want it. My point was that you didn't TTOPEWYM (tell
the OP exactly what you meant).

My approach to DWIM with data is, given
norm_space = lambda s: u' '.join(s.split())
to break up the line into fields first (just in case the field delimiter
== '\t') then apply norm_space to each field. This gets rid of your '\r'
at end (or start!) of line, and multiple whitespace characters are
replaced by a single space. Whitespace includes NBSP (U+00A0) as an
added bonus for being righteous and using Unicode :)
Thank goodness I haven't found any of my data-sources using "\n\r"
instead, which would require me to left-strip '\r' characters as well.
Sigh. My kingdom for competency. :-/

Indeed. I actually got data in that format once from a *x programmer who
was so kind as to do it that way just for me because he knew that I use
Windows and he thought that's what Windows text files looked like. No
kidding.

Cheers,
John
 
S

Steven D'Aprano

Thank goodness I haven't found any of my data-sources using "\n\r"
instead, which would require me to left-strip '\r' characters as well.
Sigh. My kingdom for competency. :-/

If I recall correctly, one of the accounting systems I used eight years
ago gave you the option of exporting text files with either \r\n or \n\r
as the end-of-line mark. Neither \n nor \r (POSIX or classic Mac) line
endings were supported, as that would have been useful.

(It may have been Arrow Accounting, but don't quote me on that.)

I can only imagine the developer couldn't remember which order the
characters were supposed to go, so rather than look it up, he made it
optional.
 
T

Tim Chase

Scott said:
Here's how I'd do it:
with open('deheap/deheap.py', 'rU') as source:
for line in source:
print line.rstrip() # Avoid trailing spaces as well.

This should handle \n, \r\n, and \n\r lines.


Unfortunately, a raw rstrip() eats other whitespace that may be
important. I frequently get tab-delimited files, using the
following pseudo-code:

def clean_line(line):
return line.rstrip('\r\n').split('\t')

f = file('customer_x.txt')
headers = clean_line(f.next())
for line in f:
field1, field2, field3 = clean_line(line)
do_stuff()

if field3 is empty in the source-file, using rstrip(None) as you
suggest triggers errors on the tuple assignment because it eats
the tab that defined it.

I suppose if I were really smart, I'd dig a little deeper in the
CSV module to sniff out the "right" way to parse tab-delimited files.

-tkc
 
G

Gabriel Genellina

En Sun, 25 Jan 2009 23:30:33 -0200, Tim Chase
Unfortunately, a raw rstrip() eats other whitespace that may be
important. I frequently get tab-delimited files, using the following
pseudo-code:

def clean_line(line):
return line.rstrip('\r\n').split('\t')

f = file('customer_x.txt')
headers = clean_line(f.next())
for line in f:
field1, field2, field3 = clean_line(line)
do_stuff()

if field3 is empty in the source-file, using rstrip(None) as you suggest
triggers errors on the tuple assignment because it eats the tab that
defined it.

I suppose if I were really smart, I'd dig a little deeper in the CSV
module to sniff out the "right" way to parse tab-delimited files.

It's so easy that don't doing that is just inexcusable lazyness :)
Your own example, written using the csv module:

import csv

f = csv.reader(open('customer_x.txt','rb'), delimiter='\t')
headers = f.next()
for line in f:
field1, field2, field3 = line
do_stuff()
 
J

John Machin

En Sun, 25 Jan 2009 23:30:33 -0200, Tim Chase  
<[email protected]> escribió:








It's so easy that don't doing that is just inexcusable lazyness :)
Your own example, written using the csv module:

import csv

f = csv.reader(open('customer_x.txt','rb'), delimiter='\t')
headers = f.next()
for line in f:
     field1, field2, field3 = line
     do_stuff()

And where in all of that do you recommend that .decode(some_encoding)
be inserted?
 
G

Gabriel Genellina

On Jan 26, 1:03 pm, "Gabriel Genellina" <[email protected]>
wrote:

And where in all of that do you recommend that .decode(some_encoding)
be inserted?

For encodings that don't use embedded NUL bytes (latin1, utf8) I'd decode
the fields right when extracting them:

field1, field2, field3 = (field.decode('utf8') for field in line)

For encodings that allow NUL bytes, I'd use any of the recipes in the csv
module documentation.

(That is, if I care about the encoding at all. Perhaps the file contains
only numbers. Perhaps it contains only ASCII characters. Perhaps I'm only
interested in some fields for which the encoding is irrelevant. Perhaps it
is an internally generated file and it doesn't matter as long as I use the
same encoding on output)
But I admit that in general, the "decode input early when reading, work in
unicode, encode output late when writing" is the best practice.
 
T

Tim Rowe

2009/1/25 Tim Chase said:
(again, a malformed text-file with no terminal '\n' may cause it
to be absent from the last line)

Ahem. That may be "malformed" for some specific file specification,
but it is only "malformed" in general if you are using an operating
system that treats '\n' as a terminator (eg, Linux) rather than as a
separator (eg, MS DOS/Windows).

Perhaps what you don't /really/ want to be reminded of is the
existence of operating systems other than your preffered one?
 
S

Sion Arrowsmith

Diez B. Roggisch said:
[ ... ] Your approach of reading the full contents can be
used like this:

content = a.read()
for line in content.split("\n"):
print line

Or if you want the full content in memory but only ever access it on a
line-by-line basis:

content = a.readlines()

(Just because we can now write "for line in file" doesn't mean that
readlines() is *totally* redundant.)
 
M

Marc 'BlackJack' Rintsch

content = a.readlines()

(Just because we can now write "for line in file" doesn't mean that
readlines() is *totally* redundant.)

But ``content = list(a)`` is shorter. :)

Ciao,
Marc 'BlackJack' Rintsch
 
A

Andreas Waldenburger

But ``content = list(a)`` is shorter. :)
But much less clear, wouldn't you say?

content is now what? A list of lines? Characters? Bytes? I-Nodes?
Dates? Granted, it can be inferred from the fact that a file is its
own iterator over its lines, but that is a mental step that readlines()
frees you from doing.

My ~0.0154 €.

/W
 
G

Gabriel Genellina

If encoding is an issue for your application, then I'd recommend you use
codecs.open('customer_x.txt', 'rb', encoding='ebcdic') instead of open()

This would be the best way *if* the csv module could handle Unicode input,
but unfortunately this is not the case. See my other reply.
 
G

Gabriel Genellina

If encoding is an issue for your application, then I'd recommend you use
codecs.open('customer_x.txt', 'rb', encoding='ebcdic') instead of open()

This would be the best way *if* the csv module could handle Unicode input,
but unfortunately this is not the case. See my other reply.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,020
Latest member
GenesisGai

Latest Threads

Top