Newby: how to transform text into lines of text

vsoler · Jan 25, 2009

Hello,

I'va read a text file into variable "a"

a=open('FicheroTexto.txt','r')
a.read()

"a" contains all the lines of the text separated by '\n' characters.

Now, I want to work with each line separately, without the '\n'
character.

How can I get variable "b" as a list of such lines?

Thank you for your help

Diez B. Roggisch · Jan 25, 2009

vsoler said:
Hello,

I'va read a text file into variable "a"

a=open('FicheroTexto.txt','r')
a.read()

"a" contains all the lines of the text separated by '\n' characters.

No, it doesn't. "a.read()" *returns* the contents, but you don't assign
it, so it is discarded.

Now, I want to work with each line separately, without the '\n'
character.

How can I get variable "b" as a list of such lines?

The idiomatic way would be iterating over the file-object itself - which
will get you the lines:

with open("foo.txt") as inf:
for line in inf:
print line

The advantage is that this works even for large files that otherwise
won't fit into memory. Your approach of reading the full contents can be
used like this:

content = a.read()
for line in content.split("\n"):
print line

Diez

Tim Chase · Jan 25, 2009

The idiomatic way would be iterating over the file-object itself - which

will get you the lines:

with open("foo.txt") as inf:
for line in inf:
print line

In versions of Python before the "with" was introduced (as in the
2.4 installations I've got at both home and work), this can simply be

for line in open("foo.txt"):
print line

If you are processing lots of files, you can use

f = open("foo.txt")
for line in f:
print line
f.close()

One other caveat here, "line" contains the newline at the end, so
you might have

print line.rstrip('\r\n')

to remove them.

content = a.read()
for line in content.split("\n"):
print line

Strings have a "splitlines()" method for this purpose:

content = a.read()
for line in content.splitlines():
print line

-tkc

vsoler · Jan 25, 2009

No, it doesn't. "a.read()" *returns* the contents, but you don't assign
it, so it is discarded.

The idiomatic way would be iterating over the file-object itself - which
will get you the lines:

with open("foo.txt") as inf:
for line in inf:
print line

The advantage is that this works even for large files that otherwise
won't fit into memory. Your approach of reading the full contents can be
used like this:

content = a.read()
for line in content.split("\n"):
print line

Diez

Thanks a lot. Very quick and clear

John Machin · Jan 25, 2009

One other caveat here, "line" contains the newline at the end, so
you might have

print line.rstrip('\r\n')

to remove them.

I don't understand the presence of the '\r' there. Any '\x0d' that
remains after reading the file in text mode and is removed by that
rstrip would be a strange occurrence in the data which the OP may
prefer to find out about and deal with; it is not part of "the
newline". Why suppress one particular data character in preference to
others?

The same applies in any case to the use of rstrip('\n'); if that finds
more than one ocurrence of '\x0a' to remove, it has exceeded the
mandate of removing the newline (if any).

So, we are left with the unfortunately awkward
if line.endswith('\n'):
line = line[:-1]

Cheers,
John

Tim Chase · Jan 25, 2009

One other caveat here, "line" contains the newline at the end, so

I don't understand the presence of the '\r' there. Any '\x0d' that
remains after reading the file in text mode and is removed by that
rstrip would be a strange occurrence in the data which the OP may
prefer to find out about and deal with; it is not part of "the
newline". Why suppress one particular data character in preference to
others?

In an ideal world where everybody knew how to make a proper
text-file, it wouldn't be an issue. Recreating the form of some
of the data I get from customers/providers:

Then reading it back in:
...
'headers\n'
'data1\r\n'
'data2\r\n'
'data3'

As for wanting to know about stray '\r' characters, I only want
the data -- I don't particularly like to be reminded of the
incompetence of those who send me malformed text-files ;-)

The same applies in any case to the use of rstrip('\n'); if that finds
more than one ocurrence of '\x0a' to remove, it has exceeded the
mandate of removing the newline (if any).

I believe that using the formulaic "for line in file(FILENAME)"
iteration guarantees that each "line" will have at most only one
'\n' and it will be at the end (again, a malformed text-file with
no terminal '\n' may cause it to be absent from the last line)

So, we are left with the unfortunately awkward
if line.endswith('\n'):
line = line[:-1]

You're welcome to it, but I'll stick with my more DWIM solution
of "get rid of anything that resembles an attempt at a CR/LF".

Thank goodness I haven't found any of my data-sources using
"\n\r" instead, which would require me to left-strip '\r'
characters as well. Sigh. My kingdom for competency. :-/

-tkc

John Machin · Jan 26, 2009

I believe that using the formulaic "for line in file(FILENAME)"
iteration guarantees that each "line" will have at most only one '\n'
and it will be at the end (again, a malformed text-file with no terminal
'\n' may cause it to be absent from the last line)

It seems that you are right -- not that I can find such a guarantee
written anywhere. I had armchair-philosophised that writing
"foo\n\r\nbar\r\n" to a file in binary mode and reading it on Windows in
text mode would be strict and report the first line as "foo\n\n"; I was
wrong.

So, we are left with the unfortunately awkward
if line.endswith('\n'):
line = line[:-1]

Click to expand...

You're welcome to it, but I'll stick with my more DWIM solution of "get
rid of anything that resembles an attempt at a CR/LF".

Thanks, but I don't want it. My point was that you didn't TTOPEWYM (tell
the OP exactly what you meant).

My approach to DWIM with data is, given
norm_space = lambda s: u' '.join(s.split())
to break up the line into fields first (just in case the field delimiter
== '\t') then apply norm_space to each field. This gets rid of your '\r'
at end (or start!) of line, and multiple whitespace characters are
replaced by a single space. Whitespace includes NBSP (U+00A0) as an
added bonus for being righteous and using Unicode

Thank goodness I haven't found any of my data-sources using "\n\r"
instead, which would require me to left-strip '\r' characters as well.
Sigh. My kingdom for competency. :-/

Indeed. I actually got data in that format once from a *x programmer who
was so kind as to do it that way just for me because he knew that I use
Windows and he thought that's what Windows text files looked like. No
kidding.

Cheers,
John

Steven D'Aprano · Jan 26, 2009

Thank goodness I haven't found any of my data-sources using "\n\r"
instead, which would require me to left-strip '\r' characters as well.
Sigh. My kingdom for competency. :-/

If I recall correctly, one of the accounting systems I used eight years
ago gave you the option of exporting text files with either \r\n or \n\r
as the end-of-line mark. Neither \n nor \r (POSIX or classic Mac) line
endings were supported, as that would have been useful.

(It may have been Arrow Accounting, but don't quote me on that.)

I can only imagine the developer couldn't remember which order the
characters were supposed to go, so rather than look it up, he made it
optional.

Tim Chase · Jan 26, 2009

Scott said:
Here's how I'd do it:
with open('deheap/deheap.py', 'rU') as source:
for line in source:
print line.rstrip() # Avoid trailing spaces as well.

This should handle \n, \r\n, and \n\r lines.

Unfortunately, a raw rstrip() eats other whitespace that may be
important. I frequently get tab-delimited files, using the
following pseudo-code:

def clean_line(line):
return line.rstrip('\r\n').split('\t')

f = file('customer_x.txt')
headers = clean_line(f.next())
for line in f:
field1, field2, field3 = clean_line(line)
do_stuff()

if field3 is empty in the source-file, using rstrip(None) as you
suggest triggers errors on the tuple assignment because it eats
the tab that defined it.

I suppose if I were really smart, I'd dig a little deeper in the
CSV module to sniff out the "right" way to parse tab-delimited files.

-tkc

Gabriel Genellina · Jan 26, 2009

En Sun, 25 Jan 2009 23:30:33 -0200, Tim Chase

Unfortunately, a raw rstrip() eats other whitespace that may be
important. I frequently get tab-delimited files, using the following
pseudo-code:

def clean_line(line):
return line.rstrip('\r\n').split('\t')

f = file('customer_x.txt')
headers = clean_line(f.next())
for line in f:
field1, field2, field3 = clean_line(line)
do_stuff()

if field3 is empty in the source-file, using rstrip(None) as you suggest
triggers errors on the tuple assignment because it eats the tab that
defined it.

I suppose if I were really smart, I'd dig a little deeper in the CSV
module to sniff out the "right" way to parse tab-delimited files.

It's so easy that don't doing that is just inexcusable lazyness

Your own example, written using the csv module:

import csv

f = csv.reader(open('customer_x.txt','rb'), delimiter='\t')
headers = f.next()
for line in f:
field1, field2, field3 = line
do_stuff()

John Machin · Jan 26, 2009

En Sun, 25 Jan 2009 23:30:33 -0200, Tim Chase
<[email protected]> escribió:

It's so easy that don't doing that is just inexcusable lazyness
Your own example, written using the csv module:

import csv

f = csv.reader(open('customer_x.txt','rb'), delimiter='\t')
headers = f.next()
for line in f:
field1, field2, field3 = line
do_stuff()

And where in all of that do you recommend that .decode(some_encoding)
be inserted?

Gabriel Genellina · Jan 26, 2009

En Mon said:
On Jan 26, 1:03 pm, "Gabriel Genellina" <[email protected]>
wrote:

And where in all of that do you recommend that .decode(some_encoding)
be inserted?

For encodings that don't use embedded NUL bytes (latin1, utf8) I'd decode
the fields right when extracting them:

field1, field2, field3 = (field.decode('utf8') for field in line)

For encodings that allow NUL bytes, I'd use any of the recipes in the csv
module documentation.

(That is, if I care about the encoding at all. Perhaps the file contains
only numbers. Perhaps it contains only ASCII characters. Perhaps I'm only
interested in some fields for which the encoding is irrelevant. Perhaps it
is an internally generated file and it doesn't matter as long as I use the
same encoding on output)
But I admit that in general, the "decode input early when reading, work in
unicode, encode output late when writing" is the best practice.

Tim Rowe · Jan 26, 2009

2009/1/25 Tim Chase said:
(again, a malformed text-file with no terminal '\n' may cause it
to be absent from the last line)

Ahem. That may be "malformed" for some specific file specification,
but it is only "malformed" in general if you are using an operating
system that treats '\n' as a terminator (eg, Linux) rather than as a
separator (eg, MS DOS/Windows).

Perhaps what you don't /really/ want to be reminded of is the
existence of operating systems other than your preffered one?

Sion Arrowsmith · Jan 26, 2009

Diez B. Roggisch said:
[ ... ] Your approach of reading the full contents can be
used like this:

content = a.read()
for line in content.split("\n"):
print line

Or if you want the full content in memory but only ever access it on a
line-by-line basis:

content = a.readlines()

(Just because we can now write "for line in file" doesn't mean that
readlines() is *totally* redundant.)

Marc 'BlackJack' Rintsch · Jan 26, 2009

content = a.readlines()

(Just because we can now write "for line in file" doesn't mean that
readlines() is *totally* redundant.)

But ``content = list(a)`` is shorter.

Ciao,
Marc 'BlackJack' Rintsch

Andreas Waldenburger · Jan 26, 2009

But ``content = list(a)`` is shorter.

But much less clear, wouldn't you say?

content is now what? A list of lines? Characters? Bytes? I-Nodes?
Dates? Granted, it can be inferred from the fact that a file is its
own iterator over its lines, but that is a mental step that readlines()
frees you from doing.

My ~0.0154 â‚¬.

/W

Gabriel Genellina · Jan 26, 2009

En Mon said:
If encoding is an issue for your application, then I'd recommend you use
codecs.open('customer_x.txt', 'rb', encoding='ebcdic') instead of open()

This would be the best way *if* the csv module could handle Unicode input,
but unfortunately this is not the case. See my other reply.

Gabriel Genellina · Jan 26, 2009

En Mon said:
If encoding is an issue for your application, then I'd recommend you use
codecs.open('customer_x.txt', 'rb', encoding='ebcdic') instead of open()

This would be the best way *if* the csv module could handle Unicode input,
but unfortunately this is not the case. See my other reply.

Marc 'BlackJack' Rintsch · Jan 26, 2009

But much less clear, wouldn't you say?

Okay, so let's make it clearer and even shorter: ``lines = list(a)``.

Ciao,
Marc 'BlackJack' Rintsch

Andreas Waldenburger · Jan 26, 2009

Okay, so let's make it clearer and even shorter: ``lines =
list(a)``.

OK, you win.

/W

Php combine identical lines in text file	4	Oct 11, 2023
Genetic algoritm generating the text	0	Aug 18, 2023
Python point location of intersect between two lines	0	Feb 28, 2018
Help with importing from multiple files and printing lines in designated spot to spit out one file.	1	Jan 16, 2023
C language. work with text	3	Dec 10, 2021
how to use dictionary - newby	0	Jul 23, 2013
Select Eof extension files based on text list of filenames with if condition	0	May 4, 2022
Python pyPDF4 code to bookmark pdf based upon date text	1	Jan 18, 2023

Newby: how to transform text into lines of text

vsoler

Diez B. Roggisch

Tim Chase

vsoler

John Machin

Tim Chase

John Machin

Steven D'Aprano

Tim Chase

Gabriel Genellina

John Machin

Gabriel Genellina

Tim Rowe

Sion Arrowsmith

Marc 'BlackJack' Rintsch

Andreas Waldenburger

Gabriel Genellina

Gabriel Genellina

Marc 'BlackJack' Rintsch

Andreas Waldenburger

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads