Reading in cooked mode (was Re: Python MSI not installing, log fileshowing name of a Viatnemese comm

C

Chris Angelico

But a *text file* is a concatenation of lines. The "text file" model is
important enough that nearly all programming languages offer a line-based
interface to files, and some (Python at least, possibly others) make it
the default interface so that iterating over the file gives you lines
rather than bytes -- even in "binary" mode.

And lines are delimited entities. A text file is a sequence of lines,
separated by certain characters.
There is: call strip('\n') on the line after reading it. Perl and Ruby
spell it chomp(). Other languages may spell it differently. I don't know
of any language that automatically strips newlines, probably because you
can easily strip the newline from the line, but if the language did it
for you, you cannot reliably reverse it.

That's not a tidy way to iterate, that's a way to iterate and then do
stuff. Compare:

for line in f:
# process line with newline

for line in f:
line = line.strip("\n")
# process line without newline, as long as it doesn't have \r\n or something

for line in f:
line = line.split("$")
# process line as a series of dollar-delimited fields

The second one is more like the third than the first. Python does not
offer a tidy way to do the common thing, which is reading the content
of the line without its terminator.
I have no problem with that: when interpreting text as a record with
delimiters, e.g. from a CSV file, you normally exclude the delimiter.
Sometimes the line terminator does double-duty as a record delimiter as
well.

So why is the delimiter excluded when you treat the file as CSV, but
included when you treat the file as lines of text?
Reading from a file is considered a low-level operation. Reading
individual bytes in binary mode is the lowest level; reading lines in
text mode is the next level, built on top of the lower binary mode. You
build higher protocols on top of one or the other of that mode, e.g.
"read a zip file" would be built on top of binary mode, "read a csv file"
would be built on top of text mode.

I agree that reading a binary file is the lowest level. Reading a text
file is higher level, but to me "reading a text file" means "reading a
binary file and decoding it into Unicode text", and not "... and
dividing it into lines". Bear in mind that reading a CSV file can be
built on top of a Unicode decode, but not on a line-based iteration
(in case there are newlines inside quotes).
As a low-level protocol, you ought to be able to copy a file without
changing it by reading it in then writing it out:

for blob in infile:
outfile.write(blob)


ought to work whether you are in text mode or binary mode, so long as the
infile and outfile are opened in the same mode. If Python were to strip
newlines, that would no longer be the case.

All you need is a "writeln" method that re-adds the newline, and then
it's correctly round-tripping, based on what you've already stated
about the file: that it's a series of lines of text. It might not be a
byte-equivalent round-trip if you're changing newline style, any more
than it already won't be for other reasons (file encoding, for
instance). By reading the file as a series of Unicode lines, you're
declaring that it contains lines of Unicode text, not arbitrary bytes,
and so a valid representation of those lines of Unicode text is a
faithful reproduction of the file. If you want a byte-for-byte
identical file, open it in binary mode to do the copy; that's what we
learn from FTPing files between Linux and Windows.
(Even high-level protocols should avoid unnecessary modifications to
files. One of the more annoying, if not crippling, limitations to the
configparser module is that reading an INI file in, then writing it out
again destroys the high-level structure of the file: comments and blank
lines are stripped, and records may be re-ordered.)

Precisely. If you read it as an INI file and then rewrite it as an INI
file, you risk damaging that sort of thing. If you parse a file as a
Python script, and then reconstitute it from the AST (with one of the
unparsers available), you have a guarantee that the result will
execute the exact same code. But it won't be the same file (although
Python's AST does guarantee order, unlike your INI file example).
Actually, this might be a useful transformation to do, sometimes -
part of a diff suite, maybe - if the old and new versions are
identical after an AST parse/unparse transformation, you don't need to
re-run tests, because there's no way a code bug can have been
introduced.

ChrisA
 
S

Steven D'Aprano

And lines are delimited entities. A text file is a sequence of lines,
separated by certain characters.

Are they really separated, or are they terminated?

a\nb\n

Three lines or two? If you say three, then you consider \n to be a
separator; if you say two, you consider it a terminator.

The thing is, both points of view are valid. If \n is a terminator, then
the above is valid text, but this may not be:

a\nb\nc

since the last line is unterminated. (You might be generous and allow
that every line must be terminated except possibly the last. Or you might
be strict and consider the last line to be broken.)

In practice, most people swap between one point of view and the other
without warning: I might say that "a\nb\n" has two lines terminated with
\n, and then an instant later say that the file ends with a blank line,
which means it has three lines, not two. Or you might say that "a\nb\n"
has three lines separated by \n, and an instant later claim that the last
line contains the letter "b". So common language about text files tends
to be inconsistent and flip-flop between the two points of view, a bit
like the Necker Cube optical illusion.

Given that the two points of view are legitimate and useful, how should a
programming language treat lines? If the language treats the newline as
separator, and strips it, then those who want to treat it as terminator
are screwed -- you cannot tell if the last line is terminated or not. But
if the language treats the newline as a terminator, and so part of the
line, it is easy for the caller to remove it. The decision ought to be a
no-brainer: keep the newline in place, let the user strip it if they
don't want it.

Here's another thought for you: words are separated by spaces. Nobody
ever considers the space to be part of the word[1]. I think that nearly
everyone agrees that both "spam eggs" and "spam eggs" contain two
words, "spam" and "eggs". I don't think anyone would say that the second
example includes seven words, five of which are blank. Would we like to
say that "spam\n\n\n\n\n\neggs" contains two lines rather than seven?


That's not a tidy way to iterate, that's a way to iterate and then do
stuff. Compare:

for line in f:
# process line with newline

for line in f:
line = line.strip("\n")
# process line without newline, as long as it doesn't have \r\n or
something

With universal newline support, you can completely ignore the difference
in platform-specific end-of-line markers. By default, Python will convert
them to and from \n when you read or write a text file, and you'll never
see any difference. Just program using \n in your source code, and let
Python do the right thing. (If you need to handle end of line markers
yourself, you can easily disable universal newline support.)

f = (line.rstrip('\n') for line in f)
for line in f:
# process line

Everything[1] in computer science can be solved by an additional layer of
indirection :)


[...]
So why is the delimiter excluded when you treat the file as CSV, but
included when you treat the file as lines of text?

Because reading lines of text is more general than reading CSV records.
Therefore it has to make fewer modifications to the raw content.

I once had a Pascal compiler that would insert spaces, indentation, even
change the case of words. Regardless of what you actually typed, it would
pretty-print your code, then write the pretty-printed output when you
saved. Likewise, if you read in a Pascal source file from an external
editor, then saved it, it would overwrite the original with it's pretty-
printed version. That sort of thing may or may not be appropriate for a
high-level tool which is allowed to impose whatever structure it likes on
its data files, but it would be completely inappropriate for a low-level
almost-raw process (more like lightly blanched than cooked) like reading
from a text file in Python.

I agree that reading a binary file is the lowest level. Reading a text
file is higher level, but to me "reading a text file" means "reading a
binary file and decoding it into Unicode text", and not "... and
dividing it into lines". Bear in mind that reading a CSV file can be
built on top of a Unicode decode, but not on a line-based iteration (in
case there are newlines inside quotes).

Of course you can build a CSV reader on top of line-based iteration. You
just need an accumulator inside your parser: if, at the end of the line,
you are still inside a quoted field, keep processing over the next line.

All you need is a "writeln" method that re-adds the newline, and then
it's correctly round-tripping, based on what you've already stated about
the file: that it's a series of lines of text.

No, that can't work. If the last line of the input file lacks a line
terminator, the writeln will add one. Let's make it simple: if your data
file consists of only a single line, "spam", the first blob you receive
will be "spam". If it consists of "spam\n" instead, the first blob you
receive will also be "spam". Should you call write() or writeln()?
Whichever you choose, you will get it wrong for some files.

It might not be a
byte-equivalent round-trip if you're changing newline style, any more
than it already won't be for other reasons (file encoding, for
instance).

Ignore encodings and newline style. They are irrelevant. So long as the
input and output writer use the same settings, the input will be copied
unchanged.

By reading the file as a series of Unicode lines, you're
declaring that it contains lines of Unicode text, not arbitrary bytes,
and so a valid representation of those lines of Unicode text is a
faithful reproduction of the file. If you want a byte-for-byte identical
file, open it in binary mode to do the copy; that's what we learn from
FTPing files between Linux and Windows.

Both "spam" and "spam\n" are valid Unicode. By striping the newline, you
make it impossible to distinguish them on the last line.




[1] For some definition of "nobody". Linguists consider that some words
contain a space, e.g. "lawn tennis", "science fiction". This is called
the open or spaced form of compound words. However, the trailing space at
the end of the word is never considered part of the word.

[2] Except efficiency.
 
C

Chris Angelico

Are they really separated, or are they terminated?

a\nb\n

Three lines or two? If you say three, then you consider \n to be a
separator; if you say two, you consider it a terminator.

The thing is, both points of view are valid. If \n is a terminator, then
the above is valid text, but this may not be:

a\nb\nc

since the last line is unterminated. (You might be generous and allow
that every line must be terminated except possibly the last. Or you might
be strict and consider the last line to be broken.)

It is a problem, and the correct usage depends on context.

I'd normally say that the first consists of two lines, the first being
"a" and the second being "b", and there is no third blank line. The
first line still doesn't consist of "a\n", though. It's more like how
environment variables are provided to a C program: separated by \0 and
the last one has to be terminated too.

In some situations, you would completely ignore the "c" in the last
example. When you're watching a growing log file, buffering might mean
that you see half of a line. When you're reading MUD text from a
socket, a partial line probably means it's broken across two packets,
and the rest of the line is coming. Either way, you don't process the
"c" in case it's the beginning of a line; you wait till you see the
"\n" separator that says that you now have a complete line. Got some
out-of-band indication that there won't be any more (like an EOF
signal)? Assume that "c" is the whole line, or assume the file is
damaged, and proceed accordingly.
Given that the two points of view are legitimate and useful, how should a
programming language treat lines? If the language treats the newline as
separator, and strips it, then those who want to treat it as terminator
are screwed -- you cannot tell if the last line is terminated or not.

That's my point, though. If you want to treat a file as lines, you
usually won't care whether the last one is terminated or not. You'll
have some means of defining lines, which might mean discarding the
last, or whatever it is, but the stream "a\nb\nc" will either become
["a", "b", "c"] or ["a", "b"] or ValueError or something, and that
list of lines is really all you care about. Universal newlines, as you
mention, means that "a\r\nb\r\n" will become the exact same thing as
"a\nb\n", and there's no way to recreate that difference - because it
*does not matter*.
Here's another thought for you: words are separated by spaces. Nobody
ever considers the space to be part of the word[1]. I think that nearly
everyone agrees that both "spam eggs" and "spam eggs" contain two
words, "spam" and "eggs". I don't think anyone would say that the second
example includes seven words, five of which are blank. Would we like to
say that "spam\n\n\n\n\n\neggs" contains two lines rather than seven?

Ahh, that's a tricky one. For the simple concept of iterating over the
lines in a file, I would have to say that it's seven lines, five of
which are blank, same as "spam eggs".split(" ") returns a
seven-element list. The tricky bit is that the term "word" means
"*non-empty* sequence of characters", which means that after splitting
on spaces, you discard all empty tokens in the list; but normally
"line" does NOT have that non-empty qualifier. However, a double
newline often means "paragraph break" as opposed to "line break", so
there's additional meaning applied there; that might be four
paragraphs, the last one unterminated (and a paragraph might well be
terminated by a single newline rather than two), and in some cases
might be squished to just two paragraphs because the paragraph itself
is required to be non-empty.
With universal newline support, you can completely ignore the difference
in platform-specific end-of-line markers. By default, Python will convert
them to and from \n when you read or write a text file, and you'll never
see any difference. Just program using \n in your source code, and let
Python do the right thing. (If you need to handle end of line markers
yourself, you can easily disable universal newline support.)

So why should we have to explicitly disable universal newlines to undo
the folding of \r\n and \n down to a single "end of line" indication,
but automatically get handling of \n or absence at the end of the
file? Surely that's parallel. In each case, you're taking the set of
lines as your important content, and folding together distinctions
that don't matter.
I once had a Pascal compiler that would insert spaces, indentation, even
change the case of words. Regardless of what you actually typed, it would
pretty-print your code, then write the pretty-printed output when you
saved. Likewise, if you read in a Pascal source file from an external
editor, then saved it, it would overwrite the original with it's pretty-
printed version. That sort of thing may or may not be appropriate for a
high-level tool which is allowed to impose whatever structure it likes on
its data files, but it would be completely inappropriate for a low-level
almost-raw process (more like lightly blanched than cooked) like reading
from a text file in Python.

GW-BASIC used to do something similar, always upper-casing keywords
like "print" and "goto", and putting exactly one space between the
line number and the code; in the file that it stored on the disk, and
probably what it stored in memory, those were stored as single tokens.
Obviously the process of turning "print" into a one-byte marker and
then back into a word is lossy, so the result comes out as "PRINT"
regardless of how you typed it. Not quite the same, but it does give a
justification for the conversion (hey, it was designed so you could
work off floppy disks, so space was important), and of course the
program would run just the same.
Of course you can build a CSV reader on top of line-based iteration. You
just need an accumulator inside your parser: if, at the end of the line,
you are still inside a quoted field, keep processing over the next line.

Sure, but that's reaching past the line-based iteration. You can't
give it a single line and get back the split version; it has to be a
stateful parser that comprehends the whole file. But you can give it
Unicode data and have it completely ignore the byte stream that
produced it - which you can't do with, say, a zip reader.
No, that can't work. If the last line of the input file lacks a line
terminator, the writeln will add one. Let's make it simple: if your data
file consists of only a single line, "spam", the first blob you receive
will be "spam". If it consists of "spam\n" instead, the first blob you
receive will also be "spam". Should you call write() or writeln()?
Whichever you choose, you will get it wrong for some files.

But you'll produce a file full of lines. You might not have something
perfectly identical, byte for byte, but it will have the same lines,
and the process will be idempotent.
Ignore encodings and newline style. They are irrelevant. So long as the
input and output writer use the same settings, the input will be copied
unchanged.

Newline style IS relevant. You're saying that this will copy a file perfectly:

out = open("out", "w")
for line in open("in"):
out.write(line)

but it wouldn't if the iteration and write stripped and recreated
newlines? Incorrect, because this version will collapse \r\n into \n.
It's still a *text file copy*. (And yes, I know about 'with'. Shut
up.) It's idempotent, not byte-for-byte perfect.

ChrisA
 
M

Mark H Harris

Newline style IS relevant. You're saying that this will copy a file perfectly:

out = open("out", "w")
for line in open("in"):
out.write(line)

but it wouldn't if the iteration and write stripped and recreated
newlines? Incorrect, because this version will collapse \r\n into \n.
It's still a *text file copy*. (And yes, I know about 'with'. Shut
up.) It's idempotent, not byte-for-byte perfect.

Which was my point in the first place about new-line standards. We all
know why its important to collapse \r\n into \n, but why(?) in a
general way would this be the universal desired end? (rhetorical) Your
example of byte-for-byte perfect copy is one good case (they are not).
Another might be controller code (maybe ancient) where the \r is
'required' and collapsing it to \n won't work on the device (tty, or
other).

There does need to be a text file standard where what is desired is a
file of "lines". Iterating over the file object should return the
"lines" on any system platform, without the user being required to strip
off the line-end (newline \n) delimiter U+000a. The delimiter does not
matter.

What python has done by collapsing the \r\n into \n is to hide the real
problem (non standard delimiters between platforms) and in the process
actually 'removes' possibly important information (\r). {lossy}

We don't really use real tty devices any longer which require one code
to bring the print head carriage back (\r) and one code to index the
paper platten (\n). Screen I/O doesn't work that way any longer either.
Its time to standardize the newline and/or file text line end delimiters.

marcus
 
D

Dennis Lee Bieber

Which was my point in the first place about new-line standards. We all
know why its important to collapse \r\n into \n, but why(?) in a
general way would this be the universal desired end? (rhetorical) Your
example of byte-for-byte perfect copy is one good case (they are not).
Another might be controller code (maybe ancient) where the \r is
'required' and collapsing it to \n won't work on the device (tty, or
other).

In ancient times, having \r and \n perform different actions were
almost mandated...

Ever seen how a password prompt on a Teletype was handled? By using a
bare \r to overtype the input field with X, $, probably M, O, and 8 too...

{And I recall standard practice was to hit \r, to return the carriage, \n
for next line, and one RUBOUT to provide a delay while the carriage
returned to the left <G>}
 
M

Mark H Harris

{And I recall standard practice was to hit \r, to return the carriage, \n
for next line, and one RUBOUT to provide a delay while the carriage
returned to the left<G>}

Yes, yes... I remember well, there had to be a delay (of some type) to
wait for the horse and carriage to get from the right side of the field
to the left. Aaah, the good 'ol days.

marcus
 
D

Dennis Lee Bieber

Yes, yes... I remember well, there had to be a delay (of some type) to
wait for the horse and carriage to get from the right side of the field
to the left. Aaah, the good 'ol days.
A couple of us managed to "steal" the school login/password (don't
think we ever used it, but...)... The teaching assistant didn't notice the
paper tape punch was active when persuaded to login to let us run a short
program (high school BASIC class, with a dial-up teletype). Playing back
the tape and manually spinning the platen during the password
obscuration/input phase gave us the plain text.
 
M

Mark H Harris

A couple of us managed to "steal" the school login/password (don't
think we ever used it, but...)... The teaching assistant didn't notice the
paper tape punch was active when persuaded to login to let us run a short
program (high school BASIC class, with a dial-up teletype). Playing back
the tape and manually spinning the platen during the password
obscuration/input phase gave us the plain text.

I still have one of my old BASIC tapes from way back in the day; I
wanted to get the code back, or just remember why I had saved the tape?

One of my linux user group buddies locally rigged up an optical reader
(IF) to a single board micro controller... we pulled the tape by hand
using the center drive holes (sprocket holes) as a strobe and after a
couple of false attempts read the entire tape into a text file.

That tape still have the caster oil smell of the tty that produced it;
smell has a very strong memory association / I can still hear that
thing running in my head. ... haven't seen one physically in years.

marcus
 
M

Mark H Harris

A couple of us managed to "steal" the school login/password (don't
think we ever used it, but...)... The teaching assistant didn't notice the
paper tape punch was active when persuaded to login to let us run a short
program (high school BASIC class, with a dial-up teletype). Playing back
the tape and manually spinning the platen during the password
obscuration/input phase gave us the plain text.

I still have one of my old BASIC tapes from way back in the day; I
wanted to get the code back, or just remember why I had saved the tape?

One of my linux user group buddies locally rigged up an optical reader
(IF) to a single board micro controller... we pulled the tape by hand
using the center drive holes (sprocket holes) as a strobe and after a
couple of false attempts read the entire tape into a text file.

That tape still have the caster oil smell of the tty that produced it;
smell has a very strong memory association / I can still hear that
thing running in my head. ... haven't seen one physically in years.

marcus
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,756
Messages
2,569,535
Members
45,008
Latest member
obedient dusk

Latest Threads

Top