Reading from text files

T

Thomas Philips

In the course of playing around with file input and output, I came
across some behavior that is not quite intuitive. I created a simple
text file, test.txt, which contains only 3 lines, and which I expect
will have 5 characters (the digits 1, 2, and 3, and two newline
characters, the first after 1 and the second after 2). Here it is in
all its glory:
1
2
3

However, when I read it using open()and then view it usingI get:
'1\n2\n3'
7L

Python thinks there are 7 characters in the file! If I type I get
'\n2\n3'

but gives me what I expected to get with file.seek(2); file.read()
'2\n3'

It appears that Python sometimes counts each of the newline escape
sequences as 2 separate characters and at other times as 1 indivisible
character. What is the appropriate way to think about these
characters?

Thomas Philips
 
J

Jeff Epler

There are three solutions to this problem:
1. Don't use Windows
2. Only use offsets with file.seek() that were returned by file.tell()
3. Open the file in binary mode

Windows stores "\n" as a two-byte sequence in text files when written,
and then transforms the two-byte sequence into "\n" when reading, for
files opened as text files.

file.seek() on Windows only knows about raw byte offsets, though, so
if you know the first line of a file is "a\n", you can't seek to 2 to
get to the second line, because that line actually starts at byte 3
(The value .tell() would return after you read the first line)


Jeff
 
P

Paul Watson

Thomas Philips said:
In the course of playing around with file input and output, I came
across some behavior that is not quite intuitive. I created a simple
text file, test.txt, which contains only 3 lines, and which I expect
will have 5 characters (the digits 1, 2, and 3, and two newline
characters, the first after 1 and the second after 2). Here it is in
all its glory:
1
2
3

However, when I read it using open()and then view it using
I get:
'1\n2\n3'
7L

Python thinks there are 7 characters in the file! If I type
I get
'\n2\n3'

but
gives me what I expected to get with file.seek(2); file.read()
'2\n3'

It appears that Python sometimes counts each of the newline escape
sequences as 2 separate characters and at other times as 1 indivisible
character. What is the appropriate way to think about these
characters?

Thomas Philips

If you want to actually "see" what is in the file do a directory listing and
dump the file in hex.

On DOS/Windows do a 'dir test.txt' command and inspect the size of the file.
Then, do a 'debug test.txt' command. At the prompt, enter the 'r' command
and press enter. Examine the CX register. It will have the same value as
the size of the file. Then do a 'd' command to dump the bytes out and you
can see exactly what is in the file.

On UNIX/Linux use 'ls -l test.txt' to see the directory listing containing
the size of the file. Use something like 'od -Ax -x test.txt' to see the
contents of the file. If that command does not produce something you like,
use 'man od' to find the parameters with which you are more comfortable.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,565
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top