Is there a unicode EOF mark like DOS ascii ctl-z or unix crl-d ?

M

Martin v. =?iso-8859-15?q?L=F6wis?=

I couldn't find one. (Hi Martin ;-)

No, there is no need to have one (neither is there a need to have one
for plain ASCII files): The end-of-file is when the file ends. Most
operating systems support a notion of a "file size", and the file ends
when file-size bytes have been consumed.

Why Microsoft decided to use ctr-z in text files is beyond me, it does
not fulfil any useful function. ctl-d on Unix is *not* an EOF mark: No
file ever contains ctl-d (or if it would, it would not be interpreted
as EOF mark). Instead, ctl-d signals the end of data entered into the
terminal (which does not have a pre-determined size), so ctl-d has its
usual EOT semantics in Unix.

So the question would only be meaningful if you had some device that
uses a character stream, instead of a byte stream. I'm not aware of
any such device - if you had one, recycling EOT would probably be a
good idea.

Regards,
Martin
 
M

Michael Geary

Martin said:
No, there is no need to have one (neither is there a need to have one
for plain ASCII files): The end-of-file is when the file ends. Most
operating systems support a notion of a "file size", and the file ends
when file-size bytes have been consumed.

Why Microsoft decided to use ctr-z in text files is beyond me, it does
not fulfil any useful function...

It came from CP/M, which believe it or not had *no* way to specify an exact
file length. File lengths were measured in sectors, not bytes. So there had
to be some way to tell where a text file ended, and CP/M used Ctrl+Z.

MS-DOS picked up this convention, although if memory serves it always had
exact file lengths even in version 1.0.

Nobody uses Ctrl+Z in Windows/DOS text files any more, although I think the
COPY command still respects it if you use the /A switch or concatenate
files.

-Mike
 
B

Bob Gailer

On Win 2K the Task Scheduler writes a log file that appears to be encoded.
The first line is:

'\xff\xfe"\x00T\x00a\x00s\x00k\x00
\x00S\x00c\x00h\x00e\x00d\x00u\x00l\x00e\x00r\x00
\x00S\x00e\x00r\x00v\x00i\x00c\x00e\x00"\x00\r\x00\n'

My goal is to read this file and process it using Python string processing.

I am disappointed in the codecs module documentation. I had hoped to find
the answer there, but can't.

I presume this is an encoding, and that '\xff\xfe' defines the encoding.
How does one map '\xff\xfe' to an "encoding".

Bob Gailer
(e-mail address removed)
303 442 2625
 
P

Piet van Oostrum

MG> It came from CP/M, which believe it or not had *no* way to specify an exact
MG> file length. File lengths were measured in sectors, not bytes. So there had
MG> to be some way to tell where a text file ended, and CP/M used Ctrl+Z.

MG> MS-DOS picked up this convention, although if memory serves it always had
MG> exact file lengths even in version 1.0.

MG> Nobody uses Ctrl+Z in Windows/DOS text files any more, although I think the
MG> COPY command still respects it if you use the /A switch or concatenate
MG> files.

I believe even stdio respects it when a file is opened in text mode. This
is a common problem when people read binary files without specifying the
"b" modifier: Apart from the stripped CR bytes they are often surprised
that their programs stop reading early in the file. This even happens in
Python.
 
?

=?ISO-8859-1?Q?Gerhard_H=E4ring?=

Bob said:
On Win 2K the Task Scheduler writes a log file that appears to be
encoded. The first line is:

'\xff\xfe"\x00T\x00a\x00s\x00k\x00
\x00S\x00c\x00h\x00e\x00d\x00u\x00l\x00e\x00r\x00
\x00S\x00e\x00r\x00v\x00i\x00c\x00e\x00"\x00\r\x00\n'

I wrote that into a file, and used the 'file' utility from Cygwin, which
tells me:

$ file t
t: Little-endian UTF-16 Unicode character data, with CR line terminators

AFAIK the codecs module doesn't have autodetection of codecs.

HTH,

-- Gerhard
 
B

Bob Gailer

At said:
I wrote that into a file, and used the 'file' utility from Cygwin, which
tells me:

$ file t
t: Little-endian UTF-16 Unicode character data, with CR line terminators

That's a good start. I presume I need to use codecs.open(filename, mode[,
encoding[, errors[, buffering]]]) to read the file. What is the actual
value of the "encoding[" parameter for "Little-endian UTF-16 Unicode
character data, with CR line terminators"

Bob Gailer
(e-mail address removed)
303 442 2625
 
D

Duncan Booth

That's a good start. I presume I need to use codecs.open(filename,
mode[, encoding[, errors[, buffering]]]) to read the file. What is the
actual value of the "encoding[" parameter for "Little-endian UTF-16
Unicode character data, with CR line terminators"

Try:

myFile = codecs.open(filename, "r", "utf16")

If the file starts with a UTF-16 marker (either little or big endian) it
will be read correctly. If it doesn't start with either marker reading from
it will throw a UnicodeError.
 
B

Bob Gailer

That's a good start. I presume I need to use codecs.open(filename,
mode[, encoding[, errors[, buffering]]]) to read the file. What is the
actual value of the "encoding[" parameter for "Little-endian UTF-16
Unicode character data, with CR line terminators"

Try:

myFile = codecs.open(filename, "r", "utf16")

If the file starts with a UTF-16 marker (either little or big endian) it
will be read correctly. If it doesn't start with either marker reading from
it will throw a UnicodeError.

Interesting error:

UniCodeError: UTF-16 decoding error: truncated data

Bob Gailer
(e-mail address removed)
303 442 2625
 
?

=?ISO-8859-1?Q?Gerhard_H=E4ring?=

Bob said:
[...] UniCodeError: UTF-16 decoding error: truncated data

If I remove the last character of the example line you posted, I can
sucessfully convert it to a Unicode string:
\x00S\x00c\x00h\x00e\x00d\x00u\x00l\x00e\x00r\x00
\x00S\x00e\x00r\x00v\x00i\x00c\x00e\x00"\x00\r\x00\n'Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'utf16' codec can't decode byte 0xa in position 52:
truncate
d data
>>> unicode(s[:-1], "utf-16") u'"Task Scheduler Service"\r'
>>>

I'm using Python 2.3, which apparently gives more useful encoding errors
(including the position of the error).

-- Gerhard
 
P

Peter Hansen

Duncan said:
That's a good start. I presume I need to use codecs.open(filename,
mode[, encoding[, errors[, buffering]]]) to read the file. What is the
actual value of the "encoding[" parameter for "Little-endian UTF-16
Unicode character data, with CR line terminators"

Try:

myFile = codecs.open(filename, "r", "utf16")

I don't do unicode, but might you not want "rb" instead of just "r"
in the above? Does that argument apply to the low-level "open" or
to the codec open? In other words, when would CR-LF translation be
happening if you specified just "r"?

-Peter
 
C

Colin S. Miller

Bob said:
That's a good start. I presume I need to use codecs.open(filename,
mode[, encoding[, errors[, buffering]]]) to read the file. What is the
actual value of the "encoding[" parameter for "Little-endian UTF-16
Unicode character data, with CR line terminators"

Try:

myFile = codecs.open(filename, "r", "utf16")

If the file starts with a UTF-16 marker (either little or big endian) it
will be read correctly. If it doesn't start with either marker reading
from
it will throw a UnicodeError.


Interesting error:

UniCodeError: UTF-16 decoding error: truncated data
Are you doing readline on the unicode file?
I bashed my head off this problem a few months ago, and ended up doing
codecs.open(...).read().splitline()

I think what happens is the codecs::readline calls the underlying
readline code, which doesn't respect unicode, and instead splits at the
first \r or \n it finds; in little-endian this will result in a string
with an odd-number of bytes.

Colin Miller
 
P

Piet van Oostrum

BG> On Win 2K the Task Scheduler writes a log file that appears to be encoded.
BG> The first line is:

BG> '\xff\xfe"\x00T\x00a\x00s\x00k\x00
BG> \x00S\x00c\x00h\x00e\x00d\x00u\x00l\x00e\x00r\x00
BG> \x00S\x00e\x00r\x00v\x00i\x00c\x00e\x00"\x00\r\x00\n'

BG> My goal is to read this file and process it using Python string
BG> processing.

BG> I am disappointed in the codecs module documentation. I had hoped to find
BG> the answer there, but can't.

BG> I presume this is an encoding, and that '\xff\xfe' defines the encoding.
BG> How does one map '\xff\xfe' to an "encoding".

It's Unicode, actually Little Endian UTF-16, which is the standard encoding
on Win2K. The '\xff\xfe' is the Byte Order mark (BOM) which signifies it
as Little Endian.
'\xff\xfe'

But there is a trailing 0 byte missing (it should have an even number of
bytes, as each character occupies two bytes). Of course this comes because
you think a line ends with '\n', whereas in UTF-16LE it ends with '\n\x00'.
This also means you cannot read them with methods like readline().
'"Task Scheduler Service"\r\n'
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,599
Members
45,173
Latest member
GeraldReund
Top