Is there a unicode EOF mark like DOS ascii ctl-z or unix crl-d ?

Bengt Richter · Sep 8, 2003

I couldn't find one. (Hi Martin ;-)

Regards,
Bengt Richter

Erik Max Francis · Sep 8, 2003

Bengt said:
I couldn't find one.

Unicode subsumes the normal ASCII control characters, so U0004 is EOT
(end of transmission) just like 0x04 in ASCII is:

http://www.unicode.org/charts/PDF/U0000.pdf

Unicode also includes a _symbol_ for it at U2404.

Martin v. =?iso-8859-15?q?L=F6wis?= · Sep 8, 2003

I couldn't find one. (Hi Martin ;-)

No, there is no need to have one (neither is there a need to have one
for plain ASCII files): The end-of-file is when the file ends. Most
operating systems support a notion of a "file size", and the file ends
when file-size bytes have been consumed.

Why Microsoft decided to use ctr-z in text files is beyond me, it does
not fulfil any useful function. ctl-d on Unix is *not* an EOF mark: No
file ever contains ctl-d (or if it would, it would not be interpreted
as EOF mark). Instead, ctl-d signals the end of data entered into the
terminal (which does not have a pre-determined size), so ctl-d has its
usual EOT semantics in Unix.

So the question would only be meaningful if you had some device that
uses a character stream, instead of a byte stream. I'm not aware of
any such device - if you had one, recycling EOT would probably be a
good idea.

Regards,
Martin

Michael Geary · Sep 8, 2003

Martin said:
No, there is no need to have one (neither is there a need to have one
for plain ASCII files): The end-of-file is when the file ends. Most
operating systems support a notion of a "file size", and the file ends
when file-size bytes have been consumed.

Why Microsoft decided to use ctr-z in text files is beyond me, it does
not fulfil any useful function...

It came from CP/M, which believe it or not had *no* way to specify an exact
file length. File lengths were measured in sectors, not bytes. So there had
to be some way to tell where a text file ended, and CP/M used Ctrl+Z.

MS-DOS picked up this convention, although if memory serves it always had
exact file lengths even in version 1.0.

Nobody uses Ctrl+Z in Windows/DOS text files any more, although I think the
COPY command still respects it if you use the /A switch or concatenate
files.

-Mike

Bob Gailer · Sep 8, 2003

On Win 2K the Task Scheduler writes a log file that appears to be encoded.
The first line is:

'\xff\xfe"\x00T\x00a\x00s\x00k\x00
\x00S\x00c\x00h\x00e\x00d\x00u\x00l\x00e\x00r\x00
\x00S\x00e\x00r\x00v\x00i\x00c\x00e\x00"\x00\r\x00\n'

My goal is to read this file and process it using Python string processing.

I am disappointed in the codecs module documentation. I had hoped to find
the answer there, but can't.

I presume this is an encoding, and that '\xff\xfe' defines the encoding.
How does one map '\xff\xfe' to an "encoding".

Bob Gailer
(e-mail address removed)
303 442 2625

Piet van Oostrum · Sep 8, 2003

MG> It came from CP/M, which believe it or not had *no* way to specify an exact
MG> file length. File lengths were measured in sectors, not bytes. So there had
MG> to be some way to tell where a text file ended, and CP/M used Ctrl+Z.

MG> MS-DOS picked up this convention, although if memory serves it always had
MG> exact file lengths even in version 1.0.

MG> Nobody uses Ctrl+Z in Windows/DOS text files any more, although I think the
MG> COPY command still respects it if you use the /A switch or concatenate
MG> files.

I believe even stdio respects it when a file is opened in text mode. This
is a common problem when people read binary files without specifying the
"b" modifier: Apart from the stripped CR bytes they are often surprised
that their programs stop reading early in the file. This even happens in
Python.

=?ISO-8859-1?Q?Gerhard_H=E4ring?= · Sep 8, 2003

Bob said:
On Win 2K the Task Scheduler writes a log file that appears to be
encoded. The first line is:

'\xff\xfe"\x00T\x00a\x00s\x00k\x00
\x00S\x00c\x00h\x00e\x00d\x00u\x00l\x00e\x00r\x00
\x00S\x00e\x00r\x00v\x00i\x00c\x00e\x00"\x00\r\x00\n'

I wrote that into a file, and used the 'file' utility from Cygwin, which
tells me:

$ file t
t: Little-endian UTF-16 Unicode character data, with CR line terminators

AFAIK the codecs module doesn't have autodetection of codecs.

HTH,

-- Gerhard

Bob Gailer · Sep 8, 2003

At said:
I wrote that into a file, and used the 'file' utility from Cygwin, which
tells me:

$ file t
t: Little-endian UTF-16 Unicode character data, with CR line terminators

That's a good start. I presume I need to use codecs.open(filename, mode[,
encoding[, errors[, buffering]]]) to read the file. What is the actual
value of the "encoding[" parameter for "Little-endian UTF-16 Unicode
character data, with CR line terminators"

Bob Gailer
(e-mail address removed)
303 442 2625

Duncan Booth · Sep 8, 2003

That's a good start. I presume I need to use codecs.open(filename,
mode[, encoding[, errors[, buffering]]]) to read the file. What is the
actual value of the "encoding[" parameter for "Little-endian UTF-16
Unicode character data, with CR line terminators"

Try:

myFile = codecs.open(filename, "r", "utf16")

If the file starts with a UTF-16 marker (either little or big endian) it
will be read correctly. If it doesn't start with either marker reading from
it will throw a UnicodeError.

Bob Gailer · Sep 8, 2003

That's a good start. I presume I need to use codecs.open(filename,
mode[, encoding[, errors[, buffering]]]) to read the file. What is the
actual value of the "encoding[" parameter for "Little-endian UTF-16
Unicode character data, with CR line terminators"

Click to expand...

Try:

myFile = codecs.open(filename, "r", "utf16")

If the file starts with a UTF-16 marker (either little or big endian) it
will be read correctly. If it doesn't start with either marker reading from
it will throw a UnicodeError.

Interesting error:

UniCodeError: UTF-16 decoding error: truncated data

Bob Gailer
(e-mail address removed)
303 442 2625

=?ISO-8859-1?Q?Gerhard_H=E4ring?= · Sep 8, 2003

Bob said:
[...] UniCodeError: UTF-16 decoding error: truncated data

If I remove the last character of the example line you posted, I can
sucessfully convert it to a Unicode string:
\x00S\x00c\x00h\x00e\x00d\x00u\x00l\x00e\x00r\x00
\x00S\x00e\x00r\x00v\x00i\x00c\x00e\x00"\x00\r\x00\n'Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'utf16' codec can't decode byte 0xa in position 52:
truncate
d data

>>> unicode(s[:-1], "utf-16") u'"Task Scheduler Service"\r'
>>>

Click to expand...

Click to expand...

I'm using Python 2.3, which apparently gives more useful encoding errors
(including the position of the error).

-- Gerhard

Peter Hansen · Sep 8, 2003

Duncan said:
That's a good start. I presume I need to use codecs.open(filename,
mode[, encoding[, errors[, buffering]]]) to read the file. What is the
actual value of the "encoding[" parameter for "Little-endian UTF-16
Unicode character data, with CR line terminators"

Click to expand...

Try:

myFile = codecs.open(filename, "r", "utf16")

I don't do unicode, but might you not want "rb" instead of just "r"
in the above? Does that argument apply to the low-level "open" or
to the codec open? In other words, when would CR-LF translation be
happening if you specified just "r"?

-Peter

Colin S. Miller · Sep 8, 2003

Bob said:
That's a good start. I presume I need to use codecs.open(filename,
mode[, encoding[, errors[, buffering]]]) to read the file. What is the
actual value of the "encoding[" parameter for "Little-endian UTF-16
Unicode character data, with CR line terminators"

Click to expand...

Try:

myFile = codecs.open(filename, "r", "utf16")

If the file starts with a UTF-16 marker (either little or big endian) it
will be read correctly. If it doesn't start with either marker reading
from
it will throw a UnicodeError.

Click to expand...

Interesting error:

UniCodeError: UTF-16 decoding error: truncated data

Are you doing readline on the unicode file?
I bashed my head off this problem a few months ago, and ended up doing
codecs.open(...).read().splitline()

I think what happens is the codecs::readline calls the underlying
readline code, which doesn't respect unicode, and instead splits at the
first \r or \n it finds; in little-endian this will result in a string
with an odd-number of bytes.

Colin Miller

Piet van Oostrum · Sep 10, 2003

BG> On Win 2K the Task Scheduler writes a log file that appears to be encoded.
BG> The first line is:

BG> '\xff\xfe"\x00T\x00a\x00s\x00k\x00
BG> \x00S\x00c\x00h\x00e\x00d\x00u\x00l\x00e\x00r\x00
BG> \x00S\x00e\x00r\x00v\x00i\x00c\x00e\x00"\x00\r\x00\n'

BG> My goal is to read this file and process it using Python string
BG> processing.

BG> I am disappointed in the codecs module documentation. I had hoped to find
BG> the answer there, but can't.

BG> I presume this is an encoding, and that '\xff\xfe' defines the encoding.
BG> How does one map '\xff\xfe' to an "encoding".

It's Unicode, actually Little Endian UTF-16, which is the standard encoding
on Win2K. The '\xff\xfe' is the Byte Order mark (BOM) which signifies it
as Little Endian.
'\xff\xfe'

But there is a trailing 0 byte missing (it should have an even number of
bytes, as each character occupies two bytes). Of course this comes because
you think a line ends with '\n', whereas in UTF-16LE it ends with '\n\x00'.
This also means you cannot read them with methods like readline().
'"Task Scheduler Service"\r\n'

Select Eof extension files based on text list of filenames with if condition	0	May 4, 2022
FAQ 8.29 Why can't my script read from STDIN after I gave it EOF (^D on Unix, ^Z on MS-DOS)?	19	Apr 11, 2011
Looking for UNICODE to ASCII Conversioni Example Code	15	Oct 18, 2013
Ascii to Unicode.	4	Jul 28, 2010
Did you know that there is a match-case function in python?	4	Dec 17, 2023
Flexible string representation, unicode, typography, ...	94	Aug 23, 2012
API for custom Unicode error handlers	5	Oct 4, 2013
email with a non-ascii charset in Python3 ?	3	Aug 15, 2012

Is there a unicode EOF mark like DOS ascii ctl-z or unix crl-d ?

Bengt Richter

Erik Max Francis

Martin v. =?iso-8859-15?q?L=F6wis?=

Michael Geary

Bob Gailer

Piet van Oostrum

=?ISO-8859-1?Q?Gerhard_H=E4ring?=

Bob Gailer

Duncan Booth

Bob Gailer

=?ISO-8859-1?Q?Gerhard_H=E4ring?=

Peter Hansen

Colin S. Miller

Piet van Oostrum

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads