Is there a unicode EOF mark like DOS ascii ctl-z or unix crl-d ?

Discussion in 'Python' started by Bengt Richter, Sep 8, 2003.

  1. I couldn't find one. (Hi Martin ;-)

    Regards,
    Bengt Richter
    Bengt Richter, Sep 8, 2003
    #1
    1. Advertising

  2. Bengt Richter wrote:

    > I couldn't find one.


    Unicode subsumes the normal ASCII control characters, so U0004 is EOT
    (end of transmission) just like 0x04 in ASCII is:

    http://www.unicode.org/charts/PDF/U0000.pdf

    Unicode also includes a _symbol_ for it at U2404.

    --
    Erik Max Francis && && http://www.alcyone.com/max/
    __ San Jose, CA, USA && 37 20 N 121 53 W && &tSftDotIotE
    / \ The work will teach you how to do it.
    \__/ (an Estonian proverb)
    Erik Max Francis, Sep 8, 2003
    #2
    1. Advertising

  3. (Bengt Richter) writes:

    > I couldn't find one. (Hi Martin ;-)


    No, there is no need to have one (neither is there a need to have one
    for plain ASCII files): The end-of-file is when the file ends. Most
    operating systems support a notion of a "file size", and the file ends
    when file-size bytes have been consumed.

    Why Microsoft decided to use ctr-z in text files is beyond me, it does
    not fulfil any useful function. ctl-d on Unix is *not* an EOF mark: No
    file ever contains ctl-d (or if it would, it would not be interpreted
    as EOF mark). Instead, ctl-d signals the end of data entered into the
    terminal (which does not have a pre-determined size), so ctl-d has its
    usual EOT semantics in Unix.

    So the question would only be meaningful if you had some device that
    uses a character stream, instead of a byte stream. I'm not aware of
    any such device - if you had one, recycling EOT would probably be a
    good idea.

    Regards,
    Martin
    Martin v. =?iso-8859-15?q?L=F6wis?=, Sep 8, 2003
    #3
  4. Martin v. Löwis wrote:
    > No, there is no need to have one (neither is there a need to have one
    > for plain ASCII files): The end-of-file is when the file ends. Most
    > operating systems support a notion of a "file size", and the file ends
    > when file-size bytes have been consumed.
    >
    > Why Microsoft decided to use ctr-z in text files is beyond me, it does
    > not fulfil any useful function...


    It came from CP/M, which believe it or not had *no* way to specify an exact
    file length. File lengths were measured in sectors, not bytes. So there had
    to be some way to tell where a text file ended, and CP/M used Ctrl+Z.

    MS-DOS picked up this convention, although if memory serves it always had
    exact file lengths even in version 1.0.

    Nobody uses Ctrl+Z in Windows/DOS text files any more, although I think the
    COPY command still respects it if you use the /A switch or concatenate
    files.

    -Mike
    Michael Geary, Sep 8, 2003
    #4
  5. Bengt Richter

    Bob Gailer Guest

    Read file that starts with '\xff\xfe'

    On Win 2K the Task Scheduler writes a log file that appears to be encoded.
    The first line is:

    '\xff\xfe"\x00T\x00a\x00s\x00k\x00
    \x00S\x00c\x00h\x00e\x00d\x00u\x00l\x00e\x00r\x00
    \x00S\x00e\x00r\x00v\x00i\x00c\x00e\x00"\x00\r\x00\n'

    My goal is to read this file and process it using Python string processing.

    I am disappointed in the codecs module documentation. I had hoped to find
    the answer there, but can't.

    I presume this is an encoding, and that '\xff\xfe' defines the encoding.
    How does one map '\xff\xfe' to an "encoding".

    Bob Gailer

    303 442 2625


    ---
    Outgoing mail is certified Virus Free.
    Checked by AVG anti-virus system (http://www.grisoft.com).
    Version: 6.0.506 / Virus Database: 303 - Release Date: 8/1/2003
    Bob Gailer, Sep 8, 2003
    #5
  6. Re: Is there a unicode EOF mark like DOS ascii ctl-z or unix crl-d?

    >>>>> "Michael Geary" <> (MG) wrote:

    MG> Martin v. Löwis wrote:
    >> No, there is no need to have one (neither is there a need to have one
    >> for plain ASCII files): The end-of-file is when the file ends. Most
    >> operating systems support a notion of a "file size", and the file ends
    >> when file-size bytes have been consumed.
    >>
    >> Why Microsoft decided to use ctr-z in text files is beyond me, it does
    >> not fulfil any useful function...


    MG> It came from CP/M, which believe it or not had *no* way to specify an exact
    MG> file length. File lengths were measured in sectors, not bytes. So there had
    MG> to be some way to tell where a text file ended, and CP/M used Ctrl+Z.

    MG> MS-DOS picked up this convention, although if memory serves it always had
    MG> exact file lengths even in version 1.0.

    MG> Nobody uses Ctrl+Z in Windows/DOS text files any more, although I think the
    MG> COPY command still respects it if you use the /A switch or concatenate
    MG> files.

    I believe even stdio respects it when a file is opened in text mode. This
    is a common problem when people read binary files without specifying the
    "b" modifier: Apart from the stripped CR bytes they are often surprised
    that their programs stop reading early in the file. This even happens in
    Python.
    --
    Piet van Oostrum <>
    URL: http://www.cs.uu.nl/~piet [PGP]
    Private email:
    Piet van Oostrum, Sep 8, 2003
    #6
  7. Re: Read file that starts with '\xff\xfe'

    Bob Gailer wrote:
    > On Win 2K the Task Scheduler writes a log file that appears to be
    > encoded. The first line is:
    >
    > '\xff\xfe"\x00T\x00a\x00s\x00k\x00
    > \x00S\x00c\x00h\x00e\x00d\x00u\x00l\x00e\x00r\x00
    > \x00S\x00e\x00r\x00v\x00i\x00c\x00e\x00"\x00\r\x00\n'


    I wrote that into a file, and used the 'file' utility from Cygwin, which
    tells me:

    $ file t
    t: Little-endian UTF-16 Unicode character data, with CR line terminators

    AFAIK the codecs module doesn't have autodetection of codecs.

    HTH,

    -- Gerhard
    =?ISO-8859-1?Q?Gerhard_H=E4ring?=, Sep 8, 2003
    #7
  8. Bengt Richter

    Bob Gailer Guest

    Re: Read file that starts with '\xff\xfe'

    At 06:13 AM 9/8/2003, Gerhard Häring wrote:

    >Bob Gailer wrote:
    >>On Win 2K the Task Scheduler writes a log file that appears to be
    >>encoded. The first line is:
    >>'\xff\xfe"\x00T\x00a\x00s\x00k\x00
    >>\x00S\x00c\x00h\x00e\x00d\x00u\x00l\x00e\x00r\x00
    >>\x00S\x00e\x00r\x00v\x00i\x00c\x00e\x00"\x00\r\x00\n'

    >
    >I wrote that into a file, and used the 'file' utility from Cygwin, which
    >tells me:
    >
    >$ file t
    >t: Little-endian UTF-16 Unicode character data, with CR line terminators


    That's a good start. I presume I need to use codecs.open(filename, mode[,
    encoding[, errors[, buffering]]]) to read the file. What is the actual
    value of the "encoding[" parameter for "Little-endian UTF-16 Unicode
    character data, with CR line terminators"

    Bob Gailer

    303 442 2625



    ---
    Outgoing mail is certified Virus Free.
    Checked by AVG anti-virus system (http://www.grisoft.com).
    Version: 6.0.506 / Virus Database: 303 - Release Date: 8/1/2003
    Bob Gailer, Sep 8, 2003
    #8
  9. Bengt Richter

    Duncan Booth Guest

    Re: Read file that starts with '\xff\xfe'

    Bob Gailer <> wrote in
    news::

    > That's a good start. I presume I need to use codecs.open(filename,
    > mode[, encoding[, errors[, buffering]]]) to read the file. What is the
    > actual value of the "encoding[" parameter for "Little-endian UTF-16
    > Unicode character data, with CR line terminators"


    Try:

    myFile = codecs.open(filename, "r", "utf16")

    If the file starts with a UTF-16 marker (either little or big endian) it
    will be read correctly. If it doesn't start with either marker reading from
    it will throw a UnicodeError.

    --
    Duncan Booth
    int month(char *p){return(124864/((p[0]+p[1]-p[2]&0x1f)+1)%12)["\5\x8\3"
    "\6\7\xb\1\x9\xa\2\0\4"];} // Who said my code was obscure?
    Duncan Booth, Sep 8, 2003
    #9
  10. Bengt Richter

    Bob Gailer Guest

    Re: Read file that starts with '\xff\xfe'

    At 07:31 AM 9/8/2003, Duncan Booth wrote:

    >Bob Gailer <> wrote in
    >news::
    >
    > > That's a good start. I presume I need to use codecs.open(filename,
    > > mode[, encoding[, errors[, buffering]]]) to read the file. What is the
    > > actual value of the "encoding[" parameter for "Little-endian UTF-16
    > > Unicode character data, with CR line terminators"

    >
    >Try:
    >
    > myFile = codecs.open(filename, "r", "utf16")
    >
    >If the file starts with a UTF-16 marker (either little or big endian) it
    >will be read correctly. If it doesn't start with either marker reading from
    >it will throw a UnicodeError.


    Interesting error:

    UniCodeError: UTF-16 decoding error: truncated data

    Bob Gailer

    303 442 2625


    ---
    Outgoing mail is certified Virus Free.
    Checked by AVG anti-virus system (http://www.grisoft.com).
    Version: 6.0.506 / Virus Database: 303 - Release Date: 8/1/2003
    Bob Gailer, Sep 8, 2003
    #10
  11. Re: Read file that starts with '\xff\xfe'

    Bob Gailer wrote:
    > [...] UniCodeError: UTF-16 decoding error: truncated data


    If I remove the last character of the example line you posted, I can
    sucessfully convert it to a Unicode string:

    >>> s = '\xff\xfe"\x00T\x00a\x00s\x00k\x00

    \x00S\x00c\x00h\x00e\x00d\x00u\x00l\x00e\x00r\x00
    \x00S\x00e\x00r\x00v\x00i\x00c\x00e\x00"\x00\r\x00\n'
    >>> unicode(s, "utf-16")

    Traceback (most recent call last):
    File "<stdin>", line 1, in ?
    UnicodeDecodeError: 'utf16' codec can't decode byte 0xa in position 52:
    truncate
    d data
    >>> unicode(s[:-1], "utf-16")

    u'"Task Scheduler Service"\r'
    >>>


    I'm using Python 2.3, which apparently gives more useful encoding errors
    (including the position of the error).

    -- Gerhard
    =?ISO-8859-1?Q?Gerhard_H=E4ring?=, Sep 8, 2003
    #11
  12. Bengt Richter

    Peter Hansen Guest

    Re: Read file that starts with '\xff\xfe'

    Duncan Booth wrote:
    >
    > Bob Gailer <> wrote in
    > news::
    >
    > > That's a good start. I presume I need to use codecs.open(filename,
    > > mode[, encoding[, errors[, buffering]]]) to read the file. What is the
    > > actual value of the "encoding[" parameter for "Little-endian UTF-16
    > > Unicode character data, with CR line terminators"

    >
    > Try:
    >
    > myFile = codecs.open(filename, "r", "utf16")


    I don't do unicode, but might you not want "rb" instead of just "r"
    in the above? Does that argument apply to the low-level "open" or
    to the codec open? In other words, when would CR-LF translation be
    happening if you specified just "r"?

    -Peter
    Peter Hansen, Sep 8, 2003
    #12
  13. Re: Read file that starts with '\xff\xfe'

    Bob Gailer wrote:
    > At 07:31 AM 9/8/2003, Duncan Booth wrote:
    >
    >> Bob Gailer <> wrote in
    >> news::
    >>
    >> > That's a good start. I presume I need to use codecs.open(filename,
    >> > mode[, encoding[, errors[, buffering]]]) to read the file. What is the
    >> > actual value of the "encoding[" parameter for "Little-endian UTF-16
    >> > Unicode character data, with CR line terminators"

    >>
    >> Try:
    >>
    >> myFile = codecs.open(filename, "r", "utf16")
    >>
    >> If the file starts with a UTF-16 marker (either little or big endian) it
    >> will be read correctly. If it doesn't start with either marker reading
    >> from
    >> it will throw a UnicodeError.

    >
    >
    > Interesting error:
    >
    > UniCodeError: UTF-16 decoding error: truncated data

    Are you doing readline on the unicode file?
    I bashed my head off this problem a few months ago, and ended up doing
    codecs.open(...).read().splitline()

    I think what happens is the codecs::readline calls the underlying
    readline code, which doesn't respect unicode, and instead splits at the
    first \r or \n it finds; in little-endian this will result in a string
    with an odd-number of bytes.

    Colin Miller

    >
    > Bob Gailer
    >
    > 303 442 2625
    >
    >
    > ------------------------------------------------------------------------
    >
    >
    > ---
    > Outgoing mail is certified Virus Free.
    > Checked by AVG anti-virus system (http://www.grisoft.com).
    > Version: 6.0.506 / Virus Database: 303 - Release Date: 8/1/2003
    Colin S. Miller, Sep 8, 2003
    #13
  14. Re: Read file that starts with '\xff\xfe'

    >>>>> Bob Gailer <> (BG) wrote:

    BG> On Win 2K the Task Scheduler writes a log file that appears to be encoded.
    BG> The first line is:

    BG> '\xff\xfe"\x00T\x00a\x00s\x00k\x00
    BG> \x00S\x00c\x00h\x00e\x00d\x00u\x00l\x00e\x00r\x00
    BG> \x00S\x00e\x00r\x00v\x00i\x00c\x00e\x00"\x00\r\x00\n'

    BG> My goal is to read this file and process it using Python string
    BG> processing.

    BG> I am disappointed in the codecs module documentation. I had hoped to find
    BG> the answer there, but can't.

    BG> I presume this is an encoding, and that '\xff\xfe' defines the encoding.
    BG> How does one map '\xff\xfe' to an "encoding".

    It's Unicode, actually Little Endian UTF-16, which is the standard encoding
    on Win2K. The '\xff\xfe' is the Byte Order mark (BOM) which signifies it
    as Little Endian.

    >>> import codecs
    >>> codecs.BOM_UTF16_LE

    '\xff\xfe'

    But there is a trailing 0 byte missing (it should have an even number of
    bytes, as each character occupies two bytes). Of course this comes because
    you think a line ends with '\n', whereas in UTF-16LE it ends with '\n\x00'.
    This also means you cannot read them with methods like readline().

    >>> st='\xff\xfe"\x00T\x00a\x00s\x00k\x00 \x00S\x00c\x00h\x00e\x00d\x00u\x00l\x00e\x00r\x00 \x00S\x00e\x00r\x00v\x00i\x00c\x00e\x00"\x00\r\x00\n\x00'
    >>> stu=unicode(st,"utf_16le")
    >>> stu

    u'"Task Scheduler Service"\r\n'
    >>> stu.encode('iso-8859-1')

    '"Task Scheduler Service"\r\n'

    --
    Piet van Oostrum <>
    URL: http://www.cs.uu.nl/~piet [PGP]
    Private email:
    Piet van Oostrum, Sep 10, 2003
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. john san
    Replies:
    19
    Views:
    730
    Diez B. Roggisch
    Feb 18, 2005
  2. Cirene
    Replies:
    5
    Views:
    573
    Cirene
    May 17, 2008
  3. pranav
    Replies:
    0
    Views:
    450
    pranav
    Mar 3, 2010
  4. Robert Wallace

    my own perl "dos->unix"/"unix->dos"

    Robert Wallace, Jan 21, 2004, in forum: Perl Misc
    Replies:
    7
    Views:
    267
    Michele Dondi
    Jan 22, 2004
  5. PerlFAQ Server
    Replies:
    19
    Views:
    212
    John Bokma
    Apr 28, 2011
Loading...

Share This Page