Re: bugs at iter file() ?

Discussion in 'Python' started by Terry Reedy, Jul 15, 2004.

  1. Terry Reedy

    Terry Reedy Guest

    "dw" <> wrote in message news:...
    > Python 2.3.4, winxp:
    >
    > I have a large text file that unknowingly contains ascii
    > character 1A, or chr(26). And doing this:
    > for line in file(sys.argv[1]):
    > print line
    > would stop iteration at the specific line containing ascii
    > char 1A, without raising exception or warning, although
    > there were still remaining lines which has not been
    > iterated.


    To add to what Tim said: From the viewpoint of Windows in its default mode,
    there are no remaining lines. ^Z is the end of file and anything after
    that is accidental junk filling out the remainder of the disk block.

    tjr
     
    Terry Reedy, Jul 15, 2004
    #1
    1. Advertising

  2. Terry Reedy wrote:
    > To add to what Tim said: From the viewpoint of Windows in
    > its default mode, there are no remaining lines. ^Z is the end
    > of file and anything after that is accidental junk filling out the
    > remainder of the disk block.


    Just to clarify one point... Windows itself does not have "text" or "binary"
    files, and it does not treat ^Z in a file in any special way. There are no
    special characters in files. A file is simply an array of arbitrary bytes
    with an exact length.

    For example, if you use Notepad to open a file with embedded ^Z characters,
    you will see those characters in the text (typically as right arrows,
    depending on the font). The file won't be truncated at the first ^Z.

    It's the C runtime that makes the distinction between text and binary files
    and treats ^Z specially. When you use fopen(filename,"rt") you get the
    special behavior of translating CRLF pairs to LF characters and stopping at
    the first ^Z.

    Of course, from the point of view of a Python program, it hardly matters
    whether it's Windows or the C runtime that is doing this. I just wanted to
    clarify where this special behavior is taking place--it's nothing
    fundamental to the operating system at all.

    -Mike
     
    Michael Geary, Jul 15, 2004
    #2
    1. Advertising

  3. Terry Reedy

    Terry Reedy Guest

    "Michael Geary" <> wrote in message
    news:...
    > Terry Reedy wrote:
    > > To add to what Tim said: From the viewpoint of Windows in
    > > its default mode, there are no remaining lines. ^Z is the end
    > > of file and anything after that is accidental junk filling out the
    > > remainder of the disk block.

    >
    > Just to clarify one point... Windows itself does not have "text" or

    "binary"
    > files, and it does not treat ^Z in a file in any special way. There are

    no
    > special characters in files.


    Sorry, but ^Z has meant end-of-file I presume from the first version of
    DOS, which I suspect copied the usage from something previous. Example
    (Microsoft Basic manual, 1989): "When input is redirected [from terminal to
    a file], GW Basic continues to read from this source until a CTRL-Z is
    detected." Perhaps the usage has dimmed in non-DOS-based Windows, so that
    I should have said more carefully "from the viewpoint of DOS and perhaps
    DOS-based Windows and partially in modern non-DOS-based Windows ...".
    Still, in Windows XP, open a Command Prompt window and enter

    disk:\path> copy con: temp
    abd^Zdef

    where ^Z is control-Z and you get a file with 3, not 7 characters.

    The Windows version of the Python interactive interpreter exits on ^Z
    because that is, or at least was, standard behavior for interactive non-gui
    DOS/Windows programs

    > For example, if you use Notepad to open a file with embedded ^Z

    characters,
    > you will see those characters in the text (typically as right arrows,
    > depending on the font). The file won't be truncated at the first ^Z.


    This surprises me a bit. Which version of Windows? Try type'ing the same
    file ('type filename') in an XP Home command prompt. Even now, it should
    be truncated (just tested this).

    07/15/2004 11:07 PM 7 temb
    ....
    C:\Documents and Settings\Terry>type temb
    abc

    I created temb as abc^Zdef with Python file.write (^Z=\032).

    > It's the C runtime that makes the distinction between text and binary

    files
    > and treats ^Z specially.


    The Microsoft Windows C runtime treats ^Z specially because that is, or at
    least was, the OS convention. possibly since before there was a C compiler
    for DOS.

    Terry J. Reedy
     
    Terry Reedy, Jul 16, 2004
    #3
  4. "Terry Reedy" <>:

    >"Michael Geary" <> wrote in message
    >news:...
    >> Terry Reedy wrote:
    >> > To add to what Tim said: From the viewpoint of Windows in
    >> > its default mode, there are no remaining lines. ^Z is the end
    >> > of file and anything after that is accidental junk filling out the
    >> > remainder of the disk block.

    >>
    >> Just to clarify one point... Windows itself does not have "text" or "binary"
    >> files, and it does not treat ^Z in a file in any special way. There are no
    >> special characters in files.


    >Sorry, but ^Z has meant end-of-file I presume from the first version of
    >DOS, which I suspect copied the usage from something previous.


    Sorry, but Michael got it right. Windows itself does not have 'text' or
    'binary' files or open modes. Have a look at CreateFile in the Platform
    SDK. You won't find anythink like _TEXT or _BINARY there.

    ^Z is a carryover from CP/M to DOS, which, like crlf<->lf translation,
    got some support in various libraries, for obvious reasons. It's not
    part of the Win32 API.


    >Example
    >(Microsoft Basic manual, 1989): "When input is redirected [from terminal to
    >a file], GW Basic continues to read from this source until a CTRL-Z is
    >detected."


    So what? BASICA is an application, just like bash or sendmail.


    >Perhaps the usage has dimmed in non-DOS-based Windows, so that
    >I should have said more carefully "from the viewpoint of DOS and perhaps
    >DOS-based Windows and partially in modern non-DOS-based Windows ...".
    >Still, in Windows XP, open a Command Prompt window and enter
    >
    >disk:\path> copy con: temp
    >abd^Zdef
    >
    >where ^Z is control-Z and you get a file with 3, not 7 characters.
    >
    >The Windows version of the Python interactive interpreter exits on ^Z
    >because that is, or at least was, standard behavior for interactive non-gui
    >DOS/Windows programs


    You mean like terminating a program using a single dot on a line is, or
    at least was, standard behaviour for interactive non-gui UNIX/Linux
    applications? :)


    --
    Thank you for observing all safety precautions
     
    Wolfgang Strobl, Aug 7, 2004
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. dw

    bugs at iter file() ?

    dw, Jul 15, 2004, in forum: Python
    Replies:
    0
    Views:
    369
  2. dw
    Replies:
    0
    Views:
    271
  3. thomas
    Replies:
    23
    Views:
    882
    James Kanze
    Feb 26, 2008
  4. Gennaro Prota
    Replies:
    1
    Views:
    347
    Gennaro Prota
    Aug 21, 2008
  5. Josef 'Jupp' Schugt

    Still use 'ruby-bugs' for Ruby bugs?

    Josef 'Jupp' Schugt, Nov 4, 2004, in forum: Ruby
    Replies:
    2
    Views:
    181
    Tom Copeland
    Nov 4, 2004
Loading...

Share This Page