Re: Python 3 encoding question: Read a filename from stdin,subsequently open that filename

Discussion in 'Python' started by Peter Otten, Nov 30, 2010.

  1. Peter Otten

    Peter Otten Guest

    Albert Hopkins wrote:

    > On Tue, 2010-11-30 at 11:52 +0100, Peter Otten wrote:
    > Dan Stromberg wrote:
    >>
    >> > I've got a couple of programs that read filenames from stdin, and

    > then
    >> > open those files and do things with them. These programs sort of do
    >> > the *ix xargs thing, without requiring xargs.
    >> >
    >> > In Python 2, these work well. Irrespective of how filenames are
    >> > encoded, things are opened OK, because it's all just a stream of
    >> > single byte characters.

    >>
    >> I think you're wrong. The filenames' encoding as they are read from stdin
    >> must be the same as the encoding used by the file system. If the file
    >> system expects UTF-8 and you feed it ISO-8859-1 you'll run into errors.
    >>

    > I think this is wrong. In Unix there is no concept of filename
    > encoding. Filenames can have any arbitrary set of bytes (except '/' and
    > '\0'). But the filesystem itself neither knows nor cares about
    > encoding.


    I think you misunderstood what I was trying to say. If you write a list of
    filenames into files.txt, and use an encoding (ISO-8859-1, say) other than
    that used by the shell to display file names (on Linux typically UTF-8 these
    days) and then write a Python script exist.py that reads filenames and
    checks for the files' existence,

    $ python3 exist.py < files.txt

    will report that a file

    b'\xe4\xf6\xfc.txt'

    doesn't exist. The user looking at his editor with the encoding set to
    ISO-8859-1 seeing the line

    äöü.txt

    and then going to the console typing

    $ ls
    äöü.txt

    will be confused even though everything is working correctly.
    The system may be shuffling bytes, but the user thinks in codepoints and
    sometimes assumes that codepoints and bytes are the same.

    > You always have to know either
    >>
    >> (a) both the file system's and stdin's actual encoding, or
    >> (b) that both encodings are the same.
    >>
    >>

    > If this is true, then I think that it is wrong to do in Python3. Any
    > language should be able to deal with the filenames that the host OS
    > allows.
    >
    > Anyway, going on with the OP.. can you open stdin so that you can accept
    > arbitrary bytes instead of strings and then open using the bytes as the
    > filename?


    You can access the underlying stdin.buffer that feeds you the raw bytes with
    no attempt to shoehorn them into codepoints. You can use filenames that are
    not valid in the encoding that the system uses to display filenames:

    $ ls
    $ python3
    Python 3.1.1+ (r311:74480, Nov 2 2009, 15:45:00)
    [GCC 4.4.1] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> with open(b"\xe4\xf6\xfc.txt", "w") as f:

    .... f.write("hello\n")
    ....
    6
    >>>

    $ ls
    ???.txt

    > I don't have that much experience with Python3 to say for sure.


    Me neither.

    Peter
     
    Peter Otten, Nov 30, 2010
    #1
    1. Advertising

  2. Peter Otten

    Nobody Guest

    Re: Python 3 encoding question: Read a filename from stdin, subsequently open that filename

    On Tue, 30 Nov 2010 18:53:14 +0100, Peter Otten wrote:

    >> I think this is wrong. In Unix there is no concept of filename
    >> encoding. Filenames can have any arbitrary set of bytes (except '/' and
    >> '\0'). But the filesystem itself neither knows nor cares about
    >> encoding.

    >
    > I think you misunderstood what I was trying to say. If you write a list of
    > filenames into files.txt, and use an encoding (ISO-8859-1, say) other than
    > that used by the shell to display file names (on Linux typically UTF-8 these
    > days) and then write a Python script exist.py that reads filenames and
    > checks for the files' existence,


    I think you misunderstood.

    In the Unix kernel, there aren't any encodings. Strings of bytes are
    /just/ strings of bytes. A text file containing a list of filenames
    doesn't /have/ an encoding. The filenames passed to API functions don't
    /have/ an encoding.

    This is why Unix filenames are case-sensitive: because there isn't any
    "case". The number 65 has no more in common with the number 97 than it
    does with the number 255. The fact that 65 is the ASCII code for "A" while
    97 is the ASCII code for "a" doesn't come into it. Case-insensitive
    filenames require knowledge of the encoding in order to determine when
    filenames are "equivalent". DOS/Windows tried this and never really got it
    right (it works fine on a standalone system, or within later versions of
    a Windows-only ecosystem, but becomes a nightmare when files get
    transferred between systems via older or non-Microsoft channels).

    Python 3.x's decision to treat filenames (and environment variables) as
    text even on Unix is, in short, a bug. One which, IMNSHO, will mean that
    Python 2.x is still around when Python 4 is released.
     
    Nobody, Dec 1, 2010
    #2
    1. Advertising

  3. Peter Otten

    MRAB Guest

    On 01/12/2010 01:28, Nobody wrote:
    > On Tue, 30 Nov 2010 18:53:14 +0100, Peter Otten wrote:
    >
    >>> I think this is wrong. In Unix there is no concept of filename
    >>> encoding. Filenames can have any arbitrary set of bytes (except '/' and
    >>> '\0'). But the filesystem itself neither knows nor cares about
    >>> encoding.

    >>
    >> I think you misunderstood what I was trying to say. If you write a list of
    >> filenames into files.txt, and use an encoding (ISO-8859-1, say) other than
    >> that used by the shell to display file names (on Linux typically UTF-8 these
    >> days) and then write a Python script exist.py that reads filenames and
    >> checks for the files' existence,

    >
    > I think you misunderstood.
    >
    > In the Unix kernel, there aren't any encodings. Strings of bytes are
    > /just/ strings of bytes. A text file containing a list of filenames
    > doesn't /have/ an encoding. The filenames passed to API functions don't
    > /have/ an encoding.
    >
    > This is why Unix filenames are case-sensitive: because there isn't any
    > "case". The number 65 has no more in common with the number 97 than it
    > does with the number 255. The fact that 65 is the ASCII code for "A" while
    > 97 is the ASCII code for "a" doesn't come into it. Case-insensitive
    > filenames require knowledge of the encoding in order to determine when
    > filenames are "equivalent". DOS/Windows tried this and never really got it
    > right (it works fine on a standalone system, or within later versions of
    > a Windows-only ecosystem, but becomes a nightmare when files get
    > transferred between systems via older or non-Microsoft channels).
    >
    > Python 3.x's decision to treat filenames (and environment variables) as
    > text even on Unix is, in short, a bug. One which, IMNSHO, will mean that
    > Python 2.x is still around when Python 4 is released.
    >

    If the filenames are to be shown to a user then there needs to be a
    mapping between bytes and glyphs. That's an encoding. If different
    users use different encodings then exchange of textual data becomes
    difficult. That's where encodings which can be used globally come in.
    By the time Python 4 is released I'd be surprised if Unix hadn't
    standardised on a single encoding like UTF-8.
     
    MRAB, Dec 1, 2010
    #3
  4. On Wed, 2010-12-01 at 02:14 +0000, MRAB wrote:
    > If the filenames are to be shown to a user then there needs to be a
    > mapping between bytes and glyphs. That's an encoding. If different
    > users use different encodings then exchange of textual data becomes
    > difficult.


    That's presentation, that's separate. Indeed, I have my user encoding
    set to UTF-8, and if there is a filename that's not valid utf-8 then my
    GUI (GNOME will show "(invalid encoding)" and even allow me to rename it
    and my shell (bash) will show '?' next to the invalid "characters" (and
    make it a little more challenging to rename ;)). And I can freely copy
    these "invalid" files across different (Unix) systems, because the OS
    doesn't care about encoding.

    But that's completely different from the actual name of the file. Unix
    doesn't care about presentation in filenames. It just cares about the
    data. There are not "glyphs" in Unix, only in the UI that runs on top
    of it.

    Or to put it another way, Unix's filename encoding is RAW-DATA. It's
    not "textual" data. The fact that most filenames contain mainly
    human-readable text is a convenient convention, but not required or
    enforced by the OS.

    > That's where encodings which can be used globally come in.
    > By the time Python 4 is released I'd be surprised if Unix hadn't
    > standardised on a single encoding like UTF-8.


    I have serious doubts about that. At least in the Linux world the
    kernel wants to stay out of encoding debates (except where it has to
    like Window filesystems). But the point is that:

    The world does not revolve around Python. Unix filenames have been
    encoding-agnostic long before Python was around. If Python3 does not
    support this then it's a regression on Python's part.
     
    Albert Hopkins, Dec 1, 2010
    #4
  5. Re: Python 3 encoding question: Read a filename from stdin, subsequentlyopen that filename

    > The world does not revolve around Python. Unix filenames have been
    > encoding-agnostic long before Python was around. If Python3 does not
    > support this then it's a regression on Python's part.


    Fortunately, Python 3 does support that.

    Regards,
    Martin
     
    Martin v. Loewis, Dec 1, 2010
    #5
  6. Peter Otten

    Nobody Guest

    Re: Python 3 encoding question: Read a filename from stdin, subsequently open that filename

    On Wed, 01 Dec 2010 02:14:09 +0000, MRAB wrote:

    > If the filenames are to be shown to a user then there needs to be a
    > mapping between bytes and glyphs. That's an encoding. If different
    > users use different encodings then exchange of textual data becomes
    > difficult.


    OTOH, the exchange of binary data is unaffected. In the worst case, users
    see a few wrong glyphs, but the software doesn't care.

    > That's where encodings which can be used globally come in.
    > By the time Python 4 is released I'd be surprised if Unix hadn't
    > standardised on a single encoding like UTF-8.


    That's probably not a serious option in parts of the world which don't use
    a latin-based alphabet, i.e. outside western Europe and its former
    colonies. In countries with non-latin alphabets, existing encodings are
    often too heavily entrenched.

    There's also a lot of legacy software which can only handle unibyte
    encodings, and not much incentive to fix it if 98% of your market can get
    by with an ISO-8859-<whatever> locale (making software work in e.g. CJK
    locales often requires a lot more work than just dealing with encodings).

    And it doesn't help that Windows has negligible support for UTF-8. It's
    either UTF-16-LE (i.e. the in-memory format dumped directly to file) or
    one of Microsoft's non-standard encodings. At least the latter are mostly
    compatible with the corresponding ISO-8859-* encoding.

    Finally, ISO-8859-* encoding/decoding can't fail. The result might
    be complete gibberish, but converting to gibberish then back to bytes
    won't lose information.
     
    Nobody, Dec 1, 2010
    #6
  7. On Tue, 30 Nov 2010 22:22:01 -0500
    Albert Hopkins <> wrote:
    > And I can freely copy
    > these "invalid" files across different (Unix) systems, because the OS
    > doesn't care about encoding.


    And so can Python, thanks to PEP 383.

    > > That's where encodings which can be used globally come in.
    > > By the time Python 4 is released I'd be surprised if Unix hadn't
    > > standardised on a single encoding like UTF-8.

    >
    > I have serious doubts about that. At least in the Linux world the
    > kernel wants to stay out of encoding debates (except where it has to
    > like Window filesystems).


    That doesn't matter. Vendors (Linux distributions) have to make a
    choice and that choice will probably standardize on UTF-8 in most
    situations. The kernel won't have a say, since it doesn't care
    about encodings anyway.

    > The world does not revolve around Python. Unix filenames have been
    > encoding-agnostic long before Python was around. If Python3 does not
    > support this then it's a regression on Python's part.


    Python 3 does support it, see other messages about using bytes
    filenames.

    Regards

    Antoine.
     
    Antoine Pitrou, Dec 1, 2010
    #7
  8. Peter Otten

    Peter Otten Guest

    Re: Python 3 encoding question: Read a filename from stdin, subsequently open that filename

    Nobody wrote:

    > Python 3.x's decision to treat filenames (and environment variables) as
    > text even on Unix is, in short, a bug. One which, IMNSHO, will mean that
    > Python 2.x is still around when Python 4 is released.


    For filenames in Python 3 the user has the choice between "text" (str) and
    bytes. If the user chooses text that will be converted to bytes using a
    default encoding that hopefully matches that of the other tools on the
    machine that manipulate filenames.

    I see that you may run into problems with the text approach when you
    encounter byte sequences that are illegal in the chosen encoding.
    I therefore expect that lowlevel tools will use bytes to manipulate
    filenames while end user scripts will choose text.

    I don't see how a dogmatic bytes only restriction can improve the situation.

    Also, you can already provide unicode filenames in Python 2.x (and a script
    containing constant filenames becomes more portable if you do), so IMHO the
    situation in Python 2 and 3 is similar enough as to not hinder adoption of
    3.x.

    Peter
     
    Peter Otten, Dec 1, 2010
    #8
  9. Peter Otten

    Nobody Guest

    Re: Python 3 encoding question: Read a filename from stdin, subsequently open that filename

    On Wed, 01 Dec 2010 10:34:24 +0100, Peter Otten wrote:

    >> Python 3.x's decision to treat filenames (and environment variables) as
    >> text even on Unix is, in short, a bug. One which, IMNSHO, will mean that
    >> Python 2.x is still around when Python 4 is released.

    >
    > For filenames in Python 3 the user has the choice between "text" (str) and
    > bytes. If the user chooses text that will be converted to bytes using a
    > default encoding that hopefully matches that of the other tools on the
    > machine that manipulate filenames.


    However, sys.argv and os.environ are automatically converted to text. If
    you want bytes, you have to convert them back explicitly.

    Also, I'm unsure as to how far the choice between bytes and str will
    extend beyond the core modules.

    > I see that you may run into problems with the text approach when you
    > encounter byte sequences that are illegal in the chosen encoding.


    This was actually a critical flaw in Python 3.0, as it meant that
    filenames which weren't valid in the locale's encoding simply couldn't be
    passed via argv or environ. 3.1 fixed this using the "surrogateescape"
    encoding, so now it's only an annoyance (i.e. you can recover the original
    bytes once you've spent enough time digging through the documentation).

    There could be a problem with encodings which aren't invertable (e.g.
    ISO-2022), but those tend to be quite rare and Python flat-out doesn't
    support those as system encodings anyhow.
     
    Nobody, Dec 1, 2010
    #9
  10. Peter Otten

    Peter Otten Guest

    Re: Python 3 encoding question: Read a filename from stdin, subsequently open that filename

    Nobody wrote:

    > This was actually a critical flaw in Python 3.0, as it meant that
    > filenames which weren't valid in the locale's encoding simply couldn't be
    > passed via argv or environ. 3.1 fixed this using the "surrogateescape"
    > encoding, so now it's only an annoyance (i.e. you can recover the original
    > bytes once you've spent enough time digging through the documentation).


    Is it just that you need to harden your scripts against these byte sequences
    or do you actually encounter them? If the latter, can you give some
    examples?
     
    Peter Otten, Dec 2, 2010
    #10
  11. Peter Otten

    Nobody Guest

    Re: Python 3 encoding question: Read a filename from stdin, subsequently open that filename

    On Thu, 02 Dec 2010 12:17:53 +0100, Peter Otten wrote:

    >> This was actually a critical flaw in Python 3.0, as it meant that
    >> filenames which weren't valid in the locale's encoding simply couldn't be
    >> passed via argv or environ. 3.1 fixed this using the "surrogateescape"
    >> encoding, so now it's only an annoyance (i.e. you can recover the original
    >> bytes once you've spent enough time digging through the documentation).

    >
    > Is it just that you need to harden your scripts against these byte sequences
    > or do you actually encounter them? If the latter, can you give some
    > examples?


    Assume that you have a Python3 script which takes filenames on the
    command-line. If any of the filenames contain byte sequences which
    aren't valid in the locale's encoding, the bytes will be decoded to
    characters in the range U+DC00 to U+DCFF.

    To recover the original bytes, you need to use 'surrogateescape' as the
    error handling method when decoding, e.g.:

    enc = sys.getfilesystemencoding()
    argv_bytes = [arg.encode(enc, 'surrogateescape') for arg in sys.argv]

    Otherwise, it will complain about not being able to encode the surrogate
    characters.

    Similarly for os.environ.

    For anything else, you can just use sys.setfilesystemencoding('iso-8859-1')
    at the beginning of the script. Decoding as ISO-8859-1 will never fail,
    and encoding as ISO-8859-1 will give you the original bytes.

    But argv and environ are decoded before your script can change the
    encoding, so you need to know the "trick" to undo them if you want to
    write a robust Python 3 script which works with byte strings in an
    encoding-agnostic manner (i.e. a traditional Unix script).
     
    Nobody, Dec 2, 2010
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Charlie Zender

    Reading stdin once confuses second stdin read

    Charlie Zender, Jun 19, 2004, in forum: C Programming
    Replies:
    6
    Views:
    822
    Dan Pop
    Jun 21, 2004
  2. Jordan S.
    Replies:
    1
    Views:
    418
    Jordan S.
    May 23, 2008
  3. Peter Otten
    Replies:
    0
    Views:
    437
    Peter Otten
    Nov 30, 2010
  4. Dan Stromberg
    Replies:
    0
    Views:
    966
    Dan Stromberg
    Dec 6, 2010
  5. M. Ayhan
    Replies:
    1
    Views:
    131
    Trans
    Mar 8, 2007
Loading...

Share This Page