Re: Python 3 encoding question: Read a filename from stdin,subsequently open that filename

Discussion in 'Python' started by Peter Otten, Nov 30, 2010.

  1. Peter Otten

    Peter Otten Guest

    Dan Stromberg wrote:

    > I've got a couple of programs that read filenames from stdin, and then
    > open those files and do things with them. These programs sort of do
    > the *ix xargs thing, without requiring xargs.
    >
    > In Python 2, these work well. Irrespective of how filenames are
    > encoded, things are opened OK, because it's all just a stream of
    > single byte characters.


    I think you're wrong. The filenames' encoding as they are read from stdin
    must be the same as the encoding used by the file system. If the file system
    expects UTF-8 and you feed it ISO-8859-1 you'll run into errors.

    You always have to know either

    (a) both the file system's and stdin's actual encoding, or
    (b) that both encodings are the same.

    If byte strings work you are in situation (b) or just lucky. I'd guess the
    latter ;)

    > In Python 3, I'm finding that I have encoding issues with characters
    > with their high bit set. Things are fine with strictly ASCII
    > filenames. With high-bit-set characters, even if I change stdin's
    > encoding with:
    >
    > import io
    > STDIN = io.open(sys.stdin.fileno(), 'r', encoding='ISO-8859-1')


    I suppose you can handle (b) with

    STDIN = sys.stdin.buffer

    or

    STDIN = io.TextIOWrapper(sys.stdin.buffer,
    encoding=sys.getfilesystemencoding())

    in Python 3. I'd prefer the latter because it makes your assumptions
    explicit. (Disclaimer: I'm not sure whether I'm using the io API as Guido
    intended it)

    > ...even with that, when I read a filename from stdin with a
    > single-character Spanish n~, the program cannot open that filename
    > because the n~ is apparently internally converted to two bytes, but
    > remains one byte in the filesystem. I decided to try ISO-8859-1 with
    > Python 3, because I have a Java program that encountered a similar
    > problem until I used en_US.ISO-8859-1 in an environment variable to
    > set the JVM's encoding for stdin.
    >
    > Python 2 shows the n~ as 0xf1 in an os.listdir('.'). Python 3 with an
    > encoding of ISO-8859-1 wants it to be 0xc3 followed by 0xb1.
    >
    > Does anyone know what I need to do to read filenames from stdin with
    > Python 3.1 and subsequently open them, when some of those filenames
    > include characters with their high bit set?
    >
    > TIA!
     
    Peter Otten, Nov 30, 2010
    #1
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Charlie Zender

    Reading stdin once confuses second stdin read

    Charlie Zender, Jun 19, 2004, in forum: C Programming
    Replies:
    6
    Views:
    815
    Dan Pop
    Jun 21, 2004
  2. Jordan S.
    Replies:
    1
    Views:
    418
    Jordan S.
    May 23, 2008
  3. Peter Otten
    Replies:
    10
    Views:
    894
    Nobody
    Dec 2, 2010
  4. Dan Stromberg
    Replies:
    0
    Views:
    955
    Dan Stromberg
    Dec 6, 2010
  5. M. Ayhan
    Replies:
    1
    Views:
    127
    Trans
    Mar 8, 2007
Loading...

Share This Page