sys.argv as a list of bytes

Discussion in 'Python' started by Olive, Jan 18, 2012.

  1. Olive

    Olive Guest

    In Unix the operating system pass argument as a list of C strings. But
    C strings does corresponds to the bytes notions of Python3. Is it
    possible to have sys.argv as a list of bytes ? What happens if I pass
    to a program an argumpent containing funny "character", for example
    (with a bash shell)?

    python -i ./test.py $'\x01'$'\x05'$'\xFF'
     
    Olive, Jan 18, 2012
    #1
    1. Advertising

  2. Olive

    Peter Otten Guest

    Olive wrote:

    > In Unix the operating system pass argument as a list of C strings. But
    > C strings does corresponds to the bytes notions of Python3. Is it
    > possible to have sys.argv as a list of bytes ? What happens if I pass
    > to a program an argumpent containing funny "character", for example
    > (with a bash shell)?
    >
    > python -i ./test.py $'\x01'$'\x05'$'\xFF'


    Python has a special errorhandler, "surrogateescape" to deal with bytes that are not
    valid UTF-8. If you try to print such a string you get an error:

    $ python3 -c'import sys; print(repr(sys.argv[1]))' $'\x01'$'\x05'$'\xFF'
    '\x01\x05\udcff'
    $ python3 -c'import sys; print(sys.argv[1])' $'\x01'$'\x05'$'\xFF'
    Traceback (most recent call last):
    File "<string>", line 1, in <module>
    UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' in position 2: surrogates not allowed

    It is still possible to get the original bytes:

    $ python3 -c'import sys; print(sys.argv[1].encode("utf-8", "surrogateescape"))' $'\x01'$'\x05'$'\xFF'
    b'\x01\x05\xff'
     
    Peter Otten, Jan 18, 2012
    #2
    1. Advertising

  3. Olive

    Olive Guest

    On Wed, 18 Jan 2012 09:05:42 +0100
    Peter Otten <> wrote:

    > Olive wrote:
    >
    > > In Unix the operating system pass argument as a list of C strings.
    > > But C strings does corresponds to the bytes notions of Python3. Is
    > > it possible to have sys.argv as a list of bytes ? What happens if I
    > > pass to a program an argumpent containing funny "character", for
    > > example (with a bash shell)?
    > >
    > > python -i ./test.py $'\x01'$'\x05'$'\xFF'

    >
    > Python has a special errorhandler, "surrogateescape" to deal with
    > bytes that are not valid UTF-8. If you try to print such a string you
    > get an error:
    >
    > $ python3 -c'import sys; print(repr(sys.argv[1]))'
    > $'\x01'$'\x05'$'\xFF' '\x01\x05\udcff'
    > $ python3 -c'import sys; print(sys.argv[1])' $'\x01'$'\x05'$'\xFF'
    > Traceback (most recent call last):
    > File "<string>", line 1, in <module>
    > UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' in
    > position 2: surrogates not allowed
    >
    > It is still possible to get the original bytes:
    >
    > $ python3 -c'import sys; print(sys.argv[1].encode("utf-8",
    > "surrogateescape"))' $'\x01'$'\x05'$'\xFF' b'\x01\x05\xff'
    >
    >


    But is it safe even if the locale is not UTF-8? I would like to be able
    to pass a file name to a script. I can use bytes for file names in the
    open function. If I keep the filename as bytes everywhere it will work
    reliably whatever the locale or strange character the file name may
    contain.

    Olive
     
    Olive, Jan 18, 2012
    #3
  4. Olive

    Peter Otten Guest

    Olive wrote:

    > On Wed, 18 Jan 2012 09:05:42 +0100
    > Peter Otten <> wrote:
    >
    >> Olive wrote:
    >>
    >> > In Unix the operating system pass argument as a list of C strings.
    >> > But C strings does corresponds to the bytes notions of Python3. Is
    >> > it possible to have sys.argv as a list of bytes ? What happens if I
    >> > pass to a program an argumpent containing funny "character", for
    >> > example (with a bash shell)?
    >> >
    >> > python -i ./test.py $'\x01'$'\x05'$'\xFF'

    >>
    >> Python has a special errorhandler, "surrogateescape" to deal with
    >> bytes that are not valid UTF-8. If you try to print such a string you
    >> get an error:
    >>
    >> $ python3 -c'import sys; print(repr(sys.argv[1]))'
    >> $'\x01'$'\x05'$'\xFF' '\x01\x05\udcff'
    >> $ python3 -c'import sys; print(sys.argv[1])' $'\x01'$'\x05'$'\xFF'
    >> Traceback (most recent call last):
    >> File "<string>", line 1, in <module>
    >> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' in
    >> position 2: surrogates not allowed
    >>
    >> It is still possible to get the original bytes:
    >>
    >> $ python3 -c'import sys; print(sys.argv[1].encode("utf-8",
    >> "surrogateescape"))' $'\x01'$'\x05'$'\xFF' b'\x01\x05\xff'
    >>
    >>

    >
    > But is it safe even if the locale is not UTF-8? I would like to be able
    > to pass a file name to a script. I can use bytes for file names in the
    > open function. If I keep the filename as bytes everywhere it will work
    > reliably whatever the locale or strange character the file name may
    > contain.


    I believe you need not convert back to bytes explicitly, you can open the
    file with open(sys.argv). I don't know if there are cornercases where
    that won't work; maybe http://www.python.org/dev/peps/pep-0383/ can help you
    figure it out.
     
    Peter Otten, Jan 18, 2012
    #4
  5. Olive

    Nobody Guest

    On Wed, 18 Jan 2012 09:05:42 +0100, Peter Otten wrote:

    >> Python has a special errorhandler, "surrogateescape" to deal with
    >> bytes that are not valid UTF-8.


    On Wed, 18 Jan 2012 11:16:27 +0100, Olive wrote:

    > But is it safe even if the locale is not UTF-8?


    Yes. Peter's reference to UTF-8 is misleading. The surrogateescape
    mechanism is used to represent anything which cannot be decoded according
    to the locale's encoding. E.g. in the "C" locale, any byte >= 128 will be
    encoded as a surrogate.

    On Wed, 18 Jan 2012 09:05:42 +0100, Peter Otten wrote:

    > It is still possible to get the original bytes:
    >
    > python3 -c'import sys; print(sys.argv[1].encode("utf-8", "surrogateescape"))'


    Except, it isn't. Because the Python dev's can't make up their mind which
    encoding sys.argv uses, or even document it.

    AFAICT:

    On Windows, there never was a bytes version of sys.argv to start with
    (the OS supplies the command line using wide strings).

    On Mac OS X, the command line is always decoded using UTF-8.

    On Unix, the command line is decoded using mbstowcs(). There isn't a
    Python function to query which encoding this used (if there even _is_ a
    corresponding Python encoding).

    Except on Windows (where OS APIs take wide string parameters), if a
    library function needs to pass a Unicode string to an API function, it
    will normally decode it using sys.getfilesystemencoding(), which isn't
    guaranteed to be the encoding which was used to fabricate sys.argv in
    the first place.

    In short: if you need to write "system" scripts on Unix, and you need them
    to work reliably, you need to stick with Python 2.x.
     
    Nobody, Jan 19, 2012
    #5
  6. Olive

    jmfauth Guest

    >
    > In short: if you need to write "system" scripts on Unix, and you need them
    > to work reliably, you need to stick with Python 2.x.



    I think, understanding the coding of the characters helps a bit.

    I can not figure out how the example below could not be
    done on other systems.

    D:\tmp>chcp
    Page de codes active : 1252

    D:\tmp>c:\python32\python.exe sysarg.py a b é € \u0430 \u03b1 z
    arg: 1 unicode name: LATIN SMALL LETTER A
    arg: 2 unicode name: LATIN SMALL LETTER B
    arg: 3 unicode name: LATIN SMALL LETTER E WITH ACUTE
    arg: 4 unicode name: EURO SIGN
    arg: 5 unicode name: CYRILLIC SMALL LETTER A
    arg: 6 unicode name: GREEK SMALL LETTER ALPHA
    arg: 7 unicode name: LATIN SMALL LETTER Z

    jmf
     
    jmfauth, Jan 19, 2012
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Bret

    char **argv vs. char* argv[]

    Bret, Aug 31, 2003, in forum: C Programming
    Replies:
    21
    Views:
    4,664
    Richard Heathfield
    Sep 3, 2003
  2. David
    Replies:
    10
    Views:
    6,049
    Richard Heathfield
    Sep 15, 2003
  3. Hal Styli
    Replies:
    14
    Views:
    1,697
    Old Wolf
    Jan 20, 2004
  4. =?ISO-8859-1?Q?Thomas_N=FCcker?=

    sys.argv[0] - 'module' object has no attribute 'argv'

    =?ISO-8859-1?Q?Thomas_N=FCcker?=, Jun 30, 2003, in forum: Python
    Replies:
    0
    Views:
    974
    =?ISO-8859-1?Q?Thomas_N=FCcker?=
    Jun 30, 2003
  5. jab3

    char **argv & char *argv[]

    jab3, Dec 4, 2004, in forum: C Programming
    Replies:
    5
    Views:
    705
    Chris Torek
    Dec 8, 2004
Loading...

Share This Page