sys.argv as a list of bytes

O

Olive

In Unix the operating system pass argument as a list of C strings. But
C strings does corresponds to the bytes notions of Python3. Is it
possible to have sys.argv as a list of bytes ? What happens if I pass
to a program an argumpent containing funny "character", for example
(with a bash shell)?

python -i ./test.py $'\x01'$'\x05'$'\xFF'
 
P

Peter Otten

Olive said:
In Unix the operating system pass argument as a list of C strings. But
C strings does corresponds to the bytes notions of Python3. Is it
possible to have sys.argv as a list of bytes ? What happens if I pass
to a program an argumpent containing funny "character", for example
(with a bash shell)?

python -i ./test.py $'\x01'$'\x05'$'\xFF'

Python has a special errorhandler, "surrogateescape" to deal with bytes that are not
valid UTF-8. If you try to print such a string you get an error:

$ python3 -c'import sys; print(repr(sys.argv[1]))' $'\x01'$'\x05'$'\xFF'
'\x01\x05\udcff'
$ python3 -c'import sys; print(sys.argv[1])' $'\x01'$'\x05'$'\xFF'
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' in position 2: surrogates not allowed

It is still possible to get the original bytes:

$ python3 -c'import sys; print(sys.argv[1].encode("utf-8", "surrogateescape"))' $'\x01'$'\x05'$'\xFF'
b'\x01\x05\xff'
 
O

Olive

Olive said:
In Unix the operating system pass argument as a list of C strings.
But C strings does corresponds to the bytes notions of Python3. Is
it possible to have sys.argv as a list of bytes ? What happens if I
pass to a program an argumpent containing funny "character", for
example (with a bash shell)?

python -i ./test.py $'\x01'$'\x05'$'\xFF'

Python has a special errorhandler, "surrogateescape" to deal with
bytes that are not valid UTF-8. If you try to print such a string you
get an error:

$ python3 -c'import sys; print(repr(sys.argv[1]))'
$'\x01'$'\x05'$'\xFF' '\x01\x05\udcff'
$ python3 -c'import sys; print(sys.argv[1])' $'\x01'$'\x05'$'\xFF'
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' in
position 2: surrogates not allowed

It is still possible to get the original bytes:

$ python3 -c'import sys; print(sys.argv[1].encode("utf-8",
"surrogateescape"))' $'\x01'$'\x05'$'\xFF' b'\x01\x05\xff'

But is it safe even if the locale is not UTF-8? I would like to be able
to pass a file name to a script. I can use bytes for file names in the
open function. If I keep the filename as bytes everywhere it will work
reliably whatever the locale or strange character the file name may
contain.

Olive
 
P

Peter Otten

Olive said:
Olive said:
In Unix the operating system pass argument as a list of C strings.
But C strings does corresponds to the bytes notions of Python3. Is
it possible to have sys.argv as a list of bytes ? What happens if I
pass to a program an argumpent containing funny "character", for
example (with a bash shell)?

python -i ./test.py $'\x01'$'\x05'$'\xFF'

Python has a special errorhandler, "surrogateescape" to deal with
bytes that are not valid UTF-8. If you try to print such a string you
get an error:

$ python3 -c'import sys; print(repr(sys.argv[1]))'
$'\x01'$'\x05'$'\xFF' '\x01\x05\udcff'
$ python3 -c'import sys; print(sys.argv[1])' $'\x01'$'\x05'$'\xFF'
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcff' in
position 2: surrogates not allowed

It is still possible to get the original bytes:

$ python3 -c'import sys; print(sys.argv[1].encode("utf-8",
"surrogateescape"))' $'\x01'$'\x05'$'\xFF' b'\x01\x05\xff'

But is it safe even if the locale is not UTF-8? I would like to be able
to pass a file name to a script. I can use bytes for file names in the
open function. If I keep the filename as bytes everywhere it will work
reliably whatever the locale or strange character the file name may
contain.

I believe you need not convert back to bytes explicitly, you can open the
file with open(sys.argv). I don't know if there are cornercases where
that won't work; maybe http://www.python.org/dev/peps/pep-0383/ can help you
figure it out.
 
N

Nobody

But is it safe even if the locale is not UTF-8?

Yes. Peter's reference to UTF-8 is misleading. The surrogateescape
mechanism is used to represent anything which cannot be decoded according
to the locale's encoding. E.g. in the "C" locale, any byte >= 128 will be
encoded as a surrogate.

It is still possible to get the original bytes:

python3 -c'import sys; print(sys.argv[1].encode("utf-8", "surrogateescape"))'

Except, it isn't. Because the Python dev's can't make up their mind which
encoding sys.argv uses, or even document it.

AFAICT:

On Windows, there never was a bytes version of sys.argv to start with
(the OS supplies the command line using wide strings).

On Mac OS X, the command line is always decoded using UTF-8.

On Unix, the command line is decoded using mbstowcs(). There isn't a
Python function to query which encoding this used (if there even _is_ a
corresponding Python encoding).

Except on Windows (where OS APIs take wide string parameters), if a
library function needs to pass a Unicode string to an API function, it
will normally decode it using sys.getfilesystemencoding(), which isn't
guaranteed to be the encoding which was used to fabricate sys.argv in
the first place.

In short: if you need to write "system" scripts on Unix, and you need them
to work reliably, you need to stick with Python 2.x.
 
J

jmfauth

In short: if you need to write "system" scripts on Unix, and you need them
to work reliably, you need to stick with Python 2.x.


I think, understanding the coding of the characters helps a bit.

I can not figure out how the example below could not be
done on other systems.

D:\tmp>chcp
Page de codes active : 1252

D:\tmp>c:\python32\python.exe sysarg.py a b é € \u0430 \u03b1 z
arg: 1 unicode name: LATIN SMALL LETTER A
arg: 2 unicode name: LATIN SMALL LETTER B
arg: 3 unicode name: LATIN SMALL LETTER E WITH ACUTE
arg: 4 unicode name: EURO SIGN
arg: 5 unicode name: CYRILLIC SMALL LETTER A
arg: 6 unicode name: GREEK SMALL LETTER ALPHA
arg: 7 unicode name: LATIN SMALL LETTER Z

jmf
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top