Why does the "".join(r) do this?

J

Jim Hefferon

Hello,

I'm getting an error join-ing strings and wonder if someone can
explain why the function is behaving this way? If I .join in a string
that contains a high character then I get an ascii codec decoding
error. (The code below illustrates.) Why doesn't it just
concatenate?

I'm building up a web page by stuffing an array and then doing
"".join(r) at
the end. I intend to later encode it as 'latin1', so I'd like it to
just concatenate. While I can work around this error, the reason for
it escapes me.

Thanks,
Jim

================= program: try.py
#!/usr/bin/python2.3 -u
t="abc"+chr(174)+"def"
print(u"next: %s :there" % (t.decode('latin1'),))
print t
r=["x",'y',u'z']
r.append(t)
k="".join(r)
print k

================== command line (on my screen between the first abc
and def is
a circle-R, while between the second two is a black oval with a
white
question mark, in case anyone cares):
jim@joshua:~$ ./try.py
next: abc®def :there
abc�def
Traceback (most recent call last):
File "./try.py", line 7, in ?
k="".join(r)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position
3: ordinal not in range(128)
 
P

Peter Hansen

Jim said:
I'm getting an error join-ing strings and wonder if someone can
explain why the function is behaving this way? If I .join in a string
that contains a high character then I get an ascii codec decoding
error. (The code below illustrates.) Why doesn't it just
concatenate?

It can't just concatenate because your list contains other
items which are unicode strings. Python is attempting to convert
your strings to unicode strings to do the join, and it fails
because your strings contain characters which don't have
meaning to the default decoder.

-Peter
 
S

Skip Montanaro

Jim> I'm building up a web page by stuffing an array and then doing
Jim> "".join(r) at the end. I intend to later encode it as 'latin1', so
Jim> I'd like it to just concatenate. While I can work around this
Jim> error, the reason for it escapes me.

Try

u"".join(r)

instead. I think the join operation is trying to convert the Unicode bits
in your list of strings to strings by encoding using the default codec,
which appears to be ASCII.

Skip
 
P

Peter Otten

Jim said:
I'm getting an error join-ing strings and wonder if someone can
explain why the function is behaving this way? If I .join in a string
that contains a high character then I get an ascii codec decoding
error. (The code below illustrates.) Why doesn't it just
concatenate?

Let's reduce the problem to its simplest case:
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position 0:
ordinal not in range(128)

So why doesn't it just concatenate? Because there is no way of knowing how
to properly decode chr(174) or any other non-ascii character to unicode:

Use either unicode or str, but don't mix them. That should keep you out of
trouble.

Peter
 
P

Peter Otten

Skip said:
Try

u"".join(r)

instead. I think the join operation is trying to convert the Unicode bits
in your list of strings to strings by encoding using the default codec,
which appears to be ASCII.

This is bound to fail when the first non-ascii str occurs:
u"".join(["a", "b"]) u'ab'
u"".join(["a", chr(174)])
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position 0:
ordinal not in range(128)
Apart from that, Python automatically switches to unicode if the list
contains unicode items:
u'ao'

Peter
 
M

moma

Jim said:
Hello,

I'm getting an error join-ing strings and wonder if someone can
explain why the function is behaving this way? If I .join in a string
that contains a high character then I get an ascii codec decoding
error. (The code below illustrates.) Why doesn't it just
concatenate?

I'm building up a web page by stuffing an array and then doing
"".join(r) at
the end. I intend to later encode it as 'latin1', so I'd like it to
just concatenate. While I can work around this error, the reason for
it escapes me.

Thanks,
Jim

================= program: try.py
#!/usr/bin/python2.3 -u
t="abc"+chr(174)+"def"
print(u"next: %s :there" % (t.decode('latin1'),))
print t
r=["x",'y',u'z']
r.append(t)
k="".join(r)
print k

================== command line (on my screen between the first abc
and def is
a circle-R, while between the second two is a black oval with a
white
question mark, in case anyone cares):
jim@joshua:~$ ./try.py
next: abc®def :there
abc�def
Traceback (most recent call last):
File "./try.py", line 7, in ?
k="".join(r)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position
3: ordinal not in range(128)

What about unichr() ?


#!/usr/bin/python2.3 -u
t="abc"+unichr(174)+"def"
print t
print(u"next: %s :there" % (t),)
print t
r=["x",'y',u'z']
r.append(t)
k="".join(r)
print k
 
M

moma

Jim said:
Hello,

I'm getting an error join-ing strings and wonder if someone can
explain why the function is behaving this way? If I .join in a string
that contains a high character then I get an ascii codec decoding
error. (The code below illustrates.) Why doesn't it just
concatenate?

I'm building up a web page by stuffing an array and then doing
"".join(r) at
the end. I intend to later encode it as 'latin1', so I'd like it to
just concatenate. While I can work around this error, the reason for
it escapes me.

Thanks,
Jim

================= program: try.py
#!/usr/bin/python2.3 -u
t="abc"+chr(174)+"def"
print(u"next: %s :there" % (t.decode('latin1'),))
print t
r=["x",'y',u'z']
r.append(t)
k="".join(r)
print k

================== command line (on my screen between the first abc
and def is
a circle-R, while between the second two is a black oval with a
white
question mark, in case anyone cares):
jim@joshua:~$ ./try.py
next: abc®def :there
abc�def
Traceback (most recent call last):
File "./try.py", line 7, in ?
k="".join(r)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position
3: ordinal not in range(128)

What about unichr() ?


#!/usr/bin/python2.3 -u
t="abc"+unichr(174)+"def"
print t
print(u"next: %s :there" % (t),)
print t
r=["x",'y',u'z']
r.append(t)
# k=u"".join(r)
k="".join(r)
print k


// moma
http://www.futuredesktop.org
 
S

Skip Montanaro

Peter> This is bound to fail when the first non-ascii str occurs:

...

Yeah I realized that later. I missed that he was appending non-ASCII
strings to his list. I thought he was only appending unicode objects and
ASCII strings (in which case what he was trying should have worked). Serves
me right for trying to respond with a head cold.

Skip
 
I

Ivan Voras

Peter said:
Skip Montanaro wrote:




This is bound to fail when the first non-ascii str occurs:

Is there a way to change the default codec in a part of a program?
(Meaning that different parts of program deal with strings they know are
in a specific different code pages?)
 
J

John Roth

Ivan Voras said:
Is there a way to change the default codec in a part of a program?
(Meaning that different parts of program deal with strings they know are
in a specific different code pages?)

Does the encoding line (1st or second line of program) do this?
I don't remember if it does or not - although I'd suspect not.
Otherwise it seems like a reasonably straightforward function
to write.

John Roth
 
P

Peter Otten

John said:
Does the encoding line (1st or second line of program) do this?
I don't remember if it does or not - although I'd suspect not.
Otherwise it seems like a reasonably straightforward function
to write.

As a str does not preserve information about the encoding, the
# -*- coding: XXX -*-
comment does not help here. It does however control the decoding of unicode
strings. I suppose using unicode for non-ascii literals plus the above
coding comment is as close as you can get to the desired effect.

With some more work you could probably automate string conversion like it is
done with quixote's htmltext. Not sure if that would be worth the effort,
though.

Peter
 
J

Jim Hefferon

Peter Otten said:
So why doesn't it just concatenate? Because there is no way of knowing how
to properly decode chr(174) or any other non-ascii character to unicode:

Forgive me, Peter, but you've only rephrased my question: I'm going to
decode them later, so why does the concatenator insist on decoding
them now? As I understand it (perhaps this is my error),
encoding/decoding is stuff that you do external to manipulating the
arrays of characters.
Use either unicode or str, but don't mix them. That should keep you out of
trouble.

Well, I got this string as the filename of some kind of Macintosh file
(I'm on Linux but I'm working with an archive that contains some pre-X
Mac stuff) while calling some os and os.path functions. So I'm taking
strings from a Python library function (and using % to stuff them into
strings that will end up on the web, which should preserve
unicode-type-ness, right?) and then .join-ing them.

I didn't go into the whole story when posting, because I tried to boil
the question down. Perhaps I should have.

Thanks; I am often struck by how helpful this group is,
Jim
 
J

John Roth

Jim Hefferon said:
Forgive me, Peter, but you've only rephrased my question: I'm going to
decode them later, so why does the concatenator insist on decoding
them now? As I understand it (perhaps this is my error),
encoding/decoding is stuff that you do external to manipulating the
arrays of characters.

Maybe I can simplify it? The result has to be in a single encoding,
which will be UTF-8 if any of the strings is a unicode string.
Ascii-7 is a proper subset of UTF-8, so there is no difficulty with
the concatination. 8-bit encodings are not, so the concatination
checks that any normal strings are, in fact, Ascii-7. The encoding
is actually doing the validity check, not an encoding conversion.

The only way the system could do a clean concatination between
unicode and one of the 8-bit encodings is to know beforehand which
of the 8-bit encodings it is dealing with, and there is no way that it
currently has of knowing that.

The people who implemented unicode (in 2.0, I believe) seem to
have decided not to guess. That's in line with the "explicit is better
than implicit" principle.
Well, I got this string as the filename of some kind of Macintosh file
(I'm on Linux but I'm working with an archive that contains some pre-X
Mac stuff) while calling some os and os.path functions. So I'm taking
strings from a Python library function (and using % to stuff them into
strings that will end up on the web, which should preserve
unicode-type-ness, right?) and then .join-ing them.

Ah. The issue then is rather simple: what is the encoding of the normal
strings? I'd presume Latin-1. So simply run the list of strings through a
function that converts any normal string to unicode using the Latin-1
codec, and then they should concatinate fine.

As far as the web goes, I'd suggest you make sure you specify UTF-8
in both the HTTP headers and in a <meta> tag in the HTML header,
and make sure that what you write out is, indeed, UTF-8.

John Roth
 
E

Erik Max Francis

Jim said:
Forgive me, Peter, but you've only rephrased my question: I'm going to
decode them later, so why does the concatenator insist on decoding
them now?

Because you're mixing normal strings and Unicode strings. To do that,
it needs to convert the normal strings to Unicode, and to do that it has
to know what encoding you want.
As I understand it (perhaps this is my error),
encoding/decoding is stuff that you do external to manipulating the
arrays of characters.

It's the process by which you turn an arbitrary string into a Unicode
string and back. When you're adding normal strings and Unicode strings,
you end up with a Unicode string, which means the normal strings have to
be implicitly converted. That's why you're getting the error.

Work with strings or Unicode strings, not a mixture, and you won't have
this problem.
 
P

Peter Otten

Jim said:
Forgive me, Peter, but you've only rephrased my question: I'm going to
decode them later, so why does the concatenator insist on decoding
them now? As I understand it (perhaps this is my error),
encoding/decoding is stuff that you do external to manipulating the
arrays of characters.

Perhaps another example will help in addition to the answers already given:
3.0

In the above 1 is converted to 1.0 before it can be added to 2.0, i. e. we
have
3.0

In the same spirit
u'ab'

"b" is converted to unicode before u"a" and u"b" can be concatenated. The
same goes for string formatting:
u'ab'

The following might be the conversion function:
.... return s.decode(encoding)
....u'ab'

Of course it would fail with non-ascii characters in the string that shall
be converted. Why not allow strings with all 256 chars? Again, as stated in
my above post, that would be ambiguous:

By the way, in the real conversion routine the encoding isn't hardcoded, see
sys.get/setdefaultencoding() for the details. Therefore you _could_ modify
site.py to assume e. g. latin1 as the encoding of 8 bit strings. The
practical benefit of that is limited as you cannot make assumptions about
machines not under your control and therefore are stuck with ascii as the
least common denominator for scripts meant to be portable - which brings us
back to:

Or make all conversions explicit with the str.decode()/unicode.encode()
methods.
Well, I got this string as the filename of some kind of Macintosh file
(I'm on Linux but I'm working with an archive that contains some pre-X
Mac stuff) while calling some os and os.path functions. So I'm taking
strings from a Python library function (and using % to stuff them into
strings that will end up on the web, which should preserve
unicode-type-ness, right?) and then .join-ing them.

I didn't go into the whole story when posting, because I tried to boil
the question down. Perhaps I should have.

While details are often helpful to identify a problem that is different from
the poster's guess, unicode handling is pretty general, and it was rather
my post that was lacking clarity.

Peter
 
J

Jim Hefferon

Peter Otten said:
Of course it would fail with non-ascii characters in the string that shall
be converted. Why not allow strings with all 256 chars? Again, as stated in
my above post, that would be ambiguous:
Thanks, Peter and others, you have been enlightening. I understand
you to say that Python insists that I explicitly decide the decoding,
and not just smoosh the strings. Thanks.

I will write to the documentation person with the suggestion that the
documentation of .join(seq) at
http://docs.python.org/lib/string-methods.html#l2h-188 might be
updated from:
"Return a string which is the concatenation of the strings in the
sequence seq."
Or make all conversions explicit with the str.decode()/unicode.encode()
methods.
Now I only have to figure out whic codec's are available and
appropriate.
Thanks again,

Jim
 
T

Terry Reedy

Jim Hefferon said:
Thanks, Peter and others, you have been enlightening. I understand
you to say that Python insists that I explicitly decide the decoding,
and not just smoosh the strings. Thanks.

Abstractly, byte strings and unicode strings are different types of beasts.
If you forget what you know about the CPython computer implementation and
linear computer memories, it make little sense to combine them. The result
would have to be some currently nonexistent byte-unicode string.

Terry J. Reedy
 
T

Tim Roberts

moma said:
What about unichr() ?

#!/usr/bin/python2.3 -u
t="abc"+unichr(174)+"def"

That's an easy trap to fall into, but it isn't right. unichr(174), which
is U+00AE, is the ® (R) registered trademark symbol. We don't have any
idea whether or not the \xae character in his original 8-bit string was
actually the registered trademark symbol.

The meaning of the original \xae, and therefore the Unicode equivalent of
that character, depends COMPLETELY on the character set of that original
string.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,022
Latest member
MaybelleMa

Latest Threads

Top