''.join() with encoded strings

S

Sandra-24

I'd love to know why calling ''.join() on a list of encoded strings
automatically results in converting to the default encoding. First of
all, it's undocumented, so If I didn't have non-ascii characters in my
utf-8 data, I'd never have known until one day I did, and then the code
would break. Secondly you can't override (for valid reasons) the
default encoding, so that's not a way around it. So ''.join becomes
pretty useless when dealing with the real (non-ascii) world.

I won't miss the str class when it finally goes (in v3?).

How can I join my encoded strings effeciently?

Thanks,
-Sandra
 
D

Diez B. Roggisch

Sandra-24 said:
I'd love to know why calling ''.join() on a list of encoded strings
automatically results in converting to the default encoding. First of
all, it's undocumented, so If I didn't have non-ascii characters in my
utf-8 data, I'd never have known until one day I did, and then the code
would break. Secondly you can't override (for valid reasons) the
default encoding, so that's not a way around it. So ''.join becomes
pretty useless when dealing with the real (non-ascii) world.

I won't miss the str class when it finally goes (in v3?).

How can I join my encoded strings effeciently?

By not mixing unicode objects with ordinary byte strings. Use

u''.join(some_unicode_objects)

to get a joined unicode object.

Diez
 
F

Fredrik Lundh

Sandra-24 said:
I'd love to know why calling ''.join() on a list of encoded strings
automatically results in converting to the default encoding. First of
all, it's undocumented, so If I didn't have non-ascii characters in my
utf-8 data, I'd never have known until one day I did, and then the code
would break. Secondly you can't override (for valid reasons) the
default encoding, so that's not a way around it. So ''.join becomes
pretty useless when dealing with the real (non-ascii) world.

if all strings in a sequence are encoded strings (byte buffers), join does
the right thing.

if all strings in a sequence are Unicode strings, join does the right thing.

if all strings are ascii strings, join does the right thing.

the only way to mess up is to mix byte buffers containing encoded data
with decoded strings. the solution is simple: make sure to *decode* all
data you're using, *before* using it.

</F>
 
S

Sandra-24

Sorry, this was my mistake, I had some unicode strings in the list
without realizing it. I deleted the topic within 10 minutes, but
apparently I wasn't fast enough. You're right join works the way it
should, I just wasn't aware I had the unicode strings in there.

-Sandra
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,432
Messages
2,571,681
Members
48,796
Latest member
Greg L.

Latest Threads

Top