Becoming Unicode Aware

M

Michael Foord

I'm trying to become 'unicode-aware'... *sigh*. What's that quote - 'a
native speaker of ascii will never learn to speak unicode like a
native'. The trouble is I think I've been a native speaker of latin-1
without realising it.

My main problem with udnerstanding unicode is what to do with
arbitrary text without an encoding specified. To the best of my
knowledge the technical term for this situation is 'buggered'. E.g. I
have a CGI guestbook script. Is the only way of knowing what encodign
the user is typing in, to ask them ?

Anyway - ConfigObj reads config files from plain text files. Is there
a standard for specifying the encoding within the text file ? I know
python scripts have a method - should I just use that ?

Also - suppose I know the encoding, or let the programmer specify, is
the following sufficient for reading the files in :

def afunction(setoflines, encoding='ascii'):
for line in setoflines:
if encoding:
line = line.decode(encoding)

Regards,


Fuzzy
http://www.voidspace.org.uk/atlantibots/pythonutils.html
 
D

Diez B. Roggisch

My main problem with udnerstanding unicode is what to do with
arbitrary text without an encoding specified. To the best of my
knowledge the technical term for this situation is 'buggered'. E.g. I
have a CGI guestbook script. Is the only way of knowing what encodign
the user is typing in, to ask them ?

Unfortunately the http standard seems to lack a specification how form data
encoding is to be transferred. But it seems that most browser which
understand a certain encoding your page is delivered in will use that for
replying.

Anyway - ConfigObj reads config files from plain text files. Is there
a standard for specifying the encoding within the text file ? I know
python scripts have a method - should I just use that ?

No idea what configobj is - is it you own config parser?
Also - suppose I know the encoding, or let the programmer specify, is
the following sufficient for reading the files in :

def afunction(setoflines, encoding='ascii'):
for line in setoflines:
if encoding:
line = line.decode(encoding)

Yes, it should be - but why the if? It is unnecessary, as its condition will
always be true - and you _want_ it that way, as the result of afunction
should always be unicode objects, no matter what encoding was used.
 
J

Jim Hefferon

(e-mail address removed) (Michael Foord) wrote ...
I'm trying to become 'unicode-aware'... *sigh*. What's that quote - 'a
native speaker of ascii will never learn to speak unicode like a
native'. The trouble is I think I've been a native speaker of latin-1
without realising it.
It *is* odd, IMHO, that my database connector spits out strings-like
things that have 8-bit data so that when I
"".join(array_of_database_strings) them, I get a failure. I've
learned to by-hand them into unicode strings, but it is odd.
Something like a pair (encoding,string) seems more natural to me, but
probably I just don't get the issues.
My main problem with udnerstanding unicode is what to do with
arbitrary text without an encoding specified. To the best of my
knowledge the technical term for this situation is 'buggered'. E.g. I
have a CGI guestbook script. Is the only way of knowing what encodign
the user is typing in, to ask them ?
I found this link
https://bugzilla.mozilla.org/show_bug.cgi?id=18643#c12
useful.

Jim
 
B

Bengt Richter

Unfortunately the http standard seems to lack a specification how form data
encoding is to be transferred. But it seems that most browser which
understand a certain encoding your page is delivered in will use that for
replying.



No idea what configobj is - is it you own config parser?


Yes, it should be - but why the if? It is unnecessary, as its condition will
always be true - and you _want_ it that way, as the result of afunction
^^^^^^^^^^^^^^

afunction(lines, None)

would seem to be a feasible call ;-)
should always be unicode objects, no matter what encoding was used.

Regards,
Bengt Richter
 
A

Alex Martelli

Michael Foord said:
def afunction(setoflines, encoding='ascii'):
for line in setoflines:
if encoding:
line = line.decode(encoding)

This snippet as posted is a complicated "no-op but raise an error for
invalidly encoded lines", if it's the whole function.

Assuming the so-called setoflines IS not a set but a list (order
normally matters in such cases), you may rather want:

def afunction(setoflines, encoding='ascii'):
for i, line in enumerate(setoflines):
setoflines = line.decode(encoding)

The removal of the 'if' is just the same advice you were already given;
if you want to be able to explicitly pass encoding='' to AVOID the
decode (the whole purpose of the function), just insert a firs line

if not encoding: return

rather than repeating the test in the loop. But the key change is to
use enumerate to get indices as well as values, and assign into the
indexing in order to update 'setoflines' in-place; assigning to the
local variable 'line' (assuming, again, that you didn't snip your code
w/o a mention of that) is no good.

A good alternative might alternatively be

setoflines[:] = [line.decode(encoding) for line in setoflines]

assuming again that you want the change to happen in-place.


Alex
 
C

Carl Banks

My main problem with udnerstanding unicode is what to do with
arbitrary text without an encoding specified. To the best of my
knowledge the technical term for this situation is 'buggered'. E.g. I
have a CGI guestbook script. Is the only way of knowing what encodign
the user is typing in, to ask them ?

Generally speaking, you have to ask (either the user or the software).
There's no reliable way to tell what encoding you're looking at
without someone or something telling you; you might be able to make a
heuristical guess, but that's it.

Anyway - ConfigObj reads config files from plain text files. Is there
a standard for specifying the encoding within the text file ? I know
python scripts have a method - should I just use that ?

It's a good method if you expect people to be editing the config file
with Emacs. It's a good enough method if you haven't any good reason
to use another method.

Also - suppose I know the encoding, or let the programmer specify, is
the following sufficient for reading the files in :

def afunction(setoflines, encoding='ascii'):
for line in setoflines:
if encoding:
line = line.decode(encoding)

For most encodings, this'll work fine. But there are some encodings,
for example UTF-16, that won't work with it. UTF-16 fails for two
reasons: the two-byte characters interfere with the line buffering,
and UTF-16 strings must be preceded by a two-byte code indicating
endianness, which would be at the beginning of the file but not of
each line.

Fortunately, most text files aren't in UTF-16. I mention this so that
you are aware that, although afunction works in most cases, it is not
universal.

I believe it's the purpose of the StreamReader and StreamWriter
classes in the codecs module to deal with such situations.
 
D

Diez B. Roggisch

afunction(lines, None)

would seem to be a feasible call ;-)

Ok, I admit that I didn't think of _that_ stupid possibility :)
Nevertheless: he wants unicode objects, so he should make sure he gets
them....
 
M

Michael Foord

This snippet as posted is a complicated "no-op but raise an error for
invalidly encoded lines", if it's the whole function.

:)
It wouldn't be the whole function...... glad you attribute me with
some intelligence ;-)
Assuming the so-called setoflines IS not a set but a list (order
normally matters in such cases), you may rather want:

def afunction(setoflines, encoding='ascii'):
for i, line in enumerate(setoflines):
setoflines = line.decode(encoding)

The removal of the 'if' is just the same advice you were already given;
if you want to be able to explicitly pass encoding='' to AVOID the
decode (the whole purpose of the function), just insert a firs line

if not encoding: return

rather than repeating the test in the loop. But the key change is to
use enumerate to get indices as well as values, and assign into the
indexing in order to update 'setoflines' in-place; assigning to the
local variable 'line' (assuming, again, that you didn't snip your code
w/o a mention of that) is no good.


The rest of the function (which I didn't show) would actually process
the lines one by one......

Regards,


Fuzzy
http://www.voidspace.org.uk/atlantibots/pythonutils.html
A good alternative might alternatively be

setoflines[:] = [line.decode(encoding) for line in setoflines]

assuming again that you want the change to happen in-place.


Alex
 
E

Egil M?ller

Unfortunately the http standard seems to lack a specification how form data
encoding is to be transferred. But it seems that most browser which
understand a certain encoding your page is delivered in will use that for
replying.

Fourtunately, this is utter bullshit :)

Send the Content-Type http header to the client, with the value
"text/html; charset=UTF-8". You may have to send it both as an HTTP
header and as a meta http-equiv-HTML tag to get it to work with all
browsers though. Usually (I don't knwo if it is really in the standard
that the client have to behave this way), the client will reply in the
same encoding as you sent your page with the form. Anyway, the client
will prolly set a similar tag upon reply, but I don't know about that,
and don't care as just expecting the same encoding works for all major
browsers (mozilla, IE, opera).
 
D

Diez B. Roggisch

Egil said:
Fourtunately, this is utter bullshit :)

Send the Content-Type http header to the client, with the value
"text/html; charset=UTF-8". You may have to send it both as an HTTP
header and as a meta http-equiv-HTML tag to get it to work with all
browsers though. Usually (I don't knwo if it is really in the standard
that the client have to behave this way), the client will reply in the
same encoding as you sent your page with the form. Anyway, the client
will prolly set a similar tag upon reply, but I don't know about that,
and don't care as just expecting the same encoding works for all major
browsers (mozilla, IE, opera).


You claim that my statement is bullshit, and then paraphrase it - delivering
a page in a certain encoding means exactly that it contains the charset
header, as that is required unless you use iso-8859-1 which is default:

http://www.w3.org/International/O-HTTP-charset

And then you as point out that expecting the right encoding usually works,
but only because of expirience, not because its standarized to behave that
way - now where is that different from saying that most browsers will use
that for replying?

I've no problem beeing corrected, or having my statements clarified - but I
don't think they generally qualify as bullshit, and prefer not to be
accused of uttering it.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top