[unicode] inconvenient unicode conversion of non-string arguments

Holger Joukl · Dec 13, 2006

Hi there,

I consider the behaviour of unicode() inconvenient wrt to conversion of
non-string
arguments.
While you can do:
u'17.3'

you cannot do:
Traceback (most recent call last):

This is somehow annoying when you want to convert a mixed-type argument
list
to unicode strings, e.g. for a logging system (that's where it bit me) and
want to make sure that possible raw string arguments are also converted to
unicode without errors (although by force).
Especially as this is a performance-critical part in my application so I
really
do not like to wrap unicode() into some custom tounicode() function that
handles
such cases by distinction of argument types.

Any reason why unicode() with a non-string argument should not allow the
encoding and errors arguments?
Or some good solution to work around my problem?

(Currently running on python 2.4.3)

Regards,
Holger

Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene
Empfänger sind oder falls diese E-Mail irrtümlich an Sie adressiert wurde,
verständigen Sie bitte den Absender sofort und löschen Sie die E-Mail
sodann. Das unerlaubte Kopieren sowie die unbefugte Übermittlung sind nicht
gestattet. Die Sicherheit von Übermittlungen per E-Mail kann nicht
garantiert werden. Falls Sie eine Bestätigung wünschen, fordern Sie bitte
den Inhalt der E-Mail als Hardcopy an.

The contents of this e-mail are confidential. If you are not the named
addressee or if this transmission has been addressed to you in error,
please notify the sender immediately and then delete this e-mail. Any
unauthorized copying and transmission is forbidden. E-Mail transmission
cannot be guaranteed to be secure. If verification is required, please
request a hard copy version.

Leo Kislov · Dec 13, 2006

Holger said:
Hi there,

I consider the behaviour of unicode() inconvenient wrt to conversion of
non-string
arguments.
While you can do:

u'17.3'

you cannot do:

Traceback (most recent call last):

This is somehow annoying when you want to convert a mixed-type argument
list
to unicode strings, e.g. for a logging system (that's where it bit me) and
want to make sure that possible raw string arguments are also converted to
unicode without errors (although by force).
Especially as this is a performance-critical part in my application so I
really
do not like to wrap unicode() into some custom tounicode() function that
handles
such cases by distinction of argument types.

Any reason why unicode() with a non-string argument should not allow the
encoding and errors arguments?

There is reason: encoding is a property of bytes, it is not applicable
to other objects.

Or some good solution to work around my problem?

Do not put undecoded bytes in a mixed-type argument list. A rule of
thumb working with unicode: decode as soon as possible, encode as late
as possible.

-- Leo

Fredrik Lundh · Dec 13, 2006

Holger said:
Ok, but I still don't see why these arguments shouldn't simply be silently
ignored

</F>

Leo Kislov · Dec 13, 2006

Holger said:
[email protected] schrieb am 13.12.2006
11:02:30:

Holger said:

Hi there,

I consider the behaviour of unicode() inconvenient wrt to conversion of
non-string
arguments.
While you can do:

unicode(17.3)
u'17.3'

you cannot do:

unicode(17.3, 'ISO-8859-1', 'replace')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: coercing to Unicode: need string or buffer, float found

[...]
Any reason why unicode() with a non-string argument should not allow the
encoding and errors arguments?

Click to expand...

There is reason: encoding is a property of bytes, it is not applicable
to other objects.

Click to expand...

Ok, but I still don't see why these arguments shouldn't simply be silently
ignored
for non-string arguments.

That's rather bizzare and sloppy approach. Should

unicode(17.3, 'just-having-fun', 'I-do-not-like-errors')
unicode(17.3, 'sdlfkj', 'ewrlkj', 'eoirj', 'sdflkj')

work?

It's not always that easy when you deal with a tree data structure with the
tree elements containing different data types and your user may decide to
output
root.element.subelement.whateverData.
I have the problems in a logging mechanism, and it would vanish if
unicode(<non-string>, encoding, errors) would work and just ignore the
obsolete
arguments.

I don't really see from your example what stops you from putting
unicode instead of bytes into your tree, but I can believe some
libraries can cause some extra work. That's the problem with libraries,
not with builtin function unicode(). Would you be happy if floating
point value 17.3 would be stored as 8 bytes in your tree? After all,
that is how 17.3 is actually represented in computer memory. Same story
with unicode, if some library gives you raw bytes *you* have to do
extra work later.

-- Leo

Marc 'BlackJack' Rintsch · Dec 13, 2006

Holger Joukl said:
Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene
EmpfÃ¤nger sind oder falls diese E-Mail irrtÃ¼mlich an Sie adressiert wurde,
verstÃ¤ndigen Sie bitte den Absender sofort und lÃ¶schen Sie die E-Mail
sodann. Das unerlaubte Kopieren sowie die unbefugte Ãœbermittlung sind nicht
gestattet. Die Sicherheit von Ãœbermittlungen per E-Mail kann nicht
garantiert werden. Falls Sie eine BestÃ¤tigung wÃ¼nschen, fordern Sie bitte
den Inhalt der E-Mail als Hardcopy an.

The contents of this e-mail are confidential. If you are not the named
addressee or if this transmission has been addressed to you in error,
please notify the sender immediately and then delete this e-mail. Any
unauthorized copying and transmission is forbidden. E-Mail transmission
cannot be guaranteed to be secure. If verification is required, please
request a hard copy version.

Maybe you should rethink if it really makes sense to add this huge block
of "nonsense" to a post to a newsgroup or public mailing list. If it's
confidential, just keep it secret. ;-)

Ciao,
Marc 'BlackJack' Rintsch

Ben Finney · Dec 13, 2006

Marc 'BlackJack' Rintsch said:
Holger Joukl said:

[a meaningless disclaimer text at the bottom of every message]

Click to expand...

Maybe you should rethink if it really makes sense to add this huge
block of "nonsense" to a post to a newsgroup or public mailing list.
If it's confidential, just keep it secret. ;-)

In all likelihood, the OP isn't choosing specifically to attach it;
these things are often done to *every* outgoing message at an
organisational level by people who don't think the issue through very
well.

<URL:http://goldmark.org/jeff/stupid-disclaimers/>

Please, those with such badly-configured systems, discuss the issue of
public discussion forums with the boneheads who think these disclaimer
texts are a good idea and at least try to change that behaviour.

Alternatively, post from some other mail system that doesn't slap
these obnoxious blocks onto your messages.

call of __del__ non-deterministic in python 2.4 (cpython)?	0	Dec 13, 2006
python 2 coercion	0	Aug 31, 2004
introspection: How to find out the class defining a method	1	May 6, 2004
Antwort: Play with classes ['LBBW': checked]	0	Feb 26, 2004
Python 2.3.3 signals, threads & extensions: signal handling problem	2	Jun 3, 2004
Python 2.4.2 gcc 3.4.4 Solaris 8 build issues	1	Dec 30, 2005
python 2.3.3 setup.py: why adding sys.prefix to include_dir, libdir?	1	Feb 19, 2004
py2.1->py2.3.3 __getattr__ confusion	1	Jul 2, 2004

[unicode] inconvenient unicode conversion of non-string arguments

Holger Joukl

Leo Kislov

Fredrik Lundh

Leo Kislov

Marc 'BlackJack' Rintsch

Ben Finney

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads