unicode mystery

Sean McIlroy · Jan 11, 2005

I recently found out that unicode("\347", "iso-8859-1") is the
lowercase c-with-cedilla, so I set out to round up the unicode numbers
of the extra characters you need for French, and I found them all just
fine EXCEPT for the o-e ligature (oeuvre, etc). I examined the unicode
characters from 0 to 900 without finding it; then I looked at
www.unicode.org but the numbers I got there (0152 and 0153) didn't
work. Can anybody put a help on me wrt this? (Do I need to give a
different value for the second parameter, maybe?)

Peace,
STM

PS: I'm considering looking into pyscript as a means of making
diagrams for inclusion in LaTeX documents. If anyone can share an
opinion about pyscript, I'm interested to hear it.

Peace

John Lenton · Jan 11, 2005

I recently found out that unicode("\347", "iso-8859-1") is the
lowercase c-with-cedilla, so I set out to round up the unicode numbers
of the extra characters you need for French, and I found them all just
fine EXCEPT for the o-e ligature (oeuvre, etc). I examined the unicode
characters from 0 to 900 without finding it; then I looked at
www.unicode.org but the numbers I got there (0152 and 0153) didn't
work. Can anybody put a help on me wrt this? (Do I need to give a
different value for the second parameter, maybe?)

Å“ isn't part of ISO 8859-1, so you can't get it that way. You can do
one of

u'\u0153'

or, if you must,

unicode("\305\223", "utf-8")

--
John Lenton ([email protected]) -- Random fortune:
Lisp, Lisp, Lisp Machine,
Lisp Machine is Fun.
Lisp, Lisp, Lisp Machine,
Fun for everyone.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)

iD8DBQFB42K4gPqu395ykGsRAuYHAKCWQPoNdtAaBm6XeKqN4/cdsVIhJgCggMRq
NlFH8U/HGRTNkYrZsFCulVg=
=47J7
-----END PGP SIGNATURE-----

John Machin · Jan 11, 2005

Sean said:
I recently found out that unicode("\347", "iso-8859-1") is the
lowercase c-with-cedilla, so I set out to round up the unicode numbers
of the extra characters you need for French, and I found them all just
fine EXCEPT for the o-e ligature (oeuvre, etc). I examined the unicode
characters from 0 to 900 without finding it; then I looked at
www.unicode.org but the numbers I got there (0152 and 0153) didn't
work. Can anybody put a help on me wrt this? (Do I need to give a
different value for the second parameter, maybe?)

Characters that are in iso-8859-1 are mapped directly into Unicode.
That is, the first 256 characters of Unicode are identical to
iso-8859-1.

Consider this:
231

What you did with c_cedilla "worked" because it was effectively doing
nothing. However if you do unicode(char, encoding) where char is not in
encoding, it won't "work".

As John Lenton has pointed out, if you find a character in the Unicode
tables, you can just use it directly. There is no need in this
circumstance to use unicode().

HTH,
John

John Machin · Jan 11, 2005

Some poster wrote (in connexion with another topic):

... unicode("\347", "iso-8859-1") ...

Well, I haven't had a good rant for quite a while, so here goes:

I'm a bit of a retro specimen, being able (inter alia) to recall octal
opcodes from the ICT 1900 series (070=call, 072=exit, 074=branch, ...)
but nowadays I regard continued usage of octal as a pox and a
pestilence.

1. Octal notation is of use to systems programmers on computers where
the number of bits in a word is a multiple of 3. Are there any still in
production use? AFAIK word sizes were 12, 24, 36, 48, and 60 bits --
all multiples of 4, so hexadecimal could be used.

2. Consider the effect on the newbie who's never even heard of "octal":
File "<stdin>", line 1
datetime.date(2005,09,09)
^
SyntaxError: invalid token

[straight out of the "BOFH Manual of Po-faced Error Messages"]

3. Consider this extract from the docs for the re module:
"""
\number
Matches the contents of the group of the same number. Groups are
numbered starting from 1. For example, (.+) \1 matches 'the the' or '55
55', but not 'the end' (note the space after the group). This special
sequence can only be used to match one of the first 99 groups. If the
first digit of number is 0, or number is 3 octal digits long, it will
not be interpreted as a group match, but as the character with octal
value number. Inside the "[" and "]" of a character class, all numeric
escapes are treated as characters.
"""

I helped to straighten out this description a few years ago, but I fear
it's still not 100% accurate. Worse, take a peek at the code necessary
to implement this.

===

We (un-Pythonically) implicitly take a leading zero (or even just
\[0-7]) as meaning octal, instead of requiring something explicit as
with hexadecimal. The variable length idea in strings doesn't help:
"\18", "\128" and "\1238" are all strings of length 2.

I don't see any mention of octal in GvR's "Python Regrets" or AMK's
"PEP 3000". Why not? Is it not regretted?

PJDM · Jan 13, 2005

John said:
1. Octal notation is of use to systems programmers on computers where
the number of bits in a word is a multiple of 3. Are there any still in
production use? AFAIK word sizes were 12, 24, 36, 48, and 60 bits --
all multiples of 4, so hexadecimal could be used.

The PDP-11 was 16 bit, but used octal. With eight registers and eight
addressing modes in instructions, octal was a convenient base. (On the
11/70, the front switches were marked in alternate pink and purple

groups of three. said:
I don't see any mention of octal in GvR's "Python Regrets" or AMK's
"PEP 3000". Why not? Is it not regretted?

I suspect that, as in other places, Python just defers to the
underlying implementation. Since the C libraries treat "0n" as octal,
so does Python. (So does JavaScript.) More "it sucks, but it's too late
to change" than regrets.

Maybe P3K will have an integer literal like "n_b" for "the integer n in
base b".

PJDM

Stephen Thorne · Jan 13, 2005

Maybe P3K will have an integer literal like "n_b" for "the integer n in
base b".

I would actually like to see pychecker pick up conceptual errors like this:

import datetime
datetime.datetime(2005, 04,04)

Regards,
Stephen Thorne

and-google · Jan 13, 2005

John said:
I regard continued usage of octal as a pox and a pestilence.

Quite agree. I was disappointed that it ever made it into Python.

Octal's only use is:

a) umasks
b) confusing the hell out of normal non-programmers for whom a
leading zero is in no way magic

(a) does not outweigh (b).

In Mythical Future Python I would like to be able to use any base in
integer literals, which would be better. Example random syntax:

flags= 2x00011010101001
umask= 8x664
answer= 10x42
addr= 16x0E800004 # 16x == 0x
gunk= 36x8H6Z9A0X

But either way, I want rid of 0->octal.

Is it not regretted?

Maybe the problem just doesn't occur to people who have used C too
long.

OT: Also, if Google doesn't stop lstrip()ing my posts I may have to get
a proper news feed. What use is that on a Python newsgroup? Grr.

Tim Roberts · Jan 13, 2005

Stephen Thorne said:
I would actually like to see pychecker pick up conceptual errors like this:

import datetime
datetime.datetime(2005, 04,04)

Why is that a conceptual error? Syntactically, this could be a valid call
to a function. Even if you have parsed and executed datetime, so that you
know datetime.datetime is a class, it's quite possible that the creation
and destruction of an object might have useful side effects.

Peter Hansen · Jan 13, 2005

In Mythical Future Python I would like to be able to use any base in
integer literals, which would be better. Example random syntax:

flags= 2x00011010101001
umask= 8x664
answer= 10x42
addr= 16x0E800004 # 16x == 0x
gunk= 36x8H6Z9A0X

I think I kinda like this idea. Allowing arbitrary values,
however, would probably be pointless, as there are very
few bases in common enough use that a language should make
it easy to write literals in any of them. So I think "36x"
is silly, and would suggest limiting this to 2, 8, 10, and
16. At the very least, a range of 2-16 should be used.
(It would be cute but pointless to allow 1x000000000.

-Peter

Guest · Jan 13, 2005

Peter said:
I think I kinda like this idea. Allowing arbitrary values,
however, would probably be pointless, as there are very
few bases in common enough use that a language should make
it easy to write literals in any of them. So I think "36x"
is silly, and would suggest limiting this to 2, 8, 10, and
16. At the very least, a range of 2-16 should be used.
(It would be cute but pointless to allow 1x000000000.

-Peter

How about base 24 and 60, for hours and minutes/seconds?

Steve Holden · Jan 13, 2005

John Machin wrote:

Quite agree. I was disappointed that it ever made it into Python.

Octal's only use is:

a) umasks
b) confusing the hell out of normal non-programmers for whom a
leading zero is in no way magic

(a) does not outweigh (b).

In Mythical Future Python I would like to be able to use any base in
integer literals, which would be better. Example random syntax:

flags= 2x00011010101001
umask= 8x664
answer= 10x42
addr= 16x0E800004 # 16x == 0x
gunk= 36x8H6Z9A0X

But either way, I want rid of 0->octal.

Maybe the problem just doesn't occur to people who have used C too
long.

OT: Also, if Google doesn't stop lstrip()ing my posts I may have to get
a proper news feed. What use is that on a Python newsgroup? Grr.

I remember using a langauge (Icon?) in which arbitrary bases up to 36
could be used with numeric literals. IIRC, the literals had to begin
with the base in decimnal, folowed by a "b" followed by the digits of
the value using a through z for digits from ten to thirty-five. So

gunk = 36b8H6Z9A0X

would have been valid.

nothing-new-under-the-sun-ly y'rs - steve

Dan Sommers · Jan 13, 2005

I remember using a langauge (Icon?) in which arbitrary bases up to 36
could be used with numeric literals. IIRC, the literals had to begin
with the base in decimnal, folowed by a "b" followed by the digits of
the value using a through z for digits from ten to thirty-five. So

gunk = 36b8H6Z9A0X

would have been valid.

Lisp also allows for literals in bases from 2 through 36.

Lisp also allows programs to change the default (away from decimal), so
that an "identifier" like aa is read by the parser as a numeric constant
with the decimal value of 170. Obviously, this has to be used with
care, but makes reading external data files written in strange bases
very easy.

nothing-new-under-the-sun-ly y'rs - steve

every-language-wants-to-be-lisp-ly y'rs,
Dan

Leif K-Brooks · Jan 13, 2005

Tim said:
Why is that a conceptual error? Syntactically, this could be a valid call
to a function. Even if you have parsed and executed datetime, so that you
know datetime.datetime is a class, it's quite possible that the creation
and destruction of an object might have useful side effects.

I'm guessing that Stephen is saying that PyChecker should have special
knowledge of the datetime module and of the fact that dates are often
specified with a leading zero, and therefor complain that they shouldn't
be used that way in Python source code.

Bengt Richter · Jan 13, 2005

I think I kinda like this idea. Allowing arbitrary values,
however, would probably be pointless, as there are very
few bases in common enough use that a language should make
it easy to write literals in any of them. So I think "36x"
is silly, and would suggest limiting this to 2, 8, 10, and
16. At the very least, a range of 2-16 should be used.
(It would be cute but pointless to allow 1x000000000.

My concern is negative numbers when you are interested in the
bits of a typical twos-complement number. (BTW, please don't tell me
that's platform-specific hardware oriented stuff: Two's complement is
a fine abstraction for interpreting a bit vector, which is another
fine abstraction ;-)

One way to do it consistently is to have a sign digit as the first
digit after the x, which is either 0 or base-1 -- e.g., +3 and -3 would be

2x011 2x101
8x03 8x75
16x03 16xfd
10x03 10x97

Then the "sign digit" can be extended indefinitely to the left without
changing the value, noting that -3 == 97-100 == 997-1000) and similarly
for other bases: -3 == binary 101-1000 == 111101-1000000 etc. IOW, you just
subtract base**<number of digits in representation> if the first digit
is base-1, to get the negative value.

This would let us have a %<width>.<base>b format to generate the digits part
and would get around the ugly hack for writing hex literals of negative numbers.

def __repr__(self): return '<%s object at %08.16b>' %(type(self).__name__, id(self))

and you could write based literals in the above formats with e.g., with
'16x%.16b' % number
or
'2x%.2b' % number
etc.

Regards,
Bengt Richter

Jeff Epler · Jan 13, 2005

One way to do it consistently is to have a sign digit as the first
digit after the x, which is either 0 or base-1 -- e.g., +3 and -3 would be

2x011 2x101
8x03 8x75
16x03 16xfd
10x03 10x97

... so that 0x8 and 16x8 would be different? So that 2x1 and 2x01 would
be different?

Jeff

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)

iD8DBQFB5weEJd01MZaTXX0RAvDCAJ46GzXpJobv6dSiIKNGmyelRerFPwCfXmRs
9i51i1LelXbZO26izFwYv58=
=vjU0
-----END PGP SIGNATURE-----

Bengt Richter · Jan 14, 2005

--LQksG6bCIzRHxTLp
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

=2E.. so that 0x8 and 16x8 would be different? So that 2x1 and 2x01 would
be different?

I guess I didn't make the encoding clear enough, sorry ;-/

16x8 would be illegal as a literal, since the first digit after x is not 0 or f (base-1)
0x8 would be spelled 16x08 in <base>x format.

2x1 _could_ be allowed as a degenerate form of the general negative format for -1, where
any number of leading base-1 "sign digits" compute to the same value. Hence

2x1 == 1*2**0 - 2**1 == -1
2x11 == 1*2**0 + 1*2**1 - 2**2 == 1 + 2 -4 == -1
2x111 == 1*2**0 + 1*2**1 + 1*2**2 - 2**4 == 1 + 2 + 4 - 8 == -1

16f == 15*16**0 - 16**1 = 15 -16 == -1
16ff == 15*6**0 + 15*16**1 - 16**2 = 15 + 240 - 256 == -1

etc. Although IMO it will be more consistent to require a leading "sign digit" even if redundant.
The one exception would be zero, since 00 doesn't look too cool in company with a collection
of other numbers if output minimally without their (known) base prefixes, e.g.,
-2,-1,0,1,2 => 110 11 0 01 010

Of course these numbers can be output in constant width with leading pads of 0 or base-1 digit according
to sign, e.g., 1110 1111 0000 0001 0010 for the same set width 4. That would be format '%04.2b' and
if you wanted the literal-making prefix, '2x%04.2b' would produce legal fixed width literals 2x1110 2x1111 etc.

Regards,
Bengt Richter

Simon Brunning · Jan 14, 2005

I'm guessing that Stephen is saying that PyChecker should have special
knowledge of the datetime module and of the fact that dates are often
specified with a leading zero, and therefor complain that they shouldn't
be used that way in Python source code.

It would be useful if PyChecker warned you when you specify an octal
literal and where the value would differ from what you might expect if
you didn't realise that you were specifying an octal literal.

x = 04 # This doesn't need a warning: 04 == 4
#x = 09 # This doesn't need a warning: it will fail to compile
x= 012 # This *does* need a warning: 012 == 10

Reinhold Birkenfeld · Jan 14, 2005

Simon said:
It would be useful if PyChecker warned you when you specify an octal
literal and where the value would differ from what you might expect if
you didn't realise that you were specifying an octal literal.

x = 04 # This doesn't need a warning: 04 == 4
#x = 09 # This doesn't need a warning: it will fail to compile
x= 012 # This *does* need a warning: 012 == 10

Well, this would generate warnings for all octal literals except
01, 02, 03, 04, 05, 06 and 07.

However, I would vote +1 for adding such an option to PyChecker. For
code that explicitly uses octals, it can be turned off and it is _very_
confusing to newbies...

Reinhold

Reinhold Birkenfeld · Jan 14, 2005

Bengt said:
My concern is negative numbers when you are interested in the
bits of a typical twos-complement number. (BTW, please don't tell me
that's platform-specific hardware oriented stuff: Two's complement is
a fine abstraction for interpreting a bit vector, which is another
fine abstraction ;-)

One way to do it consistently is to have a sign digit as the first
digit after the x, which is either 0 or base-1 -- e.g., +3 and -3 would be

2x011 2x101
8x03 8x75
16x03 16xfd
10x03 10x97

Why not just -2x11? IMHO, Py2.4 does not produce negative values out of
hex or oct literals any longer, so your proposal would be inconsistent.

Reinhold

JCM · Jan 14, 2005

In Mythical Future Python I would like to be able to use any base in
integer literals, which would be better. Example random syntax:

flags= 2x00011010101001
umask= 8x664
answer= 10x42
addr= 16x0E800004 # 16x == 0x
gunk= 36x8H6Z9A0X

I'd prefer using the leftmost character as a two's complement
extension bit.

0x1 : 1 in hex notation
1xf : -1 in hex notation, or conceptually an infinitely long string of 1s
0c12 : 10 in octal noataion
1c12 : -54 in octal (I think)
0d12 : 12 in decimal
0b10 : 2 in binary
etc

I leave it to the reader to decide whether I'm joking.

Python Unicode handling wins again -- mostly	67	Nov 30, 2013
csv and mixed lists of unicode and numbers	6	Nov 24, 2009
unicode "table of character" implementation in python	5	Aug 22, 2006
content-type and unicode	15	Apr 8, 2007
Problem with PropertyResourceBundle and unicode	2	Mar 14, 2005
saving unicode in oracle using asp	1	Oct 5, 2005
HOWTO: Parsing email using Python part2	1	Jul 15, 2011
TextStreamReader with transparent unicode BOM Support	3	Jul 2, 2003

unicode mystery

Sean McIlroy

John Lenton

John Machin

John Machin

PJDM

Stephen Thorne

and-google

Tim Roberts

Peter Hansen

Guest

Steve Holden

Dan Sommers

Leif K-Brooks

Bengt Richter

Jeff Epler

Bengt Richter

Simon Brunning

Reinhold Birkenfeld

Reinhold Birkenfeld

JCM

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads