unicode mystery

S

Sean McIlroy

I recently found out that unicode("\347", "iso-8859-1") is the
lowercase c-with-cedilla, so I set out to round up the unicode numbers
of the extra characters you need for French, and I found them all just
fine EXCEPT for the o-e ligature (oeuvre, etc). I examined the unicode
characters from 0 to 900 without finding it; then I looked at
www.unicode.org but the numbers I got there (0152 and 0153) didn't
work. Can anybody put a help on me wrt this? (Do I need to give a
different value for the second parameter, maybe?)

Peace,
STM

PS: I'm considering looking into pyscript as a means of making
diagrams for inclusion in LaTeX documents. If anyone can share an
opinion about pyscript, I'm interested to hear it.

Peace
 
J

John Lenton

I recently found out that unicode("\347", "iso-8859-1") is the
lowercase c-with-cedilla, so I set out to round up the unicode numbers
of the extra characters you need for French, and I found them all just
fine EXCEPT for the o-e ligature (oeuvre, etc). I examined the unicode
characters from 0 to 900 without finding it; then I looked at
www.unicode.org but the numbers I got there (0152 and 0153) didn't
work. Can anybody put a help on me wrt this? (Do I need to give a
different value for the second parameter, maybe?)

Å“ isn't part of ISO 8859-1, so you can't get it that way. You can do
one of

u'\u0153'

or, if you must,

unicode("\305\223", "utf-8")

--
John Lenton ([email protected]) -- Random fortune:
Lisp, Lisp, Lisp Machine,
Lisp Machine is Fun.
Lisp, Lisp, Lisp Machine,
Fun for everyone.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)

iD8DBQFB42K4gPqu395ykGsRAuYHAKCWQPoNdtAaBm6XeKqN4/cdsVIhJgCggMRq
NlFH8U/HGRTNkYrZsFCulVg=
=47J7
-----END PGP SIGNATURE-----
 
J

John Machin

Sean said:
I recently found out that unicode("\347", "iso-8859-1") is the
lowercase c-with-cedilla, so I set out to round up the unicode numbers
of the extra characters you need for French, and I found them all just
fine EXCEPT for the o-e ligature (oeuvre, etc). I examined the unicode
characters from 0 to 900 without finding it; then I looked at
www.unicode.org but the numbers I got there (0152 and 0153) didn't
work. Can anybody put a help on me wrt this? (Do I need to give a
different value for the second parameter, maybe?)

Characters that are in iso-8859-1 are mapped directly into Unicode.
That is, the first 256 characters of Unicode are identical to
iso-8859-1.

Consider this:
231

What you did with c_cedilla "worked" because it was effectively doing
nothing. However if you do unicode(char, encoding) where char is not in
encoding, it won't "work".

As John Lenton has pointed out, if you find a character in the Unicode
tables, you can just use it directly. There is no need in this
circumstance to use unicode().

HTH,
John
 
J

John Machin

Some poster wrote (in connexion with another topic):
... unicode("\347", "iso-8859-1") ...

Well, I haven't had a good rant for quite a while, so here goes:

I'm a bit of a retro specimen, being able (inter alia) to recall octal
opcodes from the ICT 1900 series (070=call, 072=exit, 074=branch, ...)
but nowadays I regard continued usage of octal as a pox and a
pestilence.

1. Octal notation is of use to systems programmers on computers where
the number of bits in a word is a multiple of 3. Are there any still in
production use? AFAIK word sizes were 12, 24, 36, 48, and 60 bits --
all multiples of 4, so hexadecimal could be used.

2. Consider the effect on the newbie who's never even heard of "octal":
File "<stdin>", line 1
datetime.date(2005,09,09)
^
SyntaxError: invalid token

[straight out of the "BOFH Manual of Po-faced Error Messages"]

3. Consider this extract from the docs for the re module:
"""
\number
Matches the contents of the group of the same number. Groups are
numbered starting from 1. For example, (.+) \1 matches 'the the' or '55
55', but not 'the end' (note the space after the group). This special
sequence can only be used to match one of the first 99 groups. If the
first digit of number is 0, or number is 3 octal digits long, it will
not be interpreted as a group match, but as the character with octal
value number. Inside the "[" and "]" of a character class, all numeric
escapes are treated as characters.
"""

I helped to straighten out this description a few years ago, but I fear
it's still not 100% accurate. Worse, take a peek at the code necessary
to implement this.

===

We (un-Pythonically) implicitly take a leading zero (or even just
\[0-7]) as meaning octal, instead of requiring something explicit as
with hexadecimal. The variable length idea in strings doesn't help:
"\18", "\128" and "\1238" are all strings of length 2.

I don't see any mention of octal in GvR's "Python Regrets" or AMK's
"PEP 3000". Why not? Is it not regretted?
 
P

PJDM

John said:
1. Octal notation is of use to systems programmers on computers where
the number of bits in a word is a multiple of 3. Are there any still in
production use? AFAIK word sizes were 12, 24, 36, 48, and 60 bits --
all multiples of 4, so hexadecimal could be used.

The PDP-11 was 16 bit, but used octal. With eight registers and eight
addressing modes in instructions, octal was a convenient base. (On the
11/70, the front switches were marked in alternate pink and purple
groups of three. said:
I don't see any mention of octal in GvR's "Python Regrets" or AMK's
"PEP 3000". Why not? Is it not regretted?

I suspect that, as in other places, Python just defers to the
underlying implementation. Since the C libraries treat "0n" as octal,
so does Python. (So does JavaScript.) More "it sucks, but it's too late
to change" than regrets.

Maybe P3K will have an integer literal like "n_b" for "the integer n in
base b".

PJDM
 
S

Stephen Thorne

Maybe P3K will have an integer literal like "n_b" for "the integer n in
base b".

I would actually like to see pychecker pick up conceptual errors like this:

import datetime
datetime.datetime(2005, 04,04)

Regards,
Stephen Thorne
 
A

and-google

John said:
I regard continued usage of octal as a pox and a pestilence.

Quite agree. I was disappointed that it ever made it into Python.

Octal's only use is:

a) umasks
b) confusing the hell out of normal non-programmers for whom a
leading zero is in no way magic

(a) does not outweigh (b).

In Mythical Future Python I would like to be able to use any base in
integer literals, which would be better. Example random syntax:

flags= 2x00011010101001
umask= 8x664
answer= 10x42
addr= 16x0E800004 # 16x == 0x
gunk= 36x8H6Z9A0X

But either way, I want rid of 0->octal.
Is it not regretted?

Maybe the problem just doesn't occur to people who have used C too
long.

OT: Also, if Google doesn't stop lstrip()ing my posts I may have to get
a proper news feed. What use is that on a Python newsgroup? Grr.
 
T

Tim Roberts

Stephen Thorne said:
I would actually like to see pychecker pick up conceptual errors like this:

import datetime
datetime.datetime(2005, 04,04)

Why is that a conceptual error? Syntactically, this could be a valid call
to a function. Even if you have parsed and executed datetime, so that you
know datetime.datetime is a class, it's quite possible that the creation
and destruction of an object might have useful side effects.
 
P

Peter Hansen

In Mythical Future Python I would like to be able to use any base in
integer literals, which would be better. Example random syntax:

flags= 2x00011010101001
umask= 8x664
answer= 10x42
addr= 16x0E800004 # 16x == 0x
gunk= 36x8H6Z9A0X

I think I kinda like this idea. Allowing arbitrary values,
however, would probably be pointless, as there are very
few bases in common enough use that a language should make
it easy to write literals in any of them. So I think "36x"
is silly, and would suggest limiting this to 2, 8, 10, and
16. At the very least, a range of 2-16 should be used.
(It would be cute but pointless to allow 1x000000000. :)

-Peter
 
G

Guest

Peter said:
I think I kinda like this idea. Allowing arbitrary values,
however, would probably be pointless, as there are very
few bases in common enough use that a language should make
it easy to write literals in any of them. So I think "36x"
is silly, and would suggest limiting this to 2, 8, 10, and
16. At the very least, a range of 2-16 should be used.
(It would be cute but pointless to allow 1x000000000. :)

-Peter
How about base 24 and 60, for hours and minutes/seconds?
 
S

Steve Holden

John Machin wrote:




Quite agree. I was disappointed that it ever made it into Python.

Octal's only use is:

a) umasks
b) confusing the hell out of normal non-programmers for whom a
leading zero is in no way magic

(a) does not outweigh (b).

In Mythical Future Python I would like to be able to use any base in
integer literals, which would be better. Example random syntax:

flags= 2x00011010101001
umask= 8x664
answer= 10x42
addr= 16x0E800004 # 16x == 0x
gunk= 36x8H6Z9A0X

But either way, I want rid of 0->octal.




Maybe the problem just doesn't occur to people who have used C too
long.
:)

OT: Also, if Google doesn't stop lstrip()ing my posts I may have to get
a proper news feed. What use is that on a Python newsgroup? Grr.

I remember using a langauge (Icon?) in which arbitrary bases up to 36
could be used with numeric literals. IIRC, the literals had to begin
with the base in decimnal, folowed by a "b" followed by the digits of
the value using a through z for digits from ten to thirty-five. So

gunk = 36b8H6Z9A0X

would have been valid.

nothing-new-under-the-sun-ly y'rs - steve
 
D

Dan Sommers

I remember using a langauge (Icon?) in which arbitrary bases up to 36
could be used with numeric literals. IIRC, the literals had to begin
with the base in decimnal, folowed by a "b" followed by the digits of
the value using a through z for digits from ten to thirty-five. So
gunk = 36b8H6Z9A0X
would have been valid.

Lisp also allows for literals in bases from 2 through 36.

Lisp also allows programs to change the default (away from decimal), so
that an "identifier" like aa is read by the parser as a numeric constant
with the decimal value of 170. Obviously, this has to be used with
care, but makes reading external data files written in strange bases
very easy.
nothing-new-under-the-sun-ly y'rs - steve

every-language-wants-to-be-lisp-ly y'rs,
Dan
 
L

Leif K-Brooks

Tim said:
Why is that a conceptual error? Syntactically, this could be a valid call
to a function. Even if you have parsed and executed datetime, so that you
know datetime.datetime is a class, it's quite possible that the creation
and destruction of an object might have useful side effects.

I'm guessing that Stephen is saying that PyChecker should have special
knowledge of the datetime module and of the fact that dates are often
specified with a leading zero, and therefor complain that they shouldn't
be used that way in Python source code.
 
B

Bengt Richter

I think I kinda like this idea. Allowing arbitrary values,
however, would probably be pointless, as there are very
few bases in common enough use that a language should make
it easy to write literals in any of them. So I think "36x"
is silly, and would suggest limiting this to 2, 8, 10, and
16. At the very least, a range of 2-16 should be used.
(It would be cute but pointless to allow 1x000000000. :)
My concern is negative numbers when you are interested in the
bits of a typical twos-complement number. (BTW, please don't tell me
that's platform-specific hardware oriented stuff: Two's complement is
a fine abstraction for interpreting a bit vector, which is another
fine abstraction ;-)

One way to do it consistently is to have a sign digit as the first
digit after the x, which is either 0 or base-1 -- e.g., +3 and -3 would be

2x011 2x101
8x03 8x75
16x03 16xfd
10x03 10x97

Then the "sign digit" can be extended indefinitely to the left without
changing the value, noting that -3 == 97-100 == 997-1000) and similarly
for other bases: -3 == binary 101-1000 == 111101-1000000 etc. IOW, you just
subtract base**<number of digits in representation> if the first digit
is base-1, to get the negative value.

This would let us have a %<width>.<base>b format to generate the digits part
and would get around the ugly hack for writing hex literals of negative numbers.

def __repr__(self): return '<%s object at %08.16b>' %(type(self).__name__, id(self))

and you could write based literals in the above formats with e.g., with
'16x%.16b' % number
or
'2x%.2b' % number
etc.

Regards,
Bengt Richter
 
J

Jeff Epler

One way to do it consistently is to have a sign digit as the first
digit after the x, which is either 0 or base-1 -- e.g., +3 and -3 would be

2x011 2x101
8x03 8x75
16x03 16xfd
10x03 10x97

... so that 0x8 and 16x8 would be different? So that 2x1 and 2x01 would
be different?

Jeff

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)

iD8DBQFB5weEJd01MZaTXX0RAvDCAJ46GzXpJobv6dSiIKNGmyelRerFPwCfXmRs
9i51i1LelXbZO26izFwYv58=
=vjU0
-----END PGP SIGNATURE-----
 
B

Bengt Richter

--LQksG6bCIzRHxTLp
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable



=2E.. so that 0x8 and 16x8 would be different? So that 2x1 and 2x01 would
be different?

I guess I didn't make the encoding clear enough, sorry ;-/

16x8 would be illegal as a literal, since the first digit after x is not 0 or f (base-1)
0x8 would be spelled 16x08 in <base>x format.

2x1 _could_ be allowed as a degenerate form of the general negative format for -1, where
any number of leading base-1 "sign digits" compute to the same value. Hence

2x1 == 1*2**0 - 2**1 == -1
2x11 == 1*2**0 + 1*2**1 - 2**2 == 1 + 2 -4 == -1
2x111 == 1*2**0 + 1*2**1 + 1*2**2 - 2**4 == 1 + 2 + 4 - 8 == -1

16f == 15*16**0 - 16**1 = 15 -16 == -1
16ff == 15*6**0 + 15*16**1 - 16**2 = 15 + 240 - 256 == -1

etc. Although IMO it will be more consistent to require a leading "sign digit" even if redundant.
The one exception would be zero, since 00 doesn't look too cool in company with a collection
of other numbers if output minimally without their (known) base prefixes, e.g.,
-2,-1,0,1,2 => 110 11 0 01 010

Of course these numbers can be output in constant width with leading pads of 0 or base-1 digit according
to sign, e.g., 1110 1111 0000 0001 0010 for the same set width 4. That would be format '%04.2b' and
if you wanted the literal-making prefix, '2x%04.2b' would produce legal fixed width literals 2x1110 2x1111 etc.

Regards,
Bengt Richter
 
S

Simon Brunning

I'm guessing that Stephen is saying that PyChecker should have special
knowledge of the datetime module and of the fact that dates are often
specified with a leading zero, and therefor complain that they shouldn't
be used that way in Python source code.

It would be useful if PyChecker warned you when you specify an octal
literal and where the value would differ from what you might expect if
you didn't realise that you were specifying an octal literal.

x = 04 # This doesn't need a warning: 04 == 4
#x = 09 # This doesn't need a warning: it will fail to compile
x= 012 # This *does* need a warning: 012 == 10
 
R

Reinhold Birkenfeld

Simon said:
It would be useful if PyChecker warned you when you specify an octal
literal and where the value would differ from what you might expect if
you didn't realise that you were specifying an octal literal.

x = 04 # This doesn't need a warning: 04 == 4
#x = 09 # This doesn't need a warning: it will fail to compile
x= 012 # This *does* need a warning: 012 == 10

Well, this would generate warnings for all octal literals except
01, 02, 03, 04, 05, 06 and 07.

However, I would vote +1 for adding such an option to PyChecker. For
code that explicitly uses octals, it can be turned off and it is _very_
confusing to newbies...

Reinhold
 
R

Reinhold Birkenfeld

Bengt said:
My concern is negative numbers when you are interested in the
bits of a typical twos-complement number. (BTW, please don't tell me
that's platform-specific hardware oriented stuff: Two's complement is
a fine abstraction for interpreting a bit vector, which is another
fine abstraction ;-)

One way to do it consistently is to have a sign digit as the first
digit after the x, which is either 0 or base-1 -- e.g., +3 and -3 would be

2x011 2x101
8x03 8x75
16x03 16xfd
10x03 10x97

Why not just -2x11? IMHO, Py2.4 does not produce negative values out of
hex or oct literals any longer, so your proposal would be inconsistent.

Reinhold
 
J

JCM

In Mythical Future Python I would like to be able to use any base in
integer literals, which would be better. Example random syntax:
flags= 2x00011010101001
umask= 8x664
answer= 10x42
addr= 16x0E800004 # 16x == 0x
gunk= 36x8H6Z9A0X

I'd prefer using the leftmost character as a two's complement
extension bit.

0x1 : 1 in hex notation
1xf : -1 in hex notation, or conceptually an infinitely long string of 1s
0c12 : 10 in octal noataion
1c12 : -54 in octal (I think)
0d12 : 12 in decimal
0b10 : 2 in binary
etc

I leave it to the reader to decide whether I'm joking.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top