a simple unicode question

G

George Trojan

A trivial one, this is the first time I have to deal with Unicode. I am
trying to parse a string s='''48° 13' 16.80" N'''. I know the charset is
"iso-8859-1". To get the degrees I did
>>> encoding='iso-8859-1'
>>> q=s.decode(encoding)
>>> q.split() [u'48\xc2\xb0', u"13'", u'16.80"', u'N']
>>> r=q.split()[0]
>>> int(r[:r.find(unichr(ord('\xc2')))])
48

Is there a better way of getting the degrees?

George
 
D

Diez B. Roggisch

George said:
A trivial one, this is the first time I have to deal with Unicode. I am
trying to parse a string s='''48° 13' 16.80" N'''. I know the charset is
"iso-8859-1". To get the degrees I did
encoding='iso-8859-1'
q=s.decode(encoding)
q.split() [u'48\xc2\xb0', u"13'", u'16.80"', u'N']
r=q.split()[0]
int(r[:r.find(unichr(ord('\xc2')))])
48

Is there a better way of getting the degrees?

Instead of this rather convoluted way to specify a degree-sign, better do

# -*- coding: utf-8 -*-
...
int(r[:r.find(u"°")])


Please note that the utf-8-encoding has *nothing* todo with your string
- it's just the source-file encoding. Of course your editor must use
utf-8 for saving the encoding. Or you can use any other one you like.

Diez
 
B

beSTEfar

A trivial one, this is the first time I have to deal with Unicode. I am
trying to parse a string s='''48° 13' 16.80" N'''. I know the charset is
"iso-8859-1". To get the degrees I did
 >>> encoding='iso-8859-1'
 >>> q=s.decode(encoding)
 >>> q.split()
[u'48\xc2\xb0', u"13'", u'16.80"', u'N']
 >>> r=q.split()[0]
 >>> int(r[:r.find(unichr(ord('\xc2')))])
48

Is there a better way of getting the degrees?

George

When parsing strings, use Regular Expressions. If you don't know how
to, spend some time teaching yourself how to - well spent time! A
great tool for playing around with REs is KODOS.

For the problem at hand you can e.g.:

import re
degrees = int(re.findall('\d+', s)[0])

that in essence will group together all groups of consecutive digits,
return the first group and int() it. No need to care/know about the
fact that the string is Unicode and the underlying coding of the
charset.
 
M

Mark Tolonen

George Trojan said:
A trivial one, this is the first time I have to deal with Unicode. I am
trying to parse a string s='''48° 13' 16.80" N'''. I know the charset is
"iso-8859-1". To get the degrees I did
encoding='iso-8859-1'
q=s.decode(encoding)
q.split() [u'48\xc2\xb0', u"13'", u'16.80"', u'N']
r=q.split()[0]
int(r[:r.find(unichr(ord('\xc2')))])
48

Is there a better way of getting the degrees?

It seems your string is UTF-8. \xc2\xb0 is UTF-8 for DEGREE SIGN:
 
M

Mark Tolonen

George Trojan said:
A trivial one, this is the first time I have to deal with Unicode. I am
trying to parse a string s='''48° 13' 16.80" N'''. I know the charset is
"iso-8859-1". To get the degrees I did
encoding='iso-8859-1'
q=s.decode(encoding)
q.split() [u'48\xc2\xb0', u"13'", u'16.80"', u'N']
r=q.split()[0]
int(r[:r.find(unichr(ord('\xc2')))])
48

Is there a better way of getting the degrees?

It seems your string is UTF-8. \xc2\xb0 is UTF-8 for DEGREE SIGN. If you
type non-ASCII characters in source code, make sure to declare the encoding
the file is *actually* saved in:

# coding: utf-8

s = '''48° 13' 16.80" N'''
q = s.decode('utf-8')

# next line equivalent to previous two
q = u'''48° 13' 16.80" N'''

# couple ways to find the degrees
print int(q[:q.find(u'°')])
import re
print re.search(ur'(\d+)°',q).group(1)

-Mark
 
G

George Trojan

Thanks for all suggestions. It took me a while to find out how to
configure my keyboard to be able to type the degree sign. I prefer to
stick with pure ASCII if possible.
Where are the literals (i.e. u'\N{DEGREE SIGN}') defined? I found
http://www.unicode.org/Public/5.1.0/ucd/UnicodeData.txt
Is that the place to look?

George
Mark said:
Is there a better way of getting the degrees?

It seems your string is UTF-8. \xc2\xb0 is UTF-8 for DEGREE SIGN. If
you type non-ASCII characters in source code, make sure to declare the
encoding the file is *actually* saved in:

# coding: utf-8

s = '''48° 13' 16.80" N'''
q = s.decode('utf-8')

# next line equivalent to previous two
q = u'''48° 13' 16.80" N'''

# couple ways to find the degrees
print int(q[:q.find(u'°')])
import re
print re.search(ur'(\d+)°',q).group(1)

Mark is right about the source, but you needn't write unicode source
to process unicode data. Since nobody else mentioned my favorite way
of writing unicode in ASCII, try:

IDLE 2.6.313' 16.80" N

And if you are unsure of the name to use:'DEGREE SIGN'

--Scott David Daniels
(e-mail address removed)
 
N

Nobody

Thanks for all suggestions. It took me a while to find out how to
configure my keyboard to be able to type the degree sign. I prefer to
stick with pure ASCII if possible.
Where are the literals (i.e. u'\N{DEGREE SIGN}') defined? I found
http://www.unicode.org/Public/5.1.0/ucd/UnicodeData.txt
Is that the place to look?

You can get them from the unicodedata module, e.g.:

import unicodedata
for i in xrange(0x10000):
n = unicodedata.name(unichr(i),None)
if n is not None:
print i, n
 
M

Martin v. Löwis

Where are the literals (i.e. u'\N{DEGREE SIGN}') defined? I found

Correct - you are supposed to fill in a Unicode character name into
the \N escape. The specific list of names depends on the version of
the UCD which was used in the specific Python version, but the
characters you are likely interested in probably had been defined
"forever".

Regards,
Martin
 
M

Mark Tolonen

Thanks for all suggestions. It took me a while to find out how to
configure my keyboard to be able to type the degree sign. I prefer to
stick with pure ASCII if possible.
Where are the literals (i.e. u'\N{DEGREE SIGN}') defined? I found
http://www.unicode.org/Public/5.1.0/ucd/UnicodeData.txt
Is that the place to look?

George
Mark said:
Is there a better way of getting the degrees?

It seems your string is UTF-8. \xc2\xb0 is UTF-8 for DEGREE SIGN. If
you type non-ASCII characters in source code, make sure to declare the
encoding the file is *actually* saved in:

# coding: utf-8

s = '''48° 13' 16.80" N'''
q = s.decode('utf-8')

# next line equivalent to previous two
q = u'''48° 13' 16.80" N'''

# couple ways to find the degrees
print int(q[:q.find(u'°')])
import re
print re.search(ur'(\d+)°',q).group(1)

Mark is right about the source, but you needn't write unicode source
to process unicode data. Since nobody else mentioned my favorite way
of writing unicode in ASCII, try:

IDLE 2.6.313' 16.80" N

And if you are unsure of the name to use:'DEGREE SIGN'

It wouldn't be your favorite way if you were typing Chinese:

x = u'我是美国人。'

vs.

x = u'\N{CJK UNIFIED IDEOGRAPH-6211}\N{CJK UNIFIED IDEOGRAPH-662F}\N{CJK
UNIFIED IDEOGRAPH-7F8E}\N{CJK UNIFIED IDEOGRAPH-56FD}\N{CJK UNIFIED
IDEOGRAPH-4EBA}\N{IDEOGRAPHIC FULL STOP}'

;^) Mark
 
C

Chris Jones

On Tue, 20 Oct 2009 17:56:21 +0000, George Trojan wrote:
[..]
Where are the literals (i.e. u'\N{DEGREE SIGN}') defined?

You can get them from the unicodedata module, e.g.:

import unicodedata
for i in xrange(0x10000):
n = unicodedata.name(unichr(i),None)
if n is not None:
print i, n

Python rocks!

Just curious, why did you choose to set the upper boundary at 0xffff?

CJ
 
B

Bruno Desthuilliers

beSTEfar a écrit :
(snip)
> When parsing strings, use Regular Expressions.

And now you have _two_ problems <g>

For some simple parsing problems, Python's string methods are powerful
enough to make REs overkill. And for any complex enough parsing (any
recursive construct for example - think XML, HTML, any programming
language etc), REs are just NOT enough by themselves - you need a full
blown parser.
 
N

Nobody

Python rocks!

Just curious, why did you choose to set the upper boundary at 0xffff?

Characters outside the 16-bit range aren't supported on all builds. They
won't be supported on most Windows builds, as Windows uses 16-bit Unicode
extensively:

Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)] on
win32
>>> unichr(0x10000)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: unichr() arg not in range(0x10000) (narrow Python build)

Note that narrow builds do understand names outside of the BMP, and
generate surrogate pairs for them:

>>> u'\N{LINEAR B SYLLABLE B008 A}'
u'\U00010000'
>>> len(_)
2

Whether or not using surrogates in this context is a good idea is open to
debate. What's the advantage of a multi-wchar string over a multi-byte
string?
 
R

rurpy

beSTEfar a écrit :
(snip)
 > When parsing strings, use Regular Expressions.

And now you have _two_ problems <g>

For some simple parsing problems, Python's string methods are powerful
enough to make REs overkill. And for any complex enough parsing (any
recursive construct for example - think XML, HTML, any programming
language etc), REs are just NOT enough by themselves - you need a full
blown parser.

But keep in mind that many XML, HTML, etc parsing problems
are restricted to a subset where you know the nesting depth
is limited (often to 0 or 1), and for that large set of
problems, RE's *are* enough.
 
T

Terry Reedy

Nobody said:
Just curious, why did you choose to set the upper boundary at 0xffff?

Characters outside the 16-bit range aren't supported on all builds. They
won't be supported on most Windows builds, as Windows uses 16-bit Unicode
extensively:

Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)] on
win32
>>> unichr(0x10000)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: unichr() arg not in range(0x10000) (narrow Python build)

In Python 3, if not 2.6, chr(0x10000) (what used to be unichr()) works
fine on Windows, and generates the appropriate surrogate pair.
 
G

Gabriel Genellina

But keep in mind that many XML, HTML, etc parsing problems
are restricted to a subset where you know the nesting depth
is limited (often to 0 or 1), and for that large set of
problems, RE's *are* enough.

I don't think so. Nesting isn't the only problem. RE's cannot handle
comments, by example. And you must support unquoted attributes, single and
double quotes, any attribute ordering, empty tags, arbitrary whitespace...
If you don't, you are not reading XML (or HTML), only a specific file
format that resembles XML but actually isn't.
 
C

Chris Jones

On Wed, Oct 21, 2009 at 12:35:11PM EDT, Nobody wrote:

[..]
Characters outside the 16-bit range aren't supported on all builds.
They won't be supported on most Windows builds, as Windows uses 16-bit
Unicode extensively:

I knew nothing about UTF-16 & friends before this thread.

Best part of Unicode is that there are multiple encodings, right? ;-)

Moot point on xterm anyway, since you'd be hard put to it to find a
decent terminal font that covers anything outside the BMP.
Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit
(Intel)] on win32
>>> unichr(0x10000)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: unichr() arg not in range(0x10000) (narrow Python build)

Note that narrow builds do understand names outside of the BMP, and
generate surrogate pairs for them:

>>> u'\N{LINEAR B SYLLABLE B008 A}'
u'\U00010000'
>>> len(_)
2

Whether or not using surrogates in this context is a good idea is open to
debate. What's the advantage of a multi-wchar string over a multi-byte
string?

I don't understand this last remark, but since I'm only a GNU/Linux
hobbyist, I guess it doesn't make much difference.

Thanks for the code snippet and comments.

CJ
 
R

rurpy

I don't think so. Nesting isn't the only problem. RE's cannot handle
comments, by example. And you must support unquoted attributes, single and
double quotes, any attribute ordering, empty tags, arbitrary whitespace....
If you don't, you are not reading XML (or HTML), only a specific file
format that resembles XML but actually isn't.

OK, then let me rephrase my point as: in the real world it is often
not necessary to parse XML in it's full generality; parsing, as you
put it, "a specific file format that resembles XML" is all that is
really needed.
 
G

Gabriel Genellina

OK, then let me rephrase my point as: in the real world it is often
not necessary to parse XML in it's full generality; parsing, as you
put it, "a specific file format that resembles XML" is all that is
really needed.

Given that using a real XML parser like ElementTree is as easy as (or even
easier than) building a regular expression, and more robust, and more
likely to survive small changes in the input format, why use the worse
solution?
RE's are good in solving some problems, but parsing XML isn't one of those.
 
L

Lie Ryan

Chris said:
On Wed, Oct 21, 2009 at 12:35:11PM EDT, Nobody wrote:

[..]
Characters outside the 16-bit range aren't supported on all builds.
They won't be supported on most Windows builds, as Windows uses 16-bit
Unicode extensively:

I knew nothing about UTF-16 & friends before this thread.

Best part of Unicode is that there are multiple encodings, right? ;-)

No, the best part about Unicode is there is no encoding!

Unicode does not define any encoding; what it defines is code-points for
characters which is not related to how characters are encoded in files
or network transmission.
 
C

Chris Jones

Chris Jones wrote:
[..]
Best part of Unicode is that there are multiple encodings, right? ;-)

No, the best part about Unicode is there is no encoding!
Unicode does not define any encoding;

RFC 3629:

"ISO/IEC 10646 and Unicode define several encoding forms of their
common repertoire: UTF-8, UCS-2, UTF-16, UCS-4 and UTF-32."
what it defines is code-points for characters which is not related to
how characters are encoded in files or network transmission.

In other words, Unicode is "not related to any encoding" .. and yet the
UTF-8, UTF-16.. "encoding forms" are clearly "related" to Unicode.

How is that possible?

CJ
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,774
Messages
2,569,599
Members
45,175
Latest member
Vinay Kumar_ Nevatia
Top