a simple unicode question

George Trojan · Oct 19, 2009

A trivial one, this is the first time I have to deal with Unicode. I am
trying to parse a string s='''48° 13' 16.80" N'''. I know the charset is
"iso-8859-1". To get the degrees I did

>>> encoding='iso-8859-1'
>>> q=s.decode(encoding)
>>> q.split() [u'48\xc2\xb0', u"13'", u'16.80"', u'N']
>>> r=q.split()[0]
>>> int(r[:r.find(unichr(ord('\xc2')))])

Click to expand...

Click to expand...

48

Is there a better way of getting the degrees?

George

Diez B. Roggisch · Oct 19, 2009

George said:
A trivial one, this is the first time I have to deal with Unicode. I am
trying to parse a string s='''48° 13' 16.80" N'''. I know the charset is
"iso-8859-1". To get the degrees I did

encoding='iso-8859-1'
q=s.decode(encoding)
q.split() [u'48\xc2\xb0', u"13'", u'16.80"', u'N']
r=q.split()[0]
int(r[:r.find(unichr(ord('\xc2')))])

Click to expand...

Click to expand...

48

Is there a better way of getting the degrees?

Instead of this rather convoluted way to specify a degree-sign, better do

# -*- coding: utf-8 -*-
...
int(r[:r.find(u"°")])

Please note that the utf-8-encoding has *nothing* todo with your string
- it's just the source-file encoding. Of course your editor must use
utf-8 for saving the encoding. Or you can use any other one you like.

Diez

beSTEfar · Oct 19, 2009

A trivial one, this is the first time I have to deal with Unicode. I am
trying to parse a string s='''48° 13' 16.80" N'''. I know the charset is
"iso-8859-1". To get the degrees I did
>>> encoding='iso-8859-1'
>>> q=s.decode(encoding)
>>> q.split()
[u'48\xc2\xb0', u"13'", u'16.80"', u'N']
>>> r=q.split()[0]
>>> int(r[:r.find(unichr(ord('\xc2')))])
48

Is there a better way of getting the degrees?

George

When parsing strings, use Regular Expressions. If you don't know how
to, spend some time teaching yourself how to - well spent time! A
great tool for playing around with REs is KODOS.

For the problem at hand you can e.g.:

import re
degrees = int(re.findall('\d+', s)[0])

that in essence will group together all groups of consecutive digits,
return the first group and int() it. No need to care/know about the
fact that the string is Unicode and the underlying coding of the
charset.

Mark Tolonen · Oct 20, 2009

George Trojan said:
A trivial one, this is the first time I have to deal with Unicode. I am
trying to parse a string s='''48° 13' 16.80" N'''. I know the charset is
"iso-8859-1". To get the degrees I did

encoding='iso-8859-1'
q=s.decode(encoding)
q.split() [u'48\xc2\xb0', u"13'", u'16.80"', u'N']
r=q.split()[0]
int(r[:r.find(unichr(ord('\xc2')))])

Click to expand...

Click to expand...

48

Is there a better way of getting the degrees?

It seems your string is UTF-8. \xc2\xb0 is UTF-8 for DEGREE SIGN:

Mark Tolonen · Oct 20, 2009

George Trojan said:
A trivial one, this is the first time I have to deal with Unicode. I am
trying to parse a string s='''48° 13' 16.80" N'''. I know the charset is
"iso-8859-1". To get the degrees I did

encoding='iso-8859-1'
q=s.decode(encoding)
q.split() [u'48\xc2\xb0', u"13'", u'16.80"', u'N']
r=q.split()[0]
int(r[:r.find(unichr(ord('\xc2')))])

Click to expand...

Click to expand...

48

Is there a better way of getting the degrees?

It seems your string is UTF-8. \xc2\xb0 is UTF-8 for DEGREE SIGN. If you
type non-ASCII characters in source code, make sure to declare the encoding
the file is *actually* saved in:

# coding: utf-8

s = '''48° 13' 16.80" N'''
q = s.decode('utf-8')

# next line equivalent to previous two
q = u'''48° 13' 16.80" N'''

# couple ways to find the degrees
print int(q[:q.find(u'°')])
import re
print re.search(ur'(\d+)°',q).group(1)

-Mark

George Trojan · Oct 20, 2009

Thanks for all suggestions. It took me a while to find out how to
configure my keyboard to be able to type the degree sign. I prefer to
stick with pure ASCII if possible.
Where are the literals (i.e. u'\N{DEGREE SIGN}') defined? I found
http://www.unicode.org/Public/5.1.0/ucd/UnicodeData.txt
Is that the place to look?

George

Mark said:
Mark said:

Is there a better way of getting the degrees?

Click to expand...

It seems your string is UTF-8. \xc2\xb0 is UTF-8 for DEGREE SIGN. If
you type non-ASCII characters in source code, make sure to declare the
encoding the file is *actually* saved in:

# coding: utf-8

s = '''48° 13' 16.80" N'''
q = s.decode('utf-8')

# next line equivalent to previous two
q = u'''48° 13' 16.80" N'''

# couple ways to find the degrees
print int(q[:q.find(u'°')])
import re
print re.search(ur'(\d+)°',q).group(1)

Click to expand...

Mark is right about the source, but you needn't write unicode source
to process unicode data. Since nobody else mentioned my favorite way
of writing unicode in ASCII, try:

IDLE 2.6.313' 16.80" N

And if you are unsure of the name to use:'DEGREE SIGN'

--Scott David Daniels
(e-mail address removed)

Nobody · Oct 21, 2009

Thanks for all suggestions. It took me a while to find out how to
configure my keyboard to be able to type the degree sign. I prefer to
stick with pure ASCII if possible.
Where are the literals (i.e. u'\N{DEGREE SIGN}') defined? I found
http://www.unicode.org/Public/5.1.0/ucd/UnicodeData.txt
Is that the place to look?

You can get them from the unicodedata module, e.g.:

import unicodedata
for i in xrange(0x10000):
n = unicodedata.name(unichr(i),None)
if n is not None:
print i, n

Martin v. Löwis · Oct 21, 2009

Where are the literals (i.e. u'\N{DEGREE SIGN}') defined? I found

http://www.unicode.org/Public/5.1.0/ucd/UnicodeData.txt
Is that the place to look?

Correct - you are supposed to fill in a Unicode character name into
the \N escape. The specific list of names depends on the version of
the UCD which was used in the specific Python version, but the
characters you are likely interested in probably had been defined
"forever".

Regards,
Martin

Mark Tolonen · Oct 21, 2009

Thanks for all suggestions. It took me a while to find out how to
configure my keyboard to be able to type the degree sign. I prefer to
stick with pure ASCII if possible.
Where are the literals (i.e. u'\N{DEGREE SIGN}') defined? I found
http://www.unicode.org/Public/5.1.0/ucd/UnicodeData.txt
Is that the place to look?

George

Mark said:
Mark said:

Is there a better way of getting the degrees?

Click to expand...

It seems your string is UTF-8. \xc2\xb0 is UTF-8 for DEGREE SIGN. If
you type non-ASCII characters in source code, make sure to declare the
encoding the file is *actually* saved in:

# coding: utf-8

s = '''48Â° 13' 16.80" N'''
q = s.decode('utf-8')

# next line equivalent to previous two
q = u'''48Â° 13' 16.80" N'''

# couple ways to find the degrees
print int(q[:q.find(u'Â°')])
import re
print re.search(ur'(\d+)Â°',q).group(1)

Click to expand...

Mark is right about the source, but you needn't write unicode source
to process unicode data. Since nobody else mentioned my favorite way
of writing unicode in ASCII, try:

IDLE 2.6.313' 16.80" N

And if you are unsure of the name to use:'DEGREE SIGN'

It wouldn't be your favorite way if you were typing Chinese:

x = u'æˆ‘æ˜¯ç¾Žå›½äººã€‚'

vs.

x = u'\N{CJK UNIFIED IDEOGRAPH-6211}\N{CJK UNIFIED IDEOGRAPH-662F}\N{CJK
UNIFIED IDEOGRAPH-7F8E}\N{CJK UNIFIED IDEOGRAPH-56FD}\N{CJK UNIFIED
IDEOGRAPH-4EBA}\N{IDEOGRAPHIC FULL STOP}'

;^) Mark

Chris Jones · Oct 21, 2009

On Tue, 20 Oct 2009 17:56:21 +0000, George Trojan wrote:
[..]

Where are the literals (i.e. u'\N{DEGREE SIGN}') defined?

Click to expand...

You can get them from the unicodedata module, e.g.:

import unicodedata
for i in xrange(0x10000):
n = unicodedata.name(unichr(i),None)
if n is not None:
print i, n

Python rocks!

Just curious, why did you choose to set the upper boundary at 0xffff?

CJ

Bruno Desthuilliers · Oct 21, 2009

beSTEfar a écrit :
(snip)

> When parsing strings, use Regular Expressions.

And now you have _two_ problems <g>

For some simple parsing problems, Python's string methods are powerful
enough to make REs overkill. And for any complex enough parsing (any
recursive construct for example - think XML, HTML, any programming
language etc), REs are just NOT enough by themselves - you need a full
blown parser.

Nobody · Oct 21, 2009

Python rocks!

Just curious, why did you choose to set the upper boundary at 0xffff?

Characters outside the 16-bit range aren't supported on all builds. They
won't be supported on most Windows builds, as Windows uses 16-bit Unicode
extensively:

Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)] on
win32
>>> unichr(0x10000)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: unichr() arg not in range(0x10000) (narrow Python build)

Note that narrow builds do understand names outside of the BMP, and
generate surrogate pairs for them:

>>> u'\N{LINEAR B SYLLABLE B008 A}'
u'\U00010000'
>>> len(_)
2

Whether or not using surrogates in this context is a good idea is open to
debate. What's the advantage of a multi-wchar string over a multi-byte
string?

rurpy · Oct 21, 2009

beSTEfar a écrit :
(snip)
> When parsing strings, use Regular Expressions.

And now you have _two_ problems <g>

For some simple parsing problems, Python's string methods are powerful
enough to make REs overkill. And for any complex enough parsing (any
recursive construct for example - think XML, HTML, any programming
language etc), REs are just NOT enough by themselves - you need a full
blown parser.

But keep in mind that many XML, HTML, etc parsing problems
are restricted to a subset where you know the nesting depth
is limited (often to 0 or 1), and for that large set of
problems, RE's *are* enough.

Terry Reedy · Oct 21, 2009

Nobody said:
Just curious, why did you choose to set the upper boundary at 0xffff?

Click to expand...

Characters outside the 16-bit range aren't supported on all builds. They
won't be supported on most Windows builds, as Windows uses 16-bit Unicode
extensively:

Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)] on
win32
>>> unichr(0x10000)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: unichr() arg not in range(0x10000) (narrow Python build)

In Python 3, if not 2.6, chr(0x10000) (what used to be unichr()) works
fine on Windows, and generates the appropriate surrogate pair.

Gabriel Genellina · Oct 22, 2009

En Wed said:
But keep in mind that many XML, HTML, etc parsing problems
are restricted to a subset where you know the nesting depth
is limited (often to 0 or 1), and for that large set of
problems, RE's *are* enough.

I don't think so. Nesting isn't the only problem. RE's cannot handle
comments, by example. And you must support unquoted attributes, single and
double quotes, any attribute ordering, empty tags, arbitrary whitespace...
If you don't, you are not reading XML (or HTML), only a specific file
format that resembles XML but actually isn't.

Chris Jones · Oct 22, 2009

On Wed, Oct 21, 2009 at 12:35:11PM EDT, Nobody wrote:

[..]

Characters outside the 16-bit range aren't supported on all builds.
They won't be supported on most Windows builds, as Windows uses 16-bit
Unicode extensively:

I knew nothing about UTF-16 & friends before this thread.

Best part of Unicode is that there are multiple encodings, right? ;-)

Moot point on xterm anyway, since you'd be hard put to it to find a
decent terminal font that covers anything outside the BMP.

Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit
(Intel)] on win32

>>> unichr(0x10000)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: unichr() arg not in range(0x10000) (narrow Python build)

Note that narrow builds do understand names outside of the BMP, and
generate surrogate pairs for them:

>>> u'\N{LINEAR B SYLLABLE B008 A}'
u'\U00010000'
>>> len(_)
2

Whether or not using surrogates in this context is a good idea is open to
debate. What's the advantage of a multi-wchar string over a multi-byte
string?

I don't understand this last remark, but since I'm only a GNU/Linux
hobbyist, I guess it doesn't make much difference.

Thanks for the code snippet and comments.

CJ

rurpy · Oct 22, 2009

I don't think so. Nesting isn't the only problem. RE's cannot handle
comments, by example. And you must support unquoted attributes, single and
double quotes, any attribute ordering, empty tags, arbitrary whitespace....
If you don't, you are not reading XML (or HTML), only a specific file
format that resembles XML but actually isn't.

OK, then let me rephrase my point as: in the real world it is often
not necessary to parse XML in it's full generality; parsing, as you
put it, "a specific file format that resembles XML" is all that is
really needed.

Gabriel Genellina · Oct 22, 2009

En Thu said:
OK, then let me rephrase my point as: in the real world it is often
not necessary to parse XML in it's full generality; parsing, as you
put it, "a specific file format that resembles XML" is all that is
really needed.

Given that using a real XML parser like ElementTree is as easy as (or even
easier than) building a regular expression, and more robust, and more
likely to survive small changes in the input format, why use the worse
solution?
RE's are good in solving some problems, but parsing XML isn't one of those.

Lie Ryan · Oct 27, 2009

Chris said:
On Wed, Oct 21, 2009 at 12:35:11PM EDT, Nobody wrote:

[..]

Characters outside the 16-bit range aren't supported on all builds.
They won't be supported on most Windows builds, as Windows uses 16-bit
Unicode extensively:

Click to expand...

I knew nothing about UTF-16 & friends before this thread.

Best part of Unicode is that there are multiple encodings, right? ;-)

No, the best part about Unicode is there is no encoding!

Unicode does not define any encoding; what it defines is code-points for
characters which is not related to how characters are encoded in files
or network transmission.

Chris Jones · Oct 28, 2009

Chris Jones wrote:
[..]

Best part of Unicode is that there are multiple encodings, right? ;-)

Click to expand...

No, the best part about Unicode is there is no encoding!

Unicode does not define any encoding;

RFC 3629:

"ISO/IEC 10646 and Unicode define several encoding forms of their
common repertoire: UTF-8, UCS-2, UTF-16, UCS-4 and UTF-32."

what it defines is code-points for characters which is not related to
how characters are encoded in files or network transmission.

In other words, Unicode is "not related to any encoding" .. and yet the
UTF-8, UTF-16.. "encoding forms" are clearly "related" to Unicode.

How is that possible?

CJ

How do I display unicode value stored in a string variable using ord()	133	Aug 16, 2012
Unicode Question	4	Jan 10, 2006
Benchmarking stripping of Unicode characters which are invalid XML	0	Mar 18, 2012
API for custom Unicode error handlers	5	Oct 4, 2013
Convert unicode escape sequences to unicode in a file	1	Jan 11, 2011
MySQLdb not playing nice with unicode	1	Mar 30, 2013
python3 Unicode is slow	1	Oct 25, 2009
Right solution to unicode error?	21	Nov 7, 2012

a simple unicode question

George Trojan

Diez B. Roggisch

beSTEfar

Mark Tolonen

Mark Tolonen

George Trojan

Nobody

Martin v. Löwis

Mark Tolonen

Chris Jones

Bruno Desthuilliers

Nobody

rurpy

Terry Reedy

Gabriel Genellina

Chris Jones

rurpy

Gabriel Genellina

Lie Ryan

Chris Jones

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads