Python and decimal character entities over 128.

B

bsagert

Some web feeds use decimal character entities that seem to confuse
Python (or me). For example, the string "doesn't" may be coded as
"doesn’t" which should produce a right leaning apostrophe.
Python hates decimal entities beyond 128 so it chokes unless you do
something like string.encode('utf-8'). Even then, what should have
been a right-leaning apostrophe ends up as "’". The following script
does just that. Look for the string "The Canuck iPhone: Apple doesnâ
€™t care" after running it.

# coding: UTF-8
import feedparser

s = ''
d = feedparser.parse('http://feeds.feedburner.com/Mathewingramcom/
work')
title = d.feed.title
link = d.feed.link
for i in range(0,4):
title = d.entries.title
link = d.entries.link
s += title +'\n' + link + '\n'

f = open('c:/x/test.txt', 'w')
f.write(s.encode('utf-8'))
f.close()

This useless script is adapted from a "useful" script. Its only
purpose is to ask the Python community how I can deal with decimal
entities > 128. Thanks in advance, Bill
 
M

Marc 'BlackJack' Rintsch

Some web feeds use decimal character entities that seem to confuse
Python (or me).

I guess they confuse you. Python is fine.
For example, the string "doesn't" may be coded as "doesn’t" which
should produce a right leaning apostrophe. Python hates decimal entities
beyond 128 so it chokes unless you do something like
string.encode('utf-8').

Python doesn't hate nor chokes on these entities. It just refuses to
guess which encoding you want, if you try to write *unicode* objects into
a file. Files contain byte values not characters.
Even then, what should have been a right-leaning apostrophe ends up as
"’". The following script does just that. Look for the string "The
Canuck iPhone: Apple doesnâ €™t care" after running it.

Then you didn't tell the application you used to look at the result, that
the text is UTF-8 encoded. I guess you are using Windows and
the application expects cp1252 encoded text because an UTF-8 encoded
apostrophe looks like '’' in cp1252.

Choose the encoding you want the result to have and anything is fine.
Unless you stumble over a feed using characters which can't be encoded
in the encoding of your choice. That's why UTF-8 might have been a good
idea.

Ciao,
Marc 'BlackJack' Rintsch
 
M

Manuel Vazquez Acosta

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Some web feeds use decimal character entities that seem to confuse
Python (or me). For example, the string "doesn't" may be coded as
"doesn’t" which should produce a right leaning apostrophe.
Python hates decimal entities beyond 128 so it chokes unless you do
something like string.encode('utf-8'). Even then, what should have
been a right-leaning apostrophe ends up as "’". The following script
does just that. Look for the string "The Canuck iPhone: Apple doesnâ
€™t care" after running it.

# coding: UTF-8
import feedparser

s = ''
d = feedparser.parse('http://feeds.feedburner.com/Mathewingramcom/
work')
title = d.feed.title
link = d.feed.link
for i in range(0,4):
title = d.entries.title
link = d.entries.link
s += title +'\n' + link + '\n'

f = open('c:/x/test.txt', 'w')
f.write(s.encode('utf-8'))
f.close()

This useless script is adapted from a "useful" script. Its only
purpose is to ask the Python community how I can deal with decimal
entities > 128. Thanks in advance, Bill


This is a two-fold issue: encodings/charsets and entities. Encodings are
a way to _encode_ charsets to a sequence of octets. Entities are a way
to avoid a (harder) encoding/decoding process at the expense of
readability: when you type #8217; no one actually see the intended
character, but those are easily encoded in ascii.

When dealing with multiples sources of information, like your script may
be, I always include a middleware of normalization to Python's Unicode
Type. Web sites may use whatever encoding they please.

The whole process is like this:
1. Fetch the content
2. Use whatever clue in the contents to guess the encoding used by the
document, e.g Content-type HTTP header; <meta http-equiv="content-type"
....>; <?xml version="1.0" encoding="utf-8"?>, and so on.
3. If none are present, then use chardet to guess for an acceptable decoder.
4. Decode ignoring those character that cannot be decoded.
5. The result is further processed to find entities and "decode" them to
actual Unicode characters. (See below)

You may find these helpful:
http://effbot.org/zone/unicode-objects.htm
http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
http://www.amk.ca/python/howto/unicode

This is function I have used to process entities:
Code:
from htmlentitydefs import name2codepoint
def __processhtmlentities__(text):
    assert type(text) is unicode, "Non-normalized text"
    html = []
    (buffer, amp, text) = text.partition('&')
    while amp:
        html.append(buffer)
        (entity, semicolon, text) = text.partition(';')
        if entity[0] != '#':
            if entity in name2codepoint:
                html.append(unichr(name2codepoint[entity]))
        else:
            html.append(int(entity[1:])))
        (buffer, amp, text) = text.partition('&')
    html.append(buffer)
    return u''.join(html)


Best regards,
Manuel.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkh2S+sACgkQI2zpkmcEAhil6gCgkAnRE4s5b8oQHamk6utkbAl7
m+YAoIZH2/u73hDcs0G/u294use27v17
=mXuK
-----END PGP SIGNATURE-----
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,904
Latest member
HealthyVisionsCBDPrice

Latest Threads

Top