codec for html/xml entities!?

Martin Bless · Apr 18, 2008

Hi friends, I've been OFF-Python now for quite a while and am glad
being back. At least to some part as work permits.

Q:
What's a good way to encode and decode those entities like € or
€ ?

I need isolated functions to process lines. Looking at the xml and
sgmlib stuff I didn't really get a clue as to what's the most pythonic
way. Are there library functions I didn't see?

FYI, here is what I hacked down and what will probably (hopefully...)
do the job.

Feel free to comment.

# -*- coding: iso-8859-1 -*-
"""\
entity_stuff.py, mb, 2008-03-14, 2008-03-18

"""

import htmlentitydefs
import re

RE_OBJ_entity = re.compile('(&.+?

')

def entity2uc(entity):
"""Convert entity like { to unichr.

Return (result,True) on success or (input string, False)
otherwise. Example:
entity2cp('€') -> (u'\u20ac',True)
entity2cp('€') -> (u'\u20ac',True)
entity2cp('€') -> (u'\u20ac',True)
entity2cp('&foobar;') -> ('&foobar;',False)
"""

gotCodepoint = False
gotUnichr = False
if entity.startswith('&#'):
if entity[2] == 'x':
base = 16
digits = entity[3:-1]
else:
base = 10
digits = entity[2:-1]
try:
v = int(digits,base)
gotCodepoint = True
except:
pass
else:
v = htmlentitydefs.name2codepoint.get(entity[1:-1],None)
if not v is None:
gotCodepoint = True

if gotCodepoint:
try:
v = unichr(v)
gotUnichr = True
except:
pass
if gotUnichr:
return v, gotUnichr
else:
return entity, gotUnichr

def line_entities_to_uc(line):
result = []
cntProblems = 0
for e in RE_OBJ_entity.split(line):
if e.startswith('&'):
e,success = entity2uc(e)
if not success:
cntProblems += 1
result.append(e)
return u''.join(result), cntProblems

def uc2entity(uc):
cp = ord(uc)
if cp > 127:
name = htmlentitydefs.codepoint2name.get(cp,None)
if name:
result = '&%s;' % name
else:
result = '&#x%x;' % cp
else:
result = chr(cp)
return result

def encode_line(line):
return ''.join([uc2entity(u) for u in line])

if 1 and __name__=="__main__":
import codecs
infile = 'temp.ascii.xml'
outfile = 'temp.utf8.xml'
of = codecs.open(outfile,'wb','utf-8')
totalProblems = 0
totalLines = 0
for line in file(infile,'rb'):
line2, cntProblems = line_entities_to_uc(line)
of.write(line2)
totalLines += 1
totalProblems += cntProblems
of.close()
print
print "Summary:"
print " Infile : %s" % (infile,)
print " Outfile: %s" % (outfile,)
print ' %8d %s %s' % (totalLines,
['lines','line'][totalLines==1], 'written.')
print ' %8d %s %s' % (totalProblems,
['entities','entity'][totalProblems==1], 'left unconverted.')
print '%s' % ('Done.',)

Have a nice day and
ru, Martin
(read you, ;-)

Stefan Behnel · Apr 18, 2008

Martin said:
What's a good way to encode and decode those entities like € or
€ ?

Hmm, since you provide code, I'm not quite sure what your actual question is.

So I'll just comment on the code here.

def entity2uc(entity):
"""Convert entity like { to unichr.

Return (result,True) on success or (input string, False)
otherwise. Example:
entity2cp('€') -> (u'\u20ac',True)
entity2cp('€') -> (u'\u20ac',True)
entity2cp('€') -> (u'\u20ac',True)
entity2cp('&foobar;') -> ('&foobar;',False)
"""

Is there a reason why you return a tuple instead of just returning the
converted result and raising an exception if the conversion fails?

Stefan

Martin Bless · Apr 20, 2008

[Stefan Behnel] wrote & said:
Martin said:

What's a good way to encode and decode those entities like € or
€ ?

Click to expand...

Hmm, since you provide code, I'm not quite sure what your actual question is.

- What's a GOOD way?
- Am I reinventing the wheel?
- Are there well tested, fast, state of the art, builtin ways?
- Is something like line.decode('htmlentities') out there?
- Am I in conformity with relevant RFCs? (I'm hoping so ...)

So I'll just comment on the code here.

Is there a reason why you return a tuple instead of just returning the
converted result and raising an exception if the conversion fails?

Mainly a matter of style. When I'll be using the function in future
this way it's unambigously clear that there might have been
unconverted entities. But I don't have to deal with the details of how
this has been discovered. And may be I'd like to change the algorithm
in future? This way it's nicely encapsulated.

Have a nice day

Martin

Stefan Behnel · Apr 20, 2008

Martin said:
[Stefan Behnel] wrote & said:

def entity2uc(entity):
"""Convert entity like { to unichr.

Return (result,True) on success or (input string, False)
otherwise. Example:
entity2cp('€') -> (u'\u20ac',True)
entity2cp('€') -> (u'\u20ac',True)
entity2cp('€') -> (u'\u20ac',True)
entity2cp('&foobar;') -> ('&foobar;',False)
"""

Click to expand...

Is there a reason why you return a tuple instead of just returning the
converted result and raising an exception if the conversion fails?

Click to expand...

Mainly a matter of style. When I'll be using the function in future
this way it's unambigously clear that there might have been
unconverted entities. But I don't have to deal with the details of how
this has been discovered. And may be I'd like to change the algorithm
in future? This way it's nicely encapsulated.

The normal case is that it could be replaced, and it is an exceptional case
that it failed, in which case the caller has to deal with the problem in one
way or another. You are making the normal case more complicated, as the caller
*always* has to check the result indicator to see if the return value is the
expected result or something different. I don't think there is any reason to
require that, except when the conversion really failed.

Stefan

Tic Tac Toe Game	2	Mar 10, 2024
I have to finish this code for my assignment but I cant figure out how to solve it	1	Jun 27, 2023
Unicode to HTML entities	6	May 29, 2007
Python point location of intersect between two lines	0	Feb 28, 2018
Python battle game help	2	Feb 23, 2023
Convert from unicode chars to HTML entities	8	Jan 29, 2007
Need help with this script	4	Mar 12, 2023
Benchmarking stripping of Unicode characters which are invalid XML	0	Mar 18, 2012

codec for html/xml entities!?

Martin Bless

Stefan Behnel

Martin Bless

Stefan Behnel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads