O
Oliver Andrich
Hi everybody,
I have to write a little skript, that reads some nasty xml formated
files. "Nasty xml formated" means, we have a xml like syntax, no dtd,
use html entities without declaration and so on. A task as I like it.
My task looks like that...
1. read the data from the file.
2. get rid of the html entities
3. parse the stuff to extract the content of two tags.
4. create a new string from the extracted content
5. write it to a cp850 or even better macroman encoded file
Well, step 1 is easy and obvious. Step is solved for me by
===== code =====
from htmlentitydefs import entitydefs
html2text = []
for k,v in entitydefs.items():
if v[0] != "&":
html2text.append(["&"+k+";" , v])
else:
html2text.append(["&"+k+";", ""])
def remove_html_entities(data):
for html, char in html2text:
data = apply(string.replace, [data, html, char])
return data
===== code =====
Step 3 + 4 also work fine so far. But step 5 drives me completely
crazy, cause I get a lot of nice exception from the codecs module.
Hopefully someone can help me with that.
If my code for processing the file looks like that:
def process_file(file_name):
data = codecs.open(file_name, "r", "latin1").read()
data = remove_html_entities(data)
dom = parseString(data)
print data
I get
Traceback (most recent call last):
File "ag2blvd.py", line 46, in ?
process_file(file_name)
File "ag2blvd.py", line 33, in process_file
data = remove_html_entities(data)
File "ag2blvd.py", line 39, in remove_html_entities
data = apply(string.replace, [data, html, char])
File "/usr/lib/python2.4/string.py", line 519, in replace
return s.replace(old, new, maxsplit)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position
0: ordinal not in range(128)
I am pretty sure that I have iso-latin-1 files, but after running
through my code everything looks pretty broken. If I remove the call
to remove_html_entities I get
Traceback (most recent call last):
File "ag2blvd.py", line 46, in ?
process_file(file_name)
File "ag2blvd.py", line 35, in process_file
print data
UnicodeEncodeError: 'ascii' codec can't encode character u'\x96' in
position 2482: ordinal not in range(128)
And this continues, when I try to write to a file in macroman encoding.
As I am pretty sure, that I am doing something completely wrong and I
also haven't found a trace in the fantastic cookbook, I like to ask
for help here.
I am also pretty sure, that I do something wrong as writing a unicode
string with german umlauts to a macroman file opened via the codecs
module works fine.
Hopefully someone can help me.
Best regards,
Oliver
I have to write a little skript, that reads some nasty xml formated
files. "Nasty xml formated" means, we have a xml like syntax, no dtd,
use html entities without declaration and so on. A task as I like it.
My task looks like that...
1. read the data from the file.
2. get rid of the html entities
3. parse the stuff to extract the content of two tags.
4. create a new string from the extracted content
5. write it to a cp850 or even better macroman encoded file
Well, step 1 is easy and obvious. Step is solved for me by
===== code =====
from htmlentitydefs import entitydefs
html2text = []
for k,v in entitydefs.items():
if v[0] != "&":
html2text.append(["&"+k+";" , v])
else:
html2text.append(["&"+k+";", ""])
def remove_html_entities(data):
for html, char in html2text:
data = apply(string.replace, [data, html, char])
return data
===== code =====
Step 3 + 4 also work fine so far. But step 5 drives me completely
crazy, cause I get a lot of nice exception from the codecs module.
Hopefully someone can help me with that.
If my code for processing the file looks like that:
def process_file(file_name):
data = codecs.open(file_name, "r", "latin1").read()
data = remove_html_entities(data)
dom = parseString(data)
print data
I get
Traceback (most recent call last):
File "ag2blvd.py", line 46, in ?
process_file(file_name)
File "ag2blvd.py", line 33, in process_file
data = remove_html_entities(data)
File "ag2blvd.py", line 39, in remove_html_entities
data = apply(string.replace, [data, html, char])
File "/usr/lib/python2.4/string.py", line 519, in replace
return s.replace(old, new, maxsplit)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position
0: ordinal not in range(128)
I am pretty sure that I have iso-latin-1 files, but after running
through my code everything looks pretty broken. If I remove the call
to remove_html_entities I get
Traceback (most recent call last):
File "ag2blvd.py", line 46, in ?
process_file(file_name)
File "ag2blvd.py", line 35, in process_file
print data
UnicodeEncodeError: 'ascii' codec can't encode character u'\x96' in
position 2482: ordinal not in range(128)
And this continues, when I try to write to a file in macroman encoding.
As I am pretty sure, that I am doing something completely wrong and I
also haven't found a trace in the fantastic cookbook, I like to ask
for help here.
I am also pretty sure, that I do something wrong as writing a unicode
string with german umlauts to a macroman file opened via the codecs
module works fine.
Hopefully someone can help me.
Best regards,
Oliver