newbie - HTML character codes

A

ardief

Hi

sorry if I'm asking something very obvious but I'm stumped. I have a
text that looks like this:

Sentence 401
4.00pm — We set off again; this time via Tony's home to collect
a variety of possessions, finally arriving at hospital no.3.
Sentence 402
4.55pm — Tony is ushered into a side ward with three doctors and
I stay outside with Mum.

And I want the HTML char codes to turn into their equivalent plain
text. I've looked at the newsgroup archives, the cookbook, the web in
general and can't manage to sort it out. I thought doing something like
this -

file = open('filename', 'r')
ofile = open('otherfile', 'w')

done = 0

while not done:
line = file.readline()
if 'THE END' in line:
done = 1
elif '—' in line:
line.replace('—', '--')
ofile.write(line)
else:
ofile.write(line)


would do it but it isn't....where am I going wrong?

many thanks
rachele
 
R

Roberto Bonvallet

ardief wrote:
[...]
And I want the HTML char codes to turn into their equivalent plain
text. I've looked at the newsgroup archives, the cookbook, the web in
general and can't manage to sort it out. I thought doing something like
this -

file = open('filename', 'r')

It's not a good idea to use 'file' as a variable name, since you are
shadowing the builtin type of the same name.
ofile = open('otherfile', 'w')

done = 0

while not done:
line = file.readline()
if 'THE END' in line:
done = 1
elif '—' in line:
line.replace('—', '--')

The replace method doesn't modify the 'line' string, it returns a new string.
ofile.write(line)
else:
ofile.write(line)

This should work (untested):

infile = open('filename', 'r')
outfile = open('otherfile', 'w')

for line in infile:
outfile.write(line.replace('—', '--'))

But I think the best approach is to use a existing aplication or library
that solves the problem. recode(1) can easily convert to and from HTML
entities:

recode html..utf-8 filename

Best regards.
 
F

Fredrik Lundh

ardief said:
sorry if I'm asking something very obvious but I'm stumped. I have a
text that looks like this:

Sentence 401
4.00pm — We set off again; this time via Tony's home to collect
a variety of possessions, finally arriving at hospital no.3.
Sentence 402
4.55pm — Tony is ushered into a side ward with three doctors and
I stay outside with Mum.

And I want the HTML char codes to turn into their equivalent plain
text. I've looked at the newsgroup archives, the cookbook, the web in
general and can't manage to sort it out.
file = open('filename', 'r')
ofile = open('otherfile', 'w')

done = 0

while not done:
line = file.readline()
if 'THE END' in line:
done = 1
elif '—' in line:
line.replace('—', '--')

this returns a new line; it doesn't update the line in place.
ofile.write(line)
else:
ofile.write(line)

for a more general solution to the actual replace problem, see:

http://effbot.org/zone/re-sub.htm#unescape-html

you may also want to lookup the "fileinput" module in the library reference
manual.

</F>
 
A

ardief

thank you both - in the end I used recode, which I wasn't aware of.
Fredrik, I had come across your script while googling for solutions,
but failed to make it work....
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top