encoding problem with BeautifulSoup - problem when writing parsedtext to file

Greg · Oct 6, 2011

Hi, I am having some encoding problems when I first parse stuff from a
non-english website using BeautifulSoup and then write the results to
a txt file.

I have the text both as a normal (text) and as a unicode string
(utext):
print repr(text)
'Branie zak\xc2\xb3adnik\xc3\xb3w'

print repr(utext)
u'Branie zak\xb3adnik\xf3w'

print text or print utext (fileSoup.prettify() also shows 'wrong'
symbols):
Branie zak³adników

Now I am trying to save this to a file but I never get the encoding
right. Here is what I tried (+ lot's of different things with encode,
decode...):
outFile=open(filePath,"w")
outFile.write(text)
outFile.close()

outFile=codecs.open( filePath, "w", "UTF8" )
outFile.write(utext)
outFile.close()

Thanks!!

Steven D'Aprano · Oct 6, 2011

Hi, I am having some encoding problems when I first parse stuff from a
non-english website using BeautifulSoup and then write the results to a
txt file.

If you haven't already read this, you should do so:

http://www.joelonsoftware.com/articles/Unicode.html

I have the text both as a normal (text) and as a unicode string (utext):
print repr(text)
'Branie zak\xc2\xb3adnik\xc3\xb3w'

This is pretty much meaningless, because we don't know how you got the
text and what it actually is. You're showing us a bunch of bytes, with no
clue as to whether they are the right bytes or not. Considering that your
Unicode text is also incorrect, I would say it is *not* right and your
description of the problem is 100% backwards: the problem is not
*writing* the text, but *reading* the bytes and decoding it.

You should do something like this:

(1) Inspect the web page to find out what encoding is actually used.

(2) If the web page doesn't know what encoding it uses, or if it uses
bits and pieces of different encodings, then the source is broken and you
shouldn't expect much better results. You could try guessing, but you
should expect mojibake in your results.

http://en.wikipedia.org/wiki/Mojibake

(3) Decode the web page into Unicode text, using the correct encoding.

(4) Do all your processing in Unicode, not bytes.

(5) Encode the text into bytes using UTF-8 encoding.

(6) Write the bytes to a file.

[...]

Now I am trying to save this to a file but I never get the encoding
right. Here is what I tried (+ lot's of different things with encode,
decode...):

outFile=codecs.open( filePath, "w", "UTF8" )
outFile.write(utext)
outFile.close()

That's the correct approach, but it won't help you if utext contains the
wrong characters in the first place. The critical step is taking the
bytes in the web page and turning them into text.

How are you generating utext?

Greg · Oct 6, 2011

Brilliant! It worked. Thanks!

Here is the final code for those who are struggling with similar
problems:

## open and decode file
# In this case, the encoding comes from the charset argument in a meta
tag
# e.g. <meta charset="iso-8859-2">
fileObj = open(filePath,"r").read()
fileContent = fileObj.decode("iso-8859-2")
fileSoup = BeautifulSoup(fileContent)

## Do some BeautifulSoup magic and preserve unicode, presume result is
saved in 'text' ##

## write extracted text to file
f = open(outFilePath, 'w')
f.write(text.encode('utf-8'))
f.close()

Hi, I am having some encoding problems when I first parse stuff from a
non-english website using BeautifulSoup and then write the results to a
txt file.

Click to expand...

If you haven't already read this, you should do so:

http://www.joelonsoftware.com/articles/Unicode.html

I have the text both as a normal (text) and as a unicode string (utext):
print repr(text)
'Branie zak\xc2\xb3adnik\xc3\xb3w'

Click to expand...

This is pretty much meaningless, because we don't know how you got the
text and what it actually is. You're showing us a bunch of bytes, with no
clue as to whether they are the right bytes or not. Considering that your
Unicode text is also incorrect, I would say it is *not* right and your
description of the problem is 100% backwards: the problem is not
*writing* the text, but *reading* the bytes and decoding it.

You should do something like this:

(1) Inspect the web page to find out what encoding is actually used.

(2) If the web page doesn't know what encoding it uses, or if it uses
bits and pieces of different encodings, then the source is broken and you
shouldn't expect much better results. You could try guessing, but you
should expect mojibake in your results.

http://en.wikipedia.org/wiki/Mojibake

(3) Decode the web page into Unicode text, using the correct encoding.

(4) Do all your processing in Unicode, not bytes.

(5) Encode the text into bytes using UTF-8 encoding.

(6) Write the bytes to a file.

[...]

Now I am trying to save this to a file but I never get the encoding
right. Here is what I tried (+ lot's of different things with encode,
decode...):
outFile=codecs.open( filePath, "w", "UTF8" )
outFile.write(utext)
outFile.close()

Click to expand...

That's the correct approach, but it won't help you if utext contains the
wrong characters in the first place. The critical step is taking the
bytes in the web page and turning them into text.

How are you generating utext?

Chris Angelico · Oct 6, 2011

Brilliant! It worked. Thanks!

Here is the final code for those who are struggling with similar
problems:

## open and decode file
# In this case, the encoding comes from the charset argument in a meta
tag
# e.g. <meta charset="iso-8859-2">
fileContent = fileObj.decode("iso-8859-2")
f.write(text.encode('utf-8'))

In other words, when you decode correctly into Unicode and encode
correctly onto the disk, it works!

This is why encodings are so important

ChrisA

Ulrich Eckhardt · Oct 6, 2011

Am 06.10.2011 05:40, schrieb Steven D'Aprano:

(4) Do all your processing in Unicode, not bytes.

(5) Encode the text into bytes using UTF-8 encoding.

(6) Write the bytes to a file.

Just wondering, why do you split the latter two parts? I would have used
codecs.open() to open the file and define the encoding in a single step.
Is there a downside to this approach?

Otherwise, I can only confirm that your overall approach is the easiest
way to get correct results.

Uli

Chris Angelico · Oct 6, 2011

Just wondering, why do you split the latter two parts? I would have used
codecs.open() to open the file and define the encoding in a single step. Is
there a downside to this approach?

Those two steps still happen, even if you achieve them in a single
function call. What Steven described is language- and library-
independent.

ChrisA

jmfauth · Oct 6, 2011

Brilliant! It worked. Thanks!

Here is the final code for those who are struggling with similar
problems:

## open and decode file
# In this case, the encoding comes from the charset argument in a meta
tag
# e.g. <meta charset="iso-8859-2">
fileObj = open(filePath,"r").read()
fileContent = fileObj.decode("iso-8859-2")
fileSoup = BeautifulSoup(fileContent)

## Do some BeautifulSoup magic and preserve unicode, presume result is
saved in 'text' ##

## write extracted text to file
f = open(outFilePath, 'w')
f.write(text.encode('utf-8'))
f.close()

or (Python2/Python3)
.... r = f.read()
........ t = f.write(r)
....True

jmf

xDog Walker · Oct 6, 2011

or (Python2/Python3)

... r = f.read()
...

... t = f.write(r)
...

True

jmf

What is this io of which you speak?

John Gordon · Oct 6, 2011

In said:
What is this io of which you speak?

It was introduced in Python 2.6.

Nobody · Oct 8, 2011

Here is the final code for those who are struggling with similar
problems:

## open and decode file
# In this case, the encoding comes from the charset argument in a meta
tag
# e.g. <meta charset="iso-8859-2">
fileObj = open(filePath,"r").read()
fileContent = fileObj.decode("iso-8859-2")
fileSoup = BeautifulSoup(fileContent)

The fileObj.decode() step should be unnecessary, and is usually
undesirable; Beautiful Soup should be doing the decoding itself.

If you actually know the encoding (e.g. from a Content-Type header), you
can specify it via the fromEncoding parameter to the BeautifulSoup
constructor, e.g.:

fileSoup = BeautifulSoup(fileObj.read(), fromEncoding="iso-8859-2")

If you don't specify the encoding, it will be deduced from a meta tag if
one is present, or a Unicode BOM, or using the chardet library if
available, or using built-in heuristics, before finally falling back to
Windows-1252 (which seems to be the preferred encoding of people who don't
understand what an encoding is or why it needs to be specified).

Reading/writing a dictionary to file problem :(	1	Mar 31, 2020
encoding misunderstanding	0	Jul 27, 2007
files.py (weird encoding error)	0	Jun 10, 2013
Output confusion	2	Mar 9, 2023
problem compiling executable with py2exe	0	Nov 25, 2005
User prompt as file to read	1	Mar 22, 2014
encoding error in python 27	4	Feb 22, 2013
Help needed: file writing problem with subprocess	4	Dec 4, 2005

encoding problem with BeautifulSoup - problem when writing parsedtext to file

Greg

Steven D'Aprano

Greg

Chris Angelico

Ulrich Eckhardt

Chris Angelico

jmfauth

xDog Walker

John Gordon

Nobody

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads