encoding problem with BeautifulSoup - problem when writing parsedtext to file

Discussion in 'Python' started by Greg, Oct 6, 2011.

  1. Greg

    Greg Guest

    Hi, I am having some encoding problems when I first parse stuff from a
    non-english website using BeautifulSoup and then write the results to
    a txt file.

    I have the text both as a normal (text) and as a unicode string
    (utext):
    print repr(text)
    'Branie zak\xc2\xb3adnik\xc3\xb3w'

    print repr(utext)
    u'Branie zak\xb3adnik\xf3w'

    print text or print utext (fileSoup.prettify() also shows 'wrong'
    symbols):
    Branie zak³adników


    Now I am trying to save this to a file but I never get the encoding
    right. Here is what I tried (+ lot's of different things with encode,
    decode...):
    outFile=open(filePath,"w")
    outFile.write(text)
    outFile.close()

    outFile=codecs.open( filePath, "w", "UTF8" )
    outFile.write(utext)
    outFile.close()

    Thanks!!
     
    Greg, Oct 6, 2011
    #1
    1. Advertising

  2. Re: encoding problem with BeautifulSoup - problem when writingparsed text to file

    On Wed, 05 Oct 2011 16:35:59 -0700, Greg wrote:

    > Hi, I am having some encoding problems when I first parse stuff from a
    > non-english website using BeautifulSoup and then write the results to a
    > txt file.


    If you haven't already read this, you should do so:

    http://www.joelonsoftware.com/articles/Unicode.html



    > I have the text both as a normal (text) and as a unicode string (utext):
    > print repr(text)
    > 'Branie zak\xc2\xb3adnik\xc3\xb3w'


    This is pretty much meaningless, because we don't know how you got the
    text and what it actually is. You're showing us a bunch of bytes, with no
    clue as to whether they are the right bytes or not. Considering that your
    Unicode text is also incorrect, I would say it is *not* right and your
    description of the problem is 100% backwards: the problem is not
    *writing* the text, but *reading* the bytes and decoding it.


    You should do something like this:

    (1) Inspect the web page to find out what encoding is actually used.

    (2) If the web page doesn't know what encoding it uses, or if it uses
    bits and pieces of different encodings, then the source is broken and you
    shouldn't expect much better results. You could try guessing, but you
    should expect mojibake in your results.

    http://en.wikipedia.org/wiki/Mojibake

    (3) Decode the web page into Unicode text, using the correct encoding.

    (4) Do all your processing in Unicode, not bytes.

    (5) Encode the text into bytes using UTF-8 encoding.

    (6) Write the bytes to a file.


    [...]
    > Now I am trying to save this to a file but I never get the encoding
    > right. Here is what I tried (+ lot's of different things with encode,
    > decode...):


    > outFile=codecs.open( filePath, "w", "UTF8" )
    > outFile.write(utext)
    > outFile.close()


    That's the correct approach, but it won't help you if utext contains the
    wrong characters in the first place. The critical step is taking the
    bytes in the web page and turning them into text.

    How are you generating utext?



    --
    Steven
     
    Steven D'Aprano, Oct 6, 2011
    #2
    1. Advertising

  3. Greg

    Greg Guest

    Brilliant! It worked. Thanks!

    Here is the final code for those who are struggling with similar
    problems:

    ## open and decode file
    # In this case, the encoding comes from the charset argument in a meta
    tag
    # e.g. <meta charset="iso-8859-2">
    fileObj = open(filePath,"r").read()
    fileContent = fileObj.decode("iso-8859-2")
    fileSoup = BeautifulSoup(fileContent)

    ## Do some BeautifulSoup magic and preserve unicode, presume result is
    saved in 'text' ##

    ## write extracted text to file
    f = open(outFilePath, 'w')
    f.write(text.encode('utf-8'))
    f.close()



    On Oct 5, 11:40 pm, Steven D'Aprano <steve
    > wrote:
    > On Wed, 05 Oct 2011 16:35:59 -0700, Greg wrote:
    > > Hi, I am having some encoding problems when I first parse stuff from a
    > > non-english website using BeautifulSoup and then write the results to a
    > > txt file.

    >
    > If you haven't already read this, you should do so:
    >
    > http://www.joelonsoftware.com/articles/Unicode.html
    >
    > > I have the text both as a normal (text) and as a unicode string (utext):
    > > print repr(text)
    > > 'Branie zak\xc2\xb3adnik\xc3\xb3w'

    >
    > This is pretty much meaningless, because we don't know how you got the
    > text and what it actually is. You're showing us a bunch of bytes, with no
    > clue as to whether they are the right bytes or not. Considering that your
    > Unicode text is also incorrect, I would say it is *not* right and your
    > description of the problem is 100% backwards: the problem is not
    > *writing* the text, but *reading* the bytes and decoding it.
    >
    > You should do something like this:
    >
    > (1) Inspect the web page to find out what encoding is actually used.
    >
    > (2) If the web page doesn't know what encoding it uses, or if it uses
    > bits and pieces of different encodings, then the source is broken and you
    > shouldn't expect much better results. You could try guessing, but you
    > should expect mojibake in your results.
    >
    > http://en.wikipedia.org/wiki/Mojibake
    >
    > (3) Decode the web page into Unicode text, using the correct encoding.
    >
    > (4) Do all your processing in Unicode, not bytes.
    >
    > (5) Encode the text into bytes using UTF-8 encoding.
    >
    > (6) Write the bytes to a file.
    >
    > [...]
    >
    > > Now I am trying to save this to a file but I never get the encoding
    > > right. Here is what I tried (+ lot's of different things with encode,
    > > decode...):
    > > outFile=codecs.open( filePath, "w", "UTF8" )
    > > outFile.write(utext)
    > > outFile.close()

    >
    > That's the correct approach, but it won't help you if utext contains the
    > wrong characters in the first place. The critical step is taking the
    > bytes in the web page and turning them into text.
    >
    > How are you generating utext?
    >
    > --
    > Steven
     
    Greg, Oct 6, 2011
    #3
  4. On Thu, Oct 6, 2011 at 3:39 PM, Greg <> wrote:
    > Brilliant! It worked. Thanks!
    >
    > Here is the final code for those who are struggling with similar
    > problems:
    >
    > ## open and decode file
    > # In this case, the encoding comes from the charset argument in a meta
    > tag
    > # e.g. <meta charset="iso-8859-2">
    > fileContent = fileObj.decode("iso-8859-2")
    > f.write(text.encode('utf-8'))


    In other words, when you decode correctly into Unicode and encode
    correctly onto the disk, it works!

    This is why encodings are so important :)

    ChrisA
     
    Chris Angelico, Oct 6, 2011
    #4
  5. Am 06.10.2011 05:40, schrieb Steven D'Aprano:
    > (4) Do all your processing in Unicode, not bytes.
    >
    > (5) Encode the text into bytes using UTF-8 encoding.
    >
    > (6) Write the bytes to a file.


    Just wondering, why do you split the latter two parts? I would have used
    codecs.open() to open the file and define the encoding in a single step.
    Is there a downside to this approach?

    Otherwise, I can only confirm that your overall approach is the easiest
    way to get correct results.

    Uli
     
    Ulrich Eckhardt, Oct 6, 2011
    #5
  6. On Thu, Oct 6, 2011 at 8:29 PM, Ulrich Eckhardt
    <> wrote:
    > Just wondering, why do you split the latter two parts? I would have used
    > codecs.open() to open the file and define the encoding in a single step. Is
    > there a downside to this approach?
    >


    Those two steps still happen, even if you achieve them in a single
    function call. What Steven described is language- and library-
    independent.

    ChrisA
     
    Chris Angelico, Oct 6, 2011
    #6
  7. Greg

    jmfauth Guest

    On 6 oct, 06:39, Greg <> wrote:
    > Brilliant! It worked. Thanks!
    >
    > Here is the final code for those who are struggling with similar
    > problems:
    >
    > ## open and decode file
    > # In this case, the encoding comes from the charset argument in a meta
    > tag
    > # e.g. <meta charset="iso-8859-2">
    > fileObj = open(filePath,"r").read()
    > fileContent = fileObj.decode("iso-8859-2")
    > fileSoup = BeautifulSoup(fileContent)
    >
    > ## Do some BeautifulSoup magic and preserve unicode, presume result is
    > saved in 'text' ##
    >
    > ## write extracted text to file
    > f = open(outFilePath, 'w')
    > f.write(text.encode('utf-8'))
    > f.close()
    >




    or (Python2/Python3)

    >>> import io
    >>> with io.open('abc.txt', 'r', encoding='iso-8859-2') as f:

    .... r = f.read()
    ....
    >>> repr(r)

    u'a\nb\nc\n'
    >>> with io.open('def.txt', 'w', encoding='utf-8-sig') as f:

    .... t = f.write(r)
    ....
    >>> f.closed

    True

    jmf
     
    jmfauth, Oct 6, 2011
    #7
  8. Greg

    xDog Walker Guest

    On Thursday 2011 October 06 10:41, jmfauth wrote:
    > or  (Python2/Python3)
    >
    > >>> import io
    > >>> with io.open('abc.txt', 'r', encoding='iso-8859-2') as f:

    >
    > ...     r = f.read()
    > ...
    >
    > >>> repr(r)

    >
    > u'a\nb\nc\n'
    >
    > >>> with io.open('def.txt', 'w', encoding='utf-8-sig') as f:

    >
    > ...     t = f.write(r)
    > ...
    >
    > >>> f.closed

    >
    > True
    >
    > jmf


    What is this io of which you speak?

    --
    I have seen the future and I am not in it.
     
    xDog Walker, Oct 6, 2011
    #8
  9. Greg

    John Gordon Guest

    Re: encoding problem with BeautifulSoup - problem when writing parsed text to file

    In <> xDog Walker <> writes:

    > What is this io of which you speak?


    It was introduced in Python 2.6.

    --
    John Gordon A is for Amy, who fell down the stairs
    B is for Basil, assaulted by bears
    -- Edward Gorey, "The Gashlycrumb Tinies"
     
    John Gordon, Oct 6, 2011
    #9
  10. Greg

    Nobody Guest

    Re: encoding problem with BeautifulSoup - problem when writing parsed text to file

    On Wed, 05 Oct 2011 21:39:17 -0700, Greg wrote:

    > Here is the final code for those who are struggling with similar
    > problems:
    >
    > ## open and decode file
    > # In this case, the encoding comes from the charset argument in a meta
    > tag
    > # e.g. <meta charset="iso-8859-2">
    > fileObj = open(filePath,"r").read()
    > fileContent = fileObj.decode("iso-8859-2")
    > fileSoup = BeautifulSoup(fileContent)


    The fileObj.decode() step should be unnecessary, and is usually
    undesirable; Beautiful Soup should be doing the decoding itself.

    If you actually know the encoding (e.g. from a Content-Type header), you
    can specify it via the fromEncoding parameter to the BeautifulSoup
    constructor, e.g.:

    fileSoup = BeautifulSoup(fileObj.read(), fromEncoding="iso-8859-2")

    If you don't specify the encoding, it will be deduced from a meta tag if
    one is present, or a Unicode BOM, or using the chardet library if
    available, or using built-in heuristics, before finally falling back to
    Windows-1252 (which seems to be the preferred encoding of people who don't
    understand what an encoding is or why it needs to be specified).
     
    Nobody, Oct 8, 2011
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. HNguyen
    Replies:
    4
    Views:
    2,457
    HNguyen
    Dec 21, 2004
  2. placid

    BeautifulSoup problem

    placid, Oct 20, 2006, in forum: Python
    Replies:
    2
    Views:
    314
    placid
    Oct 20, 2006
  3. Replies:
    2
    Views:
    685
    clurks
    Sep 22, 2008
  4. goldtech
    Replies:
    2
    Views:
    348
    Andreas Perstinger
    Nov 14, 2011
  5. Replies:
    2
    Views:
    421
Loading...

Share This Page