Handling text lines from files with some (few) starnge chars

Discussion in 'Python' started by Paulo da Silva, Jun 6, 2010.

  1. I need to read text files and process each line using string
    comparisions and regexp.

    I have a python2 program that uses <file object>.readline to read each
    line as a string. Then, processing it was a trivial job.

    With python3 I got error messagew like:
    File "./pp1.py", line 93, in RL
    line=inf.readline()
    File "/usr/lib64/python3.1/codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
    UnicodeDecodeError: 'utf8' codec can't decode bytes in position
    4963-4965: invalid data

    How do I handle this?

    If I use <file object>.read from an open as binary file I got a <bytes>
    object. Then how do I handle it? Reg exps, comparisions with strings, ?...

    Thanks for any help.
     
    Paulo da Silva, Jun 6, 2010
    #1
    1. Advertising

  2. Paulo da Silva

    Chris Rebert Guest

    On Sat, Jun 5, 2010 at 4:03 PM, Paulo da Silva
    <> wrote:
    > I need to read text files and process each line using string
    > comparisions and regexp.
    >
    > I have a python2 program that uses <file object>.readline to read each
    > line as a string. Then, processing it was a trivial job.
    >
    > With python3 I got error messagew like:
    > File "./pp1.py", line 93, in RL
    >    line=inf.readline()
    >  File "/usr/lib64/python3.1/codecs.py", line 300, in decode
    >    (result, consumed) = self._buffer_decode(data, self.errors, final)
    > UnicodeDecodeError: 'utf8' codec can't decode bytes in position
    > 4963-4965: invalid data
    >
    > How do I handle this?


    Specify the encoding of the text when opening the file using the
    `encoding` parameter. For Windows-1252 for example:

    your_file = open("path/to/file.ext", 'r', encoding='cp1252')

    Cheers,
    Chris
    --
    http://blog.rebertia.com
     
    Chris Rebert, Jun 6, 2010
    #2
    1. Advertising

  3. Paulo da Silva

    Guest

    Chris,

    > Specify the encoding of the text when opening the file using the `encoding` parameter. For Windows-1252 for example:
    >
    > your_file = open("path/to/file.ext", 'r', encoding='cp1252')


    This looks similar to the codecs module's functionality. Do you know if
    the codecs module is still required in Python 3.x?

    Thank you,
    Malcolm
     
    , Jun 6, 2010
    #3
  4. Em 06-06-2010 00:41, Chris Rebert escreveu:
    > On Sat, Jun 5, 2010 at 4:03 PM, Paulo da Silva
    > <> wrote:

    ....

    >
    > Specify the encoding of the text when opening the file using the
    > `encoding` parameter. For Windows-1252 for example:
    >
    > your_file = open("path/to/file.ext", 'r', encoding='cp1252')
    >


    OK! This fixes my current problem. I used encoding="iso-8859-15". This
    is how my text files are encoded.
    But what about a more general case where the encoding of the text file
    is unknown? Is there anything like "autodetect"?
     
    Paulo da Silva, Jun 6, 2010
    #4
  5. Paulo da Silva

    MRAB Guest

    Paulo da Silva wrote:
    > Em 06-06-2010 00:41, Chris Rebert escreveu:
    >> On Sat, Jun 5, 2010 at 4:03 PM, Paulo da Silva
    >> <> wrote:

    > ...
    >
    >> Specify the encoding of the text when opening the file using the
    >> `encoding` parameter. For Windows-1252 for example:
    >>
    >> your_file = open("path/to/file.ext", 'r', encoding='cp1252')
    >>

    >
    > OK! This fixes my current problem. I used encoding="iso-8859-15". This
    > is how my text files are encoded.
    > But what about a more general case where the encoding of the text file
    > is unknown? Is there anything like "autodetect"?
    >

    An encoding like 'cp1252' uses 1 byte/character, but so does 'cp1250'.
    How could you tell which was the correct encoding?

    Well, if the file contained words in a certain language and some of the
    characters were wrong, then you'd know that the encoding was wrong. This
    does imply, though, that you'd need to know what the language should
    look like!

    You could try different encodings, and for each one try to identify what
    could be words, then look them up in dictionaries for various languages
    to see whether they are real words...
     
    MRAB, Jun 6, 2010
    #5
  6. Paulo da Silva

    John Machin Guest

    On Jun 6, 12:14 pm, MRAB <> wrote:
    > Paulo da Silva wrote:
    > > Em 06-06-2010 00:41, Chris Rebert escreveu:
    > >> On Sat, Jun 5, 2010 at 4:03 PM, Paulo da Silva
    > >> <> wrote:

    > > ...

    >
    > >> Specify the encoding of the text when opening the file using the
    > >> `encoding` parameter. For Windows-1252 for example:

    >
    > >> your_file = open("path/to/file.ext", 'r', encoding='cp1252')

    >
    > > OK! This fixes my current problem. I used encoding="iso-8859-15". This
    > > is how my text files are encoded.
    > > But what about a more general case where the encoding of the text file
    > > is unknown? Is there anything like "autodetect"?

    >
    >  >
    > An encoding like 'cp1252' uses 1 byte/character, but so does 'cp1250'.
    > How could you tell which was the correct encoding?
    >
    > Well, if the file contained words in a certain language and some of the
    > characters were wrong, then you'd know that the encoding was wrong. This
    > does imply, though, that you'd need to know what the language should
    > look like!
    >
    > You could try different encodings, and for each one try to identify what
    > could be words, then look them up in dictionaries for various languages
    > to see whether they are real words...


    This has been automated (semi-successfully, with caveats) by the
    chardet package ... see http://chardet.feedparser.org/
     
    John Machin, Jun 6, 2010
    #6
  7. Em 06-06-2010 04:05, John Machin escreveu:
    > On Jun 6, 12:14 pm, MRAB <> wrote:
    >> Paulo da Silva wrote:

    ....

    >>> OK! This fixes my current problem. I used encoding="iso-8859-15". This
    >>> is how my text files are encoded.
    >>> But what about a more general case where the encoding of the text file
    >>> is unknown? Is there anything like "autodetect"?

    >>

    ....

    >
    > This has been automated (semi-successfully, with caveats) by the
    > chardet package ... see http://chardet.feedparser.org/


    This seems nice!
    Thanks
     
    Paulo da Silva, Jun 6, 2010
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    6
    Views:
    420
    John Machin
    Aug 24, 2005
  2. starnge question

    , Jul 29, 2005, in forum: C++
    Replies:
    7
    Views:
    309
  3. Murali
    Replies:
    2
    Views:
    593
    Jerry Coffin
    Mar 9, 2006
  4. zigzagdna

    Getting starnge memory error

    zigzagdna, Jun 19, 2009, in forum: Java
    Replies:
    16
    Views:
    1,050
    Arne Vajhøj
    Jun 25, 2009
  5. Iñaki Baz Castillo
    Replies:
    1
    Views:
    199
    Iñaki Baz Castillo
    Apr 15, 2008
Loading...

Share This Page