Help with character encodings

Discussion in 'Python' started by A_H, May 20, 2008.

  1. A_H

    A_H Guest

    Help!

    I've scraped a PDF file for text and all the minus signs come back as
    u'\xad'.

    Is there any easy way I can change them all to plain old ASCII '-' ???

    str.replace complained about a missing codec.



    Hints?
     
    A_H, May 20, 2008
    #1
    1. Advertising

  2. A_H

    Gary Herron Guest

    A_H wrote:
    > Help!
    >
    > I've scraped a PDF file for text and all the minus signs come back as
    > u'\xad'.
    >
    > Is there any easy way I can change them all to plain old ASCII '-' ???
    >
    > str.replace complained about a missing codec.
    >
    >
    >
    > Hints?
    >


    Encoding it into a 'latin1' encoded string seems to work:

    >>> print u'\xad'.encode('latin1')

    -




    >
    >
    >
    > --
    > http://mail.python.org/mailman/listinfo/python-list
    >
     
    Gary Herron, May 20, 2008
    #2
    1. Advertising

  3. On Tue, 2008-05-20 at 08:28 -0700, Gary Herron wrote:
    > A_H wrote:
    > > Help!
    > >
    > > I've scraped a PDF file for text and all the minus signs come back as
    > > u'\xad'.
    > >
    > > Is there any easy way I can change them all to plain old ASCII '-' ???
    > >
    > > str.replace complained about a missing codec.
    > >
    > >
    > >
    > > Hints?
    > >

    >
    > Encoding it into a 'latin1' encoded string seems to work:
    >
    > >>> print u'\xad'.encode('latin1')

    > -
    >
    >

    Here's what I've found:

    >>> x = u'\xad'
    >>> x.replace('\xad','-')

    Traceback (most recent call last):
    File "<stdin>", line 1, in ?
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xad in position 0:
    ordinal not in range(128)
    >>> x.replace(u'\xad','-')

    u'-'

    If you replace the *string* '\xad' in the first argument to replace with
    the *unicode object* u'\xad', python won't complain anymore. (Mind you,
    you weren't using str.replace. You were using unicode.replace. Slight
    difference, but important.) If you do the replace on a plain string, it
    doesn't have to convert anything, so you don't get a UnicodeDecodeError.

    >>> x = x.encode('latin1')
    >>> x

    '\xad'
    >>> # Note the lack of a u before the ' above.
    >>> x.replace('\xad','-')

    '-'
    >>>


    Cheers,
    Cliff
     
    J. Cliff Dyer, May 20, 2008
    #3
  4. A_H

    Gary Herron Guest

    Gary Herron wrote:
    > A_H wrote:
    >> Help!
    >>
    >> I've scraped a PDF file for text and all the minus signs come back as
    >> u'\xad'.
    >>
    >> Is there any easy way I can change them all to plain old ASCII '-' ???
    >>
    >> str.replace complained about a missing codec.
    >>
    >>
    >>
    >> Hints?
    >>

    >
    > Encoding it into a 'latin1' encoded string seems to work:
    >
    > >>> print u'\xad'.encode('latin1')

    > -


    That might be what you want, but really, it was not a very well thought
    answer. Here's a better answer:



    Using the unicodedata module, i see that the character you have u'\xad' is

    SOFT HYPHEN (codepoint 173=0xad)


    If you want to replace that with the more familiar HYPHEN-MINUS
    (codepoint 45) you can use the string replace, but stick will all
    unicode values so you don't provoke a conversion to an ascii encoded string

    >>> print u'ABC\xadDEF'.replace(u'\xad','-')

    ABC-DEF

    But does this really solve your problem? If there is the possibility
    for other unicode characters in your data, this is heading down the
    wrong track, and the question (which I can't answer) becomes: What are
    you going to do with the string?

    If you are going to display it via a GUI that understands UTF-8, then
    encode the string as utf8 and display it -- no need to convert the
    hyphens.

    If you are trying to display it somewhere that is not unicode (or UTF-8)
    aware, then you'll have to convert it. In that case, encoding it as
    latin1 is probably a good choice, but beware: That does not convert the
    u'\xad' to an chr(45) (the usual HYPHEN-MINUS), but instead to chr(173)
    which (on latin1 aware applications) will display as the usual hyphen.
    In any case, it won't be ascii (in the strict sense that ascii is chr(0)
    through chr(127)). If you *really* *really* wanted straight strict
    ascii, replace chr(173) with chr(45).

    Gary Herron
     
    Gary Herron, May 20, 2008
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Safalra
    Replies:
    8
    Views:
    652
    Roedy Green
    Jun 15, 2004
  2. Kenneth McDonald
    Replies:
    1
    Views:
    323
  3. JKPeck
    Replies:
    6
    Views:
    307
    Martin Miller
    Nov 14, 2006
  4. Replies:
    7
    Views:
    3,641
  5. Replies:
    0
    Views:
    101
Loading...

Share This Page