Help with character encodings

A_H · May 20, 2008

Help!

I've scraped a PDF file for text and all the minus signs come back as
u'\xad'.

Is there any easy way I can change them all to plain old ASCII '-' ???

str.replace complained about a missing codec.

Hints?

Gary Herron · May 20, 2008

A_H said:
Help!

I've scraped a PDF file for text and all the minus signs come back as
u'\xad'.

Is there any easy way I can change them all to plain old ASCII '-' ???

str.replace complained about a missing codec.

Hints?

Encoding it into a 'latin1' encoded string seems to work:
-

J. Cliff Dyer · May 20, 2008

Encoding it into a 'latin1' encoded string seems to work:

-

Here's what I've found:
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xad in position 0:
ordinal not in range(128)u'-'

If you replace the *string* '\xad' in the first argument to replace with
the *unicode object* u'\xad', python won't complain anymore. (Mind you,
you weren't using str.replace. You were using unicode.replace. Slight
difference, but important.) If you do the replace on a plain string, it
doesn't have to convert anything, so you don't get a UnicodeDecodeError.

Cheers,
Cliff

Gary Herron · May 20, 2008

Gary said:
Encoding it into a 'latin1' encoded string seems to work:

-

That might be what you want, but really, it was not a very well thought
answer. Here's a better answer:

Using the unicodedata module, i see that the character you have u'\xad' is

SOFT HYPHEN (codepoint 173=0xad)

If you want to replace that with the more familiar HYPHEN-MINUS
(codepoint 45) you can use the string replace, but stick will all
unicode values so you don't provoke a conversion to an ascii encoded string
ABC-DEF

But does this really solve your problem? If there is the possibility
for other unicode characters in your data, this is heading down the
wrong track, and the question (which I can't answer) becomes: What are
you going to do with the string?

If you are going to display it via a GUI that understands UTF-8, then
encode the string as utf8 and display it -- no need to convert the
hyphens.

If you are trying to display it somewhere that is not unicode (or UTF-8)
aware, then you'll have to convert it. In that case, encoding it as
latin1 is probably a good choice, but beware: That does not convert the
u'\xad' to an chr(45) (the usual HYPHEN-MINUS), but instead to chr(173)
which (on latin1 aware applications) will display as the usual hyphen.
In any case, it won't be ascii (in the strict sense that ascii is chr(0)
through chr(127)). If you *really* *really* wanted straight strict
ascii, replace chr(173) with chr(45).

Gary Herron

The future of the character-encodings library	4	Mar 16, 2011
Questions about working with character encodings	1	Dec 15, 2005
Help with code	0	Jun 12, 2022
Problem with a login script, SESSION user rights and put this together so it works with the other pages and MySQL. Code examples.	2	May 5, 2023
numpy 00 character bug?	2	Jun 5, 2009
AJAX vs form submission (character encoding)	2	Jan 26, 2012
Trouble with UnicodeEncodeError and email	0	Jan 8, 2014
Python and encodings drives me crazy	7	Jun 20, 2005

Help with character encodings

A_H

Gary Herron

J. Cliff Dyer

Gary Herron

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads