Latest approach to controlling non-printable / multi-byte characters

M

metaperl

There is no end to the number of frantic pleas for help with
characters in the realm beyond ASCII.

However, in searching thru them, I do not see a workable approach to
changing them into other things.

I am dealing with a file and in my Emacs editor, I see "MASSACHUSETTS-
AMHERST" ... in other words, there is a dash between MASSACHUSETTS and
AMHERST.

However, if I do a grep for the text the shell returns this:

MASSACHUSETTS–AMHERST

and od -tc returns this:

0000540 O F M A S S A C H U S E T
T
0000560 S 342 200 223 A M H E R S T ; U N
I


So, the conclusion is the "dash" is actually 3 octal characters. My
goal is to take those 3 octal characters and convert them to an ascii
dash. Any idea how I might write such a filter? The closest I have got
it:

unicodedata.normalize('NFKD', s).encode('ASCII', 'replace')

but that puts a question mark there.
 
P

Peter Otten

metaperl said:
There is no end to the number of frantic pleas for help with
characters in the realm beyond ASCII.

And the answer is "first decode to unicode, then modify" in nine out of ten
cases.
However, in searching thru them, I do not see a workable approach to
changing them into other things.

I am dealing with a file and in my Emacs editor, I see "MASSACHUSETTS-
AMHERST" ... in other words, there is a dash between MASSACHUSETTS and
AMHERST.

However, if I do a grep for the text the shell returns this:

MASSACHUSETTS–AMHERST

and od -tc returns this:

0000540 O F M A S S A C H U S E T
T
0000560 S 342 200 223 A M H E R S T ; U N
I


So, the conclusion is the "dash" is actually 3 octal characters. My
goal is to take those 3 octal characters and convert them to an ascii
dash. Any idea how I might write such a filter? The closest I have got
it:

unicodedata.normalize('NFKD', s).encode('ASCII', 'replace')

but that puts a question mark there.

No idea where the character references come from but the dump suggests that
your text is in UTF-8.
u'MASSACHUSETS-AMHERST'

u"\2013" is indeed a dash, by the way:'EN DASH'

Peter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,596
Members
45,142
Latest member
arinsharma
Top