Latest approach to controlling non-printable / multi-byte characters

metaperl · Feb 8, 2007

There is no end to the number of frantic pleas for help with
characters in the realm beyond ASCII.

However, in searching thru them, I do not see a workable approach to
changing them into other things.

I am dealing with a file and in my Emacs editor, I see "MASSACHUSETTS-
AMHERST" ... in other words, there is a dash between MASSACHUSETTS and
AMHERST.

However, if I do a grep for the text the shell returns this:

MASSACHUSETTSâAMHERST

and od -tc returns this:

0000540 O F M A S S A C H U S E T
T
0000560 S 342 200 223 A M H E R S T ; U N
I

So, the conclusion is the "dash" is actually 3 octal characters. My
goal is to take those 3 octal characters and convert them to an ascii
dash. Any idea how I might write such a filter? The closest I have got
it:

unicodedata.normalize('NFKD', s).encode('ASCII', 'replace')

but that puts a question mark there.

Peter Otten · Feb 9, 2007

metaperl said:
There is no end to the number of frantic pleas for help with
characters in the realm beyond ASCII.

And the answer is "first decode to unicode, then modify" in nine out of ten
cases.

However, in searching thru them, I do not see a workable approach to
changing them into other things.

I am dealing with a file and in my Emacs editor, I see "MASSACHUSETTS-
AMHERST" ... in other words, there is a dash between MASSACHUSETTS and
AMHERST.

However, if I do a grep for the text the shell returns this:

MASSACHUSETTSâAMHERST

and od -tc returns this:

0000540 O F M A S S A C H U S E T
T
0000560 S 342 200 223 A M H E R S T ; U N
I

So, the conclusion is the "dash" is actually 3 octal characters. My
goal is to take those 3 octal characters and convert them to an ascii
dash. Any idea how I might write such a filter? The closest I have got
it:

unicodedata.normalize('NFKD', s).encode('ASCII', 'replace')

but that puts a question mark there.

No idea where the character references come from but the dump suggests that
your text is in UTF-8.
u'MASSACHUSETS-AMHERST'

u"\2013" is indeed a dash, by the way:'EN DASH'

Peter

How to play corresponding sound?	2	Jun 10, 2023
Non latin characters in string literals	17	Jan 3, 2010
Can't solve problems! please Help	0	Sep 26, 2022
Outputting signal values to terminal Within Character Array	0	Dec 10, 2021
Delete all not allowed characters..	10	Oct 25, 2007
problem with logging exceptions with non-ASCII __str__ result	1	Jan 14, 2008
How to print a sorted list as a multi-column table	2	May 23, 2008
printing all escape sequence characters	5	Oct 28, 2006

Latest approach to controlling non-printable / multi-byte characters

metaperl

Peter Otten

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads