python tr equivalent (non-ascii)

kettle · Aug 13, 2008

Hi,
I was wondering how I ought to be handling character range
translations in python.

What I want to do is translate fullwidth numbers and roman alphabet
characters into their halfwidth ascii equivalents.
In perl I can do this pretty easily with tr:

tr/\x{ff00}-\x{ff5e}/\x{0020}-\x{007e}/;

and I think the string.translate method is what I need to use to
achieve the equivalent in python. Unfortunately the maktrans method
doesn't seem to accept character ranges and I'm also having trouble
with it's interpretation of length. What I came up with was to first
fudge the ranges:

my_test_string = u"$B#A#B#C#D#E#F#G(B"
f_range = "".join([unichr(x) for x in
range(ord(u"\uff00"),ord(u"\uff5e"))])
t_range = "".join([unichr(x) for x in
range(ord(u"\u0020"),ord(u"\u007e"))])

then use these as input to maketrans:
my_trans_string =
my_test_string.translate(string.maketrans(f_range,t_range))
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-93: ordinal not in range(128)

but it generates an encoding error... and if I encodethe ranges in
utf8 before passing them on I get a length error because maketrans is
counting bytes not characters and utf8 is variable width...
my_trans_string =
my_test_string.translate(string.maketrans(f_range.encode("utf8"),t_range.encode("utf8")))
Traceback (most recent call last):
File "<stdin>", line 1, in ?
ValueError: maketrans arguments must have same length

kettle · Aug 13, 2008

Hi,
I was wondering how I ought to be handling character range
translations in python.

What I want to do is translate fullwidth numbers and roman alphabet
characters into their halfwidth ascii equivalents.
In perl I can do this pretty easily with tr:

tr/\x{ff00}-\x{ff5e}/\x{0020}-\x{007e}/;

and I think the string.translate method is what I need to use to
achieve the equivalent in python. Unfortunately the maktrans method
doesn't seem to accept character ranges and I'm also having trouble
with it's interpretation of length. What I came up with was to first
fudge the ranges:

my_test_string = u"$B#A#B#C#D#E#F#G(B"
f_range = "".join([unichr(x) for x in
range(ord(u"\uff00"),ord(u"\uff5e"))])
t_range = "".join([unichr(x) for x in
range(ord(u"\u0020"),ord(u"\u007e"))])

then use these as input to maketrans:
my_trans_string =
my_test_string.translate(string.maketrans(f_range,t_range))
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-93: ordinal not in range(128)

but it generates an encoding error... and if I encodethe ranges in
utf8 before passing them on I get a length error because maketrans is
counting bytes not characters and utf8 is variable width...
my_trans_string =
my_test_string.translate(string.maketrans(f_range.encode("utf8"),t_range.encode("utf8")))
Traceback (most recent call last):
File "<stdin>", line 1, in ?
ValueError: maketrans arguments must have same length

Ok so I guess I was barking up the wrong tree. Searching for python $BA43Q(B
$B!!H>3Q(B quickly brought up a solution:

import unicodedata
my_test_string=u"$B%U%,%[%2(B-%*@A$B#B#C!]!s!v!w#1#2(B3"
print unicodedata.normalize('NFKC', my_test_string.decode("utf8")) $B%U%,%[%2(B-%*@ABC-%*@123

Click to expand...

Click to expand...

still, it would be nice if there was a more general solution, or if
maketrans actually looked at chars instead of bytes methinks.

Fredrik Lundh · Aug 13, 2008

kettle said:
I was wondering how I ought to be handling character range
translations in python.

What I want to do is translate fullwidth numbers and roman alphabet
characters into their halfwidth ascii equivalents.
In perl I can do this pretty easily with tr:

tr/\x{ff00}-\x{ff5e}/\x{0020}-\x{007e}/;

and I think the string.translate method is what I need to use to
achieve the equivalent in python. Unfortunately the maktrans method
doesn't seem to accept character ranges and I'm also having trouble
with it's interpretation of length. What I came up with was to first
fudge the ranges:

my_test_string = u"$B#A#B#C#D#E#F#G(B"
f_range = "".join([unichr(x) for x in
range(ord(u"\uff00"),ord(u"\uff5e"))])
t_range = "".join([unichr(x) for x in
range(ord(u"\u0020"),ord(u"\u007e"))])

then use these as input to maketrans:
my_trans_string =
my_test_string.translate(string.maketrans(f_range,t_range))
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-93: ordinal not in range(128)

maketrans only works for byte strings.

as for translate itself, it has different signatures for byte strings
and unicode strings; in the former case, it takes lookup table
represented as a 256-byte string (e.g. created by maketrans), in the
latter case, it takes a dictionary mapping from ordinals to ordinals or
unicode strings.

something like

lut = dict((0xff00 + ch, 0x0020 + ch) for ch in range(0x80))

new_string = old_string.translate(lut)

could work (untested).

</F>

kettle · Aug 13, 2008

kettle said:
kettle said:

I was wondering how I ought to be handling character range
translations in python.

Click to expand...

What I want to do is translate fullwidth numbers and roman alphabet
characters into their halfwidth ascii equivalents.
In perl I can do this pretty easily with tr:

and I think the string.translate method is what I need to use to
achieve the equivalent in python. Unfortunately the maktrans method
doesn't seem to accept character ranges and I'm also having trouble
with it's interpretation of length. What I came up with was to first
fudge the ranges:

Click to expand...

my_test_string = u"$B#A#B#C#D#E#F#G(B"
f_range = "".join([unichr(x) for x in
range(ord(u"\uff00"),ord(u"\uff5e"))])
t_range = "".join([unichr(x) for x in
range(ord(u"\u0020"),ord(u"\u007e"))])

Click to expand...

then use these as input to maketrans:
my_trans_string =
my_test_string.translate(string.maketrans(f_range,t_range))
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-93: ordinal not in range(128)

Click to expand...

maketrans only works for byte strings.

as for translate itself, it has different signatures for byte strings
and unicode strings; in the former case, it takes lookup table
represented as a 256-byte string (e.g. created by maketrans), in the
latter case, it takes a dictionary mapping from ordinals to ordinals or
unicode strings.

something like

lut = dict((0xff00 + ch, 0x0020 + ch) for ch in range(0x80))

new_string = old_string.translate(lut)

could work (untested).

</F>

excellent. i didnt realize from the docs that i could do that. thanks

HEX to ASCII	10	Oct 6, 2013
Benchmarking stripping of Unicode characters which are invalid XML	0	Mar 18, 2012
email with a non-ascii charset in Python3 ?	3	Aug 15, 2012
Ascii to Unicode.	4	Jul 28, 2010
Python code problem	2	Apr 23, 2023
SAX unicode and ascii parsing problem	4	Nov 30, 2010
cx_Oracle: Non-ASCII characters handling with different versions	3	Nov 13, 2007
PEP 3131: Supporting Non-ASCII Identifiers	399	May 13, 2007

python tr equivalent (non-ascii)

kettle

kettle

Fredrik Lundh

kettle

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads