String multi-replace

S

Sorin Schwimmer

Hi All,

I have to eliminate diacritics in a fairly large file.

Inspired by http://code.activestate.com/recipes/81330/, I came up with the following code:

#! /usr/bin/env python

import re

nodia={chr(196)+chr(130):'A', # mamaliga
chr(195)+chr(130):'A', # A^
chr(195)+chr(142):'I', # I^
chr(195)+chr(150):'O', # OE
chr(195)+chr(156):'U', # UE
chr(195)+chr(139):'A', # AE
chr(197)+chr(158):'S',
chr(197)+chr(162):'T',
chr(196)+chr(131):'a', # mamaliga
chr(195)+chr(162):'a', # a^
chr(195)+chr(174):'i', # i^
chr(195)+chr(182):'o', # oe
chr(195)+chr(188):'u', # ue
chr(195)+chr(164):'a', # ae
chr(197)+chr(159):'s',
chr(197)+chr(163):'t'
}
name="R\xc3\xa2\xc5\x9fca"

regex = re.compile("(%s)" % "|".join(map(re.escape, nodia.keys())))
print regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], name)

But it won't work; I end up with:

Traceback (most recent call last):
File "multirep.py", line 25, in <module>
print regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], name)
File "multirep.py", line 25, in <lambda>
print regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], name)
TypeError: 'type' object is not subscriptable

What am I doing wrong?

Thanks for your advice,
SxN
 
S

Steven D'Aprano

Hi All,

I have to eliminate diacritics in a fairly large file.

What's "fairly large"? Large to you is probably not large to your
computer. Anything less than a few dozen megabytes is small enough to be
read entirely into memory.


Inspired by http://code.activestate.com/recipes/81330/, I came up with
the following code:

If all you are doing is replacing single characters, then there's no need
for the 80lb sledgehammer of regular expressions when all you need is a
delicate tack hammer. Instead of this:

* read the file as bytes
* search for pairs of bytes like chr(195)+chr(130) using a regex
* replace them with single bytes like 'A'

do this:

* read the file as a Unicode
* search for characters like Â
* replace them with single characters like A using unicode.translate()

(or str.translate() in Python 3.x)


The only gotcha is that you need to know (or guess) the encoding to read
the file correctly.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,065
Latest member
OrderGreenAcreCBD

Latest Threads

Top