Is there a module to transliterate Russian and Ukrainian cyrillicunicode to phonetic ASCII?

News123 · Apr 22, 2009

Hi,

THe thread "Is there a better way to convert foreign characters?"
reminded me about a small problem, that I'd like to solve.

I'd like to translate some cyrillic file names into file names, that are
ASCII only.

Is there any perl module / command line tool / official transliteration
algorithm?

I know, that French, German and English do transliterate differently,
but do they differ systematically or is this more or less pseudo random?

How do native Russians/Ukrainians usually tackle this problem if they
life in a country, where PC's are not se up with a cyrillic code page?

thanks in advance for any suggestion?

IIRC I have alerady some self made transliteration table somewhere on a
disk of one of my old PCs, but it is probably not a really good one (and
I have to find it again)

bye

N

RedGrittyBrick · Apr 22, 2009

News123 said:
Hi,

THe thread "Is there a better way to convert foreign characters?"
reminded me about a small problem, that I'd like to solve.

I'd like to translate some cyrillic file names into file names, that are
ASCII only.

I wonder why? AFAIK the most commonly used operating systems nowadays
use Unicode for filenames. Of course this doesn't help if you don't have
fonts and input methods for the character ranges in question.

Is there any perl module / command line tool / official transliteration
algorithm?

Well known tools include recode and iconv.
http://www.gnu.org/software/recode/
http://www.gnu.org/software/libiconv/

In the thread you mentioned, Ted Zlatanov said

"Unicode::Transliterate does at least some of this. It uses the IBM ICU
project; the ICU documentation section on transforms may be particularly
useful. For example, see the "Any->Accents" transliteration:
http://userguide.icu-project.org/transforms/general"

Ben Bullock · Apr 22, 2009

I'd like to translate some cyrillic file names into file names, that are
ASCII only.

Is there any perl module / command line tool / official transliteration
algorithm?

I find many on search.cpan.org.

I don't vouch for their quality.

News123 · Apr 23, 2009

RedGrittyBrick said:
I wonder why? AFAIK the most commonly used operating systems nowadays
use Unicode for filenames. Of course this doesn't help if you don't have
fonts and input methods for the character ranges in question.

Old portable mp3 players, old DVD-players / car stereos / and stereos
being capable of playing mp3s are not really 'gifted' playing much
more then ASCII some even refuse / skip cyrillic file names.

Unicode::Transliterate sounds to be the right thing.I'll look into it.

The mediaplayers on my PC are absolutely happy with Unicode/UTF-8
They just struggle with cyrillic non unicode tags / file names (windows
1252 coding or alike)

News123 · Apr 23, 2009

RedGrittyBrick said:
I wonder why? AFAIK the most commonly used operating systems nowadays
use Unicode for filenames. Of course this doesn't help if you don't have
fonts and input methods for the character ranges in question.

Old portable mp3 players, old DVD-players / car stereos / and stereos
being capable of playing mp3s are not really 'gifted' playing much
more then ASCII some even refuse / skip cyrillic file names.

Unicode::Transliterate sounds to be the right thing.I'll look into it.

The mediaplayers on my PC are absolutely happy with Unicode/UTF-8
They just struggle with cyrillic non unicode tags / file names (windows
1252 coding or alike)

Dr.Ruud · Apr 24, 2009

News123 said:
I'd like to translate some cyrillic file names into file names, that are
ASCII only.

perl -Mstrict -Mutf8 -MText::Unidecode -wle'
my $s = <<'EOT';
ÐŸÐ¾Ð¸ÑÐº Ð² Ð˜Ð½Ñ‚ÐµÑ€Ð½ÐµÑ‚Ðµ
ÐŸÐ¾Ð¸ÑÐº ÑÑ‚Ñ€Ð°Ð½Ð¸Ñ† Ð½Ð° Ñ€ÑƒÑÑÐºÐ¾Ð¼
EOT
print Text::Unidecode::unidecode($s);
'
Poisk v Intiernietie
Poisk stranits na russkom

Ilya Zakharevich · Apr 25, 2009

Old portable mp3 players, old DVD-players / car stereos / and stereos
being capable of playing mp3s are not really 'gifted' playing much
more then ASCII some even refuse / skip cyrillic file names.

I use

audio_rename -@csR .

Hope this helps,
Ilya

Ilya Zakharevich · May 8, 2009

In the first message of this thread
Let me summarize what I learned from this thread:

a) http://userguide.icu-project.org/transforms/general

is very educational on what kind of intelligent questions and
answers one could encounter in this topic. (This is slightly
offset by the terms `slash' and `backslash' being used
interchangingly - in a document on character semantic!)

b) http://search.cpan.org/~sburke/Text-Unidecode-0.04/lib/Text/Unidecode.pm

documentation (and organization - tables are broken into small
chunks) is very impressive. The date looks a little bit
suspicious - but it MAY be that there is nothing to fix!

c) http://search.cpan.org/~wollmers/Text-Undiacritic-0.02

(probably) uses Unicode data tables. (Probably) does not break
the tables into small chunks, so loading may be slow and take
quite a lot of resources. (But I did not look deep; it might be
that it uses Unicode tables at build time only to generate its
own "small tables".)

Myself, I would very much prefer the flexibility of "a" combined with
the simplicity of "b". If I would try to code something like this, I
would special-case some small "hot" subset (like IBM's UGL, or M$'s
WGL), and would

A) machine-translate to this hot subset;

B) allow human-generated web of possible "equivalences" and
"simplifications" inside the hot subset...

Thanks to everybody,
Ilya

Is there a string function to trim all non-ascii characters out of astring	10	Dec 31, 2007
Is there a module to organize and parse command line parameters?	2	Sep 14, 2007
Is there a clean way to export all constants from a module?	5	Jul 26, 2003
is there a module to work with pickled objects storage in database?	2	May 4, 2007
Converting a big perl script which is called over and over to a module ?	7	Jun 4, 2005
VersionNotFoundException: There is no Proposed data to access	0	Mar 12, 2007
Is there already a Python module to access the USPTO web site?	1	May 28, 2005
Is there a library to parse Mozilla "mork" documents?	0	Jan 21, 2005

Is there a module to transliterate Russian and Ukrainian cyrillicunicode to phonetic ASCII?

News123

RedGrittyBrick

Ben Bullock

News123

News123

Dr.Ruud

Ilya Zakharevich

Ilya Zakharevich

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads