Is there a module to transliterate Russian and Ukrainian cyrillicunicode to phonetic ASCII?

N

News123

Hi,


THe thread "Is there a better way to convert foreign characters?"
reminded me about a small problem, that I'd like to solve.

I'd like to translate some cyrillic file names into file names, that are
ASCII only.


Is there any perl module / command line tool / official transliteration
algorithm?

I know, that French, German and English do transliterate differently,
but do they differ systematically or is this more or less pseudo random?

How do native Russians/Ukrainians usually tackle this problem if they
life in a country, where PC's are not se up with a cyrillic code page?


thanks in advance for any suggestion?


IIRC I have alerady some self made transliteration table somewhere on a
disk of one of my old PCs, but it is probably not a really good one (and
I have to find it again)

bye


N
 
R

RedGrittyBrick

News123 said:
Hi,


THe thread "Is there a better way to convert foreign characters?"
reminded me about a small problem, that I'd like to solve.

I'd like to translate some cyrillic file names into file names, that are
ASCII only.

I wonder why? AFAIK the most commonly used operating systems nowadays
use Unicode for filenames. Of course this doesn't help if you don't have
fonts and input methods for the character ranges in question.

Is there any perl module / command line tool / official transliteration
algorithm?

Well known tools include recode and iconv.
http://www.gnu.org/software/recode/
http://www.gnu.org/software/libiconv/


In the thread you mentioned, Ted Zlatanov said

"Unicode::Transliterate does at least some of this. It uses the IBM ICU
project; the ICU documentation section on transforms may be particularly
useful. For example, see the "Any->Accents" transliteration:
http://userguide.icu-project.org/transforms/general"
 
B

Ben Bullock

I'd like to translate some cyrillic file names into file names, that are
ASCII only.


Is there any perl module / command line tool / official transliteration
algorithm?

I find many on search.cpan.org.

I don't vouch for their quality.
 
N

News123

RedGrittyBrick said:
I wonder why? AFAIK the most commonly used operating systems nowadays
use Unicode for filenames. Of course this doesn't help if you don't have
fonts and input methods for the character ranges in question.
Old portable mp3 players, old DVD-players / car stereos / and stereos
being capable of playing mp3s are not really 'gifted' playing much
more then ASCII some even refuse / skip cyrillic file names.

Unicode::Transliterate sounds to be the right thing.I'll look into it.


The mediaplayers on my PC are absolutely happy with Unicode/UTF-8
They just struggle with cyrillic non unicode tags / file names (windows
1252 coding or alike)
 
N

News123

RedGrittyBrick said:
I wonder why? AFAIK the most commonly used operating systems nowadays
use Unicode for filenames. Of course this doesn't help if you don't have
fonts and input methods for the character ranges in question.
Old portable mp3 players, old DVD-players / car stereos / and stereos
being capable of playing mp3s are not really 'gifted' playing much
more then ASCII some even refuse / skip cyrillic file names.

Unicode::Transliterate sounds to be the right thing.I'll look into it.


The mediaplayers on my PC are absolutely happy with Unicode/UTF-8
They just struggle with cyrillic non unicode tags / file names (windows
1252 coding or alike)
 
D

Dr.Ruud

News123 said:
I'd like to translate some cyrillic file names into file names, that are
ASCII only.

perl -Mstrict -Mutf8 -MText::Unidecode -wle'
my $s = <<'EOT';
ПоиÑк в Интернете
ПоиÑк Ñтраниц на руÑÑком
EOT
print Text::Unidecode::unidecode($s);
'
Poisk v Intiernietie
Poisk stranits na russkom
 
I

Ilya Zakharevich

Old portable mp3 players, old DVD-players / car stereos / and stereos
being capable of playing mp3s are not really 'gifted' playing much
more then ASCII some even refuse / skip cyrillic file names.

I use

audio_rename -@csR .

Hope this helps,
Ilya
 
I

Ilya Zakharevich

In the first message of this thread
Let me summarize what I learned from this thread:

a) http://userguide.icu-project.org/transforms/general

is very educational on what kind of intelligent questions and
answers one could encounter in this topic. (This is slightly
offset by the terms `slash' and `backslash' being used
interchangingly - in a document on character semantic!)

b) http://search.cpan.org/~sburke/Text-Unidecode-0.04/lib/Text/Unidecode.pm

documentation (and organization - tables are broken into small
chunks) is very impressive. The date looks a little bit
suspicious - but it MAY be that there is nothing to fix!

c) http://search.cpan.org/~wollmers/Text-Undiacritic-0.02

(probably) uses Unicode data tables. (Probably) does not break
the tables into small chunks, so loading may be slow and take
quite a lot of resources. (But I did not look deep; it might be
that it uses Unicode tables at build time only to generate its
own "small tables".)

Myself, I would very much prefer the flexibility of "a" combined with
the simplicity of "b". If I would try to code something like this, I
would special-case some small "hot" subset (like IBM's UGL, or M$'s
WGL), and would

A) machine-translate to this hot subset;

B) allow human-generated web of possible "equivalences" and
"simplifications" inside the hot subset...

Thanks to everybody,
Ilya
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,575
Members
45,054
Latest member
LucyCarper

Latest Threads

Top