character mapping functions and UNICODE : remove accents, case, etc

An. Valula · Oct 19, 2003

Hi,

does anyone out there know about perl capabilities to convert rich text,
such as "étrangères" to "etrangere" (remove accents)?
Of course, tr/éè/ee/ would do, but I look for sth better: you do not
tr/a-z/A-Z/ for uc(), do you?

regards

James Willmore · Oct 19, 2003

does anyone out there know about perl capabilities to convert rich
text, such as "étrangères" to "etrangere" (remove accents)?
Of course, tr/éè/ee/ would do, but I look for sth better: you do not
tr/a-z/A-Z/ for uc(), do you?

I realize this doesn't answer the question directly, but have you
checked out RTF:

arse
(http://search.cpan.org/~pverd/RTF-Parser-1.07/)? Thay _may_ aid you
in what you want to accomplish.

HTH

--
Jim

Copyright notice: all code written by the author in this post is
released under the GPL. http://www.gnu.org/licenses/gpl.txt
for more information.

a fortune quote ...
Celebrate Hannibal Day this year. Take an elephant to lunch.

An. Valula · Oct 23, 2003

Hi,

thank you for your answer, but, no, I do not want to remove bold or
paragraph marks.

I want to convert "rich" text to "poor" text.
What I call "rich" text is for example with accents, miscelaneous cases etc.
For example: "Hêtre chétif".
Whereas "poor" text is withous accents, no casing (casing is easy to solve
with uc/lc). For example: "hetre chetif".

There must be someone else who wants to compare strings without diacritical
signs ?!

regards

does anyone out there know about perl capabilities to convert rich
text, such as "étrangères" to "etrangere" (remove accents)?
Of course, tr/éè/ee/ would do, but I look for sth better: you do not
tr/a-z/A-Z/ for uc(), do you?

I realize this doesn't answer the question directly, but have you
checked out RTF:

arse
(http://search.cpan.org/~pverd/RTF-Parser-1.07/)? Thay _may_ aid you
in what you want to accomplish.

HTH

--
Jim

Copyright notice: all code written by the author in this post is
released under the GPL. http://www.gnu.org/licenses/gpl.txt
for more information.

a fortune quote ...
Celebrate Hannibal Day this year. Take an elephant to lunch.

Alan J. Flavell · Oct 23, 2003

thank you for your answer, but, no, I do not want to remove bold or
paragraph marks.

But that *is* what the term "rich text" format normally refers to -
whether used in the generic sense or in particular reference to
Microsoft's "RTF" interchange specification.

I want to convert "rich" text to "poor" text.

Not really, and that's why you confused the previous respondent. You
need some better term. (Try a glossary of text processing if you
don't believe me).

There must be someone else who wants to compare strings without diacritical
signs ?!

Is there a problem? You already know one solution.

You probably should note that your tr/// and your uc() perform
*different* operations, in general - also depending on the locale
setting.

Anyhow, I don't have an answer to your requirement, other than the
obvious one. Well, perhaps I do: you could "do the Unicode
decomposition" thing, but it would seem distinctly inefficient
compared to a tr///

Have a look at e.g http://www.perldoc.com/perl5.8.0/pod/perlretut.html
and see whether you really want to fight this via Unicode-style regex
features. If you want to be sure of covering accents that you've
never even heard of, then I guess that's the way to go, but if you're
just looking for the usual Western-European accents then me, I'd go
with the tr/// I reckon. But this is all supposition - it's not a
requirement which I've needed myself.

Python Unicode handling wins again -- mostly	67	Nov 30, 2013
character sets? unicode?	0	Feb 3, 2005
Coding Conventions and Speech: No Punctuation, Renaming Operators etc...	10	Jul 31, 2005
Perl 5.8.x, Unicode and In-memory Filehandles	3	Mar 1, 2006
Fragments, states, bookmarks, navigation, etc.	10	Jul 15, 2009
The IE6 unicode character display puzzle	0	Feb 19, 2004
In search of elegant code - do two strings differ by one character?	2	Aug 1, 2005
Java Newbie Question: Character Sets, Unicode, et al	13	Oct 17, 2003

character mapping functions and UNICODE : remove accents, case, etc

An. Valula

James Willmore

An. Valula

Alan J. Flavell

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads