character mapping functions and UNICODE : remove accents, case, etc

A

An. Valula

Hi,

does anyone out there know about perl capabilities to convert rich text,
such as "étrangères" to "etrangere" (remove accents)?
Of course, tr/éè/ee/ would do, but I look for sth better: you do not
tr/a-z/A-Z/ for uc(), do you?

regards
 
J

James Willmore

does anyone out there know about perl capabilities to convert rich
text, such as "étrangères" to "etrangere" (remove accents)?
Of course, tr/éè/ee/ would do, but I look for sth better: you do not
tr/a-z/A-Z/ for uc(), do you?

I realize this doesn't answer the question directly, but have you
checked out RTF::parse
(http://search.cpan.org/~pverd/RTF-Parser-1.07/)? Thay _may_ aid you
in what you want to accomplish.

HTH

--
Jim

Copyright notice: all code written by the author in this post is
released under the GPL. http://www.gnu.org/licenses/gpl.txt
for more information.

a fortune quote ...
Celebrate Hannibal Day this year. Take an elephant to lunch.
 
A

An. Valula

Hi,

thank you for your answer, but, no, I do not want to remove bold or
paragraph marks.

I want to convert "rich" text to "poor" text.
What I call "rich" text is for example with accents, miscelaneous cases etc.
For example: "Hêtre chétif".
Whereas "poor" text is withous accents, no casing (casing is easy to solve
with uc/lc). For example: "hetre chetif".

There must be someone else who wants to compare strings without diacritical
signs ?!

regards



does anyone out there know about perl capabilities to convert rich
text, such as "étrangères" to "etrangere" (remove accents)?
Of course, tr/éè/ee/ would do, but I look for sth better: you do not
tr/a-z/A-Z/ for uc(), do you?

I realize this doesn't answer the question directly, but have you
checked out RTF::parse
(http://search.cpan.org/~pverd/RTF-Parser-1.07/)? Thay _may_ aid you
in what you want to accomplish.

HTH

--
Jim

Copyright notice: all code written by the author in this post is
released under the GPL. http://www.gnu.org/licenses/gpl.txt
for more information.

a fortune quote ...
Celebrate Hannibal Day this year. Take an elephant to lunch.
 
A

Alan J. Flavell

thank you for your answer, but, no, I do not want to remove bold or
paragraph marks.

But that *is* what the term "rich text" format normally refers to -
whether used in the generic sense or in particular reference to
Microsoft's "RTF" interchange specification.
I want to convert "rich" text to "poor" text.

Not really, and that's why you confused the previous respondent. You
need some better term. (Try a glossary of text processing if you
don't believe me).
There must be someone else who wants to compare strings without diacritical
signs ?!

Is there a problem? You already know one solution.

You probably should note that your tr/// and your uc() perform
*different* operations, in general - also depending on the locale
setting.

Anyhow, I don't have an answer to your requirement, other than the
obvious one. Well, perhaps I do: you could "do the Unicode
decomposition" thing, but it would seem distinctly inefficient
compared to a tr///

Have a look at e.g http://www.perldoc.com/perl5.8.0/pod/perlretut.html
and see whether you really want to fight this via Unicode-style regex
features. If you want to be sure of covering accents that you've
never even heard of, then I guess that's the way to go, but if you're
just looking for the usual Western-European accents then me, I'd go
with the tr/// I reckon. But this is all supposition - it's not a
requirement which I've needed myself.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top