Clean out accents in French names

P

Patrick L. Nolan

I have a script that takes information, including people's
names, and builds an XML file. I just found that the
application that reads the XML is fussy about characters.
It choked on the name "Jean-Paul le Fevre", where there
was an accent over the first e in Fevre. I don't know
how to type that on this keyboard. I edited the file
by hand, changing that character to a plain "e", and
all was OK. By the way, this isn't Unicode, it's just
extended ASCII.

I think I know how to identify "non-printing" characters
like that, but I would like to translate each one to
its nearest equivalent in the basic ASCII character
set. Thus the various e's with acute, grave and
circumflex accents would all go to "e", and so forth.
Has this problem been solved?
 
J

John Bokma

Patrick said:
I have a script that takes information, including people's
names, and builds an XML file. I just found that the
application that reads the XML is fussy about characters.

Then use something like:

Has this problem been solved?

Specify an encoding, and there shouldn't be any problem.
 
J

Jürgen Exner

Patrick said:
I have a script that takes information, including people's
names, and builds an XML file. I just found that the
application that reads the XML is fussy about characters.
It choked on the name "Jean-Paul le Fevre", where there
was an accent over the first e in Fevre. I don't know
how to type that on this keyboard. I edited the file
by hand, changing that character to a plain "e", and
all was OK. By the way, this isn't Unicode, it's just
extended ASCII.

Well, did you set the encoding attribute in the XML file explicitely to
ISO-8859-1 or ANSI-1252 or to whatever you mean with "extended ASCII"?
If not then the file would contain "extended ASCII" but the other
application would try to read UTF-8, because that's the default for XML if
nothing contrary is specified. Of course that's going to cause trouble.
I think I know how to identify "non-printing" characters
like that, but I would like to translate each one to
its nearest equivalent in the basic ASCII character
set. Thus the various e's with acute, grave and
circumflex accents would all go to "e", and so forth.

Not a good idea. I for one am particular about my first name. It's Jürgen,
not Jurgen (which isn't any name at all).
And as has been discussed many times just removing the accents etc.
transforms e.g. "to sleep" (?) into "whore" in Swedish.
Has this problem been solved?

Often. Just use the proper encoding. Luckily XML was designed with support
for non-English characters in mind.

jue
 
G

Gunnar Hjalmarsson

Jürgen Exner said:
And as has been discussed many times just removing the accents etc.
transforms e.g. "to sleep" (?) into "whore" in Swedish.

The Swedish for "hear" is "höra",

Guess you mean "to hear". The Swedish for "hear" is "höra", and removing
those dots changes the meaning as you say. :)
 
C

Charles DeRykus

Not a good idea. I for one am particular about my first name. It's Jürgen,
not Jurgen (which isn't any name at all).
And as has been discussed many times just removing the accents etc.
transforms e.g. "to sleep" (?) into "whore" in Swedish.
As I recall, 'hear' gets 'downgraded' to 'whore' with an
non-umlauted 'o'. (I'm not a Swede but I sit near one).

(Personally, I just hate it when 'DeRykus' gets rendered without
the capital 'R' too but that's another beef with no Perl content).
 
J

John

I can see where you'd want to use these encoding attributes from Day 1,
but I think that day has passed for the OP. Am I reading that
correctly?

His constraint is that the receiving application can't deal with those
characters, regardless of what header precedes it. Again, correct me if
I'm wrong, Patrick.

If I was trying to solve that problem, I'd probably prepare by looping
through the character set and printing each character alongside an
index(hex, dec, octal -- take your pick). After eyeballing the
selection to gauge the best match, I'd use those indices in a
translation, rather than try to type the accented character.

You'd run afoul of Juergen (is that spelling a better approximation?),
but your target app wouln't barf.
 
T

thundergnat

Patrick said:
I have a script that takes information, including people's
names, and builds an XML file. I just found that the
application that reads the XML is fussy about characters.
It choked on the name "Jean-Paul le Fevre", where there
was an accent over the first e in Fevre. I don't know
how to type that on this keyboard. I edited the file
by hand, changing that character to a plain "e", and
all was OK. By the way, this isn't Unicode, it's just
extended ASCII.

I think I know how to identify "non-printing" characters
like that, but I would like to translate each one to
its nearest equivalent in the basic ASCII character
set. Thus the various e's with acute, grave and
circumflex accents would all go to "e", and so forth.
Has this problem been solved?

As has been pointed out in several other posts, there
are many reasons to avoid doing this, or at least to
do it very sparingly.

Never-the-less, I have written routines in the
past to do this, for when I need to aphabetize a list
of words which could contain Latin-1 characters > 127
but I could not be certain of a particular locale
setting.


sub deaccent{
my $phrase = shift;
return $phrase unless ($phrase =~ m/[\xC0-\xFF]/);
$phrase =~ tr/ÀÁÂÃÄÅàáâãäåÇçÈÉÊËèéêëÌÍÎÏìíîïÒÓÔÕÖØòóôõöøÑñÙÚÛÜùúûüÝÿý/AAAAAAaaaaaaCcEEEEeeeeIIIIiiiiOOOOOOooooooNnUUUUuuuuYyy/;
$phrase =~ s/\xC6/AE/g;
$phrase =~ s/\xE6/ae/g;
return $phrase;
}



That could then be used to sort a word list hash like:


sort {deaccent(lc $a) cmp deaccent(lc $b)} keys %wordlist


Again though, you should NOT use this lightly, as it can
completely change the meaning of words.
 
A

Arndt Jonasson

thundergnat said:
Patrick said:
I have a script that takes information, including people's
names, and builds an XML file. I just found that the
application that reads the XML is fussy about characters.
It choked on the name "Jean-Paul le Fevre", where there
was an accent over the first e in Fevre. I don't know
how to type that on this keyboard. I edited the file
by hand, changing that character to a plain "e", and
all was OK. By the way, this isn't Unicode, it's just
extended ASCII.
I think I know how to identify "non-printing" characters
like that, but I would like to translate each one to
its nearest equivalent in the basic ASCII character
set. Thus the various e's with acute, grave and
circumflex accents would all go to "e", and so forth.
Has this problem been solved?

As has been pointed out in several other posts, there
are many reasons to avoid doing this, or at least to
do it very sparingly.

Never-the-less, I have written routines in the
past to do this, for when I need to aphabetize a list
of words which could contain Latin-1 characters > 127
but I could not be certain of a particular locale
setting.


sub deaccent{
my $phrase = shift;
return $phrase unless ($phrase =~ m/[\xC0-\xFF]/);
$phrase =~ tr/ÀÁÂÃÄÅàáâãäåÇçÈÉÊËèéêëÌÍÎÏìíîïÒÓÔÕÖØòóôõöøÑñÙÚÛÜùúûüÝÿý/AAAAAAaaaaaaCcEEEEeeeeIIIIiiiiOOOOOOooooooNnUUUUuuuuYyy/;
$phrase =~ s/\xC6/AE/g;
$phrase =~ s/\xE6/ae/g;
return $phrase;
}

A few Latin-1 characters are not taken care of by the above function:

upper and lowercase Icelandic thorn (Þ, þ);
upper and lowercase Icelandic eth (ð, Ð);
German ess-zet (ß) (there is no uppercase version).

The following additions may be appropriate:

$phrase =~ s/\xDE/TH/g;
$phrase =~ s/\xFE/th/g;
$phrase =~ s/\xD0/TH/g;
$phrase =~ s/\xF0/th/g;
$phrase =~ s/\xDF/ss/g;

Thorn and eth are certainly not equivalent, but I leave it to an
Icelandic speaker to say whether there is a better conversion.
 
P

Patrick L. Nolan

John Bokma said:
Patrick L. Nolan wrote:
Then use something like:
<?xml version="1.0" encoding="ISO-8859-1"?>
^^^^^^^^^^
Specify an encoding, and there shouldn't be any problem.

Thanks. That works. I'm a bit surprised that it was that
simple. I got the XML format by reverse-engineering files
that were produced by the target application. Since they
didn't have an encoding specified, I thought they wouldn't
be able to deal with it. They must be using a standard
XML parser library.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top