Clean out accents in French names

Patrick L. Nolan · May 17, 2005

I have a script that takes information, including people's
names, and builds an XML file. I just found that the
application that reads the XML is fussy about characters.
It choked on the name "Jean-Paul le Fevre", where there
was an accent over the first e in Fevre. I don't know
how to type that on this keyboard. I edited the file
by hand, changing that character to a plain "e", and
all was OK. By the way, this isn't Unicode, it's just
extended ASCII.

I think I know how to identify "non-printing" characters
like that, but I would like to translate each one to
its nearest equivalent in the basic ASCII character
set. Thus the various e's with acute, grave and
circumflex accents would all go to "e", and so forth.
Has this problem been solved?

John Bokma · May 18, 2005

Patrick said:
I have a script that takes information, including people's
names, and builds an XML file. I just found that the
application that reads the XML is fussy about characters.

Then use something like:

Has this problem been solved?

Specify an encoding, and there shouldn't be any problem.

Jürgen Exner · May 18, 2005

Patrick said:
I have a script that takes information, including people's
names, and builds an XML file. I just found that the
application that reads the XML is fussy about characters.
It choked on the name "Jean-Paul le Fevre", where there
was an accent over the first e in Fevre. I don't know
how to type that on this keyboard. I edited the file
by hand, changing that character to a plain "e", and
all was OK. By the way, this isn't Unicode, it's just
extended ASCII.

Well, did you set the encoding attribute in the XML file explicitely to
ISO-8859-1 or ANSI-1252 or to whatever you mean with "extended ASCII"?
If not then the file would contain "extended ASCII" but the other
application would try to read UTF-8, because that's the default for XML if
nothing contrary is specified. Of course that's going to cause trouble.

I think I know how to identify "non-printing" characters
like that, but I would like to translate each one to
its nearest equivalent in the basic ASCII character
set. Thus the various e's with acute, grave and
circumflex accents would all go to "e", and so forth.

Not a good idea. I for one am particular about my first name. It's Jürgen,
not Jurgen (which isn't any name at all).
And as has been discussed many times just removing the accents etc.
transforms e.g. "to sleep" (?) into "whore" in Swedish.

Has this problem been solved?

Often. Just use the proper encoding. Luckily XML was designed with support
for non-English characters in mind.

jue

Gunnar Hjalmarsson · May 18, 2005

Jürgen Exner said:
And as has been discussed many times just removing the accents etc.
transforms e.g. "to sleep" (?) into "whore" in Swedish.

The Swedish for "hear" is "höra",

Guess you mean "to hear". The Swedish for "hear" is "höra", and removing
those dots changes the meaning as you say.

Charles DeRykus · May 18, 2005

Not a good idea. I for one am particular about my first name. It's Jürgen,
not Jurgen (which isn't any name at all).
And as has been discussed many times just removing the accents etc.
transforms e.g. "to sleep" (?) into "whore" in Swedish.

As I recall, 'hear' gets 'downgraded' to 'whore' with an
non-umlauted 'o'. (I'm not a Swede but I sit near one).

(Personally, I just hate it when 'DeRykus' gets rendered without
the capital 'R' too but that's another beef with no Perl content).

John · May 18, 2005

I can see where you'd want to use these encoding attributes from Day 1,
but I think that day has passed for the OP. Am I reading that
correctly?

His constraint is that the receiving application can't deal with those
characters, regardless of what header precedes it. Again, correct me if
I'm wrong, Patrick.

If I was trying to solve that problem, I'd probably prepare by looping
through the character set and printing each character alongside an
index(hex, dec, octal -- take your pick). After eyeballing the
selection to gauge the best match, I'd use those indices in a
translation, rather than try to type the accented character.

You'd run afoul of Juergen (is that spelling a better approximation?),
but your target app wouln't barf.

thundergnat · May 18, 2005

Patrick said:
I have a script that takes information, including people's
names, and builds an XML file. I just found that the
application that reads the XML is fussy about characters.
It choked on the name "Jean-Paul le Fevre", where there
was an accent over the first e in Fevre. I don't know
how to type that on this keyboard. I edited the file
by hand, changing that character to a plain "e", and
all was OK. By the way, this isn't Unicode, it's just
extended ASCII.

I think I know how to identify "non-printing" characters
like that, but I would like to translate each one to
its nearest equivalent in the basic ASCII character
set. Thus the various e's with acute, grave and
circumflex accents would all go to "e", and so forth.
Has this problem been solved?

As has been pointed out in several other posts, there
are many reasons to avoid doing this, or at least to
do it very sparingly.

Never-the-less, I have written routines in the
past to do this, for when I need to aphabetize a list
of words which could contain Latin-1 characters > 127
but I could not be certain of a particular locale
setting.

sub deaccent{
my $phrase = shift;
return $phrase unless ($phrase =~ m/[\xC0-\xFF]/);
$phrase =~ tr/ÀÁÂÃÄÅàáâãäåÇçÈÉÊËèéêëÌÍÎÏìíîïÒÓÔÕÖØòóôõöøÑñÙÚÛÜùúûüÝÿý/AAAAAAaaaaaaCcEEEEeeeeIIIIiiiiOOOOOOooooooNnUUUUuuuuYyy/;
$phrase =~ s/\xC6/AE/g;
$phrase =~ s/\xE6/ae/g;
return $phrase;
}

That could then be used to sort a word list hash like:

sort {deaccent(lc $a) cmp deaccent(lc $b)} keys %wordlist

Again though, you should NOT use this lightly, as it can
completely change the meaning of words.

Arndt Jonasson · May 18, 2005

thundergnat said:
Patrick said:

I have a script that takes information, including people's
names, and builds an XML file. I just found that the
application that reads the XML is fussy about characters.
It choked on the name "Jean-Paul le Fevre", where there
was an accent over the first e in Fevre. I don't know
how to type that on this keyboard. I edited the file
by hand, changing that character to a plain "e", and
all was OK. By the way, this isn't Unicode, it's just
extended ASCII.
I think I know how to identify "non-printing" characters
like that, but I would like to translate each one to
its nearest equivalent in the basic ASCII character
set. Thus the various e's with acute, grave and
circumflex accents would all go to "e", and so forth.
Has this problem been solved?

Click to expand...

As has been pointed out in several other posts, there
are many reasons to avoid doing this, or at least to
do it very sparingly.

Never-the-less, I have written routines in the
past to do this, for when I need to aphabetize a list
of words which could contain Latin-1 characters > 127
but I could not be certain of a particular locale
setting.

sub deaccent{
my $phrase = shift;
return $phrase unless ($phrase =~ m/[\xC0-\xFF]/);
$phrase =~ tr/ÀÁÂÃÄÅàáâãäåÇçÈÉÊËèéêëÌÍÎÏìíîïÒÓÔÕÖØòóôõöøÑñÙÚÛÜùúûüÝÿý/AAAAAAaaaaaaCcEEEEeeeeIIIIiiiiOOOOOOooooooNnUUUUuuuuYyy/;
$phrase =~ s/\xC6/AE/g;
$phrase =~ s/\xE6/ae/g;
return $phrase;
}

A few Latin-1 characters are not taken care of by the above function:

upper and lowercase Icelandic thorn (Þ, þ);
upper and lowercase Icelandic eth (ð, Ð);
German ess-zet (ß) (there is no uppercase version).

The following additions may be appropriate:

$phrase =~ s/\xDE/TH/g;
$phrase =~ s/\xFE/th/g;
$phrase =~ s/\xD0/TH/g;
$phrase =~ s/\xF0/th/g;
$phrase =~ s/\xDF/ss/g;

Thorn and eth are certainly not equivalent, but I leave it to an
Icelandic speaker to say whether there is a better conversion.

Patrick L. Nolan · May 18, 2005

John Bokma said:
Patrick L. Nolan wrote:

Then use something like:

<?xml version="1.0" encoding="ISO-8859-1"?>
^^^^^^^^^^

Specify an encoding, and there shouldn't be any problem.

Thanks. That works. I'm a bit surprised that it was that
simple. I got the XML format by reverse-engineering files
that were produced by the target application. Since they
didn't have an encoding specified, I thought they wouldn't
be able to deal with it. They must be using a standard
XML parser library.

French Accents appear incorrectly...	4	Jan 29, 2007
some hotmail and gmail can't render French characters	5	Mar 22, 2012
accents in e-mail	3	Jan 7, 2007
C# sign in and out program	0	Sep 27, 2017
python+ncurses: I can't display accents	14	Jan 26, 2007
SQL request returns incorrect french characters	1	Nov 2, 2006
Encoding, "extended ansi", and unicode in 1.9	2	Jun 16, 2010
Xml, Accents and php domxml_open_file	2	Oct 11, 2003

Clean out accents in French names

Patrick L. Nolan

John Bokma

Jürgen Exner

Gunnar Hjalmarsson

Charles DeRykus

John

thundergnat

Arndt Jonasson

Patrick L. Nolan

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads