Displaying 'umlaut' character

dn.perl · Sep 22, 2010

My aim is to display the ‘special’ (NON-Ascii) German character/
diacritic umlaut or diaresis correctly on a browser. The browser calls
a cgi perl-script which resides on a linux server. The browser which
calls the perl-script displays Vietnamese characters correctly (but
not the umlaut) without any special setting. The script sets NLS_LANG
variable to AMERICAN_AMERICA.UTF8 and uses utf8 module, but that’s
about it.

$ENV{'NLS_LANG'}='AMERICAN_AMERICA.UTF8';
Works for Vietnamese characters, but not with umlaut (ö).

But even before we get to a perl-script, perhaps the LC_CTYPE env
variable needs to be set correctly. From my windows laptop, if I
access Oracle through Oracle Query Server, I can see the umlaut. But
if I open a linux-window, initiate an sqlplus session, and run the
same SQL, I do not see the umlaut correctly. I have tried a few values
for the env variable LC_CTYPE (like iso_8859_1, en_US,
en_US.iso88591), but with no luck. The surprising thing is that
‘umalut’ is a muck-known alphabet, Vietnamese alphabets are less-
known. Yet the Vietnamese characters are being displayed correctly.

What settings should I use in a perl-script or for a linux-window to
see the umlaut correctly? Please advise.

Frank van Bortel · Sep 22, 2010

My aim is to display the ‘special’ (NON-Ascii) German character/
diacritic umlaut or diaresis correctly on a browser. The browser calls
a cgi perl-script which resides on a linux server. The browser which
calls the perl-script displays Vietnamese characters correctly (but
not the umlaut) without any special setting. The script sets NLS_LANG
variable to AMERICAN_AMERICA.UTF8 and uses utf8 module, but that’s
about it.

$ENV{'NLS_LANG'}='AMERICAN_AMERICA.UTF8';
Works for Vietnamese characters, but not with umlaut (ö).

But even before we get to a perl-script, perhaps the LC_CTYPE env
variable needs to be set correctly. From my windows laptop, if I
access Oracle through Oracle Query Server, I can see the umlaut. But
if I open a linux-window, initiate an sqlplus session, and run the
same SQL, I do not see the umlaut correctly. I have tried a few values
for the env variable LC_CTYPE (like iso_8859_1, en_US,
en_US.iso88591), but with no luck. The surprising thing is that
‘umalut’ is a muck-known alphabet, Vietnamese alphabets are less-
known. Yet the Vietnamese characters are being displayed correctly.

What settings should I use in a perl-script or for a linux-window to
see the umlaut correctly? Please advise.

Maybe this helps: (shameless self promotion)
http://vanbortel.blogspot.com/2009/04/special-characters-part-i.html
Last part is here:
http://vanbortel.blogspot.com/2010/01/special-characters-part-iv.html

Peter J. Holzer · Sep 22, 2010

You almost certainly don't want to do either of those. 'use utf8' does
exactly one thing: it tells Perl your script itself is written in UTF-8.
If that isn't the case you don't want to use it. Perl also doesn't take
any notice of NLS_LANG or any of the other locale envvars unless you ask
it to (and, normally, that's a bad idea). However, it's possible that
whatever database interface you're using does.

I don't think that's usually a valid locale on a Linux system. Usually
they are of the form 'en_US.UTF-8', but in any case if you need locales
at all you will want to check which locales are available on your
system.

The NLS_LANG environment variable is for Oracle. He does need that if he
wants to get anything but US-ASCII out of (or into) an Oracle database.
AMERICAN_AMERICA.UTF8 is a valid locale for Oracle, but for Oracle 9 or
later you should use .AL32UTF8 instead of .UTF8 (.AL32UTF8 is real
UTF-8, .UTF8 is a weird mixture of UTF-8 and UTF-16).

Whatever "a linux window" may be. Putty? An X server? A VM running on
the windows host? Whatever it is, NLS_LANG must match the character set
used by the terminal emulator.

OK. What is actually stored in the database (what data types are you
using, and how is the data encoded before being stored)? How are you
getting the data out of the database (the only correct answer here is
'DBI', or possibly a wrapper around that)? Have you read the DBI and
DBD::Oracle docs for anything concerning character encodings? Have you
read perlunitut and the other docs that refers you to?

FWIW when I do this sort of thing I use Postgres with DBD:g, I set the
database encoding to UTF-8 (this is a Pg-specific feature, but I
wouldn't be surprised if Ora has got something similar),

DBD::Oracle does this if NLS_LANG includes a UTF-8-like character set.
Since he has set that correctly he gets wide characters back from the
database. The umlauts all have character codes <= 0xFF, so they can be
printed as a single byte and perl does that. The vietnamese characters
have codes >= 0x0100, so Perl converts them to UTF-8 (I bet he has a lot
of "Wide character in print" warnings in log file).

I push an :encoding(utf8) layer onto any filehandles, I make sure to
send a 'Content-type: text/html; charset=utf-8' header, and everything
Just Works. There are variations on that which work just as well, but
that's by far the simplest approach.

ACK. The OP is probably missing the :encoding(utf8) layer.

hp

Frank van Bortel · Sep 22, 2010

My aim is to display the ‘special’ (NON-Ascii) German character/
diacritic umlaut or diaresis correctly on a browser. The browser calls
a cgi perl-script which resides on a linux server. The browser which
calls the perl-script displays Vietnamese characters correctly (but
not the umlaut) without any special setting. The script sets NLS_LANG
variable to AMERICAN_AMERICA.UTF8 and uses utf8 module, but that’s
about it.

$ENV{'NLS_LANG'}='AMERICAN_AMERICA.UTF8';
Works for Vietnamese characters, but not with umlaut (ö).

But even before we get to a perl-script, perhaps the LC_CTYPE env
variable needs to be set correctly. From my windows laptop, if I
access Oracle through Oracle Query Server, I can see the umlaut. But
if I open a linux-window, initiate an sqlplus session, and run the
same SQL, I do not see the umlaut correctly. I have tried a few values
for the env variable LC_CTYPE (like iso_8859_1, en_US,
en_US.iso88591), but with no luck. The surprising thing is that
‘umalut’ is a muck-known alphabet, Vietnamese alphabets are less-
known. Yet the Vietnamese characters are being displayed correctly.

What settings should I use in a perl-script or for a linux-window to
see the umlaut correctly? Please advise.

Apart from what I replied earlier, the correct way to encode
is of course "ö" (without the quotes...)
As this is all ASCII, no problems should arise.

Helmut Richter · Sep 22, 2010

And a web server normally
sends a HTML header with the page that may contain a line
like

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

That tells the browser which type of character set to use
when displaying the page it got from the server.

Caution: A Web server sends an HTTP header (this is *not* a part of the
text of the Web page, in particular, it has *not* the form of an HTML tag
like <meta>) telling the MIME type, e.g. "text/html" and *optionally*
containing a charset specification, e.g. "charset=utf-8". The Web page may
*optionally* contain such a <meta> tag. The Web server is not obliged to
send the HTTP header suggested in the <meta> tag, and most servers don't
-- I am not sure any of them does.

In the special case that the page is generated by a CGI script, the output
of the script contains *both* the HTTP header and the HTML text.

If there is a character code specified in the HTTP header, it takes
precedence. If there is none, the one in the <meta> tag is honoured.
Opinions are divided whether one should use the <meta> tag, as it has not
always the intended effect, to wit when the server sends an HTTP header
with a diverging code specification. I prefer using it for documentation
and for those cases where there is no other code specification. More
important than that it is present is that it is true if present -- if
true, it does never any harm. All these specifications only describe what
code is used in the content; they do not enforce the code.

Back to perl: Whatever your problem is (which is by no means obvious), you
won't be able to understand it, let alone fix it, before knowing what is
written in http://perldoc.perl.org/perlunitut.html.

Peter J. Holzer · Sep 22, 2010

Apart from what I replied earlier, the correct way to encode
is of course "ö" (without the quotes...)

That's not *the* correct way, just *a* correct way. Encoding it in the
charset indicated in the Content-Type header or a meta tag is equally
correct (and preferrable in most cicumstances, IMHO).

hp

joel garry · Sep 22, 2010

Maybe this helps: (shameless self promotion)http://vanbortel.blogspot.com/2009/04/special-characters-part-i.html
Last part is here:http://vanbortel.blogspot.com/2010/01/special-characters-part-iv.html

Thanks for that Frank, I'm always forgetting where I've seen the
excellent write-up.

It always need to be emphasized that using the wrong database
character set creates a ticking time bomb, as Oracle is so
sophisticated about automatic conversions in various circumstances.

jg

Jürgen Exner · Sep 23, 2010

Frank van Bortel said:
Apart from what I replied earlier, the correct way to encode
is of course "ö" (without the quotes...)

If that were true then I guess we wouldn't need Unicode and all the
gazillion other attempts to represent non-English letters.

jue

Dr.Ruud · Sep 23, 2010

If that were true then I guess we wouldn't need Unicode and all the
gazillion other attempts to represent non-English letters.

Non-English? The trema (diaeresis) is often used: cooperate reenact
zoology Brontë naïve. (Umlaut diacritics are not.)

Uri Guttman · Sep 23, 2010

BM> I don't think there are any native English words which need any
BM> non-ASCII letters.

for some definition of native english!
for other definitions, all of english is non-native.

uri

Jürgen Exner · Sep 23, 2010

Dr.Ruud said:
Non-English? The trema (diaeresis) is often used: cooperate reenact
zoology Brontë naïve. (Umlaut diacritics are not.)

Quite right. Nevertheless it's still not a character found in the
English alphabet.

jue

Randal L. Schwartz · Sep 23, 2010

Uri> for other definitions, all of english is non-native.

Really? What language was the origination of commonly-used "laser" and
"radar"?

Looks like native english to me.

Jürgen Exner · Sep 23, 2010

Uri> for other definitions, all of english is non-native.

Really? What language was the origination of commonly-used "laser" and
"radar"?
Looks like native english to me.

You are right with "radar", but "laser" is pure Americanese ;-)

jue

Frank van Bortel · Sep 26, 2010

Thanks for that Frank, I'm always forgetting where I've seen the
excellent write-up.

It always need to be emphasized that using the wrong database
character set creates a ticking time bomb, as Oracle is so
sophisticated about automatic conversions in various circumstances.

jg

Thanks for the thumbs up.

However - one thing I was trying to clarify,
is the fact that
* you do not store characters; you store code points
* there's no such thing as a wrong database character set
(a.k.a. there's always one way to screw up, at least!)

Frank van Bortel · Sep 26, 2010

That's not *the* correct way, just *a* correct way. Encoding it in the
charset indicated in the Content-Type header or a meta tag is equally
correct (and preferrable in most cicumstances, IMHO).

hp

would you please read the HTML definition?

Peter J. Holzer · Sep 28, 2010

I disagree. It isn't excellent. It is at best didactically inept and at
worst dangerously wrong.

Thanks for the thumbs up.

However - one thing I was trying to clarify,
is the fact that
* you do not store characters; you store code points

Nope. Of course from a very low-level point of view you only store
bytes. But those bytes have a meaning for Oracle - in the case of a
varchar2 field they mean characters, just as they mean floating point
numbers in a number field or dates in date field.

* there's no such thing as a wrong database character set

The database character set is wrong if it isn't able to represent the
characters you want to store.

(a.k.a. there's always one way to screw up, at least!)

There is also a way to not screw up. That would be the way most people
prefer.

hp

Unicode help please	5	Oct 19, 2013
Outputting signal values to terminal Within Character Array	0	Dec 10, 2021
DBD::Oracle, Unicode, non-UTF8-non-ASCII strings	0	Jul 23, 2009
How to replace UniCode representation with actual character?	6	Dec 18, 2013
Display Umlaut in meta tag	1	Aug 14, 2003
Problem In Displaying Umlaut Characters In Linux	10	Apr 5, 2006
Writing a UTF-8 file	1	Jan 5, 2007
Calling WebMethod from SoapClient api does not process umlaut character properly	1	Aug 24, 2004

Displaying 'umlaut' character

dn.perl

Frank van Bortel

Peter J. Holzer

Frank van Bortel

Helmut Richter

Peter J. Holzer

joel garry

Jürgen Exner

Dr.Ruud

Uri Guttman

Jürgen Exner

Randal L. Schwartz

Jürgen Exner

Frank van Bortel

Frank van Bortel

Peter J. Holzer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads