UTF-8 encoding decoding not working with Danish characters

L

LarsM

Hi all,
I am new to XML, but I use it for an RSS feed.

I have one problem, which I have really been struggling with.

My XML document is generated from the contents of a MySQL database. It is
UTF-8 encoded.

However, the Danish special characters appear wrong.

For example the letter å becomes "Ã¥", the letter ø becomes "ø"

See an examle here:
http://netm.dk/blog/rss/index_rss2.xml

I thought that it could be because the encoding was not set in the document,
so I added this:
<?xml version="1.0" encoding="UTF-8" ?>
However, that did not make any difference, as can be seen here:
http://netm.dk/blog/rss/test_rss2.xml

The text decodes correctly on my regular web pages on http://netm.dk/

What am I doing wrong?

Regards,
Lars
www.netm.dk
 
M

Malte

LarsM said:
Hi all,
I am new to XML, but I use it for an RSS feed.

I have one problem, which I have really been struggling with.

My XML document is generated from the contents of a MySQL database. It is
UTF-8 encoded.

However, the Danish special characters appear wrong.

For example the letter å becomes "Ã¥", the letter ø becomes "ø"

See an examle here:
http://netm.dk/blog/rss/index_rss2.xml

I thought that it could be because the encoding was not set in the document,
so I added this:
<?xml version="1.0" encoding="UTF-8" ?>
However, that did not make any difference, as can be seen here:
http://netm.dk/blog/rss/test_rss2.xml

The text decodes correctly on my regular web pages on http://netm.dk/

What am I doing wrong?

Regards,
Lars
www.netm.dk
This is not limited to XML. I try to send JavaMail mails. When doing
this from a Windows PC, Danish characters are garbled, when running the
exact same program on Linux, the characters get through fine.

Hope we get rid of thos ¤%@£¥ darned NLS issues sometime in my lifetime,
but I doubt it.
 
?

=?ISO-8859-1?Q?J=FCrgen_Kahrs?=

LarsM said:
My XML document is generated from the contents of a MySQL database. It is
UTF-8 encoded.

You have to take care that *every* tool in the toolchain
knows how to handle utf-8 correctly. Maybe you give us
a list of tools involved ?
The text decodes correctly on my regular web pages on http://netm.dk/

Your web page looks OK to me.
I bet it is in the database or shortly thereafter.
 
L

LarsM

Jürgen Kahrs said:
Maybe you give us a list of tools involved ?

Thanks Jürgen,
The RSS feed is being generated by the same Blog application
("Boastmachine"), which I use to generate the Web pages. As far as I know it
accesses the database in the same way as for the "real" pages.
But I will check up on that.
-Lars
 
?

=?ISO-8859-1?Q?J=FCrgen_Kahrs?=

LarsM said:
The RSS feed is being generated by the same Blog application
("Boastmachine"), which I use to generate the Web pages. As far as I know it
accesses the database in the same way as for the "real" pages.

So the problem should be in the Blog application.
But I will check up on that.

Good idea. Maybe there is simply a bug in the RSS
extraction mechanism.
 
N

Nick Kew

LarsM said:
Hi all,
I am new to XML, but I use it for an RSS feed.

I have one problem, which I have really been struggling with.

My XML document is generated from the contents of a MySQL database. It is
UTF-8 encoded.

No. It's ASCII encoded before an agent even looks at the document itself.
See RFC3023 for details.

The good news is that the fix is a single line in httpd.conf.
 
M

Malte

LarsM said:
Hi all,
I am new to XML, but I use it for an RSS feed.

I have one problem, which I have really been struggling with.

My XML document is generated from the contents of a MySQL database. It is
UTF-8 encoded.

However, the Danish special characters appear wrong.

For example the letter å becomes "Ã¥", the letter ø becomes "ø"

See an examle here:
http://netm.dk/blog/rss/index_rss2.xml

I thought that it could be because the encoding was not set in the document,
so I added this:
<?xml version="1.0" encoding="UTF-8" ?>
However, that did not make any difference, as can be seen here:
http://netm.dk/blog/rss/test_rss2.xml

The text decodes correctly on my regular web pages on http://netm.dk/

What am I doing wrong?

Regards,
Lars
www.netm.dk
Pointing my (Linux) Firefox browser at your web site, and having
encoding set to utf-8, I see you page fine. Setting encoding to
ISO-8859-1 generates the å stuff. One never knows how the users'
browsers are setup.

Look at this page: www.vietbao.com

Great looking, authentic, Vietnamese fonts with utf-8. Obviously not
looking good with iso (vn fonts not part of iso..).
 
L

LarsM

Nick Kew said:
The good news is that the fix is a single line in httpd.conf.

I don't have my own Apache server, but am using an ISP (Freepaq.dk). Where
can I make the configuration change, then?

-Lars
 
H

Henri Sivonen

LarsM said:
I don't have my own Apache server, but am using an ISP (Freepaq.dk). Where
can I make the configuration change, then?

In a .htaccess file if your host allows it. Failing that, you could ask
your host to map .xml to application/xml. Failing that, I recommend
switching to another host.
 
L

LarsM

In a .htaccess file if your host allows it. Failing that, you could ask
your host to map .xml to application/xml. Failing that, I recommend
switching to another host.

I've been reading through the RFC, but please enlighten me. What would the
syntax be for setting this? Please be as specific as possible.

Regards,
Lars
 
R

Rob van der Putten

Hi there


Henri said:
In a .htaccess file if your host allows it. Failing that, you could ask
your host to map .xml to application/xml. Failing that, I recommend
switching to another host.

lynx -head http://netm.dk/blog/rss/index_rss2.xml
HTTP/1.0 200 OK
Date: Thu, 10 Feb 2005 14:46:31 GMT
Server: Apache/1.3.33 (Unix) mod_perl/1.29 DAV/1.0.3 mod_gzip/1.3.26.1a
PHP/4.3.9
Last-Modified: Tue, 08 Feb 2005 08:03:13 GMT
ETag: "bd67c1-1141-42087241"
Accept-Ranges: bytes
Content-Length: 4417
Content-Type: application/xml
Age: 704
X-Cache: HIT from www.sput.nl
X-Cache-Lookup: HIT from www.sput.nl:8080
Proxy-Connection: close

lynx -head http://netm.dk/blog/rss/test_rss2.xml
HTTP/1.0 200 OK
Date: Thu, 10 Feb 2005 14:48:18 GMT
Server: Apache/1.3.33 (Unix) mod_perl/1.29 DAV/1.0.3 mod_gzip/1.3.26.1a
PHP/4.3.9
Last-Modified: Mon, 07 Feb 2005 18:45:44 GMT
ETag: "11e2dc0-1022-4207b758"
Accept-Ranges: bytes
Content-Length: 4130
Content-Type: application/xml
Age: 624
X-Cache: HIT from www.sput.nl
X-Cache-Lookup: HIT from www.sput.nl:8080
Proxy-Connection: close

This one on my box;
lynx -head http://www.sput.nl/software/leased-line/leased-line.xml
HTTP/1.1 200 OK
Date: Thu, 10 Feb 2005 15:00:02 GMT
Server: Apache/1.3.26 (Unix) Debian GNU/Linux PHP/4.1.2
Last-Modified: Sun, 30 Jan 2005 07:44:42 GMT
ETag: "2787c-4840-41fc906a"
Accept-Ranges: bytes
Content-Length: 18496
Connection: close
Content-Type: text/xml; charset=UTF-8

However, my browser does consider all these files to be UTF-8 XML.


Regards,
Rob
 
L

LarsM

Sorry, but excactly how do I set that setting, which Nick Kew and Henry
Sivonen suggested?

I have been reading through the RFC, but it is not completely clear to me...

Cheers,
Lars
www.netm.dk
 
S

Stanimir Stamenkov

/LarsM/:
Sorry, but excactly how do I set that setting, which Nick Kew and Henry
Sivonen suggested?

I have been reading through the RFC, but it is not completely clear to me...

Please, quote at least some relevant text from the post you're
replying to.

What I've meant is, AFAIK MySQL versions prior 4.1 doesn't handle
Unicode characters. I have no experience with the 4.1 version but
seems the encoding configuration could be tricky with it, too.

It could happen that a text is inserted into the DB using some
encoding and read using another (depending on the connection driver
configuration) producing different results. So, I guess, somehow the
info is inserted UTF-8 encoded but then read using ISO-8859-1, for
example. Generally it has nothing to do with RFCs but MySQL specific
configuration.

I've worked on an application which used MySQL 4.0 as data store and
because it was targeted for the Japanese market we had to configure
the connection driver specifically to encode/decode using a
Shift_JIS encoding.
 
L

LarsM

"Stanimir Stamenkov" wrote :
What I've meant is, AFAIK MySQL versions prior 4.1 doesn't handle Unicode
characters. I have no experience with the 4.1 version but seems the
encoding configuration could be tricky with it, too.
Thank you Stanimir. I think my Web host is on 4.0 only. I will look into
that and maybe go for another encoding all the way through...
Sorry about not quoting correctly...

Regards,
Lars
www.netm.dk
 
R

Rob van der Putten

Hi there

I am new to XML, but I use it for an RSS feed.

I have one problem, which I have really been struggling with.

My XML document is generated from the contents of a MySQL database. It is
UTF-8 encoded.

However, the Danish special characters appear wrong.

For example the letter å becomes "Ã¥", the letter ø becomes "ø"

In ISO-8859-1 a-ring is 0xE5, in UTF-8 0xC3 0xA5
0xC3 0xA5 in ISO-8859-1 is A-tilde Yen.
The same applies to the other example.

So maybe the data gets stored as UTF-8 but retreived as ISO-8859-1 and
then converted to UTF-8.


Vr.Gr,
Rob
 
A

Andreas Prilop

X-Newsreader: Microsoft Outlook Express 6.00.2900.2180

However, the Danish special characters appear wrong.
For example the letter ? becomes "??", the letter ? becomes "??"

As long as you are unable to post special, non-ASCII characters
with appropriate MIME header in your newsreader^W Outlook Express,
don't expect anything.

You need to make these settings:

Tools > Options > Send
Mail Sending Format > Plain Text Settings > Message format MIME
News Sending Format > Plain Text Settings > Message format MIME
Encode text using: None

Better yet, get a newsreader instead of OE.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,479
Members
44,899
Latest member
RodneyMcAu

Latest Threads

Top