Maybe difficult for English Users - but I ask anyway

R

Reinhard Glauber

I fetch an HTML Site, through out all the <> TAGS, so that I get plain text, search for some words and
save them to a MYSQL Database

That all seems to work except the correct display of "ä" "ö" "ü"

Wörl is saved as Wörl

I tried utf-8 decoding /encoding

utf-8 decoding will show the word correct in the Linux shell but not in the database :(
I have a Kanotix (Debain) Computer where I try the Script. Perl is up to date, MYSQL is 5.0.18

Linux-Shell: INSERT INTO Tabelle (NAME, STRASSE.....) VALUES ('Wörl GmbH','Karlsberg .........)

The charset of the database is configured as utf-8, collation as utf8_general_ci

I also tried with the Modul HTML::FormatText which converts html to ascii, but same problem ...

I reallay need this script to work, so maybe one has an good idea.

Thanks, thanks, thanks
 
B

Bodo Eing

Reinhard said:
I fetch an HTML Site, through out all the <> TAGS, so that I get plain text, search for some words and
save them to a MYSQL Database

That all seems to work except the correct display of "ä" "ö" "ü"

Wörl is saved as Wörl
^^ looks like utf8, but displayed by software set
to display iso-8859-1. *Where* does it look wrong: in your browser, in
your text editor ...?
[ snip]

Bodo
 
R

Reinhard Glauber

Bodo Eing said:
Reinhard Glauber wrote:



^^ looks like utf8, but displayed by software set
to display iso-8859-1. *Where* does it look wrong: in your browser, in
your text editor ...?

----------------------------------------------------------

ok, the Problem is not the database, as i found out.
if i make an insert statement right from the linux-shell it works

if i download the html site into a file and look at it with dreamweaver I also see Wörl .

that means......... ??

I solved the Problem with $html =~ s/ö/ö/gs;

Is it possible that this occures because the Website has no
META HTTP-EQUIV charset line in all of the html pages?Thanks
 
A

Alan J. Flavell

On Fri, 20 Jan 2006, Reinhard Glauber wrote:

(in a usenet posting advertised as Content-Type: text/plain;
charset="iso-8859-1")
----------------------------------------------------------

ok, the Problem is not the database, as i found out.
if i make an insert statement right from the linux-shell it works

if i download the html site into a file and look at it with
dreamweaver I also see Wörl .

that means......... ??

It means you don't appear to have a Perl problem as such.
It also means your "problem space partitioning" skills could use
some tuning...
I solved the Problem with $html =~ s/ö/ö/gs;

I wouldn't call that a "solution". It's hardly even a decent
workaround. "Kludge", as we say.
Is it possible that this occures because the Website has no
META HTTP-EQUIV charset line in all of the html pages?Thanks

You seem to have a question about the WWW.

There's a useful briefing here:
http://www.w3.org/International/O-charset.html

This is off-topic for Perl, but the best solution (if this is an HTTP
context) is to serve documents out with the correct character encoding
specified on their real HTTP Content-type header.

If and only if there is no such specification on the real HTTP header,
then, provided we're talking about text/html content type, it's
permissible to supply it on a meta http-equiv. It's not so simple for
XHTML, however.

Unfortunately there isn't a usenet group dedicated to HTTP protocol,
so the topic gets scattered around in discussions on
comp.infosystems.www.authoring.html, comp.infosystems.www.servers.*
and various other groups in the hierarchy.


In general terms, when you're discussing character encoding problems
on usenet, there are too many things that *can* go wrong when merely
copy/pasting from examples. It's generally useful also to *describe
in words" what you were seeing, and where, e.g "my browser displayed
A-tilde followed by a pilcrow sign" (ideally, staying close to the
reference names of the characters in the unicode database). As it
happens, in this case I think you've managed to get the message over
without that, but I've seen too many cases where it went wrong (take a
look at some of the i18n discussions in the Mozilla bugzilla to see
just how horribly wrong this can get, when different people are
posting differently-encoded Chinese, Russian, whatever, into the same
bug report page).

And I'd have to recommend putting the subject of your posting in the
"Subject:" of your posting. As recommended in the posting guidelines
for this group.

good luck.
 
B

Bodo Eing

Reinhard said:
----------------------------------------------------------

ok, the Problem is not the database, as i found out.
if i make an insert statement right from the linux-shell it works

if i download the html site into a file and look at it with dreamweaver I also see Wörl .

that means......... ??

That your Dreamweaver is set to assume iso-8859-1 as html source
I solved the Problem with $html =~ s/ö/ö/gs;

Since you seem to be trying to convert utf-8 to iso-8859-1, don't do it
with self-written regexes because

- you most probably don't need to at all
- if you are sure you know why you want to do it, use the Encode module(s)
Is it possible that this occures because the Website has no
META HTTP-EQUIV charset line in all of the html pages?Thanks

No need for more typing here, please read Alan's reply.

Bodo
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,577
Members
45,052
Latest member
LucyCarper

Latest Threads

Top