How to send utf-8 data using LWP::UserAgent?

G

Gert Brinkmann

Hello,

I am using LWP::UserAgent to send utf-8 encoded xml-data to a web-server.

my $req = HTTP::Request->new (
POST => "http://myhost:8181",
HTTP::Headers->new (
'content-type' => "text/xml; charset=utf-8",
),
$xml_data,
);

my $ua = LWP::UserAgent->new;
my $resp = $ua->simple_request($req);

The problem ist, that lwp seems to convert the utf-8 data to iso-latin. I
have checked this by listening on the port 8181 via: "netcat -l -p 8181".
German Umlauts do occur there correctly readable as äöüß, but IMHO should
not.

I also have checked that the terminal is not converting the data by writing
a file using gedit that contains the string "gört" and netcat'ing it to the
port 8181. The result is: "gört" as expected.

What am I doing wrong?

Thanks,
Gert
 
P

Peter J. Holzer

I am using LWP::UserAgent to send utf-8 encoded xml-data to a web-server.

my $req = HTTP::Request->new (
POST => "http://myhost:8181",
HTTP::Headers->new (
'content-type' => "text/xml; charset=utf-8",
),
$xml_data,
);

my $ua = LWP::UserAgent->new;
my $resp = $ua->simple_request($req);

The problem ist, that lwp seems to convert the utf-8 data to iso-latin. I
have checked this by listening on the port 8181 via: "netcat -l -p 8181".
German Umlauts do occur there correctly readable as äöüß, but IMHO should
not. [...]
What am I doing wrong?

You are not providing a complete script to demonstrate your problem.
Where does $xml_data come from? How do you know that it contains UTF-8?

Dump $xml_data in hex to see what it really contains:

printf STDERR "%x ", ord($_) for (split//, $xml_data);

If "gört" is printed as
67 f6 72 74
it's not UTF-8. It should be
67 c3 b6 72 74

hp
 
G

Gert Brinkmann

Thank you, Peter, for your answer.
You are not providing a complete script to demonstrate your problem.

Yes, Sorry. I have been so sure that the input to LWP was correct... but it
is not.
Where does $xml_data come from? How do you know that it contains UTF-8?

I did a check via dumping data into a file:

-----------------
binmode $fh;
print $fh "isutf8=",(Encode::is_utf8($text,0)?1:0), "; correct="
(Encode::is_utf8($text,1)?1:0),"; debugprint=$text\n";
-----------------

the result was:
-----------------
isutf8=1; correct=1; ...gört...
-----------------

I just did notice the utf-8 flag and the utf-8-is-correct-flag. But now
after rechecking with your hexdump printout I see that it is a mistake
that "gört" is printed out readable.

Why does the is_utf8($text,1) routine tell me, that the utf-8 String is
correct utf-8 even if there is an iso-latin "ö" in the string?

Hmm, now I have to search why the "ö" is not correctly set as utf-8. This
charset/encoding topic is so unbelievable complicated.

Thank you again,
Gert
 
G

Gert Brinkmann

Gert said:
Why does the is_utf8($text,1) routine tells me, that the utf-8 String is
correct utf-8 even if there is an iso-latin "ö" in the string?

Ok. The string is completely correct. It is tagged as utf8 and it contains
utf8. But the question ist: Why is utf8 converted to iso-latin again, when
writing it into the "binmode'd" file?

Here is a test-script:
-----------------------------------------------
#!/usr/bin/perl

use strict;
use warnings;
use Encode;

my $x = 'gört';
$x = Encode::encode("utf-8", $x);
Encode::_utf8_on($x);

open (my $fh, ">foo.log") or die "could not open foo.log";
binmode $fh;
print $fh "isutf8=", (Encode::is_utf8($x,0)?1:0),
"; correct=", (Encode::is_utf8($x,1)?1:0),";\n";
print $fh $x;
print $fh "\n";
close $fh;
-----------------------------------------------

Execute it gives the following:
$ perl utf8test.pl ; cat foo.log
isutf8=1; correct=1;
gört

I have also tried with
binmode, ":raw"
or ":bytes", but it does not make any difference.

Gert
 
A

Alan J. Flavell

Ok. The string is completely correct. It is tagged as utf8 and it
contains utf8.

Without being able to tell you the precise answer, I suspect this is a
consequence of Perl's attempt to be compatible with earlier versions.
If your string contains nothing more than iso-8859-1 characters, then
in some circumstances it will be treated as such, even though a
utf8-ified version of the string is available to those who ask for it
nicely. If there had been just one character in the string that was
outside of the iso-8859-1 repertoire, I suspect you would have seen
different behaviour.

I *think* a careful perusal of perldoc perlunicode for the relevant
Perl version should help.

But there are some hunches in what I say above, and ICBW. Hope it's
vaguely useful.
 
B

Ben Morrow

Quoth Gert Brinkmann said:
Ok. The string is completely correct. It is tagged as utf8 and it contains
utf8. But the question ist: Why is utf8 converted to iso-latin again, when
writing it into the "binmode'd" file?

Here is a test-script:
-----------------------------------------------
#!/usr/bin/perl

use strict;
use warnings;
use Encode;

my $x = 'gört';
$x = Encode::encode("utf-8", $x);

This is wrong. (I'm surprised you didn't get an error.) encode converts
from characters to bytes; you want to convert from bytes (in whatever
your source file is in, probably iso8859-1) into characters, so you want

$x = Encode::decode iso8859_1 => $x;

An alternative to this would be to use the encoding pragma to tell Perl
what charset your source file uses.
Encode::_utf8_on($x);

NO! You should never need to call the _utf8_o{n,ff} functions.
open (my $fh, ">foo.log") or die "could not open foo.log";

open my $fh, '>:encoding(utf8)', 'foo.log' or die...;

Tell Perl what you want, or it doesn't know what to give you.
:encoding(utf8) is (IMHO) preferable to :utf8 as you get better error
handling.
binmode $fh;

This says '$fh is for binary data'. That means that each character
printed to $fh will be written out as a single byte if possible, IOW
the string will be printed in ISO8859-1. Characters above \xff will give
a 'wide character in print' warning, and (I think, but this situation is
Wrong anyway) utf8 output.

Ben
print $fh "isutf8=", (Encode::is_utf8($x,0)?1:0),
"; correct=", (Encode::is_utf8($x,1)?1:0),";\n";

Again, you don't need to care about the state of the internal utf8 flag.
Just tell Perl you want $x to be characters, not bytes.

Ben
 
G

Gert Brinkmann

Thank you, Ben,

with this information I have to reread the utf8- and Encode-perldocs to
really "internalize"(?) this topic.

Ben said:
NO! You should never need to call the _utf8_o{n,ff} functions.

But what are you doing if you receive a CGI-parameter that was sent from a
web-browser in utf-8? On server side AFAIK you do not get the information
from http which charset was used. If you know that the script is working in
your completely utf-8 enabled web-application it should be utf-8. But is
the $parameter CGI variable correctly tagged as utf-8 by the CGI module? In
my understanding it receives utf-8 textstrings and stores it into an
non-utf-8 variable that has to be utf-8-tagged by yourself. Isn't it?

Thanks,
Gert
 
B

Ben Morrow

Quoth Gert Brinkmann said:
Thank you, Ben,

with this information I have to reread the utf8- and Encode-perldocs to
really "internalize"(?) this topic.

The most important point (and I'm not sure the Perl docs currently make
this entirely clear) is that you always have to know whether a given
string is a sequence of *characters* or a sequence of *bytes*. This is
not the same as whether the perl-internal utf8 flag is on, due to perl's
back-compat stuff.

Basically, all input is in bytes, and all text data should be decoded to
characters before processing. Binary data obviously shouldn't. So on
input (from any source that doesn't do the decoding for you) you need to
determine (somehow) what charset the data is expected to be in, and
decode it. Then on output (again to any source that outputs bytes
directly) you need to decide (somehow) what charset you want and encode
the data before output.

One way of making this easier is to push the :encoding layer onto a
filehandle (see PerlIO::encoding): this does the de/encoding for you
automatically so the filehandle now appears to be a stream of characters
rather than a stream of bytes.

[Note to pacify Alan :): my use of the term 'charset' above (and yours
below) corresponds to the MIME paramater of the same name, rather than
to a 'character set' proper]
But what are you doing if you receive a CGI-parameter that was sent from a
web-browser in utf-8? On server side AFAIK you do not get the information
from http which charset was used. If you know that the script is working in
your completely utf-8 enabled web-application it should be utf-8. But is
the $parameter CGI variable correctly tagged as utf-8 by the CGI module? In
my understanding it receives utf-8 textstrings and stores it into an
non-utf-8 variable that has to be utf-8-tagged by yourself. Isn't it?

I don't really understand the situation you're describing (but then my
knowledge of CGI programming is somewhat limited). Are you saying the
data is known to be in UTF8, or that you don't know what charset it's
in?

A string that contains a sequence of bytes that happen to be valid UTF8
is not at all the same thing as a string that contains the sequence of
characters represented by those bytes. In fact, converting from one to
the other is what the Encode::decode function is for.

The internal utf8 flag *does not* mean 'this string is in UTF8' in any
sense that matters to a user of Perl. What it means is 'this string
contains characters rather than bytes, *AND* some of those characters
are above 0xff'. Or sometimes '... *AND* some of those characters used
to be above 0xff but aren't any more, but I haven't noticed that yet'.
Do you begin to see now why this is a property of the string you really
don't care about?

Ben
 
A

Alan J. Flavell

But what are you doing if you receive a CGI-parameter that was sent
from a web-browser in utf-8?

An interesting question - but not, I think, a question to which the
answer could ever be _utf8_on($x)
On server side AFAIK you do not get the information
from http which charset was used.

The simplest case (and recommended, except that the old NN4.* does not
work, if anybody still cares), is to send out the page which contains
the form, as utf-8, and the browser will respond by submitting the
form in utf-8 encoding.

More complex things can happen if Accept-charset is used. I don't
think I would want to go there, as there seems to be no advantage in
it.

Some browsers, in some situations, unilaterally add to the submitted
data an extra name=value pair, with the name "_charset_" and the value
being the submission encoding that they are using. You can't rely on
getting this, though.
But is the $parameter CGI variable correctly tagged as utf-8 by the
CGI module?

"tagging as utf-8" is something which Perl does behind the scenes when
you apply appropriate encode/decode operations on data. Except in
some very obscure situations, it's not something that it makes any
sense to set directly, as Ben has already shown.
In my understanding it receives utf-8 textstrings and stores it into
an non-utf-8 variable that has to be utf-8-tagged by yourself. Isn't
it?

I thought Ben already addressed that point. Ah, and via googroups I
see that he has already responded, although it hasn't yet reached my
news swerver. So I'll leave it there for now.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,566
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top