How to send utf-8 data using LWP::UserAgent?

Discussion in 'Perl Misc' started by Gert Brinkmann, Jul 25, 2006.

  1. Hello,

    I am using LWP::UserAgent to send utf-8 encoded xml-data to a web-server.

    my $req = HTTP::Request->new (
    POST => "http://myhost:8181",
    HTTP::Headers->new (
    'content-type' => "text/xml; charset=utf-8",
    ),
    $xml_data,
    );

    my $ua = LWP::UserAgent->new;
    my $resp = $ua->simple_request($req);

    The problem ist, that lwp seems to convert the utf-8 data to iso-latin. I
    have checked this by listening on the port 8181 via: "netcat -l -p 8181".
    German Umlauts do occur there correctly readable as äöüß, but IMHO should
    not.

    I also have checked that the terminal is not converting the data by writing
    a file using gedit that contains the string "gört" and netcat'ing it to the
    port 8181. The result is: "gört" as expected.

    What am I doing wrong?

    Thanks,
    Gert
     
    Gert Brinkmann, Jul 25, 2006
    #1
    1. Advertising

  2. On Tue, 25 Jul 2006 20:19:03 +0200, Gert Brinkmann wrote:
    > I am using LWP::UserAgent to send utf-8 encoded xml-data to a web-server.
    >
    > my $req = HTTP::Request->new (
    > POST => "http://myhost:8181",
    > HTTP::Headers->new (
    > 'content-type' => "text/xml; charset=utf-8",
    > ),
    > $xml_data,
    > );
    >
    > my $ua = LWP::UserAgent->new;
    > my $resp = $ua->simple_request($req);
    >
    > The problem ist, that lwp seems to convert the utf-8 data to iso-latin. I
    > have checked this by listening on the port 8181 via: "netcat -l -p 8181".
    > German Umlauts do occur there correctly readable as äöüß, but IMHO should
    > not.

    [...]
    > What am I doing wrong?


    You are not providing a complete script to demonstrate your problem.
    Where does $xml_data come from? How do you know that it contains UTF-8?

    Dump $xml_data in hex to see what it really contains:

    printf STDERR "%x ", ord($_) for (split//, $xml_data);

    If "gört" is printed as
    67 f6 72 74
    it's not UTF-8. It should be
    67 c3 b6 72 74

    hp

    --
    _ | Peter J. Holzer | > Wieso sollte man etwas erfinden was nicht
    |_|_) | Sysadmin WSR | > ist?
    | | | | Was sonst wäre der Sinn des Erfindens?
    __/ | http://www.hjp.at/ | -- P. Einstein u. V. Gringmuth in desd
     
    Peter J. Holzer, Jul 25, 2006
    #2
    1. Advertising

  3. Thank you, Peter, for your answer.

    Peter J. Holzer wrote:
    > You are not providing a complete script to demonstrate your problem.


    Yes, Sorry. I have been so sure that the input to LWP was correct... but it
    is not.

    > Where does $xml_data come from? How do you know that it contains UTF-8?


    I did a check via dumping data into a file:

    -----------------
    binmode $fh;
    print $fh "isutf8=",(Encode::is_utf8($text,0)?1:0), "; correct="
    (Encode::is_utf8($text,1)?1:0),"; debugprint=$text\n";
    -----------------

    the result was:
    -----------------
    isutf8=1; correct=1; ...gört...
    -----------------

    I just did notice the utf-8 flag and the utf-8-is-correct-flag. But now
    after rechecking with your hexdump printout I see that it is a mistake
    that "gört" is printed out readable.

    Why does the is_utf8($text,1) routine tell me, that the utf-8 String is
    correct utf-8 even if there is an iso-latin "ö" in the string?

    Hmm, now I have to search why the "ö" is not correctly set as utf-8. This
    charset/encoding topic is so unbelievable complicated.

    Thank you again,
    Gert
     
    Gert Brinkmann, Jul 26, 2006
    #3
  4. Gert Brinkmann wrote:

    > Why does the is_utf8($text,1) routine tells me, that the utf-8 String is
    > correct utf-8 even if there is an iso-latin "ö" in the string?


    Ok. The string is completely correct. It is tagged as utf8 and it contains
    utf8. But the question ist: Why is utf8 converted to iso-latin again, when
    writing it into the "binmode'd" file?

    Here is a test-script:
    -----------------------------------------------
    #!/usr/bin/perl

    use strict;
    use warnings;
    use Encode;

    my $x = 'gört';
    $x = Encode::encode("utf-8", $x);
    Encode::_utf8_on($x);

    open (my $fh, ">foo.log") or die "could not open foo.log";
    binmode $fh;
    print $fh "isutf8=", (Encode::is_utf8($x,0)?1:0),
    "; correct=", (Encode::is_utf8($x,1)?1:0),";\n";
    print $fh $x;
    print $fh "\n";
    close $fh;
    -----------------------------------------------

    Execute it gives the following:
    $ perl utf8test.pl ; cat foo.log
    isutf8=1; correct=1;
    gört

    I have also tried with
    binmode, ":raw"
    or ":bytes", but it does not make any difference.

    Gert
     
    Gert Brinkmann, Jul 26, 2006
    #4
  5. On Wed, 26 Jul 2006, Gert Brinkmann wrote:

    > Gert Brinkmann wrote:
    >
    > > Why does the is_utf8($text,1) routine tells me, that the utf-8
    > > String is correct utf-8 even if there is an iso-latin "ö" in the
    > > string?

    >
    > Ok. The string is completely correct. It is tagged as utf8 and it
    > contains utf8.


    Without being able to tell you the precise answer, I suspect this is a
    consequence of Perl's attempt to be compatible with earlier versions.
    If your string contains nothing more than iso-8859-1 characters, then
    in some circumstances it will be treated as such, even though a
    utf8-ified version of the string is available to those who ask for it
    nicely. If there had been just one character in the string that was
    outside of the iso-8859-1 repertoire, I suspect you would have seen
    different behaviour.

    I *think* a careful perusal of perldoc perlunicode for the relevant
    Perl version should help.

    But there are some hunches in what I say above, and ICBW. Hope it's
    vaguely useful.
     
    Alan J. Flavell, Jul 26, 2006
    #5
  6. Gert Brinkmann

    Ben Morrow Guest

    Quoth Gert Brinkmann <>:
    > Gert Brinkmann wrote:
    >
    > > Why does the is_utf8($text,1) routine tells me, that the utf-8 String is
    > > correct utf-8 even if there is an iso-latin "ö" in the string?

    >
    > Ok. The string is completely correct. It is tagged as utf8 and it contains
    > utf8. But the question ist: Why is utf8 converted to iso-latin again, when
    > writing it into the "binmode'd" file?
    >
    > Here is a test-script:
    > -----------------------------------------------
    > #!/usr/bin/perl
    >
    > use strict;
    > use warnings;
    > use Encode;
    >
    > my $x = 'gört';
    > $x = Encode::encode("utf-8", $x);


    This is wrong. (I'm surprised you didn't get an error.) encode converts
    from characters to bytes; you want to convert from bytes (in whatever
    your source file is in, probably iso8859-1) into characters, so you want

    $x = Encode::decode iso8859_1 => $x;

    An alternative to this would be to use the encoding pragma to tell Perl
    what charset your source file uses.

    > Encode::_utf8_on($x);


    NO! You should never need to call the _utf8_o{n,ff} functions.

    > open (my $fh, ">foo.log") or die "could not open foo.log";


    open my $fh, '>:encoding(utf8)', 'foo.log' or die...;

    Tell Perl what you want, or it doesn't know what to give you.
    :encoding(utf8) is (IMHO) preferable to :utf8 as you get better error
    handling.

    > binmode $fh;


    This says '$fh is for binary data'. That means that each character
    printed to $fh will be written out as a single byte if possible, IOW
    the string will be printed in ISO8859-1. Characters above \xff will give
    a 'wide character in print' warning, and (I think, but this situation is
    Wrong anyway) utf8 output.

    Ben

    > print $fh "isutf8=", (Encode::is_utf8($x,0)?1:0),
    > "; correct=", (Encode::is_utf8($x,1)?1:0),";\n";


    Again, you don't need to care about the state of the internal utf8 flag.
    Just tell Perl you want $x to be characters, not bytes.

    Ben

    --
    I must not fear. Fear is the mind-killer. I will face my fear and
    I will let it pass through me. When the fear is gone there will be
    nothing. Only I will remain.
    Frank Herbert, 'Dune'
     
    Ben Morrow, Jul 26, 2006
    #6
  7. Thank you, Ben,

    with this information I have to reread the utf8- and Encode-perldocs to
    really "internalize"(?) this topic.

    Ben Morrow wrote:
    >> Encode::_utf8_on($x);

    >
    > NO! You should never need to call the _utf8_o{n,ff} functions.


    But what are you doing if you receive a CGI-parameter that was sent from a
    web-browser in utf-8? On server side AFAIK you do not get the information
    from http which charset was used. If you know that the script is working in
    your completely utf-8 enabled web-application it should be utf-8. But is
    the $parameter CGI variable correctly tagged as utf-8 by the CGI module? In
    my understanding it receives utf-8 textstrings and stores it into an
    non-utf-8 variable that has to be utf-8-tagged by yourself. Isn't it?

    Thanks,
    Gert
     
    Gert Brinkmann, Jul 27, 2006
    #7
  8. Gert Brinkmann

    Ben Morrow Guest

    Quoth Gert Brinkmann <>:
    >
    > Thank you, Ben,
    >
    > with this information I have to reread the utf8- and Encode-perldocs to
    > really "internalize"(?) this topic.


    The most important point (and I'm not sure the Perl docs currently make
    this entirely clear) is that you always have to know whether a given
    string is a sequence of *characters* or a sequence of *bytes*. This is
    not the same as whether the perl-internal utf8 flag is on, due to perl's
    back-compat stuff.

    Basically, all input is in bytes, and all text data should be decoded to
    characters before processing. Binary data obviously shouldn't. So on
    input (from any source that doesn't do the decoding for you) you need to
    determine (somehow) what charset the data is expected to be in, and
    decode it. Then on output (again to any source that outputs bytes
    directly) you need to decide (somehow) what charset you want and encode
    the data before output.

    One way of making this easier is to push the :encoding layer onto a
    filehandle (see PerlIO::encoding): this does the de/encoding for you
    automatically so the filehandle now appears to be a stream of characters
    rather than a stream of bytes.

    [Note to pacify Alan :): my use of the term 'charset' above (and yours
    below) corresponds to the MIME paramater of the same name, rather than
    to a 'character set' proper]

    > Ben Morrow wrote:
    > >> Encode::_utf8_on($x);

    > >
    > > NO! You should never need to call the _utf8_o{n,ff} functions.

    >
    > But what are you doing if you receive a CGI-parameter that was sent from a
    > web-browser in utf-8? On server side AFAIK you do not get the information
    > from http which charset was used. If you know that the script is working in
    > your completely utf-8 enabled web-application it should be utf-8. But is
    > the $parameter CGI variable correctly tagged as utf-8 by the CGI module? In
    > my understanding it receives utf-8 textstrings and stores it into an
    > non-utf-8 variable that has to be utf-8-tagged by yourself. Isn't it?


    I don't really understand the situation you're describing (but then my
    knowledge of CGI programming is somewhat limited). Are you saying the
    data is known to be in UTF8, or that you don't know what charset it's
    in?

    A string that contains a sequence of bytes that happen to be valid UTF8
    is not at all the same thing as a string that contains the sequence of
    characters represented by those bytes. In fact, converting from one to
    the other is what the Encode::decode function is for.

    The internal utf8 flag *does not* mean 'this string is in UTF8' in any
    sense that matters to a user of Perl. What it means is 'this string
    contains characters rather than bytes, *AND* some of those characters
    are above 0xff'. Or sometimes '... *AND* some of those characters used
    to be above 0xff but aren't any more, but I haven't noticed that yet'.
    Do you begin to see now why this is a property of the string you really
    don't care about?

    Ben

    --
    Musica Dei donum optimi, trahit homines, trahit deos. |
    Musica truces mollit animos, tristesque mentes erigit.|
    Musica vel ipsas arbores et horridas movet feras. |
     
    Ben Morrow, Jul 27, 2006
    #8
  9. On Thu, 27 Jul 2006, Gert Brinkmann wrote:

    > Ben Morrow wrote:
    > >> Encode::_utf8_on($x);

    > >
    > > NO! You should never need to call the _utf8_o{n,ff} functions.

    >
    > But what are you doing if you receive a CGI-parameter that was sent
    > from a web-browser in utf-8?


    An interesting question - but not, I think, a question to which the
    answer could ever be _utf8_on($x)

    > On server side AFAIK you do not get the information
    > from http which charset was used.


    The simplest case (and recommended, except that the old NN4.* does not
    work, if anybody still cares), is to send out the page which contains
    the form, as utf-8, and the browser will respond by submitting the
    form in utf-8 encoding.

    More complex things can happen if Accept-charset is used. I don't
    think I would want to go there, as there seems to be no advantage in
    it.

    Some browsers, in some situations, unilaterally add to the submitted
    data an extra name=value pair, with the name "_charset_" and the value
    being the submission encoding that they are using. You can't rely on
    getting this, though.

    > But is the $parameter CGI variable correctly tagged as utf-8 by the
    > CGI module?


    "tagging as utf-8" is something which Perl does behind the scenes when
    you apply appropriate encode/decode operations on data. Except in
    some very obscure situations, it's not something that it makes any
    sense to set directly, as Ben has already shown.

    > In my understanding it receives utf-8 textstrings and stores it into
    > an non-utf-8 variable that has to be utf-8-tagged by yourself. Isn't
    > it?


    I thought Ben already addressed that point. Ah, and via googroups I
    see that he has already responded, although it hasn't yet reached my
    news swerver. So I'll leave it there for now.
     
    Alan J. Flavell, Jul 27, 2006
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Gaurav

    Problem using LWP::UserAgent

    Gaurav, Oct 2, 2003, in forum: Perl Misc
    Replies:
    1
    Views:
    253
    Tad McClellan
    Oct 2, 2003
  2. Rubel Kanubel

    Post radiobuttons (forms) using lwp-useragent

    Rubel Kanubel, Oct 12, 2003, in forum: Perl Misc
    Replies:
    5
    Views:
    257
    Anno Siegel
    Oct 13, 2003
  3. Andrew
    Replies:
    1
    Views:
    414
    Gunnar Hjalmarsson
    Dec 5, 2003
  4. dan baker

    using LWP:UserAgent under htaccess

    dan baker, Jan 18, 2004, in forum: Perl Misc
    Replies:
    0
    Views:
    127
    dan baker
    Jan 18, 2004
  5. neo
    Replies:
    5
    Views:
    184
    zentara
    Dec 12, 2006
Loading...

Share This Page