"Wide character in syswrite" in writing an HTML form.

B

Ben Bullock

I have written a Perl script which accesses a WWW form, gets the text, does
some editing and then sends it back. I'm encountering a problem. I'm using
the latest version of Perl, 5.8.8, with the libwww and HTML::Form modules. I
keep getting the above error message "Wide character in syswrite" when my
code tries to update the page. It is in UTF8. Also, I have had some
characters mangled. I have tried extensive searches of Google about how to
solve this problem, with no luck so far. One thing I have tried is "use open
':utf8';", but it doesn't work. Can anyone here suggest some way to solve
this problem? It is very exasperating since otherwise the script is working
perfectly. Thanks.
 
A

Alan J. Flavell

I have written a Perl script which accesses a WWW form, gets the
text, does some editing and then sends it back. I'm encountering a
problem. I'm using the latest version of Perl, 5.8.8, with the
libwww and HTML::Form modules. I keep getting the above error
message "Wide character in syswrite" when my code tries to update
the page. It is in UTF8.

I don't know the answer, but it's an area that's of interest to me...

When you know the answer, maybe you'd be in a position to update this
bug: http://rt.cpan.org/Ticket/Display.html?id=17249 - SCNR. I found
that with a simple google - odd that you didn't mention it yourself.

I hadn't seen it before, so I can't really say what it implies yet.
Just how close would you say that report is to your own problem?
Also, I have had some characters mangled.

Sorry, but this is not a useful report!

To get a worthwhile result, you need to boil the problem down into a
simple test case that we can reproduce for ourselves. Web forms
submission is particularly fraught with pitfalls and hurdles. Just
saying the equivalent of "it doesn't work" gets us no further.
have tried extensive searches of Google about how to solve this
problem, with no luck so far.

Be explicit! Otherwise, people trying to help you are just going to
repeat the things you already found.

good luck
 
B

benkasminbullock

Alan said:
I don't know the answer, but it's an area that's of interest to me...

When you know the answer, maybe you'd be in a position to update this
bug: http://rt.cpan.org/Ticket/Display.html?id=17249 - SCNR. I found
that with a simple google - odd that you didn't mention it yourself.

I found that one, it looks similar to my situation. I don't know why
you think that it's odd that I didn't mention it though - I found
buckets of hits on Google for similar-looking things, and I don't know
which one is relevant.
I hadn't seen it before, so I can't really say what it implies yet.
Just how close would you say that report is to your own problem?


Sorry, but this is not a useful report!

Hmm? Some non-ascii UTF8 characters got mangled into non-UTF8
compliant characters going out from my program back to the WWW form.
To get a worthwhile result, you need to boil the problem down into a
simple test case that we can reproduce for ourselves. Web forms
submission is particularly fraught with pitfalls and hurdles. Just
saying the equivalent of "it doesn't work" gets us no further.

I don't have a complete test program I can show you here,
unfortunately. I've tracked the bug to the following lines in my code:

use LWP::UserAgent;
use HTML::Form;

sub replace_text_in_form
{
my $ua = $_[0]; # user agent
my $form = $_[1]; # already-parsed form from HTML::Form
my $newtext = $_[2]; # update the textbox with this new text
$form->value ("textbox", $newtext);
my $request = $form->click;
my $response = $ua->request($request);
return $response;
}

Sometimes I get a value in "$response->status_line" of "500 Wide
character in syswrite" error and it fails, and sometimes it works, but
either time some of the non-ascii characters get mangled. I checked
and the characters are mangled after they go out: they are OK going in
to the above.
Be explicit! Otherwise, people trying to help you are just going to
repeat the things you already found.

Yeah, well, people often say things like that on Usenet, but then you
give them more details to work on, and after all that you often find
they don't know the answer anyway :). Have a nice day.
 
P

Peter J. Holzer

Alan said:
I have written a Perl script which accesses a WWW form, gets the
text, does some editing and then sends it back. I'm encountering a
problem. I'm using the latest version of Perl, 5.8.8, with the
libwww and HTML::Form modules. I keep getting the above error
message "Wide character in syswrite" when my code tries to update
the page. It is in UTF8.
[...]
To get a worthwhile result, you need to boil the problem down into a
simple test case that we can reproduce for ourselves. Web forms
submission is particularly fraught with pitfalls and hurdles. Just
saying the equivalent of "it doesn't work" gets us no further.

I don't have a complete test program I can show you here,
unfortunately. I've tracked the bug to the following lines in my code:

use LWP::UserAgent;
use HTML::Form;

sub replace_text_in_form
{
my $ua = $_[0]; # user agent
my $form = $_[1]; # already-parsed form from HTML::Form
my $newtext = $_[2]; # update the textbox with this new text
$form->value ("textbox", $newtext);

$form->value("textbox", encode($charset, $newtext));

where $charset must be the charset of the page containing the form (If
you know that's UTF-8 you can hardcode it in your script, but it is
probably safer to get it from the original page).
Yeah, well, people often say things like that on Usenet, but then you
give them more details to work on, and after all that you often find
they don't know the answer anyway :). Have a nice day.

They can't know if they know the answer if they don't even know the
question!

hp
 
J

John Bokma

Alan J. Flavell wrote:

[ .. ]
Yeah, well, people often say things like that on Usenet, but then you
give them more details to work on, and after all that you often find

^^^^

That is a very good choice of wording: *you* give them *work* and several
people try to do that work, for *free*.
they don't know the answer anyway :). Have a nice day.

Even if people pay me to do work, if they don't give me enough details I
can't answer a very important question: can I help in the first place.
 
B

Ben Bullock

John Bokma said:
Alan J. Flavell wrote:

[ .. ]
Yeah, well, people often say things like that on Usenet, but then you
give them more details to work on, and after all that you often find

^^^^

That is a very good choice of wording: *you* give them *work* and several
people try to do that work, for *free*.

*I* only saw *one* post when I replied there. *Thanks* to the several other
people who tried to do the work for *free*.

Thanks also to *John* *Bokma* for all the asterisks.
 
B

Ben Bullock

Peter J. Holzer said:
I don't have a complete test program I can show you here,
unfortunately. I've tracked the bug to the following lines in my code:

use LWP::UserAgent;
use HTML::Form;

sub replace_text_in_form
{
my $ua = $_[0]; # user agent
my $form = $_[1]; # already-parsed form from HTML::Form
my $newtext = $_[2]; # update the textbox with this new text
$form->value ("textbox", $newtext);

$form->value("textbox", encode($charset, $newtext));

where $charset must be the charset of the page containing the form (If
you know that's UTF-8 you can hardcode it in your script, but it is
probably safer to get it from the original page).

No, I know that it's utf-8. Thanks very much for this tip. Surprisingly (to
me at least) it worked, so today's Perl superhero is Peter J. Holzer. Call
me ignorant (preferably with some added asterisks) but I don't really
understand why it wasn't working before, or what the above is doing. Anyway,
thanks. You're a lifesaver. In case some people don't know about "encode",
it was also necessary to write

use Encode;

at the top of the script. After reading "perldoc -f Encode" I found that
there is another function called "encode_utf8" which I used in the end to
save having to write the $charset variable in the above.
 
J

John Bokma

Ben Bullock said:
*I* only saw *one* post when I replied there.

That's Usenet. It's not an instant help desk.
*Thanks* to the several
other people who tried to do the work for *free*.
:-D.

Thanks also to *John* *Bokma* for all the asterisks.

You're welcome :-D I often test my shift key, because I need it a lot with
Perl programming :-D
 
A

Alan J. Flavell

$form->value("textbox", encode($charset, $newtext));
[...]

Thanks very much for this tip. Surprisingly (to me
at least) it worked, so today's Perl superhero is Peter J. Holzer.
Indeed.

I don't really understand why it wasn't working before, or what the
above is doing. Anyway, thanks. You're a lifesaver. In case some
people don't know about "encode", it was also necessary to write

use Encode;

at the top of the script. After reading "perldoc -f Encode"
[...]

Well, let's read the documentation of Encode to see what light it
throws on our understanding. (I think I learned something from this,
anyway).

$octets = encode(ENCODING, $string [, CHECK])

Encodes a string from Perl's internal form into ENCODING and returns
a sequence of octets.

What that says is that you feed it a "string" (i.e of characters
represented in Perl's internal format, which might include "wide"
unicode characters, and in this case actually did so), and it returns
a sequence of octets as they might be expected in the outside world.

The complaint you were getting was that a "wide character" had been
fed to syswrite (which was open to a socket). If I take a look at the
documentation for syswrite, then towards the end it says:

Note that if the filehandle has been marked as :utf8 , Unicode
characters are written instead of bytes (the LENGTH, OFFSET, and the
return value of syswrite() are in UTF-8 encoded Unicode characters).
The :encoding(...) layer implicitly introduces the :utf8 layer.

It seems to me that the observed symptoms are saying that the
filehandle had not, in fact, been "marked as :utf8", yet it was
finding itself being fed with (Perl's internal representation of)
unicode data. By feeding it instead with a sequence of binary "octets"
- the output from encode() - we are smuggling our utf8-encoded data
into the syswrite() without Perl being explicitly aware of it ("as
binary data", if you will). HTTP is defined to be 8-bit clean, so I
guess this is OK. I interpret this as the approach mentioned under
binmode as "raw".

It's an open question whether this was the module author's intention?

At least, that's how I'm rationalising it - feel free to shoot this
down. It all seems to make coherent sense when the data is
represented in Perl's internal (utf8-based) form. Presumably some
different approach is needed when the web form in question is wanted
to be in one of the traditional 8-bit encodings (iso-8859-2,
windows-1251, whatever).

Does anyone have contact with the module author (Gisle Aas) - this
seems like something that could/should be explained in the module
documentation?

best
 
B

Ben Bullock

Alan J. Flavell said:
It seems to me that the observed symptoms are saying that the
filehandle had not, in fact, been "marked as :utf8", yet it was
finding itself being fed with (Perl's internal representation of)
unicode data. By feeding it instead with a sequence of binary "octets"
- the output from encode() - we are smuggling our utf8-encoded data
into the syswrite() without Perl being explicitly aware of it ("as
binary data", if you will).

This helped me to understand what's going on, so thank you very much. As to
whether it's a bug or a feature, I'll leave such weighty matters for others
to decide. It was certainly "unexpected behaviour" from my point of view.
The funny thing is that I've been using that script since last July to send
utf8 encoded Japanese characters, and hadn't had a problem with it. The
mangled stuff was things like pound signs and unicode half signs.
 
A

Alan J. Flavell

funny thing is that I've been using that script since last July to
send utf8 encoded Japanese characters, and hadn't had a problem with
it.
Interesting.

The mangled stuff was things like pound signs and unicode half
signs.

You didn't say that before, and there might be something significant
in the detail. Witness this earlier discussion:

|> > Also, I have had some characters mangled.
|
|> Sorry, but this is not a useful report!
|
|Hmm? Some non-ascii UTF8 characters got mangled [...]

and compare it with the new information which you now provided.

On the basis of that new information, I'd say there's a possibility
that the code does not realise that it needs to use the utf8
representation, unless the characters are above 255.
 
P

Peter J. Holzer

Alan said:
funny thing is that I've been using that script since last July to
send utf8 encoded Japanese characters, and hadn't had a problem with
it.
Interesting.

The mangled stuff was things like pound signs and unicode half
signs.

You didn't say that before, and there might be something significant
in the detail. [...]
On the basis of that new information, I'd say there's a possibility
that the code does not realise that it needs to use the utf8
representation, unless the characters are above 255.

Right. By default perl streams are in "byte-mode": Every character of a
perl string is written as one byte. This is for backwards-compatiblity
with older versions of perl. So, if you write the string "½£" to a
stream, it will be printed as two bytes: 0xBD 0xA3. But if you try to
write "½€", perl would have to write 0xBD 0x20AC. Since there is no way
that perl can stuff the value 0x20AC into 8 bits, it converts the whole
string into UTF-8 and prints that instead: That's now 5 Bytes: 0xC2 0xBD
0xE2 0x82 0xAC. Since that may not be what you wanted, it also gives you
the "Wide character in syswrite" warning.

hp
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,540
Members
45,025
Latest member
KetoRushACVFitness

Latest Threads

Top