UTF-8 read & print?


T

Tuxedo

In reading and printing a file that may contain UTF-8 characters and print
it into a web browser, my first attempt is:

#!/usr/bin/perl -w

use warnings;
use strict;
use CGI qw:)standard);

print "Content-type: text/plain; charset=UTF-8\n\n";

open my $fh, "<:encoding(UTF-8)", 'UTF-8-demo.txt';
binmode STDOUT, ':utf-8';
while (my $line = <$fh>) {
print $line;
}

The example file is this one:
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt

Of course, different browsers and systems have different result depending
on supported characters in the UTF-8 range (I guess) and while most
characters in the above UTF-8-demo.txt display when reading the file as
above, some characters towards the end of the page, being the ones
following the lowercase basic Latin alphabet, i.e. the British pound sign,
the copyright symbol and the remaining 9 characters on that same line, do
not to display in an up-to-date web browser with the above read and print
procedure, while they do display as they should when accessing the
UTF-8-demo.txt file directly in a same browser via the above URL. If
however I omit the "encoding(UTF-8)" part after my $fh I find that those
particular characters print correctly.

While I guess UTF-8 compatibility is generally a broad topic, what are the
better or worse ways to read and print UTF-8 for maximum success in typical
web browsers?

Sorry if the question is a bit basic and has been asked times before, but
any comments and examples are always much appreciated.

Many thanks,
Tuxedo
 
Ad

Advertisements

H

Helmut Richter

The example file is this one:
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt

Of course, different browsers and systems have different result depending
on supported characters in the UTF-8 range (I guess) and while most
characters in the above UTF-8-demo.txt display when reading the file as
above, some characters towards the end of the page, being the ones
following the lowercase basic Latin alphabet, i.e. the British pound sign,
the copyright symbol and the remaining 9 characters on that same line, do
not to display in an up-to-date web browser with the above read and print
procedure, while they do display as they should when accessing the
UTF-8-demo.txt file directly in a same browser via the above URL. If
however I omit the "encoding(UTF-8)" part after my $fh I find that those
particular characters print correctly.

So you read the demo file and print it out again. If you print it to a
file, why not do a diff of the two files and see what has changed, if
anything? If the printing goes to HTTP output, why not give us the URL so
that we all can see whether your server serves exactly the same text as
the URL you gave us. We can hardly guess what happens when we are denied
access to the difference of the two versions.
 
R

Rainer Weikusat

Ben Morrow said:
Quoth Tuxedo <[email protected]>:
[...]

If you're just copying a file, it's better to do it in blocks than
line-by-line.

local $/ = \4096;
while (...) { ... }

As soon as an application starts to do any explicit buffer management,
using the supposedly transparent buffer management embedded in the
buffered I/O subsystem is not only pointless but actually a bad idea
(one would assume that it should be self-evident that reading data
into a buffer of size x, copying it into a buffer of size y, copying
it into another buffer of size x and finally 'writing' it out isn't a
particularly sensible thing to do ...)

NB: It is interesting the observe the effect of using a larger buffer
size. For the test I made, 8192 seemed to be the best choice and this
improves the 'blocks' version significantly but the fread version only
marginally (in the first case, the speed increase was 34% of the
slower speed, for the second, it was only 6%).

---------
use Benchmark;

open($out, '>', '/dev/null');

timethese(-5,
{
lines => sub {
my $line;

seek(STDIN, 0, 0);
print $out ($line) while $line = <>;
},

fread => sub {
my $block;
local $/ = \4096;

seek(STDIN, 0, 0);
print $out ($block) while $block = <>;
},

blocks => sub {
my $block;

seek(STDIN, 0, 0);
syswrite($out, $block) while sysread(STDIN, $block, 4096);
}});
 
T

Tuxedo

Helmut said:
]

So you read the demo file and print it out again. If you print it to a
file, why not do a diff of the two files and see what has changed, if
anything? If the printing goes to HTTP output, why not give us the URL so
that we all can see whether your server serves exactly the same text as
the URL you gave us. We can hardly guess what happens when we are denied
access to the difference of the two versions.

No denial intended. I have no online version, although you are right, a
header sent by different servers may vary for example. I'm just trying gain
a better understanding of the various issues in submitting, writing,
reading and printing utf-8 and have some difficultly doing all of that in
my localhost environment. However, I now understand that at least the most
basic part is to set the charset. Thereafter, I'm not sure if encoding and
decoding user input is always necessary, at least not for simply echoing
some UTF-8 user input for example. For this, the below seems to work Ok:

use strict;
use warnings;
use CGI ':standard';

print header(-charset => 'UTF-8'),
start_html,
start_form,
textfield('unicode'),
submit,
end_form;

print param('unicode');
print end_html;
 
T

Tuxedo

Ben said:
You don't need -w if you use warnings.


binmode STDOUT, ':utf8';

You should have got a warning about this. If you had been using autodie,
you would have got an error (which is better, IMHO).


If you're just copying a file, it's better to do it in blocks than
line-by-line.

local $/ = \4096;
while (...) { ... }

Ben

Thanks for these comments. I must have misunderstood utf-8 vs. utf8,
thinking utf-8 caters to a broader spectrum of unicode charsets. I don't
know what I'm doing with the file yet, as I'm just learning by testing.

I will look into autodie as well as skip the -w flag from now on.

Tuxedo
 
R

Rainer Weikusat

Tuxedo said:
Helmut said:
]

So you read the demo file and print it out again. If you print it to a
file, why not do a diff of the two files and see what has changed, if
anything? If the printing goes to HTTP output, why not give us the URL so
that we all can see whether your server serves exactly the same text as
the URL you gave us. We can hardly guess what happens when we are denied
access to the difference of the two versions.

No denial intended. I have no online version, although you are right, a
header sent by different servers may vary for example. I'm just trying gain
a better understanding of the various issues in submitting, writing,
reading and printing utf-8 and have some difficultly doing all of that in
my localhost environment. However, I now understand that at least the most
basic part is to set the charset. Thereafter, I'm not sure if encoding and
decoding user input is always necessary, at least not for simply echoing
some UTF-8 user input for example.

Practically, encoding or deconding UTF-8 explicitly is not necessary
because perl was designed to work with UTF-8 encoded Unicode strings
which are supposed to be decoded (and possibly, re-encoded) when and
if this has to be done because of a processing step which needs
this. Theoretically, this is considered to be too difficult to
implement correctly and hence, users of the language are encouraged to
behave as if Perl wasn't capable of working with UTF-8 and always use
the three pass algorithm 1. Decode all of the input into some internal
representation the processing code can work with. 2. Perform whatever
processing is necessary. 3. Re-encode all of the processed data into
whatever output format happens to be desired.

The plan9 paper on UTF-8 support contains the following, nice
statement:

To decide whether to compute using runes or UTF-encoded byte
strings requires balancing the cost of converting the data
when read and written against the cost of converting relevant
text on demand. For programs such as editors that run a long
time with a relatively constant dataset, runes are the better
choice.

http://plan9.bell-labs.com/sys/doc/utf.html

Since most Perl programs run a relatively short time with a highly
variable data set, the statement above suggests that the
implementation choice to do on-demand decoding was sensible. Eg, let's
assume someone is using some Perl code to do log file analysis. Log
files are often big and since this will usually involve doing regexp
matches on all input lines, decoding the input while trying to match
the regexp in a single processing loop will possibly be a lot cheaper
than first decoding everything and then looking for matches: When a
line of input is discarded as not being of interest, the hitertho
undecoded remainder doesn't need to be touched anymore.
 
Ad

Advertisements

T

Tuxedo

Rainer said:
Tuxedo said:
Helmut said:
]

So you read the demo file and print it out again. If you print it to a
file, why not do a diff of the two files and see what has changed, if
anything? If the printing goes to HTTP output, why not give us the URL
so that we all can see whether your server serves exactly the same text
as the URL you gave us. We can hardly guess what happens when we are
denied access to the difference of the two versions.

No denial intended. I have no online version, although you are right, a
header sent by different servers may vary for example. I'm just trying
gain a better understanding of the various issues in submitting,
writing, reading and printing utf-8 and have some difficultly doing all
of that in my localhost environment. However, I now understand that at
least the most basic part is to set the charset. Thereafter, I'm not
sure if encoding and decoding user input is always necessary, at least
not for simply echoing some UTF-8 user input for example.

Practically, encoding or deconding UTF-8 explicitly is not necessary
because perl was designed to work with UTF-8 encoded Unicode strings
which are supposed to be decoded (and possibly, re-encoded) when and
if this has to be done because of a processing step which needs
this. Theoretically, this is considered to be too difficult to
implement correctly and hence, users of the language are encouraged to
behave as if Perl wasn't capable of working with UTF-8 and always use
the three pass algorithm 1. Decode all of the input into some internal
representation the processing code can work with. 2. Perform whatever
processing is necessary. 3. Re-encode all of the processed data into
whatever output format happens to be desired.

The plan9 paper on UTF-8 support contains the following, nice
statement:

To decide whether to compute using runes or UTF-encoded byte
strings requires balancing the cost of converting the data
when read and written against the cost of converting relevant
text on demand. For programs such as editors that run a long
time with a relatively constant dataset, runes are the better
choice.

http://plan9.bell-labs.com/sys/doc/utf.html

Since most Perl programs run a relatively short time with a highly
variable data set, the statement above suggests that the
implementation choice to do on-demand decoding was sensible. Eg, let's
assume someone is using some Perl code to do log file analysis. Log
files are often big and since this will usually involve doing regexp
matches on all input lines, decoding the input while trying to match
the regexp in a single processing loop will possibly be a lot cheaper
than first decoding everything and then looking for matches: When a
line of input is discarded as not being of interest, the hitertho
undecoded remainder doesn't need to be touched anymore.

Thanks for the intel including the plan9 link, adding to my must-read-about
list of subjects....

Tuxedo
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top