UTF-8 read & print?

Discussion in 'Perl Misc' started by Tuxedo, Nov 25, 2012.

  1. Tuxedo

    Tuxedo Guest

    In reading and printing a file that may contain UTF-8 characters and print
    it into a web browser, my first attempt is:

    #!/usr/bin/perl -w

    use warnings;
    use strict;
    use CGI qw:)standard);

    print "Content-type: text/plain; charset=UTF-8\n\n";

    open my $fh, "<:encoding(UTF-8)", 'UTF-8-demo.txt';
    binmode STDOUT, ':utf-8';
    while (my $line = <$fh>) {
    print $line;
    }

    The example file is this one:
    http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt

    Of course, different browsers and systems have different result depending
    on supported characters in the UTF-8 range (I guess) and while most
    characters in the above UTF-8-demo.txt display when reading the file as
    above, some characters towards the end of the page, being the ones
    following the lowercase basic Latin alphabet, i.e. the British pound sign,
    the copyright symbol and the remaining 9 characters on that same line, do
    not to display in an up-to-date web browser with the above read and print
    procedure, while they do display as they should when accessing the
    UTF-8-demo.txt file directly in a same browser via the above URL. If
    however I omit the "encoding(UTF-8)" part after my $fh I find that those
    particular characters print correctly.

    While I guess UTF-8 compatibility is generally a broad topic, what are the
    better or worse ways to read and print UTF-8 for maximum success in typical
    web browsers?

    Sorry if the question is a bit basic and has been asked times before, but
    any comments and examples are always much appreciated.

    Many thanks,
    Tuxedo
     
    Tuxedo, Nov 25, 2012
    #1
    1. Advertising

  2. On Sun, 25 Nov 2012, Tuxedo wrote:

    > The example file is this one:
    > http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt
    >
    > Of course, different browsers and systems have different result depending
    > on supported characters in the UTF-8 range (I guess) and while most
    > characters in the above UTF-8-demo.txt display when reading the file as
    > above, some characters towards the end of the page, being the ones
    > following the lowercase basic Latin alphabet, i.e. the British pound sign,
    > the copyright symbol and the remaining 9 characters on that same line, do
    > not to display in an up-to-date web browser with the above read and print
    > procedure, while they do display as they should when accessing the
    > UTF-8-demo.txt file directly in a same browser via the above URL. If
    > however I omit the "encoding(UTF-8)" part after my $fh I find that those
    > particular characters print correctly.


    So you read the demo file and print it out again. If you print it to a
    file, why not do a diff of the two files and see what has changed, if
    anything? If the printing goes to HTTP output, why not give us the URL so
    that we all can see whether your server serves exactly the same text as
    the URL you gave us. We can hardly guess what happens when we are denied
    access to the difference of the two versions.

    --
    Helmut Richter
     
    Helmut Richter, Nov 25, 2012
    #2
    1. Advertising

  3. Ben Morrow <> writes:
    > Quoth Tuxedo <>:


    [...]

    > If you're just copying a file, it's better to do it in blocks than
    > line-by-line.
    >
    > local $/ = \4096;
    > while (...) { ... }


    As soon as an application starts to do any explicit buffer management,
    using the supposedly transparent buffer management embedded in the
    buffered I/O subsystem is not only pointless but actually a bad idea
    (one would assume that it should be self-evident that reading data
    into a buffer of size x, copying it into a buffer of size y, copying
    it into another buffer of size x and finally 'writing' it out isn't a
    particularly sensible thing to do ...)

    NB: It is interesting the observe the effect of using a larger buffer
    size. For the test I made, 8192 seemed to be the best choice and this
    improves the 'blocks' version significantly but the fread version only
    marginally (in the first case, the speed increase was 34% of the
    slower speed, for the second, it was only 6%).

    ---------
    use Benchmark;

    open($out, '>', '/dev/null');

    timethese(-5,
    {
    lines => sub {
    my $line;

    seek(STDIN, 0, 0);
    print $out ($line) while $line = <>;
    },

    fread => sub {
    my $block;
    local $/ = \4096;

    seek(STDIN, 0, 0);
    print $out ($block) while $block = <>;
    },

    blocks => sub {
    my $block;

    seek(STDIN, 0, 0);
    syswrite($out, $block) while sysread(STDIN, $block, 4096);
    }});
     
    Rainer Weikusat, Nov 26, 2012
    #3
  4. Tuxedo

    Tuxedo Guest

    Helmut Richter wrote:

    > On Sun, 25 Nov 2012, Tuxedo wrote:


    [...]

    > So you read the demo file and print it out again. If you print it to a
    > file, why not do a diff of the two files and see what has changed, if
    > anything? If the printing goes to HTTP output, why not give us the URL so
    > that we all can see whether your server serves exactly the same text as
    > the URL you gave us. We can hardly guess what happens when we are denied
    > access to the difference of the two versions.


    No denial intended. I have no online version, although you are right, a
    header sent by different servers may vary for example. I'm just trying gain
    a better understanding of the various issues in submitting, writing,
    reading and printing utf-8 and have some difficultly doing all of that in
    my localhost environment. However, I now understand that at least the most
    basic part is to set the charset. Thereafter, I'm not sure if encoding and
    decoding user input is always necessary, at least not for simply echoing
    some UTF-8 user input for example. For this, the below seems to work Ok:

    use strict;
    use warnings;
    use CGI ':standard';

    print header(-charset => 'UTF-8'),
    start_html,
    start_form,
    textfield('unicode'),
    submit,
    end_form;

    print param('unicode');
    print end_html;
     
    Tuxedo, Nov 26, 2012
    #4
  5. Tuxedo

    Tuxedo Guest

    Ben Morrow wrote:

    >
    > Quoth Tuxedo <>:
    > > In reading and printing a file that may contain UTF-8 characters and
    > > print it into a web browser, my first attempt is:
    > >
    > > #!/usr/bin/perl -w

    >
    > You don't need -w if you use warnings.
    >
    > >
    > > use warnings;
    > > use strict;
    > > use CGI qw:)standard);
    > >
    > > print "Content-type: text/plain; charset=UTF-8\n\n";
    > >
    > > open my $fh, "<:encoding(UTF-8)", 'UTF-8-demo.txt';
    > > binmode STDOUT, ':utf-8';

    >
    > binmode STDOUT, ':utf8';
    >
    > You should have got a warning about this. If you had been using autodie,
    > you would have got an error (which is better, IMHO).
    >
    > > while (my $line = <$fh>) {
    > > print $line;
    > > }

    >
    > If you're just copying a file, it's better to do it in blocks than
    > line-by-line.
    >
    > local $/ = \4096;
    > while (...) { ... }
    >
    > Ben
    >


    Thanks for these comments. I must have misunderstood utf-8 vs. utf8,
    thinking utf-8 caters to a broader spectrum of unicode charsets. I don't
    know what I'm doing with the file yet, as I'm just learning by testing.

    I will look into autodie as well as skip the -w flag from now on.

    Tuxedo
     
    Tuxedo, Nov 26, 2012
    #5
  6. Tuxedo <> writes:
    > Helmut Richter wrote:
    >
    >> On Sun, 25 Nov 2012, Tuxedo wrote:

    >
    > [...]
    >
    >> So you read the demo file and print it out again. If you print it to a
    >> file, why not do a diff of the two files and see what has changed, if
    >> anything? If the printing goes to HTTP output, why not give us the URL so
    >> that we all can see whether your server serves exactly the same text as
    >> the URL you gave us. We can hardly guess what happens when we are denied
    >> access to the difference of the two versions.

    >
    > No denial intended. I have no online version, although you are right, a
    > header sent by different servers may vary for example. I'm just trying gain
    > a better understanding of the various issues in submitting, writing,
    > reading and printing utf-8 and have some difficultly doing all of that in
    > my localhost environment. However, I now understand that at least the most
    > basic part is to set the charset. Thereafter, I'm not sure if encoding and
    > decoding user input is always necessary, at least not for simply echoing
    > some UTF-8 user input for example.


    Practically, encoding or deconding UTF-8 explicitly is not necessary
    because perl was designed to work with UTF-8 encoded Unicode strings
    which are supposed to be decoded (and possibly, re-encoded) when and
    if this has to be done because of a processing step which needs
    this. Theoretically, this is considered to be too difficult to
    implement correctly and hence, users of the language are encouraged to
    behave as if Perl wasn't capable of working with UTF-8 and always use
    the three pass algorithm 1. Decode all of the input into some internal
    representation the processing code can work with. 2. Perform whatever
    processing is necessary. 3. Re-encode all of the processed data into
    whatever output format happens to be desired.

    The plan9 paper on UTF-8 support contains the following, nice
    statement:

    To decide whether to compute using runes or UTF-encoded byte
    strings requires balancing the cost of converting the data
    when read and written against the cost of converting relevant
    text on demand. For programs such as editors that run a long
    time with a relatively constant dataset, runes are the better
    choice.

    http://plan9.bell-labs.com/sys/doc/utf.html

    Since most Perl programs run a relatively short time with a highly
    variable data set, the statement above suggests that the
    implementation choice to do on-demand decoding was sensible. Eg, let's
    assume someone is using some Perl code to do log file analysis. Log
    files are often big and since this will usually involve doing regexp
    matches on all input lines, decoding the input while trying to match
    the regexp in a single processing loop will possibly be a lot cheaper
    than first decoding everything and then looking for matches: When a
    line of input is discarded as not being of interest, the hitertho
    undecoded remainder doesn't need to be touched anymore.
     
    Rainer Weikusat, Nov 26, 2012
    #6
  7. Tuxedo

    Tuxedo Guest

    Rainer Weikusat wrote:

    > Tuxedo <> writes:
    > > Helmut Richter wrote:
    > >
    > >> On Sun, 25 Nov 2012, Tuxedo wrote:

    > >
    > > [...]
    > >
    > >> So you read the demo file and print it out again. If you print it to a
    > >> file, why not do a diff of the two files and see what has changed, if
    > >> anything? If the printing goes to HTTP output, why not give us the URL
    > >> so that we all can see whether your server serves exactly the same text
    > >> as the URL you gave us. We can hardly guess what happens when we are
    > >> denied access to the difference of the two versions.

    > >
    > > No denial intended. I have no online version, although you are right, a
    > > header sent by different servers may vary for example. I'm just trying
    > > gain a better understanding of the various issues in submitting,
    > > writing, reading and printing utf-8 and have some difficultly doing all
    > > of that in my localhost environment. However, I now understand that at
    > > least the most basic part is to set the charset. Thereafter, I'm not
    > > sure if encoding and decoding user input is always necessary, at least
    > > not for simply echoing some UTF-8 user input for example.

    >
    > Practically, encoding or deconding UTF-8 explicitly is not necessary
    > because perl was designed to work with UTF-8 encoded Unicode strings
    > which are supposed to be decoded (and possibly, re-encoded) when and
    > if this has to be done because of a processing step which needs
    > this. Theoretically, this is considered to be too difficult to
    > implement correctly and hence, users of the language are encouraged to
    > behave as if Perl wasn't capable of working with UTF-8 and always use
    > the three pass algorithm 1. Decode all of the input into some internal
    > representation the processing code can work with. 2. Perform whatever
    > processing is necessary. 3. Re-encode all of the processed data into
    > whatever output format happens to be desired.
    >
    > The plan9 paper on UTF-8 support contains the following, nice
    > statement:
    >
    > To decide whether to compute using runes or UTF-encoded byte
    > strings requires balancing the cost of converting the data
    > when read and written against the cost of converting relevant
    > text on demand. For programs such as editors that run a long
    > time with a relatively constant dataset, runes are the better
    > choice.
    >
    > http://plan9.bell-labs.com/sys/doc/utf.html
    >
    > Since most Perl programs run a relatively short time with a highly
    > variable data set, the statement above suggests that the
    > implementation choice to do on-demand decoding was sensible. Eg, let's
    > assume someone is using some Perl code to do log file analysis. Log
    > files are often big and since this will usually involve doing regexp
    > matches on all input lines, decoding the input while trying to match
    > the regexp in a single processing loop will possibly be a lot cheaper
    > than first decoding everything and then looking for matches: When a
    > line of input is discarded as not being of interest, the hitertho
    > undecoded remainder doesn't need to be touched anymore.


    Thanks for the intel including the plan9 link, adding to my must-read-about
    list of subjects....

    Tuxedo
     
    Tuxedo, Nov 26, 2012
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. JJBW
    Replies:
    1
    Views:
    10,340
    Joerg Jooss
    Apr 24, 2004
  2. Sandeep
    Replies:
    1
    Views:
    646
    Christopher Benson-Manica
    Jan 12, 2004
  3. keto
    Replies:
    0
    Views:
    1,001
  4. David Cournapeau

    print a vs print '%s' % a vs print '%f' a

    David Cournapeau, Dec 30, 2008, in forum: Python
    Replies:
    0
    Views:
    374
    David Cournapeau
    Dec 30, 2008
  5. moonhkt
    Replies:
    18
    Views:
    2,573
    Roedy Green
    Feb 5, 2010
Loading...

Share This Page