Assigning another filehandle to STDOUT, using binmode.

Discussion in 'Perl Misc' started by Adam Funk, Jun 19, 2007.

  1. Adam Funk

    Adam Funk Guest

    I'm writing a program that uses File::Find to recurse through the
    files and directories specified as command-line arguments, and to call
    process_file() on each one.

    By default the program prints each file's results to STDOUT, but if I
    give it the -d DIRECTORY option, it should print each file's output to
    a file in DIRECTORY with ".txt" at the end of the name instead of
    ".xml". There are a lot of non-English UTF-8 characters in the input
    and output.

    At the moment, I have the following near the beginning of the program:

    binmode (STDOUT, ":utf8");
    *OUTPUT = *STDOUT ;


    and the following for each input file:


    sub process_file {
    # find is called with the no_chdir option set
    my $input_filename = $_;
    my $output_filename = $input_filename;

    if ($option{x} || ($input_filename =~ m!\.xml$!i ) ) {
    if ($option{d}) {
    # drop the ".xml" suffix
    $output_filename =~ s!\.xml$!!i ;
    # drop the relative path
    $output_filename =~ s!^.*/!! ;
    # add the new path and suffix
    $output_filename = $option{d} . "/" . $output_filename;
    $output_filename = $output_filename . ".txt";
    open(OUTPUT, ">" . $output_filename);
    binmode (OUTPUT, ":utf8");
    }

    print(STDERR "Reading : ", $input_filename, "\n");

    # ... CODE THAT CALLS OTHER SUBROUTINES TO READ THE
    # INPUT FILE, PROCESS IT, AND print(OUTPUT ...) A
    # LOT OF STUFF

    if ($option{d}) {
    print(STDERR "Wrote : ", $output_filename, "\n");
    close(OUTPUT);
    }

    }
    else {
    print(STDERR "Ignoring: ", $File::Find::name, "\n");
    }
    }


    As far as I can tell, this works and cleanly suppresses the "Wide
    character" warnings. Is this use of filehandle assignment OK, or am I
    likely to run into trouble later?

    Also, why is it necessary to set binmode on OUTPUT every time I open
    it?

    Thanks,
    Adam
     
    Adam Funk, Jun 19, 2007
    #1
    1. Advertising

  2. Adam Funk

    Joe Smith Guest

    Adam Funk wrote:

    > Also, why is it necessary to set binmode on OUTPUT every time I open
    > it?


    Each open() on a handle is independent of any previous I/O on that
    handle. What makes you think binmode() would last past any
    explicit (or implicit) close()?
    -Joe
     
    Joe Smith, Jun 20, 2007
    #2
    1. Advertising

  3. Adam Funk

    Adam Funk Guest

    On 2007-06-20, Joe Smith wrote:

    >> Also, why is it necessary to set binmode on OUTPUT every time I open
    >> it?

    >
    > Each open() on a handle is independent of any previous I/O on that
    > handle. What makes you think binmode() would last past any
    > explicit (or implicit) close()?


    It wasn't obvious to me, but thanks for clarifying that. I'm still
    wondering about a few things, though.

    Is using binmode the most correct way to suppress those annoying "Wide
    character" warnings?

    Why does Perl act surprised by UTF-8 characters in the output when I'm
    running the program with LANG=en_GB.UTF-8 in the environment?

    Thanks,
    Adam
     
    Adam Funk, Jun 21, 2007
    #3
  4. Adam Funk

    Dr.Ruud Guest

    Adam Funk schreef:

    > Is using binmode the most correct way to suppress those annoying "Wide
    > character" warnings?


    What is annoying about them? The just mean that you need to fix your
    program.

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Jun 22, 2007
    #4
  5. Adam Funk

    Adam Funk Guest

    On 2007-06-21, Dr.Ruud wrote:

    > Adam Funk schreef:
    >
    >> Is using binmode the most correct way to suppress those annoying "Wide
    >> character" warnings?

    >
    > What is annoying about them? The just mean that you need to fix your
    > program.


    OK, let my try a different set of questions: is using binmode the
    correct way to fix the error that causes those warnings?


    As I said, I'm running the program in a UTF-8 environment but getting
    thousands (I think) of identical warnings about "Wide characters"
    which actually refer to correct UTF-8 characters that Perl has read
    from input data files without a hiccup.

    Why is it unreasonable that I find this annoying?
    or
    What am I doing that constitutes an error?
     
    Adam Funk, Jun 22, 2007
    #5
  6. Adam Funk

    Klaus Guest

    On Jun 22, 8:18 pm, Adam Funk <> wrote:
    > On 2007-06-21, Dr.Ruud wrote:
    > > Adam Funk schreef:

    >
    > >> Is using binmode the most correct way to suppress those annoying "Wide
    > >> character" warnings?

    >
    > > What is annoying about them? The just mean that you need to fix your
    > > program.

    >
    > OK, let my try a different set of questions: is using binmode the
    > correct way to fix the error that causes those warnings?
    >
    > As I said, I'm running the program in a UTF-8 environment but getting
    > thousands (I think) of identical warnings about "Wide characters"
    > which actually refer to correct UTF-8 characters that Perl has read
    > from input data files without a hiccup.
    >
    > Why is it unreasonable that I find this annoying?
    > or
    > What am I doing that constitutes an error?


    try perl -C7

    see perldoc perlrun:

    ++ ===============================
    ++ -C [number/list]
    ++
    ++ The -C flag controls some Unicode of the Perl
    ++ Unicode features.
    ++
    ++ As of 5.8.1, the -C can be followed either by a
    ++ number or a list of option letters. The letters, their
    ++ numeric values, and effects are as follows; listing
    ++ the letters is equal to summing the numbers.
    ++
    ++ I 1 STDIN is assumed to be in UTF-8
    ++ O 2 STDOUT will be in UTF-8
    ++ E 4 STDERR will be in UTF-8
    ++ S 7 I + O + E
    ++ ...
    ++ ===============================

    --
    Klaus
     
    Klaus, Jun 22, 2007
    #6
  7. Adam Funk

    Mumia W. Guest

    On 06/22/2007 01:18 PM, Adam Funk wrote:
    > On 2007-06-21, Dr.Ruud wrote:
    >
    >> Adam Funk schreef:
    >>
    >>> Is using binmode the most correct way to suppress those annoying "Wide
    >>> character" warnings?

    >> What is annoying about them? The just mean that you need to fix your
    >> program.

    >
    > OK, let my try a different set of questions: is using binmode the
    > correct way to fix the error that causes those warnings?
    >


    Yes.

    >
    > As I said, I'm running the program in a UTF-8 environment but getting
    > thousands (I think) of identical warnings about "Wide characters"
    > which actually refer to correct UTF-8 characters that Perl has read
    > from input data files without a hiccup.
    >
    > Why is it unreasonable that I find this annoying?
    > or
    > What am I doing that constitutes an error?


    You probably are assuming that open() configures your filehandles with
    binmode() for you. This isn't true.

    If you open a file, and it needs a special encoding, you need to call
    binmode(). If you close and re-open STDOUT, you need to call binmode()
    on it (if it needs encoding). If you close and re-open STDOUT when it's
    aliased as OUTPUT, you still need to set up the encoding.

    When you need an encoding, it's your responsibility to use binmode() to
    set it on each file handle. The only exception I'm aware of is when the
    "encoding" module is used. But that only sets up STDIN and STDOUT, and
    it only sets them once. Even if the encoding pragma is used, if STDOUT
    is closed and re-opened, binmode() must be called on it again.
     
    Mumia W., Jun 22, 2007
    #7
  8. Mumia W. wrote:
    > On 06/22/2007 01:18 PM, Adam Funk wrote:
    >> On 2007-06-21, Dr.Ruud wrote:
    >>
    >>> Adam Funk schreef:
    >>>
    >>>> Is using binmode the most correct way to suppress those annoying "Wide
    >>>> character" warnings?
    >>> What is annoying about them? The just mean that you need to fix your
    >>> program.

    >>
    >> OK, let my try a different set of questions: is using binmode the
    >> correct way to fix the error that causes those warnings?

    >
    > Yes.
    >
    >> As I said, I'm running the program in a UTF-8 environment but getting
    >> thousands (I think) of identical warnings about "Wide characters"
    >> which actually refer to correct UTF-8 characters that Perl has read
    >> from input data files without a hiccup.
    >>
    >> Why is it unreasonable that I find this annoying?
    >> or
    >> What am I doing that constitutes an error?

    >
    > You probably are assuming that open() configures your filehandles with
    > binmode() for you. This isn't true.


    You mean like:

    open FH, '<:raw', 'filename';

    ??


    John
    --
    Perl isn't a toolbox, but a small machine shop where you
    can special-order certain sorts of tools at low cost and
    in short order. -- Larry Wall
     
    John W. Krahn, Jun 23, 2007
    #8
  9. Adam Funk

    Mumia W. Guest

    On 06/22/2007 06:18 PM, John W. Krahn wrote:
    > Mumia W. wrote:
    >> [...]
    >> You probably are assuming that open() configures your filehandles with
    >> binmode() for you. This isn't true.

    >
    > You mean like:
    >
    > open FH, '<:raw', 'filename';
    >
    > ??
    >
    >
    > John


    Oh yeah.

    ;-)
     
    Mumia W., Jun 23, 2007
    #9
  10. On 2007-06-22 18:18, Adam Funk <> wrote:
    > On 2007-06-21, Dr.Ruud wrote:
    >> Adam Funk schreef:
    >>
    >>> Is using binmode the most correct way to suppress those annoying "Wide
    >>> character" warnings?

    >>
    >> What is annoying about them? The just mean that you need to fix your
    >> program.

    >
    > OK, let my try a different set of questions: is using binmode the
    > correct way to fix the error that causes those warnings?


    It is "a" correct way, not "the" correct way. There are other ways: The
    -C option (and it's cousin, the PERL_UNICODE environment variable),
    specifying perl I/O layers for open, etc.

    I generally prefer

    open($fh, '<:utf8', $filename);

    to

    open($fh, '<', $filename);
    binmode $fh, ':utf8';

    because it is shorter and cleaner. So I use binmode only on STDIN,
    STDOUT and (rarely) STDERR, and then I might use -C instead.

    I used to use the PERL_UNICODE environment variable, but that bit me
    almost as often as it helped, so I don't do that any more.

    > As I said, I'm running the program in a UTF-8 environment but getting
    > thousands (I think) of identical warnings about "Wide characters"
    > which actually refer to correct UTF-8 characters that Perl has read
    > from input data files without a hiccup.
    >
    > Why is it unreasonable that I find this annoying?
    > or
    > What am I doing that constitutes an error?


    You are producing complete garbage. Consider this:

    ------------------------------------------------------------------------
    1 #!/usr/bin/perl
    2
    3 use warnings;
    4 use strict;
    5 use utf8;
    6
    7 my $s1 = "Rübezahl\n";
    8 my $s2 = "€ 200,--\n";
    9
    10 print $s1;
    11 print $s2;
    ------------------------------------------------------------------------
    hrunkner:~/tmp 21:55 193% ./foo | od -c
    Wide character in print at ./foo line 11.
    0000000 R 374 b e z a h l \n 342 202 254 2 0 0
    0000020 , - - \n
    0000024
    hrunkner:~/tmp 21:55 194%

    As you can see you get the warning only when printing $s2, but *not*
    when printing $s1. The "ü" in $s1 has a code of less than 256, so it can
    be printed as a single byte, and is. The € cannot be printed as a single
    byte, so it is encoded as UTF-8 and a warning is printed.

    The end result is that the output is a mixture of encodings. The first
    line is ISO-8859-1, the second is UTF-8. It is impossible to read this
    mess again. (And perl really cannot help this - in line 10 it doesn't
    know that it will be asked to print a euro sign in line 11, it doesn't
    even know it is printing text - it might print an image).

    Now if we add a -CO to the shebang line, the output is:

    hrunkner:~/tmp 22:04 198% ./foo | od -c
    0000000 R 303 274 b e z a h l \n 342 202 254 2 0
    0000020 0 , - - \n
    0000025

    And we now have both lines encoded in UTF-8.

    hp


    --
    _ | Peter J. Holzer | I know I'd be respectful of a pirate
    |_|_) | Sysadmin WSR | with an emu on his shoulder.
    | | | |
    __/ | http://www.hjp.at/ | -- Sam in "Freefall"
     
    Peter J. Holzer, Jun 23, 2007
    #10
  11. Adam Funk

    Adam Funk Guest

    On 2007-06-22, Mumia W. wrote:

    >> As I said, I'm running the program in a UTF-8 environment but getting
    >> thousands (I think) of identical warnings about "Wide characters"
    >> which actually refer to correct UTF-8 characters that Perl has read
    >> from input data files without a hiccup.
    >>
    >> Why is it unreasonable that I find this annoying?
    >> or
    >> What am I doing that constitutes an error?

    >
    > You probably are assuming that open() configures your filehandles with
    > binmode() for you. This isn't true.
    >
    > If you open a file, and it needs a special encoding,


    By "special" you mean "anything other than ASCII, right?

    > you need to call
    > binmode(). If you close and re-open STDOUT, you need to call binmode()
    > on it (if it needs encoding). If you close and re-open STDOUT when it's
    > aliased as OUTPUT, you still need to set up the encoding.
    >
    > When you need an encoding, it's your responsibility to use binmode() to
    > set it on each file handle. The only exception I'm aware of is when the
    > "encoding" module is used. But that only sets up STDIN and STDOUT, and
    > it only sets them once. Even if the encoding pragma is used, if STDOUT
    > is closed and re-opened, binmode() must be called on it again.


    OK, thanks.
     
    Adam Funk, Jun 25, 2007
    #11
  12. Adam Funk

    Adam Funk Guest

    On 2007-06-22, John W. Krahn wrote:

    > Mumia W. wrote:


    >> You probably are assuming that open() configures your filehandles with
    >> binmode() for you. This isn't true.

    >
    > You mean like:
    >
    > open FH, '<:raw', 'filename';
    >
    > ??


    But to be fair to Mumia, the "simpler" form of open() doesn't do that,
    and I was expressing surprise that open() didn't assume the
    environment locale to be applicable.


    Is there any difference between

    open(OUTPUT, '>:utf8', $output_filename);

    and

    open(OUTPUT, ">" . $output_filename);
    binmode (OUTPUT, ":utf8");

    or should I just use whichever one I find more aesthetic?
     
    Adam Funk, Jun 25, 2007
    #12
  13. Adam Funk

    Adam Funk Guest

    On 2007-06-22, Klaus wrote:

    >> OK, let my try a different set of questions: is using binmode the
    >> correct way to fix the error that causes those warnings?
    >>
    >> As I said, I'm running the program in a UTF-8 environment but getting
    >> thousands (I think) of identical warnings about "Wide characters"
    >> which actually refer to correct UTF-8 characters that Perl has read
    >> from input data files without a hiccup.


    > try perl -C7
    >
    > see perldoc perlrun:


    Since I'm sometimes using output other than STDOUT, I think I need the
    more comprehensive -C31 (equivalent to IOEio). Thanks.
     
    Adam Funk, Jun 25, 2007
    #13
  14. Adam Funk

    Adam Funk Guest

    On 2007-06-23, Peter J. Holzer wrote:

    > It is "a" correct way, not "the" correct way. There are other ways: The
    > -C option (and it's cousin, the PERL_UNICODE environment variable),
    > specifying perl I/O layers for open, etc.
    >
    > I generally prefer
    >
    > open($fh, '<:utf8', $filename);
    >
    > to
    >
    > open($fh, '<', $filename);
    > binmode $fh, ':utf8';
    >
    > because it is shorter and cleaner. So I use binmode only on STDIN,
    > STDOUT and (rarely) STDERR, and then I might use -C instead.


    As far as I can tell, I'm not getting errors or warnings reading the
    input files (but I'm not doing it directly with my own code --- I'm
    using XML::Twig's parsefile($input_filename) method; the input files
    are XML with Cyrillic UTF-8 PCDATA) --- does Perl by default take the
    environment into consideration, or assume UTF-8, for input but not
    output?
     
    Adam Funk, Jun 25, 2007
    #14
  15. On 2007-06-25 10:13, Adam Funk <> wrote:
    > On 2007-06-22, Mumia W. wrote:
    >>> As I said, I'm running the program in a UTF-8 environment but getting
    >>> thousands (I think) of identical warnings about "Wide characters"
    >>> which actually refer to correct UTF-8 characters that Perl has read
    >>> from input data files without a hiccup.
    >>>
    >>> Why is it unreasonable that I find this annoying?
    >>> or
    >>> What am I doing that constitutes an error?

    >>
    >> You probably are assuming that open() configures your filehandles with
    >> binmode() for you. This isn't true.
    >>
    >> If you open a file, and it needs a special encoding,

    >
    > By "special" you mean "anything other than ASCII, right?


    "Anything other than what happens to be the default in your perl
    implementation" actually. That might be EBCDIC :).

    It might be a good idea to always specify the intended encoding.

    If you want to get the current charset/encoding from the locale, you can
    use I18N::Langinfo:


    use I18N::Langinfo qw(langinfo CODESET)
    $charset = langinfo(CODESET)

    [...]

    open(my $fh, "<:encoding(charset)", $filename);

    hp


    --
    _ | Peter J. Holzer | I know I'd be respectful of a pirate
    |_|_) | Sysadmin WSR | with an emu on his shoulder.
    | | | |
    __/ | http://www.hjp.at/ | -- Sam in "Freefall"
     
    Peter J. Holzer, Jun 25, 2007
    #15
  16. On 2007-06-25 10:28, Adam Funk <> wrote:
    > As far as I can tell, I'm not getting errors or warnings reading the
    > input files (but I'm not doing it directly with my own code --- I'm
    > using XML::Twig's parsefile($input_filename) method; the input files
    > are XML with Cyrillic UTF-8 PCDATA) --- does Perl by default take the
    > environment into consideration,


    No. By default it assumes (on Unix) binary input. You are reading and
    writing a stream of bytes, not a stream of characters.

    > or assume UTF-8, for input but not output?


    No. The XML parser gets the encoding from the XML file. If the XML file
    doesn't explicitely specify an encoding, it must be UTF-8. This is
    completely independent of the locale. XML files are supposed to be
    portable and must not be interpreted differently depending on the
    locale.

    hp


    --
    _ | Peter J. Holzer | I know I'd be respectful of a pirate
    |_|_) | Sysadmin WSR | with an emu on his shoulder.
    | | | |
    __/ | http://www.hjp.at/ | -- Sam in "Freefall"
     
    Peter J. Holzer, Jun 25, 2007
    #16
  17. On 2007-06-25 10:18, Adam Funk <> wrote:
    > But to be fair to Mumia, the "simpler" form of open() doesn't do that,
    > and I was expressing surprise that open() didn't assume the
    > environment locale to be applicable.


    open cannot know whether the file it opens is supposed to be a text file
    or a binary file. Since perl treated all files as binary on Unix
    previously, to keep that as default. Changing the default would have
    broken lots of old scripts.

    > Is there any difference between
    >
    > open(OUTPUT, '>:utf8', $output_filename);
    >
    > and
    >
    > open(OUTPUT, ">" . $output_filename);
    > binmode (OUTPUT, ":utf8");
    >
    > or should I just use whichever one I find more aesthetic?


    AFAIK they are equivalent.

    hp


    --
    _ | Peter J. Holzer | I know I'd be respectful of a pirate
    |_|_) | Sysadmin WSR | with an emu on his shoulder.
    | | | |
    __/ | http://www.hjp.at/ | -- Sam in "Freefall"
     
    Peter J. Holzer, Jun 25, 2007
    #17
  18. Adam Funk

    Adam Funk Guest

    On 2007-06-25, Peter J. Holzer wrote:

    > On 2007-06-25 10:28, Adam Funk <> wrote:
    >> As far as I can tell, I'm not getting errors or warnings reading the
    >> input files (but I'm not doing it directly with my own code --- I'm
    >> using XML::Twig's parsefile($input_filename) method; the input files
    >> are XML with Cyrillic UTF-8 PCDATA) --- does Perl by default take the
    >> environment into consideration,

    >
    > No. By default it assumes (on Unix) binary input. You are reading and
    > writing a stream of bytes, not a stream of characters.
    >
    >> or assume UTF-8, for input but not output?

    >
    > No. The XML parser gets the encoding from the XML file. If the XML file
    > doesn't explicitely specify an encoding, it must be UTF-8. This is
    > completely independent of the locale. XML files are supposed to be
    > portable and must not be interpreted differently depending on the
    > locale.


    Oh of course! I got so caught in up in this business of setting
    encodings that I forgot about the encoding specified explicitly in the
    XML file.
     
    Adam Funk, Jun 25, 2007
    #18
  19. Adam Funk

    Adam Funk Guest

    On 2007-06-25, Peter J. Holzer wrote:

    >>> You probably are assuming that open() configures your filehandles with
    >>> binmode() for you. This isn't true.
    >>>
    >>> If you open a file, and it needs a special encoding,

    >>
    >> By "special" you mean "anything other than ASCII, right?

    >
    > "Anything other than what happens to be the default in your perl
    > implementation" actually. That might be EBCDIC :).


    I've got enough trouble already, thanks. ;-)
     
    Adam Funk, Jun 25, 2007
    #19
  20. Adam Funk

    Adam Funk Guest

    On 2007-06-25, Peter J. Holzer wrote:

    > On 2007-06-25 10:18, Adam Funk <> wrote:
    >> But to be fair to Mumia, the "simpler" form of open() doesn't do that,
    >> and I was expressing surprise that open() didn't assume the
    >> environment locale to be applicable.

    >
    > open cannot know whether the file it opens is supposed to be a text file
    > or a binary file. Since perl treated all files as binary on Unix
    > previously, to keep that as default. Changing the default would have
    > broken lots of old scripts.


    It's starting to make sense now.


    >> Is there any difference between
    >>
    >> open(OUTPUT, '>:utf8', $output_filename);
    >>
    >> and
    >>
    >> open(OUTPUT, ">" . $output_filename);
    >> binmode (OUTPUT, ":utf8");
    >>
    >> or should I just use whichever one I find more aesthetic?

    >
    > AFAIK they are equivalent.


    Thanks.
     
    Adam Funk, Jun 25, 2007
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. dnrg
    Replies:
    5
    Views:
    135
    Matija Papec
    Jun 26, 2003
  2. Alain Star
    Replies:
    1
    Views:
    300
    Brian McCauley
    Dec 13, 2004
  3. Replies:
    2
    Views:
    476
  4. Andry

    closing filehandle for tee STDOUT

    Andry, Sep 22, 2008, in forum: Perl Misc
    Replies:
    2
    Views:
    127
    Andry
    Sep 23, 2008
  5. Helmut Richter

    setting binmode for empty filehandle

    Helmut Richter, Apr 8, 2014, in forum: Perl Misc
    Replies:
    3
    Views:
    92
    George Mpouras
    Apr 8, 2014
Loading...

Share This Page