Assigning another filehandle to STDOUT, using binmode.

Discussion in 'Perl Misc' started by Adam Funk, Jun 19, 2007.

  1. Adam Funk

    Adam Funk Guest

    I'm writing a program that uses File::Find to recurse through the
    files and directories specified as command-line arguments, and to call
    process_file() on each one.

    By default the program prints each file's results to STDOUT, but if I
    give it the -d DIRECTORY option, it should print each file's output to
    a file in DIRECTORY with ".txt" at the end of the name instead of
    ".xml". There are a lot of non-English UTF-8 characters in the input
    and output.

    At the moment, I have the following near the beginning of the program:

    binmode (STDOUT, ":utf8");
    *OUTPUT = *STDOUT ;


    and the following for each input file:


    sub process_file {
    # find is called with the no_chdir option set
    my $input_filename = $_;
    my $output_filename = $input_filename;

    if ($option{x} || ($input_filename =~ m!\.xml$!i ) ) {
    if ($option{d}) {
    # drop the ".xml" suffix
    $output_filename =~ s!\.xml$!!i ;
    # drop the relative path
    $output_filename =~ s!^.*/!! ;
    # add the new path and suffix
    $output_filename = $option{d} . "/" . $output_filename;
    $output_filename = $output_filename . ".txt";
    open(OUTPUT, ">" . $output_filename);
    binmode (OUTPUT, ":utf8");
    }

    print(STDERR "Reading : ", $input_filename, "\n");

    # ... CODE THAT CALLS OTHER SUBROUTINES TO READ THE
    # INPUT FILE, PROCESS IT, AND print(OUTPUT ...) A
    # LOT OF STUFF

    if ($option{d}) {
    print(STDERR "Wrote : ", $output_filename, "\n");
    close(OUTPUT);
    }

    }
    else {
    print(STDERR "Ignoring: ", $File::Find::name, "\n");
    }
    }


    As far as I can tell, this works and cleanly suppresses the "Wide
    character" warnings. Is this use of filehandle assignment OK, or am I
    likely to run into trouble later?

    Also, why is it necessary to set binmode on OUTPUT every time I open
    it?

    Thanks,
    Adam
     
    Adam Funk, Jun 19, 2007
    #1
    1. Advertisements

  2. Adam Funk

    Joe Smith Guest

    Each open() on a handle is independent of any previous I/O on that
    handle. What makes you think binmode() would last past any
    explicit (or implicit) close()?
    -Joe
     
    Joe Smith, Jun 20, 2007
    #2
    1. Advertisements

  3. Adam Funk

    Adam Funk Guest

    It wasn't obvious to me, but thanks for clarifying that. I'm still
    wondering about a few things, though.

    Is using binmode the most correct way to suppress those annoying "Wide
    character" warnings?

    Why does Perl act surprised by UTF-8 characters in the output when I'm
    running the program with LANG=en_GB.UTF-8 in the environment?

    Thanks,
    Adam
     
    Adam Funk, Jun 21, 2007
    #3
  4. Adam Funk

    Dr.Ruud Guest

    Adam Funk schreef:
    What is annoying about them? The just mean that you need to fix your
    program.
     
    Dr.Ruud, Jun 22, 2007
    #4
  5. Adam Funk

    Adam Funk Guest

    OK, let my try a different set of questions: is using binmode the
    correct way to fix the error that causes those warnings?


    As I said, I'm running the program in a UTF-8 environment but getting
    thousands (I think) of identical warnings about "Wide characters"
    which actually refer to correct UTF-8 characters that Perl has read
    from input data files without a hiccup.

    Why is it unreasonable that I find this annoying?
    or
    What am I doing that constitutes an error?
     
    Adam Funk, Jun 22, 2007
    #5
  6. Adam Funk

    Klaus Guest

    try perl -C7

    see perldoc perlrun:

    ++ ===============================
    ++ -C [number/list]
    ++
    ++ The -C flag controls some Unicode of the Perl
    ++ Unicode features.
    ++
    ++ As of 5.8.1, the -C can be followed either by a
    ++ number or a list of option letters. The letters, their
    ++ numeric values, and effects are as follows; listing
    ++ the letters is equal to summing the numbers.
    ++
    ++ I 1 STDIN is assumed to be in UTF-8
    ++ O 2 STDOUT will be in UTF-8
    ++ E 4 STDERR will be in UTF-8
    ++ S 7 I + O + E
    ++ ...
    ++ ===============================
     
    Klaus, Jun 22, 2007
    #6
  7. Adam Funk

    Mumia W. Guest

    You probably are assuming that open() configures your filehandles with
    binmode() for you. This isn't true.

    If you open a file, and it needs a special encoding, you need to call
    binmode(). If you close and re-open STDOUT, you need to call binmode()
    on it (if it needs encoding). If you close and re-open STDOUT when it's
    aliased as OUTPUT, you still need to set up the encoding.

    When you need an encoding, it's your responsibility to use binmode() to
    set it on each file handle. The only exception I'm aware of is when the
    "encoding" module is used. But that only sets up STDIN and STDOUT, and
    it only sets them once. Even if the encoding pragma is used, if STDOUT
    is closed and re-opened, binmode() must be called on it again.
     
    Mumia W., Jun 22, 2007
    #7
  8. You mean like:

    open FH, '<:raw', 'filename';

    ??


    John
     
    John W. Krahn, Jun 23, 2007
    #8
  9. Adam Funk

    Mumia W. Guest

    Oh yeah.

    ;-)
     
    Mumia W., Jun 23, 2007
    #9
  10. It is "a" correct way, not "the" correct way. There are other ways: The
    -C option (and it's cousin, the PERL_UNICODE environment variable),
    specifying perl I/O layers for open, etc.

    I generally prefer

    open($fh, '<:utf8', $filename);

    to

    open($fh, '<', $filename);
    binmode $fh, ':utf8';

    because it is shorter and cleaner. So I use binmode only on STDIN,
    STDOUT and (rarely) STDERR, and then I might use -C instead.

    I used to use the PERL_UNICODE environment variable, but that bit me
    almost as often as it helped, so I don't do that any more.
    You are producing complete garbage. Consider this:

    ------------------------------------------------------------------------
    1 #!/usr/bin/perl
    2
    3 use warnings;
    4 use strict;
    5 use utf8;
    6
    7 my $s1 = "Rübezahl\n";
    8 my $s2 = "€ 200,--\n";
    9
    10 print $s1;
    11 print $s2;
    ------------------------------------------------------------------------
    hrunkner:~/tmp 21:55 193% ./foo | od -c
    Wide character in print at ./foo line 11.
    0000000 R 374 b e z a h l \n 342 202 254 2 0 0
    0000020 , - - \n
    0000024
    hrunkner:~/tmp 21:55 194%

    As you can see you get the warning only when printing $s2, but *not*
    when printing $s1. The "ü" in $s1 has a code of less than 256, so it can
    be printed as a single byte, and is. The € cannot be printed as a single
    byte, so it is encoded as UTF-8 and a warning is printed.

    The end result is that the output is a mixture of encodings. The first
    line is ISO-8859-1, the second is UTF-8. It is impossible to read this
    mess again. (And perl really cannot help this - in line 10 it doesn't
    know that it will be asked to print a euro sign in line 11, it doesn't
    even know it is printing text - it might print an image).

    Now if we add a -CO to the shebang line, the output is:

    hrunkner:~/tmp 22:04 198% ./foo | od -c
    0000000 R 303 274 b e z a h l \n 342 202 254 2 0
    0000020 0 , - - \n
    0000025

    And we now have both lines encoded in UTF-8.

    hp
     
    Peter J. Holzer, Jun 23, 2007
    #10
  11. Adam Funk

    Adam Funk Guest

    By "special" you mean "anything other than ASCII, right?
    OK, thanks.
     
    Adam Funk, Jun 25, 2007
    #11
  12. Adam Funk

    Adam Funk Guest

    But to be fair to Mumia, the "simpler" form of open() doesn't do that,
    and I was expressing surprise that open() didn't assume the
    environment locale to be applicable.


    Is there any difference between

    open(OUTPUT, '>:utf8', $output_filename);

    and

    open(OUTPUT, ">" . $output_filename);
    binmode (OUTPUT, ":utf8");

    or should I just use whichever one I find more aesthetic?
     
    Adam Funk, Jun 25, 2007
    #12
  13. Adam Funk

    Adam Funk Guest

    Since I'm sometimes using output other than STDOUT, I think I need the
    more comprehensive -C31 (equivalent to IOEio). Thanks.
     
    Adam Funk, Jun 25, 2007
    #13
  14. Adam Funk

    Adam Funk Guest

    As far as I can tell, I'm not getting errors or warnings reading the
    input files (but I'm not doing it directly with my own code --- I'm
    using XML::Twig's parsefile($input_filename) method; the input files
    are XML with Cyrillic UTF-8 PCDATA) --- does Perl by default take the
    environment into consideration, or assume UTF-8, for input but not
    output?
     
    Adam Funk, Jun 25, 2007
    #14
  15. "Anything other than what happens to be the default in your perl
    implementation" actually. That might be EBCDIC :).

    It might be a good idea to always specify the intended encoding.

    If you want to get the current charset/encoding from the locale, you can
    use I18N::Langinfo:


    use I18N::Langinfo qw(langinfo CODESET)
    $charset = langinfo(CODESET)

    [...]

    open(my $fh, "<:encoding(charset)", $filename);

    hp
     
    Peter J. Holzer, Jun 25, 2007
    #15
  16. No. By default it assumes (on Unix) binary input. You are reading and
    writing a stream of bytes, not a stream of characters.
    No. The XML parser gets the encoding from the XML file. If the XML file
    doesn't explicitely specify an encoding, it must be UTF-8. This is
    completely independent of the locale. XML files are supposed to be
    portable and must not be interpreted differently depending on the
    locale.

    hp
     
    Peter J. Holzer, Jun 25, 2007
    #16
  17. open cannot know whether the file it opens is supposed to be a text file
    or a binary file. Since perl treated all files as binary on Unix
    previously, to keep that as default. Changing the default would have
    broken lots of old scripts.
    AFAIK they are equivalent.

    hp
     
    Peter J. Holzer, Jun 25, 2007
    #17
  18. Adam Funk

    Adam Funk Guest

    Oh of course! I got so caught in up in this business of setting
    encodings that I forgot about the encoding specified explicitly in the
    XML file.
     
    Adam Funk, Jun 25, 2007
    #18
  19. Adam Funk

    Adam Funk Guest

    I've got enough trouble already, thanks. ;-)
     
    Adam Funk, Jun 25, 2007
    #19
  20. Adam Funk

    Adam Funk Guest

    It's starting to make sense now.

    Thanks.
     
    Adam Funk, Jun 25, 2007
    #20
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.