Assigning another filehandle to STDOUT, using binmode.

Adam Funk · Jun 19, 2007

I'm writing a program that uses File::Find to recurse through the
files and directories specified as command-line arguments, and to call
process_file() on each one.

By default the program prints each file's results to STDOUT, but if I
give it the -d DIRECTORY option, it should print each file's output to
a file in DIRECTORY with ".txt" at the end of the name instead of
".xml". There are a lot of non-English UTF-8 characters in the input
and output.

At the moment, I have the following near the beginning of the program:

binmode (STDOUT, ":utf8");
*OUTPUT = *STDOUT ;

and the following for each input file:

sub process_file {
# find is called with the no_chdir option set
my $input_filename = $_;
my $output_filename = $input_filename;

if ($option{x} || ($input_filename =~ m!\.xml$!i ) ) {
if ($option{d}) {
# drop the ".xml" suffix
$output_filename =~ s!\.xml$!!i ;
# drop the relative path
$output_filename =~ s!^.*/!! ;
# add the new path and suffix
$output_filename = $option{d} . "/" . $output_filename;
$output_filename = $output_filename . ".txt";
open(OUTPUT, ">" . $output_filename);
binmode (OUTPUT, ":utf8");
}

print(STDERR "Reading : ", $input_filename, "\n");

# ... CODE THAT CALLS OTHER SUBROUTINES TO READ THE
# INPUT FILE, PROCESS IT, AND print(OUTPUT ...) A
# LOT OF STUFF

if ($option{d}) {
print(STDERR "Wrote : ", $output_filename, "\n");
close(OUTPUT);
}

}
else {
print(STDERR "Ignoring: ", $File::Find::name, "\n");
}
}

As far as I can tell, this works and cleanly suppresses the "Wide
character" warnings. Is this use of filehandle assignment OK, or am I
likely to run into trouble later?

Also, why is it necessary to set binmode on OUTPUT every time I open
it?

Thanks,
Adam

Joe Smith · Jun 20, 2007

Adam said:
Also, why is it necessary to set binmode on OUTPUT every time I open
it?

Each open() on a handle is independent of any previous I/O on that
handle. What makes you think binmode() would last past any
explicit (or implicit) close()?
-Joe

Adam Funk · Jun 21, 2007

Each open() on a handle is independent of any previous I/O on that
handle. What makes you think binmode() would last past any
explicit (or implicit) close()?

It wasn't obvious to me, but thanks for clarifying that. I'm still
wondering about a few things, though.

Is using binmode the most correct way to suppress those annoying "Wide
character" warnings?

Why does Perl act surprised by UTF-8 characters in the output when I'm
running the program with LANG=en_GB.UTF-8 in the environment?

Thanks,
Adam

Dr.Ruud · Jun 21, 2007

Adam Funk schreef:

Is using binmode the most correct way to suppress those annoying "Wide
character" warnings?

What is annoying about them? The just mean that you need to fix your
program.

Adam Funk · Jun 22, 2007

Adam Funk schreef:

What is annoying about them? The just mean that you need to fix your
program.

OK, let my try a different set of questions: is using binmode the
correct way to fix the error that causes those warnings?

As I said, I'm running the program in a UTF-8 environment but getting
thousands (I think) of identical warnings about "Wide characters"
which actually refer to correct UTF-8 characters that Perl has read
from input data files without a hiccup.

Why is it unreasonable that I find this annoying?
or
What am I doing that constitutes an error?

Klaus · Jun 22, 2007

OK, let my try a different set of questions: is using binmode the
correct way to fix the error that causes those warnings?

As I said, I'm running the program in a UTF-8 environment but getting
thousands (I think) of identical warnings about "Wide characters"
which actually refer to correct UTF-8 characters that Perl has read
from input data files without a hiccup.

Why is it unreasonable that I find this annoying?
or
What am I doing that constitutes an error?

try perl -C7

see perldoc perlrun:

++ ===============================
++ -C [number/list]
++
++ The -C flag controls some Unicode of the Perl
++ Unicode features.
++
++ As of 5.8.1, the -C can be followed either by a
++ number or a list of option letters. The letters, their
++ numeric values, and effects are as follows; listing
++ the letters is equal to summing the numbers.
++
++ I 1 STDIN is assumed to be in UTF-8
++ O 2 STDOUT will be in UTF-8
++ E 4 STDERR will be in UTF-8
++ S 7 I + O + E
++ ...
++ ===============================

Mumia W. · Jun 22, 2007

OK, let my try a different set of questions: is using binmode the
correct way to fix the error that causes those warnings?

Yes.

As I said, I'm running the program in a UTF-8 environment but getting
thousands (I think) of identical warnings about "Wide characters"
which actually refer to correct UTF-8 characters that Perl has read
from input data files without a hiccup.

Why is it unreasonable that I find this annoying?
or
What am I doing that constitutes an error?

You probably are assuming that open() configures your filehandles with
binmode() for you. This isn't true.

If you open a file, and it needs a special encoding, you need to call
binmode(). If you close and re-open STDOUT, you need to call binmode()
on it (if it needs encoding). If you close and re-open STDOUT when it's
aliased as OUTPUT, you still need to set up the encoding.

When you need an encoding, it's your responsibility to use binmode() to
set it on each file handle. The only exception I'm aware of is when the
"encoding" module is used. But that only sets up STDIN and STDOUT, and
it only sets them once. Even if the encoding pragma is used, if STDOUT
is closed and re-opened, binmode() must be called on it again.

John W. Krahn · Jun 22, 2007

Mumia said:
You probably are assuming that open() configures your filehandles with
binmode() for you. This isn't true.

You mean like:

open FH, '<:raw', 'filename';

??

John

Mumia W. · Jun 22, 2007

Mumia said:
Mumia said:

[...]
You probably are assuming that open() configures your filehandles with
binmode() for you. This isn't true.

Click to expand...

You mean like:

open FH, '<:raw', 'filename';

??

John

Oh yeah.

;-)

Peter J. Holzer · Jun 23, 2007

OK, let my try a different set of questions: is using binmode the
correct way to fix the error that causes those warnings?

It is "a" correct way, not "the" correct way. There are other ways: The
-C option (and it's cousin, the PERL_UNICODE environment variable),
specifying perl I/O layers for open, etc.

I generally prefer

open($fh, '<:utf8', $filename);

to

open($fh, '<', $filename);
binmode $fh, ':utf8';

because it is shorter and cleaner. So I use binmode only on STDIN,
STDOUT and (rarely) STDERR, and then I might use -C instead.

I used to use the PERL_UNICODE environment variable, but that bit me
almost as often as it helped, so I don't do that any more.

As I said, I'm running the program in a UTF-8 environment but getting
thousands (I think) of identical warnings about "Wide characters"
which actually refer to correct UTF-8 characters that Perl has read
from input data files without a hiccup.

Why is it unreasonable that I find this annoying?
or
What am I doing that constitutes an error?

You are producing complete garbage. Consider this:

------------------------------------------------------------------------
1 #!/usr/bin/perl
2
3 use warnings;
4 use strict;
5 use utf8;
6
7 my $s1 = "RÃ¼bezahl\n";
8 my $s2 = "â‚¬ 200,--\n";
9
10 print $s1;
11 print $s2;
------------------------------------------------------------------------
hrunkner:~/tmp 21:55 193% ./foo | od -c
Wide character in print at ./foo line 11.
0000000 R 374 b e z a h l \n 342 202 254 2 0 0
0000020 , - - \n
0000024
hrunkner:~/tmp 21:55 194%

As you can see you get the warning only when printing $s2, but *not*
when printing $s1. The "Ã¼" in $s1 has a code of less than 256, so it can
be printed as a single byte, and is. The â‚¬ cannot be printed as a single
byte, so it is encoded as UTF-8 and a warning is printed.

The end result is that the output is a mixture of encodings. The first
line is ISO-8859-1, the second is UTF-8. It is impossible to read this
mess again. (And perl really cannot help this - in line 10 it doesn't
know that it will be asked to print a euro sign in line 11, it doesn't
even know it is printing text - it might print an image).

Now if we add a -CO to the shebang line, the output is:

hrunkner:~/tmp 22:04 198% ./foo | od -c
0000000 R 303 274 b e z a h l \n 342 202 254 2 0
0000020 0 , - - \n
0000025

And we now have both lines encoded in UTF-8.

hp

Adam Funk · Jun 25, 2007

You probably are assuming that open() configures your filehandles with
binmode() for you. This isn't true.

If you open a file, and it needs a special encoding,

By "special" you mean "anything other than ASCII, right?

you need to call
binmode(). If you close and re-open STDOUT, you need to call binmode()
on it (if it needs encoding). If you close and re-open STDOUT when it's
aliased as OUTPUT, you still need to set up the encoding.

When you need an encoding, it's your responsibility to use binmode() to
set it on each file handle. The only exception I'm aware of is when the
"encoding" module is used. But that only sets up STDIN and STDOUT, and
it only sets them once. Even if the encoding pragma is used, if STDOUT
is closed and re-opened, binmode() must be called on it again.

OK, thanks.

Adam Funk · Jun 25, 2007

Mumia W. wrote:

You mean like:

open FH, '<:raw', 'filename';

??

But to be fair to Mumia, the "simpler" form of open() doesn't do that,
and I was expressing surprise that open() didn't assume the
environment locale to be applicable.

Is there any difference between

open(OUTPUT, '>:utf8', $output_filename);

and

open(OUTPUT, ">" . $output_filename);
binmode (OUTPUT, ":utf8");

or should I just use whichever one I find more aesthetic?

Adam Funk · Jun 25, 2007

try perl -C7

see perldoc perlrun:

Since I'm sometimes using output other than STDOUT, I think I need the
more comprehensive -C31 (equivalent to IOEio). Thanks.

Adam Funk · Jun 25, 2007

It is "a" correct way, not "the" correct way. There are other ways: The
-C option (and it's cousin, the PERL_UNICODE environment variable),
specifying perl I/O layers for open, etc.

I generally prefer

open($fh, '<:utf8', $filename);

to

open($fh, '<', $filename);
binmode $fh, ':utf8';

because it is shorter and cleaner. So I use binmode only on STDIN,
STDOUT and (rarely) STDERR, and then I might use -C instead.

As far as I can tell, I'm not getting errors or warnings reading the
input files (but I'm not doing it directly with my own code --- I'm
using XML::Twig's parsefile($input_filename) method; the input files
are XML with Cyrillic UTF-8 PCDATA) --- does Perl by default take the
environment into consideration, or assume UTF-8, for input but not
output?

Peter J. Holzer · Jun 25, 2007

By "special" you mean "anything other than ASCII, right?

"Anything other than what happens to be the default in your perl
implementation" actually. That might be EBCDIC

.

It might be a good idea to always specify the intended encoding.

If you want to get the current charset/encoding from the locale, you can
use I18N::Langinfo:

use I18N::Langinfo qw(langinfo CODESET)
$charset = langinfo(CODESET)

[...]

open(my $fh, "<:encoding(charset)", $filename);

hp

Peter J. Holzer · Jun 25, 2007

As far as I can tell, I'm not getting errors or warnings reading the
input files (but I'm not doing it directly with my own code --- I'm
using XML::Twig's parsefile($input_filename) method; the input files
are XML with Cyrillic UTF-8 PCDATA) --- does Perl by default take the
environment into consideration,

No. By default it assumes (on Unix) binary input. You are reading and
writing a stream of bytes, not a stream of characters.

or assume UTF-8, for input but not output?

No. The XML parser gets the encoding from the XML file. If the XML file
doesn't explicitely specify an encoding, it must be UTF-8. This is
completely independent of the locale. XML files are supposed to be
portable and must not be interpreted differently depending on the
locale.

hp

Peter J. Holzer · Jun 25, 2007

But to be fair to Mumia, the "simpler" form of open() doesn't do that,
and I was expressing surprise that open() didn't assume the
environment locale to be applicable.

open cannot know whether the file it opens is supposed to be a text file
or a binary file. Since perl treated all files as binary on Unix
previously, to keep that as default. Changing the default would have
broken lots of old scripts.

Is there any difference between

open(OUTPUT, '>:utf8', $output_filename);

and

open(OUTPUT, ">" . $output_filename);
binmode (OUTPUT, ":utf8");

or should I just use whichever one I find more aesthetic?

AFAIK they are equivalent.

hp

Adam Funk · Jun 25, 2007

No. By default it assumes (on Unix) binary input. You are reading and
writing a stream of bytes, not a stream of characters.

No. The XML parser gets the encoding from the XML file. If the XML file
doesn't explicitely specify an encoding, it must be UTF-8. This is
completely independent of the locale. XML files are supposed to be
portable and must not be interpreted differently depending on the
locale.

Oh of course! I got so caught in up in this business of setting
encodings that I forgot about the encoding specified explicitly in the
XML file.

Adam Funk · Jun 25, 2007

"Anything other than what happens to be the default in your perl
implementation" actually. That might be EBCDIC .

I've got enough trouble already, thanks. ;-)

Adam Funk · Jun 25, 2007

open cannot know whether the file it opens is supposed to be a text file
or a binary file. Since perl treated all files as binary on Unix
previously, to keep that as default. Changing the default would have
broken lots of old scripts.

It's starting to make sense now.

AFAIK they are equivalent.

Thanks.

setting binmode for empty filehandle	3	Apr 8, 2014
How to try a range of hex values in C# code ?	0	Nov 19, 2022
FAQ 5.1 How do I flush/unbuffer an output filehandle? Why must I do this?	0	Apr 2, 2011
closing filehandle for tee STDOUT	2	Sep 22, 2008
timeout a print to stdout?	8	Apr 20, 2013
redirecting stdout	27	Jul 16, 2013
writing to terminal even with STDOUT and STDERR redirected	5	Jan 22, 2010
Unable to read input from keyboard, in below C code, for a BST.	0	Jul 20, 2025

Assigning another filehandle to STDOUT, using binmode.

Adam Funk

Joe Smith

Adam Funk

Dr.Ruud

Adam Funk

Klaus

Mumia W.

John W. Krahn

Mumia W.

Peter J. Holzer

Adam Funk

Adam Funk

Adam Funk

Adam Funk

Peter J. Holzer

Peter J. Holzer

Peter J. Holzer

Adam Funk

Adam Funk

Adam Funk

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads