Assigning another filehandle to STDOUT, using binmode.

A

Adam Funk

I'm writing a program that uses File::Find to recurse through the
files and directories specified as command-line arguments, and to call
process_file() on each one.

By default the program prints each file's results to STDOUT, but if I
give it the -d DIRECTORY option, it should print each file's output to
a file in DIRECTORY with ".txt" at the end of the name instead of
".xml". There are a lot of non-English UTF-8 characters in the input
and output.

At the moment, I have the following near the beginning of the program:

binmode (STDOUT, ":utf8");
*OUTPUT = *STDOUT ;


and the following for each input file:


sub process_file {
# find is called with the no_chdir option set
my $input_filename = $_;
my $output_filename = $input_filename;

if ($option{x} || ($input_filename =~ m!\.xml$!i ) ) {
if ($option{d}) {
# drop the ".xml" suffix
$output_filename =~ s!\.xml$!!i ;
# drop the relative path
$output_filename =~ s!^.*/!! ;
# add the new path and suffix
$output_filename = $option{d} . "/" . $output_filename;
$output_filename = $output_filename . ".txt";
open(OUTPUT, ">" . $output_filename);
binmode (OUTPUT, ":utf8");
}

print(STDERR "Reading : ", $input_filename, "\n");

# ... CODE THAT CALLS OTHER SUBROUTINES TO READ THE
# INPUT FILE, PROCESS IT, AND print(OUTPUT ...) A
# LOT OF STUFF

if ($option{d}) {
print(STDERR "Wrote : ", $output_filename, "\n");
close(OUTPUT);
}

}
else {
print(STDERR "Ignoring: ", $File::Find::name, "\n");
}
}


As far as I can tell, this works and cleanly suppresses the "Wide
character" warnings. Is this use of filehandle assignment OK, or am I
likely to run into trouble later?

Also, why is it necessary to set binmode on OUTPUT every time I open
it?

Thanks,
Adam
 
J

Joe Smith

Adam said:
Also, why is it necessary to set binmode on OUTPUT every time I open
it?

Each open() on a handle is independent of any previous I/O on that
handle. What makes you think binmode() would last past any
explicit (or implicit) close()?
-Joe
 
A

Adam Funk

Each open() on a handle is independent of any previous I/O on that
handle. What makes you think binmode() would last past any
explicit (or implicit) close()?

It wasn't obvious to me, but thanks for clarifying that. I'm still
wondering about a few things, though.

Is using binmode the most correct way to suppress those annoying "Wide
character" warnings?

Why does Perl act surprised by UTF-8 characters in the output when I'm
running the program with LANG=en_GB.UTF-8 in the environment?

Thanks,
Adam
 
D

Dr.Ruud

Adam Funk schreef:
Is using binmode the most correct way to suppress those annoying "Wide
character" warnings?

What is annoying about them? The just mean that you need to fix your
program.
 
A

Adam Funk

Adam Funk schreef:


What is annoying about them? The just mean that you need to fix your
program.

OK, let my try a different set of questions: is using binmode the
correct way to fix the error that causes those warnings?


As I said, I'm running the program in a UTF-8 environment but getting
thousands (I think) of identical warnings about "Wide characters"
which actually refer to correct UTF-8 characters that Perl has read
from input data files without a hiccup.

Why is it unreasonable that I find this annoying?
or
What am I doing that constitutes an error?
 
K

Klaus

OK, let my try a different set of questions: is using binmode the
correct way to fix the error that causes those warnings?

As I said, I'm running the program in a UTF-8 environment but getting
thousands (I think) of identical warnings about "Wide characters"
which actually refer to correct UTF-8 characters that Perl has read
from input data files without a hiccup.

Why is it unreasonable that I find this annoying?
or
What am I doing that constitutes an error?

try perl -C7

see perldoc perlrun:

++ ===============================
++ -C [number/list]
++
++ The -C flag controls some Unicode of the Perl
++ Unicode features.
++
++ As of 5.8.1, the -C can be followed either by a
++ number or a list of option letters. The letters, their
++ numeric values, and effects are as follows; listing
++ the letters is equal to summing the numbers.
++
++ I 1 STDIN is assumed to be in UTF-8
++ O 2 STDOUT will be in UTF-8
++ E 4 STDERR will be in UTF-8
++ S 7 I + O + E
++ ...
++ ===============================
 
M

Mumia W.

OK, let my try a different set of questions: is using binmode the
correct way to fix the error that causes those warnings?

Yes.


As I said, I'm running the program in a UTF-8 environment but getting
thousands (I think) of identical warnings about "Wide characters"
which actually refer to correct UTF-8 characters that Perl has read
from input data files without a hiccup.

Why is it unreasonable that I find this annoying?
or
What am I doing that constitutes an error?

You probably are assuming that open() configures your filehandles with
binmode() for you. This isn't true.

If you open a file, and it needs a special encoding, you need to call
binmode(). If you close and re-open STDOUT, you need to call binmode()
on it (if it needs encoding). If you close and re-open STDOUT when it's
aliased as OUTPUT, you still need to set up the encoding.

When you need an encoding, it's your responsibility to use binmode() to
set it on each file handle. The only exception I'm aware of is when the
"encoding" module is used. But that only sets up STDIN and STDOUT, and
it only sets them once. Even if the encoding pragma is used, if STDOUT
is closed and re-opened, binmode() must be called on it again.
 
J

John W. Krahn

Mumia said:
You probably are assuming that open() configures your filehandles with
binmode() for you. This isn't true.

You mean like:

open FH, '<:raw', 'filename';

??


John
 
P

Peter J. Holzer

OK, let my try a different set of questions: is using binmode the
correct way to fix the error that causes those warnings?

It is "a" correct way, not "the" correct way. There are other ways: The
-C option (and it's cousin, the PERL_UNICODE environment variable),
specifying perl I/O layers for open, etc.

I generally prefer

open($fh, '<:utf8', $filename);

to

open($fh, '<', $filename);
binmode $fh, ':utf8';

because it is shorter and cleaner. So I use binmode only on STDIN,
STDOUT and (rarely) STDERR, and then I might use -C instead.

I used to use the PERL_UNICODE environment variable, but that bit me
almost as often as it helped, so I don't do that any more.
As I said, I'm running the program in a UTF-8 environment but getting
thousands (I think) of identical warnings about "Wide characters"
which actually refer to correct UTF-8 characters that Perl has read
from input data files without a hiccup.

Why is it unreasonable that I find this annoying?
or
What am I doing that constitutes an error?

You are producing complete garbage. Consider this:

------------------------------------------------------------------------
1 #!/usr/bin/perl
2
3 use warnings;
4 use strict;
5 use utf8;
6
7 my $s1 = "Rübezahl\n";
8 my $s2 = "€ 200,--\n";
9
10 print $s1;
11 print $s2;
------------------------------------------------------------------------
hrunkner:~/tmp 21:55 193% ./foo | od -c
Wide character in print at ./foo line 11.
0000000 R 374 b e z a h l \n 342 202 254 2 0 0
0000020 , - - \n
0000024
hrunkner:~/tmp 21:55 194%

As you can see you get the warning only when printing $s2, but *not*
when printing $s1. The "ü" in $s1 has a code of less than 256, so it can
be printed as a single byte, and is. The € cannot be printed as a single
byte, so it is encoded as UTF-8 and a warning is printed.

The end result is that the output is a mixture of encodings. The first
line is ISO-8859-1, the second is UTF-8. It is impossible to read this
mess again. (And perl really cannot help this - in line 10 it doesn't
know that it will be asked to print a euro sign in line 11, it doesn't
even know it is printing text - it might print an image).

Now if we add a -CO to the shebang line, the output is:

hrunkner:~/tmp 22:04 198% ./foo | od -c
0000000 R 303 274 b e z a h l \n 342 202 254 2 0
0000020 0 , - - \n
0000025

And we now have both lines encoded in UTF-8.

hp
 
A

Adam Funk

You probably are assuming that open() configures your filehandles with
binmode() for you. This isn't true.

If you open a file, and it needs a special encoding,

By "special" you mean "anything other than ASCII, right?
you need to call
binmode(). If you close and re-open STDOUT, you need to call binmode()
on it (if it needs encoding). If you close and re-open STDOUT when it's
aliased as OUTPUT, you still need to set up the encoding.

When you need an encoding, it's your responsibility to use binmode() to
set it on each file handle. The only exception I'm aware of is when the
"encoding" module is used. But that only sets up STDIN and STDOUT, and
it only sets them once. Even if the encoding pragma is used, if STDOUT
is closed and re-opened, binmode() must be called on it again.

OK, thanks.
 
A

Adam Funk

Mumia W. wrote:

You mean like:

open FH, '<:raw', 'filename';

??

But to be fair to Mumia, the "simpler" form of open() doesn't do that,
and I was expressing surprise that open() didn't assume the
environment locale to be applicable.


Is there any difference between

open(OUTPUT, '>:utf8', $output_filename);

and

open(OUTPUT, ">" . $output_filename);
binmode (OUTPUT, ":utf8");

or should I just use whichever one I find more aesthetic?
 
A

Adam Funk

try perl -C7

see perldoc perlrun:

Since I'm sometimes using output other than STDOUT, I think I need the
more comprehensive -C31 (equivalent to IOEio). Thanks.
 
A

Adam Funk

It is "a" correct way, not "the" correct way. There are other ways: The
-C option (and it's cousin, the PERL_UNICODE environment variable),
specifying perl I/O layers for open, etc.

I generally prefer

open($fh, '<:utf8', $filename);

to

open($fh, '<', $filename);
binmode $fh, ':utf8';

because it is shorter and cleaner. So I use binmode only on STDIN,
STDOUT and (rarely) STDERR, and then I might use -C instead.

As far as I can tell, I'm not getting errors or warnings reading the
input files (but I'm not doing it directly with my own code --- I'm
using XML::Twig's parsefile($input_filename) method; the input files
are XML with Cyrillic UTF-8 PCDATA) --- does Perl by default take the
environment into consideration, or assume UTF-8, for input but not
output?
 
P

Peter J. Holzer

By "special" you mean "anything other than ASCII, right?

"Anything other than what happens to be the default in your perl
implementation" actually. That might be EBCDIC :).

It might be a good idea to always specify the intended encoding.

If you want to get the current charset/encoding from the locale, you can
use I18N::Langinfo:


use I18N::Langinfo qw(langinfo CODESET)
$charset = langinfo(CODESET)

[...]

open(my $fh, "<:encoding(charset)", $filename);

hp
 
P

Peter J. Holzer

As far as I can tell, I'm not getting errors or warnings reading the
input files (but I'm not doing it directly with my own code --- I'm
using XML::Twig's parsefile($input_filename) method; the input files
are XML with Cyrillic UTF-8 PCDATA) --- does Perl by default take the
environment into consideration,

No. By default it assumes (on Unix) binary input. You are reading and
writing a stream of bytes, not a stream of characters.
or assume UTF-8, for input but not output?

No. The XML parser gets the encoding from the XML file. If the XML file
doesn't explicitely specify an encoding, it must be UTF-8. This is
completely independent of the locale. XML files are supposed to be
portable and must not be interpreted differently depending on the
locale.

hp
 
P

Peter J. Holzer

But to be fair to Mumia, the "simpler" form of open() doesn't do that,
and I was expressing surprise that open() didn't assume the
environment locale to be applicable.

open cannot know whether the file it opens is supposed to be a text file
or a binary file. Since perl treated all files as binary on Unix
previously, to keep that as default. Changing the default would have
broken lots of old scripts.
Is there any difference between

open(OUTPUT, '>:utf8', $output_filename);

and

open(OUTPUT, ">" . $output_filename);
binmode (OUTPUT, ":utf8");

or should I just use whichever one I find more aesthetic?

AFAIK they are equivalent.

hp
 
A

Adam Funk

No. By default it assumes (on Unix) binary input. You are reading and
writing a stream of bytes, not a stream of characters.


No. The XML parser gets the encoding from the XML file. If the XML file
doesn't explicitely specify an encoding, it must be UTF-8. This is
completely independent of the locale. XML files are supposed to be
portable and must not be interpreted differently depending on the
locale.

Oh of course! I got so caught in up in this business of setting
encodings that I forgot about the encoding specified explicitly in the
XML file.
 
A

Adam Funk

"Anything other than what happens to be the default in your perl
implementation" actually. That might be EBCDIC :).

I've got enough trouble already, thanks. ;-)
 
A

Adam Funk

open cannot know whether the file it opens is supposed to be a text file
or a binary file. Since perl treated all files as binary on Unix
previously, to keep that as default. Changing the default would have
broken lots of old scripts.

It's starting to make sense now.

AFAIK they are equivalent.

Thanks.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,575
Members
45,053
Latest member
billing-software

Latest Threads

Top