UTF8 strings and filesystem access

G

Gary E. Ansok

One way to access the files in a directory is

opendir DH, $dir or die "opendir: $!";
while (my $file = readdir DH) {
next unless -f "$dir/$file";
# do whatever needs to be done with "$dir/$file";
}

However, this fails given the combination of two facts:
1) $dir is encoded internally in UTF8 (even if $dir doesn't
contain any non-ASCII characters)
2) $file contains non-ASCII characters

The string "$dir/$file" becomes UTF8-encoded, and while it
prints correctly, and compares equal to the same string not
UTF8-encoded, apparently the internal encoding is used
in a stat() (or open()) call, which then fails with $! being
"No such file".

Is there a way to work around this without needing to
transcode all strings that might be UTF8-encoded? $dir is
being read in from a config file using a module (XML::Simple),
so I don't have a lot of control over how it's initialized.

I know I could recast the code to chdir() to $dir, but that
would be a significant change given the current code structure.

This is on Solaris, using 5.8.0, though I've verified
similar behavior on Windows with 5.8.7. I've tried different
settings for LC_ALL, and it doesn't seem to make a difference.

Below is a more complete program to demonstrate the bug. It
assumes that a directory "t2" already exists, with
suitably-named file in it (I used "fil\351.txt")

Thanks,
Gary Ansok

#! /opt/perl/5.8.0/bin/perl

use strict;
use warnings;

my $show_bug = 1;

my $dir = 't2';
if ($show_bug) { # force $dir to be UTF8-encoded
$dir .= "\x{100}";
chop $dir;
}

print "Opening dir '$dir'\n";
opendir DH, $dir or die "opendir: $!";

while (my $file = readdir DH) {
print "Checking file '$dir/$file'\n";
next unless -f "$dir/$file";
print "Found file '$dir/$file'\n";
}
 
B

Ben Morrow

Quoth (e-mail address removed) (Gary E. Ansok):
One way to access the files in a directory is

opendir DH, $dir or die "opendir: $!";
while (my $file = readdir DH) {
next unless -f "$dir/$file";
# do whatever needs to be done with "$dir/$file";
}

However, this fails given the combination of two facts:
1) $dir is encoded internally in UTF8 (even if $dir doesn't
contain any non-ASCII characters)
2) $file contains non-ASCII characters

The string "$dir/$file" becomes UTF8-encoded, and while it
prints correctly, and compares equal to the same string not
UTF8-encoded, apparently the internal encoding is used
in a stat() (or open()) call, which then fails with $! being
"No such file".

Is there a way to work around this without needing to
transcode all strings that might be UTF8-encoded?

No, not with current versions of perl. All interactions with the system
use raw byte-strings[1], so you will need to encode them correctly in
your local character set for open, and decode them from readdir.

Ben

[1] The -C switch used to switch to the Unicode API on Win32, but noone
used it and the switch was removed in 5.8.1.
 
P

Peter J. Holzer

Quoth (e-mail address removed) (Gary E. Ansok):

Then why is it a wide string?
2) $file contains non-ASCII characters

The string "$dir/$file" becomes UTF8-encoded, and while it
prints correctly, and compares equal to the same string not
UTF8-encoded, apparently the internal encoding is used
in a stat() (or open()) call, which then fails with $! being
"No such file".

Is there a way to work around this without needing to
transcode all strings that might be UTF8-encoded?

No, not with current versions of perl. All interactions with the system
use raw byte-strings[1], so you will need to encode them correctly in
your local character set for open, and decode them from readdir.

or alternatively, treat file names as opaque byte strings.
[1] The -C switch used to switch to the Unicode API on Win32, but noone
used it and the switch was removed in 5.8.1.

The switch is still there but it does something different now: It
controls whether I/O streams and command line parameters are in UTF-8.
I use

#!/usr/bin/perl -CSAL

quite often.

hp
 
G

Gary E. Ansok

Then why is it a wide string?

It's read in using XML::Simple from a config file that does not
contain any non-ASCII characters, or any encoding specification in
the XML prolog (though adding "encoding='ISO-8859-1'" didn't help).

Now that I've dug a little deeper, I think upgrading some of our
module versions may help avoid this problem -- a recent change to
XML::LibXML mentioned "strip-off UTF8 flag for consistent behavior
independent of document encoding".

The module versions we're using:
XML::Simple 2.16, XML::SAX 0.12, XML::LibXML 1.52, libxml2.so.2.6.26

Gary
 
P

Peter J. Holzer

It's read in using XML::Simple from a config file that does not
contain any non-ASCII characters, or any encoding specification in
the XML prolog (though adding "encoding='ISO-8859-1'" didn't help).

The prolog really can't (or at least shouldn't) make any difference: It
specifies how the file is encoded, but the result of parsing the file is
always text which possibly contains wide characters.

You should decide whether you want to treat filenames as text or as
byte strings within your script.

If you want to treat them as text (e.g. because you want to do
operations like case-mapping, substrings, etc. on them), explicitely
encode them with the local character set just before using them in open,
stat, etc.

$dir_as_text = $xml_simple->{foo}{dir};
$filename_as_text = $xml_simple->{foo}{bar}[42]{title};
$filename_as_text = lc(substr($filename_as_text, 0, 20));
$filename_as_text = "$dir_as_text/$filename_as_text.pdf";
$filename_as_bytes = encode('us-ascii', $filename_as_text);
open($fh, '<', $filename_as_bytes);

If you want to treat them as byte strings, explicitely encode any text
string you get from a different source (in your case, from an XML file)
as early as possible.

$dir_as_bytes = encode('us-ascii', $xml_simple->{foo}{dir});
$filename_as_bytes = "$dir_as_text/$basename_as_bytes.pdf";
open($fh, '<', $filename_as_bytes);
Now that I've dug a little deeper, I think upgrading some of our
module versions may help avoid this problem -- a recent change to
XML::LibXML mentioned "strip-off UTF8 flag for consistent behavior
independent of document encoding".

You omitted an important piece here: The entry reads
"strip-off UTF8 flag with $node->toString($format,1) for consistent ..."
$node->toString returns a piece of XML, which always should be a series
of bytes, not characters. I haven't looked at the source code of
XML::Simple, but it probably uses $text->data or $node->nodeValue.

hp
 
G

Gary E. Ansok

You omitted an important piece here: The entry reads
"strip-off UTF8 flag with $node->toString($format,1) for consistent ..."
$node->toString returns a piece of XML, which always should be a series
of bytes, not characters. I haven't looked at the source code of
XML::Simple, but it probably uses $text->data or $node->nodeValue.

I've worked around the problem by switching from XML::LibXML to
XML::SAX::purePerl as the underlying parser -- now, the string
read in from the configuration file no longer has the UTF8 flag
set, and the problem does not appear.

I still think it's a bug that a string that can successfully opendir()
a directory, combined (including the appropriate separator) with a
file name read in by readdir(), does not result in a string that can
by used to open() or stat() the file. Especially since the path appears
correct when printed as part of an error message, and it's difficult
to diagnose the problem without resorting to something like Devel::peek.

Thanks for the assistance,
Gary Ansok
 
P

Peter J. Holzer

Why did you quote this paragraph? You don't seem to reply to it.
I've worked around the problem by switching from XML::LibXML to
XML::SAX::purePerl as the underlying parser -- now, the string
read in from the configuration file no longer has the UTF8 flag
set, and the problem does not appear.

Probably because you have now two bugs which cancel each other out.
The charset handling of XML::SAX::purePerl is severely broken[0] - don't
use it.

I still think it's a bug that a string that can successfully opendir()
a directory, combined (including the appropriate separator) with a
file name read in by readdir(), does not result in a string that can
by used to open() or stat() the file.

I agree. However, the opendir() only worked accidentally in your code
because the directory name just happened to contain only characters <=
0x7F. If it had contained a character >= 0x80 (like the file name you
read) it would have failed, too. It is the nature of buggy code that it
appears to work sometimes. The real fix is to explicitely encode/decode
strings as required.
Especially since the path appears correct when printed as part of an
error message, and it's difficult to diagnose the problem without
resorting to something like Devel::peek.

I think that open should work the same whether the filename argument
is a wide or narrow string. But I'm not sure how it should behave: There
are arguments for viewing a file name as a sequence of bytes and for
viewing it as a sequence of characters. The latter is usually more
convenient, but it makes some tasks impossible (e.g., renaming files
with "illegal" byte sequences). Maybe we need the equivalent of IO
layers for filenames, too. Or at least a flag "take filename encoding
from the locale".

hp

[0] Actually just outdated: The current release is older than perl 5.8,
so it doesn't know about perl 5.8 Unicode support.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
473,763
Messages
2,569,562
Members
45,038
Latest member
OrderProperKetocapsules

Latest Threads

Top