UTF8 strings and filesystem access

Discussion in 'Perl Misc' started by Gary E. Ansok, Oct 11, 2007.

  1. One way to access the files in a directory is

    opendir DH, $dir or die "opendir: $!";
    while (my $file = readdir DH) {
    next unless -f "$dir/$file";
    # do whatever needs to be done with "$dir/$file";
    }

    However, this fails given the combination of two facts:
    1) $dir is encoded internally in UTF8 (even if $dir doesn't
    contain any non-ASCII characters)
    2) $file contains non-ASCII characters

    The string "$dir/$file" becomes UTF8-encoded, and while it
    prints correctly, and compares equal to the same string not
    UTF8-encoded, apparently the internal encoding is used
    in a stat() (or open()) call, which then fails with $! being
    "No such file".

    Is there a way to work around this without needing to
    transcode all strings that might be UTF8-encoded? $dir is
    being read in from a config file using a module (XML::Simple),
    so I don't have a lot of control over how it's initialized.

    I know I could recast the code to chdir() to $dir, but that
    would be a significant change given the current code structure.

    This is on Solaris, using 5.8.0, though I've verified
    similar behavior on Windows with 5.8.7. I've tried different
    settings for LC_ALL, and it doesn't seem to make a difference.

    Below is a more complete program to demonstrate the bug. It
    assumes that a directory "t2" already exists, with
    suitably-named file in it (I used "fil\351.txt")

    Thanks,
    Gary Ansok

    #! /opt/perl/5.8.0/bin/perl

    use strict;
    use warnings;

    my $show_bug = 1;

    my $dir = 't2';
    if ($show_bug) { # force $dir to be UTF8-encoded
    $dir .= "\x{100}";
    chop $dir;
    }

    print "Opening dir '$dir'\n";
    opendir DH, $dir or die "opendir: $!";

    while (my $file = readdir DH) {
    print "Checking file '$dir/$file'\n";
    next unless -f "$dir/$file";
    print "Found file '$dir/$file'\n";
    }
     
    Gary E. Ansok, Oct 11, 2007
    #1
    1. Advertising

  2. Gary E. Ansok

    Ben Morrow Guest

    Quoth (Gary E. Ansok):
    > One way to access the files in a directory is
    >
    > opendir DH, $dir or die "opendir: $!";
    > while (my $file = readdir DH) {
    > next unless -f "$dir/$file";
    > # do whatever needs to be done with "$dir/$file";
    > }
    >
    > However, this fails given the combination of two facts:
    > 1) $dir is encoded internally in UTF8 (even if $dir doesn't
    > contain any non-ASCII characters)
    > 2) $file contains non-ASCII characters
    >
    > The string "$dir/$file" becomes UTF8-encoded, and while it
    > prints correctly, and compares equal to the same string not
    > UTF8-encoded, apparently the internal encoding is used
    > in a stat() (or open()) call, which then fails with $! being
    > "No such file".
    >
    > Is there a way to work around this without needing to
    > transcode all strings that might be UTF8-encoded?


    No, not with current versions of perl. All interactions with the system
    use raw byte-strings[1], so you will need to encode them correctly in
    your local character set for open, and decode them from readdir.

    Ben

    [1] The -C switch used to switch to the Unicode API on Win32, but noone
    used it and the switch was removed in 5.8.1.
     
    Ben Morrow, Oct 11, 2007
    #2
    1. Advertising

  3. On 2007-10-11 00:31, Ben Morrow <> wrote:
    >
    > Quoth (Gary E. Ansok):
    >> One way to access the files in a directory is
    >>
    >> opendir DH, $dir or die "opendir: $!";
    >> while (my $file = readdir DH) {
    >> next unless -f "$dir/$file";
    >> # do whatever needs to be done with "$dir/$file";
    >> }
    >>
    >> However, this fails given the combination of two facts:
    >> 1) $dir is encoded internally in UTF8 (even if $dir doesn't
    >> contain any non-ASCII characters)


    Then why is it a wide string?

    >> 2) $file contains non-ASCII characters
    >>
    >> The string "$dir/$file" becomes UTF8-encoded, and while it
    >> prints correctly, and compares equal to the same string not
    >> UTF8-encoded, apparently the internal encoding is used
    >> in a stat() (or open()) call, which then fails with $! being
    >> "No such file".
    >>
    >> Is there a way to work around this without needing to
    >> transcode all strings that might be UTF8-encoded?

    >
    > No, not with current versions of perl. All interactions with the system
    > use raw byte-strings[1], so you will need to encode them correctly in
    > your local character set for open, and decode them from readdir.


    or alternatively, treat file names as opaque byte strings.

    > [1] The -C switch used to switch to the Unicode API on Win32, but noone
    > used it and the switch was removed in 5.8.1.


    The switch is still there but it does something different now: It
    controls whether I/O streams and command line parameters are in UTF-8.
    I use

    #!/usr/bin/perl -CSAL

    quite often.

    hp


    --
    _ | Peter J. Holzer | I know I'd be respectful of a pirate
    |_|_) | Sysadmin WSR | with an emu on his shoulder.
    | | | |
    __/ | http://www.hjp.at/ | -- Sam in "Freefall"
     
    Peter J. Holzer, Oct 11, 2007
    #3
  4. In article <>,
    Peter J. Holzer <> wrote:
    >> Quoth (Gary E. Ansok):
    >>>
    >>> 1) $dir is encoded internally in UTF8 (even if $dir doesn't
    >>> contain any non-ASCII characters)

    >
    >Then why is it a wide string?


    It's read in using XML::Simple from a config file that does not
    contain any non-ASCII characters, or any encoding specification in
    the XML prolog (though adding "encoding='ISO-8859-1'" didn't help).

    Now that I've dug a little deeper, I think upgrading some of our
    module versions may help avoid this problem -- a recent change to
    XML::LibXML mentioned "strip-off UTF8 flag for consistent behavior
    independent of document encoding".

    The module versions we're using:
    XML::Simple 2.16, XML::SAX 0.12, XML::LibXML 1.52, libxml2.so.2.6.26

    Gary
     
    Gary E. Ansok, Oct 11, 2007
    #4
  5. On 2007-10-11 22:22, Gary E. Ansok <> wrote:
    > In article <>,
    > Peter J. Holzer <> wrote:
    >>> Quoth (Gary E. Ansok):
    >>>>
    >>>> 1) $dir is encoded internally in UTF8 (even if $dir doesn't
    >>>> contain any non-ASCII characters)

    >>
    >>Then why is it a wide string?

    >
    > It's read in using XML::Simple from a config file that does not
    > contain any non-ASCII characters, or any encoding specification in
    > the XML prolog (though adding "encoding='ISO-8859-1'" didn't help).


    The prolog really can't (or at least shouldn't) make any difference: It
    specifies how the file is encoded, but the result of parsing the file is
    always text which possibly contains wide characters.

    You should decide whether you want to treat filenames as text or as
    byte strings within your script.

    If you want to treat them as text (e.g. because you want to do
    operations like case-mapping, substrings, etc. on them), explicitely
    encode them with the local character set just before using them in open,
    stat, etc.

    $dir_as_text = $xml_simple->{foo}{dir};
    $filename_as_text = $xml_simple->{foo}{bar}[42]{title};
    $filename_as_text = lc(substr($filename_as_text, 0, 20));
    $filename_as_text = "$dir_as_text/$filename_as_text.pdf";
    $filename_as_bytes = encode('us-ascii', $filename_as_text);
    open($fh, '<', $filename_as_bytes);

    If you want to treat them as byte strings, explicitely encode any text
    string you get from a different source (in your case, from an XML file)
    as early as possible.

    $dir_as_bytes = encode('us-ascii', $xml_simple->{foo}{dir});
    $filename_as_bytes = "$dir_as_text/$basename_as_bytes.pdf";
    open($fh, '<', $filename_as_bytes);

    > Now that I've dug a little deeper, I think upgrading some of our
    > module versions may help avoid this problem -- a recent change to
    > XML::LibXML mentioned "strip-off UTF8 flag for consistent behavior
    > independent of document encoding".


    You omitted an important piece here: The entry reads
    "strip-off UTF8 flag with $node->toString($format,1) for consistent ..."
    $node->toString returns a piece of XML, which always should be a series
    of bytes, not characters. I haven't looked at the source code of
    XML::Simple, but it probably uses $text->data or $node->nodeValue.

    hp


    --
    _ | Peter J. Holzer | I know I'd be respectful of a pirate
    |_|_) | Sysadmin WSR | with an emu on his shoulder.
    | | | |
    __/ | http://www.hjp.at/ | -- Sam in "Freefall"
     
    Peter J. Holzer, Oct 14, 2007
    #5
  6. In article <>,
    Peter J. Holzer <> wrote:
    >On 2007-10-11 22:22, Gary E. Ansok <> wrote:
    >> In article <>,
    >> Peter J. Holzer <> wrote:
    >>>> Quoth (Gary E. Ansok):
    >>>>>
    >>>>> 1) $dir is encoded internally in UTF8 (even if $dir doesn't
    >>>>> contain any non-ASCII characters)
    >>>
    >>>Then why is it a wide string?

    >>
    >> It's read in using XML::Simple from a config file that does not
    >> contain any non-ASCII characters, or any encoding specification in
    >> the XML prolog (though adding "encoding='ISO-8859-1'" didn't help).

    >
    >> Now that I've dug a little deeper, I think upgrading some of our
    >> module versions may help avoid this problem -- a recent change to
    >> XML::LibXML mentioned "strip-off UTF8 flag for consistent behavior
    >> independent of document encoding".

    >
    >You omitted an important piece here: The entry reads
    >"strip-off UTF8 flag with $node->toString($format,1) for consistent ..."
    >$node->toString returns a piece of XML, which always should be a series
    >of bytes, not characters. I haven't looked at the source code of
    >XML::Simple, but it probably uses $text->data or $node->nodeValue.


    I've worked around the problem by switching from XML::LibXML to
    XML::SAX::purePerl as the underlying parser -- now, the string
    read in from the configuration file no longer has the UTF8 flag
    set, and the problem does not appear.

    I still think it's a bug that a string that can successfully opendir()
    a directory, combined (including the appropriate separator) with a
    file name read in by readdir(), does not result in a string that can
    by used to open() or stat() the file. Especially since the path appears
    correct when printed as part of an error message, and it's difficult
    to diagnose the problem without resorting to something like Devel::peek.

    Thanks for the assistance,
    Gary Ansok
     
    Gary E. Ansok, Oct 15, 2007
    #6
  7. On 2007-10-15 17:03, Gary E. Ansok <> wrote:
    > In article <>,
    > Peter J. Holzer <> wrote:
    >>On 2007-10-11 22:22, Gary E. Ansok <> wrote:
    >>> In article <>,
    >>> Peter J. Holzer <> wrote:
    >>>>> Quoth (Gary E. Ansok):
    >>>>>>
    >>>>>> 1) $dir is encoded internally in UTF8 (even if $dir doesn't
    >>>>>> contain any non-ASCII characters)
    >>>>
    >>>>Then why is it a wide string?
    >>>
    >>> It's read in using XML::Simple from a config file that does not
    >>> contain any non-ASCII characters, or any encoding specification in
    >>> the XML prolog (though adding "encoding='ISO-8859-1'" didn't help).

    >>
    >>> Now that I've dug a little deeper, I think upgrading some of our
    >>> module versions may help avoid this problem -- a recent change to
    >>> XML::LibXML mentioned "strip-off UTF8 flag for consistent behavior
    >>> independent of document encoding".

    >>
    >>You omitted an important piece here: The entry reads
    >>"strip-off UTF8 flag with $node->toString($format,1) for consistent ..."
    >>$node->toString returns a piece of XML, which always should be a series
    >>of bytes, not characters. I haven't looked at the source code of
    >>XML::Simple, but it probably uses $text->data or $node->nodeValue.


    Why did you quote this paragraph? You don't seem to reply to it.

    > I've worked around the problem by switching from XML::LibXML to
    > XML::SAX::purePerl as the underlying parser -- now, the string
    > read in from the configuration file no longer has the UTF8 flag
    > set, and the problem does not appear.


    Probably because you have now two bugs which cancel each other out.
    The charset handling of XML::SAX::purePerl is severely broken[0] - don't
    use it.


    > I still think it's a bug that a string that can successfully opendir()
    > a directory, combined (including the appropriate separator) with a
    > file name read in by readdir(), does not result in a string that can
    > by used to open() or stat() the file.


    I agree. However, the opendir() only worked accidentally in your code
    because the directory name just happened to contain only characters <=
    0x7F. If it had contained a character >= 0x80 (like the file name you
    read) it would have failed, too. It is the nature of buggy code that it
    appears to work sometimes. The real fix is to explicitely encode/decode
    strings as required.

    > Especially since the path appears correct when printed as part of an
    > error message, and it's difficult to diagnose the problem without
    > resorting to something like Devel::peek.


    I think that open should work the same whether the filename argument
    is a wide or narrow string. But I'm not sure how it should behave: There
    are arguments for viewing a file name as a sequence of bytes and for
    viewing it as a sequence of characters. The latter is usually more
    convenient, but it makes some tasks impossible (e.g., renaming files
    with "illegal" byte sequences). Maybe we need the equivalent of IO
    layers for filenames, too. Or at least a flag "take filename encoding
    from the locale".

    hp

    [0] Actually just outdated: The current release is older than perl 5.8,
    so it doesn't know about perl 5.8 Unicode support.

    --
    _ | Peter J. Holzer | I know I'd be respectful of a pirate
    |_|_) | Sysadmin WSR | with an emu on his shoulder.
    | | | |
    __/ | http://www.hjp.at/ | -- Sam in "Freefall"
     
    Peter J. Holzer, Oct 27, 2007
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. stefoid
    Replies:
    14
    Views:
    820
    ddimitrov
    Jul 6, 2006
  2. Ben

    Strings, Strings and Damned Strings

    Ben, Jun 22, 2006, in forum: C Programming
    Replies:
    14
    Views:
    789
    Malcolm
    Jun 24, 2006
  3. David RF

    Count chars (not bytes) in UTF8 strings

    David RF, May 8, 2009, in forum: C Programming
    Replies:
    5
    Views:
    552
    Ben Bacarisse
    May 8, 2009
  4. Benny

    utf8 strings and inspect

    Benny, Jun 4, 2004, in forum: Ruby
    Replies:
    3
    Views:
    121
    Carlos
    Jun 4, 2004
  5. gry
    Replies:
    2
    Views:
    779
    Alf P. Steinbach
    Mar 13, 2012
Loading...

Share This Page