reading filenames from stdin - with umlauts?

Discussion in 'Java' started by Dan Stromberg, Jul 27, 2008.

  1. I wrote a small java program to read filenames from stdin (produced by
    Linux' "find"), and then to divide those files up into like groups.

    Actually, it was originally a python program, but I've been wanting to
    expand my horizons a little, so I rewrote it in perl, and now I'm trying
    to redo it in java to celebrate java going opensource, and I'll likely
    rewrite it in Haskell and/or Objective Caml after the java version.

    The java version of the program seems to work pretty well, and I have a
    feeling it's going to prove faster than the python or perl versions
    (which are at http://stromberg.dnsalias.org/~strombrg/equivalence-
    classes.html - and I hope to put the java version there too after it's
    working a little better).

    However, to my disappointment, the java version of the program can't seem
    to deal with filenames that have umlauts in them. Filenames using only
    characters in the English alphabet seem fine.

    I suspect the problem is that the file_name_, as it appears in a Linux
    ext3 filesystem, has an 8 bit per character representation, but java
    wants to convert the string I read from stdin to a 16 bit per character
    representation, and then doesn't reverse the conversion when I go to open
    the file by its name.

    I've googled about this for around 4 hours now, and found little but
    other people having similar issues - sometimes with files, sometimes with
    files inside zip archives.

    The error looks like:

    find /home/dstromberg/Sound/Music/mp3/Bjork -type f -print | LANG=en_US
    java -jar equivs.jar equivs.main
    Encoding on isr is ISO8859_1
    IO error 1: java.io.FileNotFoundException: /home/dstromberg/Sound/Music/
    mp3/Bjork/Bj?rk_The Music From Drawing Restraint 9_06_Shimenawa.mp3 (No
    such file or directory)
    java.io.FileNotFoundException: /home/dstromberg/Sound/Music/mp3/Bjork/Bj?
    rk_The Music From Drawing Restraint 9_06_Shimenawa.mp3 (No such file or
    directory)
    at java.io.FileInputStream.open(Native Method)
    at java.io.FileInputStream.<init>(FileInputStream.java:106)
    at Sortable_file.get_prefix(Sortable_file.java:63)
    at Sortable_file.compareTo(Sortable_file.java:266)
    at Sortable_file.compareTo(Sortable_file.java:1)
    at java.util.Arrays.mergeSort(Arrays.java:1144)
    at java.util.Arrays.mergeSort(Arrays.java:1155)
    at java.util.Arrays.sort(Arrays.java:1079)
    at equivs.main(equivs.java:54)

    The code I'm reading filenames with looks like:

    InputStreamReader isr = null;
    try
    {
    isr = (new InputStreamReader(System.in, "ISO-8859-1"));
    }
    catch (UnsupportedEncodingException uee)
    {
    System.err.println("UnsupportedEncodingException: " + uee);
    uee.printStackTrace();
    java.lang.System.exit(1);
    }
    System.err.println("Encoding on isr is " + isr.getEncoding());
    BufferedReader stdin = new BufferedReader (isr);
    String line;

    try
    {
    while((line = stdin.readLine()) != null)
    {
    // System.out.println(line);
    // System.out.flush();
    lst.add(new Sortable_file(line));
    }
    }
    catch(java.io.IOException e)
    {
    System.err.println("IO error 0.5: " + e);
    e.printStackTrace();
    java.lang.System.exit(1);
    }

    ....and the code I'm opening the filenames with looks like:

    byte[] buffer = new byte[128];
    java.io.File this_file;
    try
    {
    this_file = new java.io.File(this.filename);
    java.io.FileInputStream file = new java.io.FileInputStream
    (this_file);
    file.read(buffer);
    // System.out.println("this.prefix.length " +
    this.prefix.length);
    file.close();
    }
    catch (java.io.IOException ioe)
    {
    System.out.println( "IO error 1: " + ioe );
    ioe.printStackTrace();
    java.lang.System.exit(1);
    }

    (this is just one small part of the compareTo function - the goal was to
    make things fast, and one of the optimizations is to compare just the
    first 128 bytes of a file early in the comparison, and keep it cached in
    memory to make the sort fast. Only if two files have the same prefix do
    we do the expensive md5 hash - etc.).

    Has anyone found a way to do:

    find <options> -print | ./java-prog

    ....and have java-prog act on the files coming from stdin - including
    opening them?

    Thanks!

    PS: I suspect I could write a class to read bytes and piece together
    strings, but 1) That'd probably be slow and 2) I want to use the
    established java class hierarchy where possible and 3) the byte arrays
    still might get upconverted to a different encoding upon converting them
    to a string anyway. But if that's the only way, that's fine.
    Dan Stromberg, Jul 27, 2008
    #1
    1. Advertising

  2. Dan Stromberg

    Stefan Ram Guest

    Dan Stromberg <> writes:
    >The error looks like:


    We need to isolate the problem (as in »SSCCE«).

    Try this:

    echo "\0344" | java Main

    With

    public class Main
    { public static void main( final java.lang.String[] args )
    throws java.lang.Throwable
    { final java.io.InputStreamReader inputStreamReader
    = new java.io.InputStreamReader( System.in, "ISO8859_1" );
    final java.io.BufferedReader bufferedReader
    = new java.io.BufferedReader( inputStreamReader );
    final java.lang.String string = bufferedReader.readLine();
    java.lang.System.out.println( "\u00E4".equals( string.substring( 0, 1 ))); }}

    If prints »false«, post the output of

    echo "\0344" | od -h

    and also the hexadecimal codes of the String »string« at the
    end of the block above.

    Additional information:

    344 is the octal code of the letter LATIN SMALL LETTER A WITH DIAERESIS
    in ISO 8859-1.

    "\u00E4" is a Java String containing only the letter
    LATIN SMALL LETTER A WITH DIAERESIS.
    Stefan Ram, Jul 28, 2008
    #2
    1. Advertising

  3. Dan Stromberg wrote:
    > However, to my disappointment, the java version of the program can't seem
    > to deal with filenames that have umlauts in them. Filenames using only
    > characters in the English alphabet seem fine.
    >
    > I suspect the problem is that the file_name_, as it appears in a Linux
    > ext3 filesystem, has an 8 bit per character representation, but java
    > wants to convert the string I read from stdin to a 16 bit per character
    > representation, and then doesn't reverse the conversion when I go to open
    > the file by its name.
    >
    > I've googled about this for around 4 hours now, and found little but
    > other people having similar issues - sometimes with files, sometimes with
    > files inside zip archives.
    >
    > The error looks like:
    >
    > find /home/dstromberg/Sound/Music/mp3/Bjork -type f -print | LANG=en_US
    > java -jar equivs.jar equivs.main
    > Encoding on isr is ISO8859_1
    > IO error 1: java.io.FileNotFoundException: /home/dstromberg/Sound/Music/
    > mp3/Bjork/Bj?rk_The Music From Drawing Restraint 9_06_Shimenawa.mp3 (No
    > such file or directory)
    > java.io.FileNotFoundException: /home/dstromberg/Sound/Music/mp3/Bjork/Bj?
    > rk_The Music From Drawing Restraint 9_06_Shimenawa.mp3 (No such file or
    > directory)
    > at java.io.FileInputStream.open(Native Method)
    > at java.io.FileInputStream.<init>(FileInputStream.java:106)
    > at Sortable_file.get_prefix(Sortable_file.java:63)
    > at Sortable_file.compareTo(Sortable_file.java:266)
    > at Sortable_file.compareTo(Sortable_file.java:1)
    > at java.util.Arrays.mergeSort(Arrays.java:1144)
    > at java.util.Arrays.mergeSort(Arrays.java:1155)
    > at java.util.Arrays.sort(Arrays.java:1079)
    > at equivs.main(equivs.java:54)
    >
    > The code I'm reading filenames with looks like:
    >
    > InputStreamReader isr = null;
    > try
    > {
    > isr = (new InputStreamReader(System.in, "ISO-8859-1"));
    > }
    > catch (UnsupportedEncodingException uee)
    > {
    > System.err.println("UnsupportedEncodingException: " + uee);
    > uee.printStackTrace();
    > java.lang.System.exit(1);
    > }
    > System.err.println("Encoding on isr is " + isr.getEncoding());


    > Has anyone found a way to do:
    >
    > find <options> -print | ./java-prog
    >
    > ...and have java-prog act on the files coming from stdin - including
    > opening them?


    Have you tried "UTF-8" instead of "ISO-8859-1" ?

    Arne
    Arne Vajhøj, Jul 28, 2008
    #3
  4. Dan Stromberg

    Stefan Ram Guest

    Stefan Ram, Jul 28, 2008
    #4
  5. On Sun, 27 Jul 2008 23:25:01 +0000, Stefan Ram wrote:

    > Dan Stromberg <> writes:
    >>The error looks like:

    >
    > We need to isolate the problem (as in »SSCCE«).
    >
    > Try this:
    >
    > echo "\0344" | java Main
    >
    > With
    >
    > public class Main
    > { public static void main( final java.lang.String[] args )
    > throws java.lang.Throwable
    > { final java.io.InputStreamReader inputStreamReader
    > = new java.io.InputStreamReader( System.in, "ISO8859_1" ); final
    > java.io.BufferedReader bufferedReader = new java.io.BufferedReader(
    > inputStreamReader ); final java.lang.String string =
    > bufferedReader.readLine(); java.lang.System.out.println(
    > "\u00E4".equals( string.substring( 0, 1 ))); }}
    >
    > If prints »false«, post the output of
    >
    > echo "\0344" | od -h


    It printed false, but this prints true:

    printf '\344' | ./foo

    (This was with the gcj implementation of java. I get the same result
    using OpenJDK though).

    > and also the hexadecimal codes of the String »string« at the end of
    > the block above.
    >
    > Additional information:
    >
    > 344 is the octal code of the letter LATIN SMALL LETTER A WITH
    > DIAERESIS in ISO 8859-1.
    >
    > "\u00E4" is a Java String containing only the letter LATIN SMALL
    > LETTER A WITH DIAERESIS.
    Dan Stromberg, Jul 28, 2008
    #5
  6. On Sun, 27 Jul 2008 19:27:29 -0400, Arne Vajhøj wrote:

    > Dan Stromberg wrote:
    >> However, to my disappointment, the java version of the program can't
    >> seem to deal with filenames that have umlauts in them. Filenames using
    >> only characters in the English alphabet seem fine.
    >>
    >> I suspect the problem is that the file_name_, as it appears in a Linux
    >> ext3 filesystem, has an 8 bit per character representation, but java
    >> wants to convert the string I read from stdin to a 16 bit per character
    >> representation, and then doesn't reverse the conversion when I go to
    >> open the file by its name.
    >>
    >> I've googled about this for around 4 hours now, and found little but
    >> other people having similar issues - sometimes with files, sometimes
    >> with files inside zip archives.
    >>
    >> The error looks like:
    >>
    >> find /home/dstromberg/Sound/Music/mp3/Bjork -type f -print | LANG=en_US
    >> java -jar equivs.jar equivs.main
    >> Encoding on isr is ISO8859_1
    >> IO error 1: java.io.FileNotFoundException:
    >> /home/dstromberg/Sound/Music/ mp3/Bjork/Bj?rk_The Music From Drawing
    >> Restraint 9_06_Shimenawa.mp3 (No such file or directory)
    >> java.io.FileNotFoundException:
    >> /home/dstromberg/Sound/Music/mp3/Bjork/Bj? rk_The Music From Drawing
    >> Restraint 9_06_Shimenawa.mp3 (No such file or directory)
    >> at java.io.FileInputStream.open(Native Method) at
    >> java.io.FileInputStream.<init>(FileInputStream.java:106) at
    >> Sortable_file.get_prefix(Sortable_file.java:63) at
    >> Sortable_file.compareTo(Sortable_file.java:266) at
    >> Sortable_file.compareTo(Sortable_file.java:1) at
    >> java.util.Arrays.mergeSort(Arrays.java:1144) at
    >> java.util.Arrays.mergeSort(Arrays.java:1155) at
    >> java.util.Arrays.sort(Arrays.java:1079) at
    >> equivs.main(equivs.java:54)
    >>
    >> The code I'm reading filenames with looks like:
    >>
    >> InputStreamReader isr = null;
    >> try
    >> {
    >> isr = (new InputStreamReader(System.in, "ISO-8859-1")); }
    >> catch (UnsupportedEncodingException uee)
    >> {
    >> System.err.println("UnsupportedEncodingException: " + uee);
    >> uee.printStackTrace();
    >> java.lang.System.exit(1);
    >> }
    >> System.err.println("Encoding on isr is " + isr.getEncoding());

    >
    >> Has anyone found a way to do:
    >>
    >> find <options> -print | ./java-prog
    >>
    >> ...and have java-prog act on the files coming from stdin - including
    >> opening them?

    >
    > Have you tried "UTF-8" instead of "ISO-8859-1" ?
    >
    > Arne


    I had tried a handful of encodings but not UTF-8. I've now tried it, and
    found that I got the same result as with other encodings - file not found.
    Dan Stromberg, Jul 28, 2008
    #6
  7. On Mon, 28 Jul 2008 00:32:23 +0000, Stefan Ram wrote:

    > Dan Stromberg <> writes:
    >>isr = (new InputStreamReader(System.in, "ISO-8859-1")

    >
    > Now, I become aware of another fact:
    > »java.lang.System.in« already has an encoding.
    >
    > You might try to use this as a base instead:
    >
    > http://download.java.net/jdk7/docs/api/java/io/FileDescriptor.html#in


    I tried this but still I get file not found with OpenJDK. gcj seems fine
    though:

    FileReader fr = null;
    // isr = (new InputStreamReader(System.in, "ISO-8859-1"));
    // isr = (new InputStreamReader(System.in, "UTF-8"));
    fr = (new FileReader(java.io.FileDescriptor.in));
    System.err.println("Encoding on fr is " + fr.getEncoding());
    //BufferedReader stdin = new BufferedReader (fr);
    StringBuffer line;

    char ch;
    int int_char;
    try
    {
    while (true)
    {
    line = new StringBuffer("");
    while(true)
    {
    int_char = fr.read();
    if (int_char == -1)
    {
    break;
    }
    ch = (char)int_char;
    System.out.println("" + ch);
    if (ch == (char)10)
    {
    break;
    }
    line.append(ch);
    }
    if (int_char == -1)
    {
    break;
    }
    System.out.println(new String(line));
    lst.add(new Sortable_file(new String(line)));
    }
    }
    catch(java.io.IOException e)
    {

    BTW, this code says the encoding is ASCII when I run it, whether using
    OpenJDK or gcj.

    Is the java String type -always- 16 bits per character? That is, if I
    try to stick an 8 bit value into a String, is it always going to be
    converted to a different encoding that maps back most of the time, but
    not always?

    Do java strings of any sort have an associated but variable encoding?
    Are there different string types that have different encodings?

    Is there any way of opening a filename that isn't stored in a String?
    Short of something like SWIG, JNI or ctypes that is?
    Dan Stromberg, Jul 28, 2008
    #7
  8. Dan Stromberg

    Stefan Ram Guest

    Dan Stromberg <> writes:
    >Is the java String type -always- 16 bits per character?


    Yes (if we ignore surrogate pairs, which are rare and not
    used for umlauts).

    >That is, if I try to stick an 8 bit value into a String, is it
    >always going to be converted to a different encoding that maps
    >back most of the time, but not always?


    The Reader objects already take care to convert between
    raw bytes and characters. Strings contain characters,
    stricly speaking, they have no »encoding«. They might
    be converted to/from byte[] or streams to en- or decode them.

    >Do java strings of any sort have an associated but variable encoding?


    No. Ignoring surrogate pairs, a string is a sequence of
    characters; the value of each character /always/ is the
    corresponding Unicode code point.

    >Are there different string types that have different encodings?


    No (for the strings of the standard class »java.lang.String«).

    >Is there any way of opening a filename that isn't stored in a String?


    Not with the standard classes AFAIK.

    ~~

    To debug, try this:

    $mkdir d0
    $touch d0/ä
    $find d0 -name ä -print | od -h
    0000000 6430 2fe4 0a00
    0000005

    If the filesystem uses ISO 8859-1, you should see »e4« as above
    (»64302fe4« is »d0/ä«).

    Then, read the output of this find from Java and debug print
    it from Java to a sequence of hex codes.

    If it is »6430sfe4«, then you have read it correctly (ISO
    8859-1 code points agree with Unicode code points here).
    Otherwise, you might post here what it is instead.

    You can also bypass the Reader class, read the »raw bytes«
    from the stream, and use their hex dump to get an idea of the
    apparent encoding of the stream (post the hexdump here).
    Stefan Ram, Jul 28, 2008
    #8
  9. On 28/07/2008 07:05, Dan Stromberg allegedly wrote:
    > I had tried a handful of encodings but not UTF-8. I've now tried it, and
    > found that I got the same result as with other encodings - file not found.


    Have you tried not using any "encoding"? As others pointed out,
    System.in is a Reader, that is something which already has some kind of
    byte-to-char handling. Furthermore, if your solution ought to be
    portable, it would seem to me as a bad idea to hardcode the charset. You
    should rather rely on proper system configuration (java's file.encoding
    being the same as the shell's) -- or maybe a runtime parameter.

    --
    DF.
    Daniele Futtorovic, Jul 28, 2008
    #9
  10. Dan Stromberg

    Stefan Ram Guest

    Daniele Futtorovic <> writes:
    >(java's file.encoding being the same as the shell's)


    It is not always the same, for example under some versions of
    »Microsoft® Windows«, the console Window uses »cp437«, which
    is not the default encoding of java.lang.System.out there.

    Also, FileReader is not recommended (by several
    programmers), exactly because it uses a »default encoding«,
    which not always is appropriate for the task at hand.
    Stefan Ram, Jul 28, 2008
    #10
  11. Dan Stromberg wrote:
    > However, to my disappointment, the java version of the program can't seem
    > to deal with filenames that have umlauts in them. Filenames using only
    > characters in the English alphabet seem fine.
    >
    > I suspect the problem is that the file_name_, as it appears in a Linux
    > ext3 filesystem, has an 8 bit per character representation, but java
    > wants to convert the string I read from stdin to a 16 bit per character
    > representation, and then doesn't reverse the conversion when I go to open
    > the file by its name.


    No. Java /always/ uses 16-bit characters; if it did that, it couldn't
    open files at all.

    Try running this program:

    import java.io.File;

    public final class DirScan {

    public static void main(final String[] args) {
    for (final String dirName : args) {
    System.out.println(dirName);
    final File dir = new File(dirName);
    final File[] files = dir.listFiles();
    for (final File file : files) {
    final String fileName = file.toString();
    System.out.printf(" %-25s ", fileName);
    for (int i = 0; i < fileName.length(); ++i)
    System.out.printf(" %04X", (int) fileName.charAt(i));
    System.out.println();
    }
    }

    }

    }

    ....specifying one or more directories as arguments.


    --
    John W. Kennedy
    "Never try to take over the international economy based on a radical
    feminist agenda if you're not sure your leader isn't a transvestite."
    -- David Misch: "She-Spies", "While You Were Out"
    John W Kennedy, Jul 29, 2008
    #11
  12. On 29/07/2008 02:25, Lew allegedly wrote:
    > Daniele Futtorovic wrote:
    >> Have you tried not using any "encoding"? As others pointed out,
    >> System.in is a Reader, that is something which already has some kind of
    >> byte-to-char handling.

    >
    > Ahem:
    >> public static final InputStream in

    > <http://java.sun.com/javase/6/docs/api/java/lang/System.html#in>
    >


    <scratches head, walks to the nearest wall, bangs>

    --
    DF.
    Daniele Futtorovic, Jul 29, 2008
    #12
  13. Dan Stromberg

    Stefan Ram Guest

    Daniele Futtorovic <> writes:
    >> Daniele Futtorovic wrote:
    >>> Have you tried not using any "encoding"? As others pointed out,
    >>> System.in is a Reader, that is something which already has some kind of
    >>> byte-to-char handling.

    ><scratches head, walks to the nearest wall, bangs>


    My fault. It seems as if I would have assumed that there
    is a symmetry between System.in and System.out.

    A java.io.PrintStream really can have an encoding.
    Stefan Ram, Jul 29, 2008
    #13
  14. Dan Stromberg

    Stefan Ram Guest

    Daniele Futtorovic <> writes:
    ><scratches head, walks to the nearest wall, bangs>


    Still, allegedly java.lang.System.in sometimes /has/ some
    transcoding magic in it (based on a native method).

    For example:

    »Data read from [...] System.in, [...] are handled
    differently than data read from [...] other sources [...].

    [A] conversion is performed by the JVM on the data to
    convert from the normal character encoding of
    file.encoding to a CCSID matching the System i job CCSID.

    When System.in [...][is] redirected [...], this additional
    data conversion is not performed and the data remains in a
    character encoding matching file.encoding.«

    http://publib.boulder.ibm.com/infocenter/iseries/v5r4/topic/rzaha/charenc.htm
    Stefan Ram, Jul 29, 2008
    #14
  15. On 29/07/2008 03:41, Stefan Ram allegedly wrote:
    > Daniele Futtorovic <> writes:
    >>> Daniele Futtorovic wrote:
    >>>> Have you tried not using any "encoding"? As others pointed out,
    >>>> System.in is a Reader, that is something which already has some kind of
    >>>> byte-to-char handling.

    >> <scratches head, walks to the nearest wall, bangs>

    >
    > My fault. It seems as if I would have assumed that there
    > is a symmetry between System.in and System.out.


    No, mine really -- I should know the class of System.in by heart --, as
    well as accumulated frustration over too many mistakes in posts lately,
    perplexing me. I hate making mistakes. Especially in public. :)


    > Still, allegedly java.lang.System.in sometimes /has/ some
    > transcoding magic in it (based on a native method).
    >
    > For example:
    >
    > »Data read from [...] System.in, [...] are handled
    > differently than data read from [...] other sources [...].
    >
    > [A] conversion is performed by the JVM on the data to
    > convert from the normal character encoding of
    > file.encoding to a CCSID matching the System i job CCSID.
    >
    > When System.in [...][is] redirected [...], this additional
    > data conversion is not performed and the data remains in a
    > character encoding matching file.encoding.«
    >
    > http://publib.boulder.ibm.com/infocenter/iseries/v5r4/topic/rzaha/charenc.htm


    This appears to be specific to the iSeries. I can't find any other
    reference to System.in and encoding on the Sun site. Furthermore, the
    fact that System.in is an InputStream speaks squarely against any type
    of byte-to-char conversion (<=> "encoding"), doesn't it? Or should there
    be some magic hidden in the JVM that decides whether the process' input
    is text? I don't think that's likely. I don't think even see why that
    would be a good idea.

    --
    DF.
    Daniele Futtorovic, Jul 29, 2008
    #15
  16. On Mon, 28 Jul 2008 05:53:20 +0000, Stefan Ram wrote:

    > Dan Stromberg <> writes:
    >>Is the java String type -always- 16 bits per character?

    >
    > Yes (if we ignore surrogate pairs, which are rare and not used for
    > umlauts).
    >
    >>That is, if I try to stick an 8 bit value into a String, is it always
    >>going to be converted to a different encoding that maps back most of the
    >>time, but not always?

    >
    > The Reader objects already take care to convert between raw bytes and
    > characters. Strings contain characters, stricly speaking, they have no
    > »encoding«. They might be converted to/from byte[] or streams to en-
    > or decode them.
    >
    >>Do java strings of any sort have an associated but variable encoding?

    >
    > No. Ignoring surrogate pairs, a string is a sequence of characters;
    > the value of each character /always/ is the corresponding Unicode code
    > point.
    >
    >>Are there different string types that have different encodings?

    >
    > No (for the strings of the standard class »java.lang.String«).
    >
    >>Is there any way of opening a filename that isn't stored in a String?

    >
    > Not with the standard classes AFAIK.
    >
    > ~~
    >
    > To debug, try this:
    >
    > $mkdir d0
    > $touch d0/ä
    > $find d0 -name ä -print | od -h
    > 0000000 6430 2fe4 0a00
    > 0000005
    >
    > If the filesystem uses ISO 8859-1, you should see »e4« as above
    > (»64302fe4« is »d0/ä«).
    >
    > Then, read the output of this find from Java and debug print it from
    > Java to a sequence of hex codes.
    >
    > If it is »6430sfe4«, then you have read it correctly (ISO 8859-1 code
    > points agree with Unicode code points here). Otherwise, you might post
    > here what it is instead.
    >
    > You can also bypass the Reader class, read the »raw bytes« from the
    > stream, and use their hex dump to get an idea of the apparent encoding
    > of the stream (post the hexdump here).


    Often, at least on *ix, strace/truss/par/trace are a more direct route to
    a solution than endless test programs.

    I ran the OpenJDK version of my program under strace, and found that this
    is what's being read:

    [pid 11252] read(0, "/home/dstromberg/Sound/Music/mp3/Bjork/Bj\366rk_The
    Music From Drawing Restraint 9_06_Shimenawa.mp3\n/home/dstromberg/Sound/
    Music/mp3/Bjork/Bj\366rk_The Music From Drawing Restraint 9_10_Cetacea.mp3
    \n/home/dstromberg/Sound/Music/mp3/Bjork/Bj\366rk_The Music From Drawing
    Restraint 9_04_Bath.mp3\n/home/dstromberg/Sound/Music/mp3/Bjork/Bj
    \366rk_The Music From Drawing Restraint 9_05_Hunter Vessel.mp3\n/home/
    dstromberg/Sound/Music/mp3/Bjork/Bj\366rk_The Music From Drawing
    Restraint 9_01_Gratitude.mp3\n/home/dstromberg/Sound/Music/mp3/Bjork/Bj
    \366rk_The Music From Drawing Restraint 9_03_Ambergris March.mp3\n/home/
    dstromberg/Sound/Music/mp3/Bjork/Bj\366rk_The Music From Drawing
    Restraint 9_02_Pearl.mp3\n/home/dstromberg/Sound/Music/mp3/Bjork/Bj
    \366rk_The Music From Drawing Restraint 9_09_Bolographic Entrypoint.mp3\n/
    home/dstromberg/Sound/Music/mp3/Bjork/Bj\366rk_The Music From Drawing
    Restraint 9_08_Storm.mp3\n/home/dstromberg/Sound/Music/mp3/Bjork/Bj
    \366rk_The Music From Drawing Restraint 9_11_Antarctic Return.mp3\n/home/
    dstromberg/Sound/Music/mp3/Bjork/"..., 8192) = 1089

    ....and this is what it's trying to open:

    [pid 11252] open("/home/dstromberg/Sound/Music/mp3/Bjork/Bj�rk_The
    Music From Drawing Restraint 9_06_Shimenawa.mp3", O_RDONLY|O_LARGEFILE) =
    -1 ENOENT (No such file or directory)

    In case your newsreader unmunged that for you, the read has one non-ASCII
    byte for o+umlaut, and the open has 3 non-ASCII bytes for o+umlaut.

    Any further suggestions, folks?
    Dan Stromberg, Jul 31, 2008
    #16
  17. Dan Stromberg

    Guest

    I found some good help with this over on OpenJDK's i18n-dev mailing
    list.

    it turns out that in java (and perhaps other languages with
    localization support) many locales do not guarantee correct round-trip
    conversion from 8 bit filenames to 16 bit and back to 8 bit - so
    you'll seem to get phantom files that seem to be there for one purpose
    but not another. en_US.ISO-8859-1 is one of the few that does make
    this guarantee - that is, no phantom files. I'd been trying that
    locale among a handful of others, but it wasn't working because I
    didn't have that locale configured on my system.

    The python, perl and java versions of the program are now at
    http://stromberg.dnsalias.org/~strombrg/equivalence-classes.html

    Thanks to all who took an interest in the project!

    On Jul 27, 3:54 pm, Dan Stromberg <> wrote:
    > I wrote a small java program to read filenames from stdin (produced by
    > Linux' "find"), and then to divide those files up into like groups.
    >
    > Actually, it was originally a python program, but I've been wanting to
    > expand my horizons a little, so I rewrote it in perl, and now I'm trying
    > to redo it in java to celebrate java going opensource, and I'll likely
    > rewrite it in Haskell and/or Objective Caml after the java version.
    >
    > The java version of the program seems to work pretty well, and I have a
    > feeling it's going to prove faster than the python or perl versions
    > (which are athttp://stromberg.dnsalias.org/~strombrg/equivalence-
    > classes.html - and I hope to put the java version there too after it's
    > working a little better).
    >
    > However, to my disappointment, the java version of the program can't seem
    > to deal with filenames that have umlauts in them.  Filenames using only
    > characters in the English alphabet seem fine.
    >
    > I suspect the problem is that the file_name_, as it appears in a Linux
    > ext3 filesystem, has an 8 bit per character representation, but java
    > wants to convert the string I read from stdin to a 16 bit per character
    > representation, and then doesn't reverse the conversion when I go to open
    > the file by its name.
    >
    > I've googled about this for around 4 hours now, and found little but
    > other people having similar issues - sometimes with files, sometimes with
    > files inside zip archives.
    >
    > The error looks like:
    >
    > find /home/dstromberg/Sound/Music/mp3/Bjork -type f -print | LANG=en_US
    > java -jar equivs.jar equivs.main
    > Encoding on isr is ISO8859_1
    > IO error 1: java.io.FileNotFoundException: /home/dstromberg/Sound/Music/
    > mp3/Bjork/Bj?rk_The Music From Drawing Restraint 9_06_Shimenawa.mp3 (No
    > such file or directory)
    > java.io.FileNotFoundException: /home/dstromberg/Sound/Music/mp3/Bjork/Bj?
    > rk_The Music From Drawing Restraint 9_06_Shimenawa.mp3 (No such file or
    > directory)
    >         at java.io.FileInputStream.open(Native Method)
    >         at java.io.FileInputStream.<init>(FileInputStream.java:106)
    >         at Sortable_file.get_prefix(Sortable_file.java:63)
    >         at Sortable_file.compareTo(Sortable_file.java:266)
    >         at Sortable_file.compareTo(Sortable_file.java:1)
    >         at java.util.Arrays.mergeSort(Arrays.java:1144)
    >         at java.util.Arrays.mergeSort(Arrays.java:1155)
    >         at java.util.Arrays.sort(Arrays.java:1079)
    >         at equivs.main(equivs.java:54)
    >
    > The code I'm reading filenames with looks like:
    >
    >       InputStreamReader isr = null;
    >       try
    >          {
    >          isr = (new InputStreamReader(System.in, "ISO-8859-1"));
    >          }
    >       catch (UnsupportedEncodingException uee)
    >          {
    >          System.err.println("UnsupportedEncodingException: " + uee);
    >          uee.printStackTrace();
    >          java.lang.System.exit(1);
    >          }
    >       System.err.println("Encoding on isr is " + isr.getEncoding());
    >       BufferedReader stdin = new BufferedReader (isr);
    >       String line;
    >
    >       try
    >          {
    >          while((line = stdin.readLine()) != null)
    >             {
    >             // System.out.println(line);
    >             // System.out.flush();
    >             lst.add(new Sortable_file(line));
    >             }
    >          }
    >       catch(java.io.IOException e)
    >          {
    >          System.err.println("IO error 0.5: " + e);
    >          e.printStackTrace();
    >          java.lang.System.exit(1);
    >          }
    >
    > ...and the code I'm opening the filenames with looks like:
    >
    >       byte[] buffer = new byte[128];
    >       java.io.File this_file;
    >       try
    >          {
    >          this_file = new java.io.File(this.filename);
    >          java.io.FileInputStream file = new java.io.FileInputStream
    > (this_file);
    >          file.read(buffer);
    >          // System.out.println("this.prefix.length " +
    > this.prefix.length);
    >          file.close();
    >          }
    >       catch (java.io.IOException ioe)
    >          {
    >          System.out.println( "IO error 1: " + ioe );
    >          ioe.printStackTrace();
    >          java.lang.System.exit(1);
    >          }
    >
    > (this is just one small part of the compareTo function - the goal was to
    > make things fast, and one of the optimizations is to compare just the
    > first 128 bytes of a file early in the comparison, and keep it cached in
    > memory to make the sort fast.  Only if two files have the same prefix do
    > we do the expensive md5 hash - etc.).
    >
    > Has anyone found a way to do:
    >
    > find <options> -print | ./java-prog
    >
    > ...and have java-prog act on the files coming from stdin - including
    > opening them?
    >
    > Thanks!
    >
    > PS: I suspect I could write a class to read bytes and piece together
    > strings, but 1) That'd probably be slow and 2) I want to use the
    > established java class hierarchy where possible and 3) the byte arrays
    > still might get upconverted to a different encoding upon converting them
    > to a string anyway.  But if that's the only way, that's fine.
    , Sep 14, 2008
    #17
  18. Dan Stromberg

    Roedy Green Guest

    On Sun, 27 Jul 2008 22:54:46 GMT, Dan Stromberg
    <> wrote, quoted or indirectly quoted someone
    who said :

    >
    >I suspect the problem is that the file_name_, as it appears in a Linux
    >ext3 filesystem, has an 8 bit per character representation, but java
    >wants to convert the string I read from stdin to a 16 bit per character
    >representation, and then doesn't reverse the conversion when I go to open
    >the file by its name.


    For background on your problem, see
    http://mindprod.com/jgloss/encoding.html

    I suggest you put your filenames in a file with UTF-8 encoding or some
    encoding that supports umlauts. Then read it with a Reader. See
    http://mindprod.com/applet/fileio.html for sample code.

    Alternatively encode your umlauts is some weird way for the console :
    eg. u^, and convert them back.

    --

    Roedy Green Canadian Mind Products
    The Java Glossary
    http://mindprod.com
    Roedy Green, Sep 16, 2008
    #18
  19. > I suggest you put your filenames in a file with UTF-8 encoding or some
    > encoding that supports umlauts. Then read it with a Reader. See
    > http://mindprod.com/applet/fileio.html for sample code.


    to the OP:

    My suggestion is, that you "migrate" your system to utf-8, by renaming
    all files with iso-8859-whatever umlauts to utf-8 encoded filenames,
    and having system's LANG set to something like de_AT.utf-8 or
    en_US.utf-8 or whatever applies to your location.

    When I did that a couple of years ago, I wrote some TCL-script to
    do the renaming. The script is available, but isn't optimized for
    fool-proof usage. (no GUI, no "usage:"-screen). Also, no warranties
    and whatsoever.
    Anyway, (if still not scared/bored away) it's here:
    <http://www.logic.at/people/avl/stuff/convertNamesToUtf8.tcl>
    (tclsh should be available (if not preinstalled) on all linux-
    distributions, anyway.) Just go to the root of a tree that contains
    files with umlauts in their names, and run the script from there,
    but of course only after having had a look at the script to verify
    it doesn't install a trojan.
    Andreas Leitgeb, Sep 16, 2008
    #19
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. B.J.
    Replies:
    4
    Views:
    739
    Toby Inkster
    Apr 23, 2005
  2. Charlie Zender

    Reading stdin once confuses second stdin read

    Charlie Zender, Jun 19, 2004, in forum: C Programming
    Replies:
    6
    Views:
    783
    Dan Pop
    Jun 21, 2004
  3. Michiel Overtoom

    Re: Problem reading file with umlauts

    Michiel Overtoom, Jul 7, 2009, in forum: Python
    Replies:
    1
    Views:
    439
    Stefan Behnel
    Jul 7, 2009
  4. MRAB
    Replies:
    0
    Views:
    488
  5. Stefano Sabatini
    Replies:
    6
    Views:
    289
    Stefano Sabatini
    Jul 29, 2007
Loading...

Share This Page