reading filenames from stdin - with umlauts?

Dan Stromberg · Jul 27, 2008

I wrote a small java program to read filenames from stdin (produced by
Linux' "find"), and then to divide those files up into like groups.

Actually, it was originally a python program, but I've been wanting to
expand my horizons a little, so I rewrote it in perl, and now I'm trying
to redo it in java to celebrate java going opensource, and I'll likely
rewrite it in Haskell and/or Objective Caml after the java version.

The java version of the program seems to work pretty well, and I have a
feeling it's going to prove faster than the python or perl versions
(which are at http://stromberg.dnsalias.org/~strombrg/equivalence-
classes.html - and I hope to put the java version there too after it's
working a little better).

However, to my disappointment, the java version of the program can't seem
to deal with filenames that have umlauts in them. Filenames using only
characters in the English alphabet seem fine.

I suspect the problem is that the file_name_, as it appears in a Linux
ext3 filesystem, has an 8 bit per character representation, but java
wants to convert the string I read from stdin to a 16 bit per character
representation, and then doesn't reverse the conversion when I go to open
the file by its name.

I've googled about this for around 4 hours now, and found little but
other people having similar issues - sometimes with files, sometimes with
files inside zip archives.

The error looks like:

find /home/dstromberg/Sound/Music/mp3/Bjork -type f -print | LANG=en_US
java -jar equivs.jar equivs.main
Encoding on isr is ISO8859_1
IO error 1: java.io.FileNotFoundException: /home/dstromberg/Sound/Music/
mp3/Bjork/Bj?rk_The Music From Drawing Restraint 9_06_Shimenawa.mp3 (No
such file or directory)
java.io.FileNotFoundException: /home/dstromberg/Sound/Music/mp3/Bjork/Bj?
rk_The Music From Drawing Restraint 9_06_Shimenawa.mp3 (No such file or
directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:106)
at Sortable_file.get_prefix(Sortable_file.java:63)
at Sortable_file.compareTo(Sortable_file.java:266)
at Sortable_file.compareTo(Sortable_file.java:1)
at java.util.Arrays.mergeSort(Arrays.java:1144)
at java.util.Arrays.mergeSort(Arrays.java:1155)
at java.util.Arrays.sort(Arrays.java:1079)
at equivs.main(equivs.java:54)

The code I'm reading filenames with looks like:

InputStreamReader isr = null;
try
{
isr = (new InputStreamReader(System.in, "ISO-8859-1"));
}
catch (UnsupportedEncodingException uee)
{
System.err.println("UnsupportedEncodingException: " + uee);
uee.printStackTrace();
java.lang.System.exit(1);
}
System.err.println("Encoding on isr is " + isr.getEncoding());
BufferedReader stdin = new BufferedReader (isr);
String line;

try
{
while((line = stdin.readLine()) != null)
{
// System.out.println(line);
// System.out.flush();
lst.add(new Sortable_file(line));
}
}
catch(java.io.IOException e)
{
System.err.println("IO error 0.5: " + e);
e.printStackTrace();
java.lang.System.exit(1);
}

....and the code I'm opening the filenames with looks like:

byte[] buffer = new byte[128];
java.io.File this_file;
try
{
this_file = new java.io.File(this.filename);
java.io.FileInputStream file = new java.io.FileInputStream
(this_file);
file.read(buffer);
// System.out.println("this.prefix.length " +
this.prefix.length);
file.close();
}
catch (java.io.IOException ioe)
{
System.out.println( "IO error 1: " + ioe );
ioe.printStackTrace();
java.lang.System.exit(1);
}

(this is just one small part of the compareTo function - the goal was to
make things fast, and one of the optimizations is to compare just the
first 128 bytes of a file early in the comparison, and keep it cached in
memory to make the sort fast. Only if two files have the same prefix do
we do the expensive md5 hash - etc.).

Has anyone found a way to do:

find <options> -print | ./java-prog

....and have java-prog act on the files coming from stdin - including
opening them?

Thanks!

PS: I suspect I could write a class to read bytes and piece together
strings, but 1) That'd probably be slow and 2) I want to use the
established java class hierarchy where possible and 3) the byte arrays
still might get upconverted to a different encoding upon converting them
to a string anyway. But if that's the only way, that's fine.

Stefan Ram · Jul 27, 2008

Dan Stromberg said:
The error looks like:

We need to isolate the problem (as in »SSCCE«).

Try this:

echo "\0344" | java Main

With

public class Main
{ public static void main( final java.lang.String[] args )
throws java.lang.Throwable
{ final java.io.InputStreamReader inputStreamReader
= new java.io.InputStreamReader( System.in, "ISO8859_1" );
final java.io.BufferedReader bufferedReader
= new java.io.BufferedReader( inputStreamReader );
final java.lang.String string = bufferedReader.readLine();
java.lang.System.out.println( "\u00E4".equals( string.substring( 0, 1 ))); }}

If prints »false«, post the output of

echo "\0344" | od -h

and also the hexadecimal codes of the String »string« at the
end of the block above.

Additional information:

344 is the octal code of the letter LATIN SMALL LETTER A WITH DIAERESIS
in ISO 8859-1.

"\u00E4" is a Java String containing only the letter
LATIN SMALL LETTER A WITH DIAERESIS.

Arne VajhÃ¸j · Jul 27, 2008

Dan said:
However, to my disappointment, the java version of the program can't seem
to deal with filenames that have umlauts in them. Filenames using only
characters in the English alphabet seem fine.

I suspect the problem is that the file_name_, as it appears in a Linux
ext3 filesystem, has an 8 bit per character representation, but java
wants to convert the string I read from stdin to a 16 bit per character
representation, and then doesn't reverse the conversion when I go to open
the file by its name.

I've googled about this for around 4 hours now, and found little but
other people having similar issues - sometimes with files, sometimes with
files inside zip archives.

The error looks like:

find /home/dstromberg/Sound/Music/mp3/Bjork -type f -print | LANG=en_US
java -jar equivs.jar equivs.main
Encoding on isr is ISO8859_1
IO error 1: java.io.FileNotFoundException: /home/dstromberg/Sound/Music/
mp3/Bjork/Bj?rk_The Music From Drawing Restraint 9_06_Shimenawa.mp3 (No
such file or directory)
java.io.FileNotFoundException: /home/dstromberg/Sound/Music/mp3/Bjork/Bj?
rk_The Music From Drawing Restraint 9_06_Shimenawa.mp3 (No such file or
directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:106)
at Sortable_file.get_prefix(Sortable_file.java:63)
at Sortable_file.compareTo(Sortable_file.java:266)
at Sortable_file.compareTo(Sortable_file.java:1)
at java.util.Arrays.mergeSort(Arrays.java:1144)
at java.util.Arrays.mergeSort(Arrays.java:1155)
at java.util.Arrays.sort(Arrays.java:1079)
at equivs.main(equivs.java:54)

The code I'm reading filenames with looks like:

InputStreamReader isr = null;
try
{
isr = (new InputStreamReader(System.in, "ISO-8859-1"));
}
catch (UnsupportedEncodingException uee)
{
System.err.println("UnsupportedEncodingException: " + uee);
uee.printStackTrace();
java.lang.System.exit(1);
}
System.err.println("Encoding on isr is " + isr.getEncoding());

Has anyone found a way to do:

find <options> -print | ./java-prog

...and have java-prog act on the files coming from stdin - including
opening them?

Have you tried "UTF-8" instead of "ISO-8859-1" ?

Arne

Stefan Ram · Jul 27, 2008

Dan Stromberg said:
isr = (new InputStreamReader(System.in, "ISO-8859-1")

Now, I become aware of another fact:
»java.lang.System.in« already has an encoding.

You might try to use this as a base instead:

http://download.java.net/jdk7/docs/api/java/io/FileDescriptor.html#in

Dan Stromberg · Jul 28, 2008

Dan Stromberg said:
Dan Stromberg said:

The error looks like:

Click to expand...

We need to isolate the problem (as in Â»SSCCEÂ«).

Try this:

echo "\0344" | java Main

With

public class Main
{ public static void main( final java.lang.String[] args )
throws java.lang.Throwable
{ final java.io.InputStreamReader inputStreamReader
= new java.io.InputStreamReader( System.in, "ISO8859_1" ); final
java.io.BufferedReader bufferedReader = new java.io.BufferedReader(
inputStreamReader ); final java.lang.String string =
bufferedReader.readLine(); java.lang.System.out.println(
"\u00E4".equals( string.substring( 0, 1 ))); }}

If prints Â»falseÂ«, post the output of

echo "\0344" | od -h

It printed false, but this prints true:

printf '\344' | ./foo

(This was with the gcj implementation of java. I get the same result
using OpenJDK though).

Dan Stromberg · Jul 28, 2008

Have you tried "UTF-8" instead of "ISO-8859-1" ?

Arne

I had tried a handful of encodings but not UTF-8. I've now tried it, and
found that I got the same result as with other encodings - file not found.

Dan Stromberg · Jul 28, 2008

Now, I become aware of another fact:
Â»java.lang.System.inÂ« already has an encoding.

You might try to use this as a base instead:

http://download.java.net/jdk7/docs/api/java/io/FileDescriptor.html#in

I tried this but still I get file not found with OpenJDK. gcj seems fine
though:

FileReader fr = null;
// isr = (new InputStreamReader(System.in, "ISO-8859-1"));
// isr = (new InputStreamReader(System.in, "UTF-8"));
fr = (new FileReader(java.io.FileDescriptor.in));
System.err.println("Encoding on fr is " + fr.getEncoding());
//BufferedReader stdin = new BufferedReader (fr);
StringBuffer line;

char ch;
int int_char;
try
{
while (true)
{
line = new StringBuffer("");
while(true)
{
int_char = fr.read();
if (int_char == -1)
{
break;
}
ch = (char)int_char;
System.out.println("" + ch);
if (ch == (char)10)
{
break;
}
line.append(ch);
}
if (int_char == -1)
{
break;
}
System.out.println(new String(line));
lst.add(new Sortable_file(new String(line)));
}
}
catch(java.io.IOException e)
{

BTW, this code says the encoding is ASCII when I run it, whether using
OpenJDK or gcj.

Is the java String type -always- 16 bits per character? That is, if I
try to stick an 8 bit value into a String, is it always going to be
converted to a different encoding that maps back most of the time, but
not always?

Do java strings of any sort have an associated but variable encoding?
Are there different string types that have different encodings?

Is there any way of opening a filename that isn't stored in a String?
Short of something like SWIG, JNI or ctypes that is?

Stefan Ram · Jul 28, 2008

Dan Stromberg said:
Is the java String type -always- 16 bits per character?

Yes (if we ignore surrogate pairs, which are rare and not
used for umlauts).

That is, if I try to stick an 8 bit value into a String, is it
always going to be converted to a different encoding that maps
back most of the time, but not always?

The Reader objects already take care to convert between
raw bytes and characters. Strings contain characters,
stricly speaking, they have no »encoding«. They might
be converted to/from byte[] or streams to en- or decode them.

Do java strings of any sort have an associated but variable encoding?

No. Ignoring surrogate pairs, a string is a sequence of
characters; the value of each character /always/ is the
corresponding Unicode code point.

Are there different string types that have different encodings?

No (for the strings of the standard class »java.lang.String«).

Is there any way of opening a filename that isn't stored in a String?

Not with the standard classes AFAIK.

~~

To debug, try this:

$mkdir d0
$touch d0/ä
$find d0 -name ä -print | od -h
0000000 6430 2fe4 0a00
0000005

If the filesystem uses ISO 8859-1, you should see »e4« as above
(»64302fe4« is »d0/ä«).

Then, read the output of this find from Java and debug print
it from Java to a sequence of hex codes.

If it is »6430sfe4«, then you have read it correctly (ISO
8859-1 code points agree with Unicode code points here).
Otherwise, you might post here what it is instead.

You can also bypass the Reader class, read the »raw bytes«
from the stream, and use their hex dump to get an idea of the
apparent encoding of the stream (post the hexdump here).

Daniele Futtorovic · Jul 28, 2008

I had tried a handful of encodings but not UTF-8. I've now tried it, and
found that I got the same result as with other encodings - file not found.

Have you tried not using any "encoding"? As others pointed out,
System.in is a Reader, that is something which already has some kind of
byte-to-char handling. Furthermore, if your solution ought to be
portable, it would seem to me as a bad idea to hardcode the charset. You
should rather rely on proper system configuration (java's file.encoding
being the same as the shell's) -- or maybe a runtime parameter.

Stefan Ram · Jul 28, 2008

Daniele Futtorovic said:
(java's file.encoding being the same as the shell's)

It is not always the same, for example under some versions of
»Microsoft® Windows«, the console Window uses »cp437«, which
is not the default encoding of java.lang.System.out there.

Also, FileReader is not recommended (by several
programmers), exactly because it uses a »default encoding«,
which not always is appropriate for the task at hand.

John W Kennedy · Jul 28, 2008

Dan said:
However, to my disappointment, the java version of the program can't seem
to deal with filenames that have umlauts in them. Filenames using only
characters in the English alphabet seem fine.

I suspect the problem is that the file_name_, as it appears in a Linux
ext3 filesystem, has an 8 bit per character representation, but java
wants to convert the string I read from stdin to a 16 bit per character
representation, and then doesn't reverse the conversion when I go to open
the file by its name.

No. Java /always/ uses 16-bit characters; if it did that, it couldn't
open files at all.

Try running this program:

import java.io.File;

public final class DirScan {

public static void main(final String[] args) {
for (final String dirName : args) {
System.out.println(dirName);
final File dir = new File(dirName);
final File[] files = dir.listFiles();
for (final File file : files) {
final String fileName = file.toString();
System.out.printf(" %-25s ", fileName);
for (int i = 0; i < fileName.length(); ++i)
System.out.printf(" %04X", (int) fileName.charAt(i));
System.out.println();
}
}

}

}

....specifying one or more directories as arguments.

Daniele Futtorovic · Jul 28, 2008

<http://java.sun.com/javase/6/docs/api/java/lang/System.html#in>

Stefan Ram · Jul 28, 2008

Daniele Futtorovic said:
<scratches head, walks to the nearest wall, bangs>

My fault. It seems as if I would have assumed that there
is a symmetry between System.in and System.out.

A java.io.PrintStream really can have an encoding.

Stefan Ram · Jul 28, 2008

Daniele Futtorovic said:
<scratches head, walks to the nearest wall, bangs>

Still, allegedly java.lang.System.in sometimes /has/ some
transcoding magic in it (based on a native method).

For example:

»Data read from [...] System.in, [...] are handled
differently than data read from [...] other sources [...].

[A] conversion is performed by the JVM on the data to
convert from the normal character encoding of
file.encoding to a CCSID matching the System i job CCSID.

When System.in [...][is] redirected [...], this additional
data conversion is not performed and the data remains in a
character encoding matching file.encoding.«

http://publib.boulder.ibm.com/infocenter/iseries/v5r4/topic/rzaha/charenc.htm

Daniele Futtorovic · Jul 29, 2008

My fault. It seems as if I would have assumed that there
is a symmetry between System.in and System.out.

No, mine really -- I should know the class of System.in by heart --, as
well as accumulated frustration over too many mistakes in posts lately,
perplexing me. I hate making mistakes. Especially in public.

Still, allegedly java.lang.System.in sometimes /has/ some
transcoding magic in it (based on a native method).

For example:

»Data read from [...] System.in, [...] are handled
differently than data read from [...] other sources [...].

[A] conversion is performed by the JVM on the data to
convert from the normal character encoding of
file.encoding to a CCSID matching the System i job CCSID.

When System.in [...][is] redirected [...], this additional
data conversion is not performed and the data remains in a
character encoding matching file.encoding.«

http://publib.boulder.ibm.com/infocenter/iseries/v5r4/topic/rzaha/charenc.htm

This appears to be specific to the iSeries. I can't find any other
reference to System.in and encoding on the Sun site. Furthermore, the
fact that System.in is an InputStream speaks squarely against any type
of byte-to-char conversion (<=> "encoding"), doesn't it? Or should there
be some magic hidden in the JVM that decides whether the process' input
is text? I don't think that's likely. I don't think even see why that
would be a good idea.

Dan Stromberg · Jul 30, 2008

Dan Stromberg said:
Dan Stromberg said:

Is the java String type -always- 16 bits per character?

Click to expand...

Yes (if we ignore surrogate pairs, which are rare and not used for
umlauts).

That is, if I try to stick an 8 bit value into a String, is it always
going to be converted to a different encoding that maps back most of the
time, but not always?

Click to expand...

The Reader objects already take care to convert between raw bytes and
characters. Strings contain characters, stricly speaking, they have no
Â»encodingÂ«. They might be converted to/from byte[] or streams to en-
or decode them.

Do java strings of any sort have an associated but variable encoding?

Click to expand...

No. Ignoring surrogate pairs, a string is a sequence of characters;
the value of each character /always/ is the corresponding Unicode code
point.

Are there different string types that have different encodings?

Click to expand...

No (for the strings of the standard class Â»java.lang.StringÂ«).

Is there any way of opening a filename that isn't stored in a String?

Click to expand...

Not with the standard classes AFAIK.

~~

To debug, try this:

$mkdir d0
$touch d0/Ã¤
$find d0 -name Ã¤ -print | od -h
0000000 6430 2fe4 0a00
0000005

If the filesystem uses ISO 8859-1, you should see Â»e4Â« as above
(Â»64302fe4Â« is Â»d0/Ã¤Â«).

Then, read the output of this find from Java and debug print it from
Java to a sequence of hex codes.

If it is Â»6430sfe4Â«, then you have read it correctly (ISO 8859-1 code
points agree with Unicode code points here). Otherwise, you might post
here what it is instead.

You can also bypass the Reader class, read the Â»raw bytesÂ« from the
stream, and use their hex dump to get an idea of the apparent encoding
of the stream (post the hexdump here).

Often, at least on *ix, strace/truss/par/trace are a more direct route to
a solution than endless test programs.

I ran the OpenJDK version of my program under strace, and found that this
is what's being read:

[pid 11252] read(0, "/home/dstromberg/Sound/Music/mp3/Bjork/Bj\366rk_The
Music From Drawing Restraint 9_06_Shimenawa.mp3\n/home/dstromberg/Sound/
Music/mp3/Bjork/Bj\366rk_The Music From Drawing Restraint 9_10_Cetacea.mp3
\n/home/dstromberg/Sound/Music/mp3/Bjork/Bj\366rk_The Music From Drawing
Restraint 9_04_Bath.mp3\n/home/dstromberg/Sound/Music/mp3/Bjork/Bj
\366rk_The Music From Drawing Restraint 9_05_Hunter Vessel.mp3\n/home/
dstromberg/Sound/Music/mp3/Bjork/Bj\366rk_The Music From Drawing
Restraint 9_01_Gratitude.mp3\n/home/dstromberg/Sound/Music/mp3/Bjork/Bj
\366rk_The Music From Drawing Restraint 9_03_Ambergris March.mp3\n/home/
dstromberg/Sound/Music/mp3/Bjork/Bj\366rk_The Music From Drawing
Restraint 9_02_Pearl.mp3\n/home/dstromberg/Sound/Music/mp3/Bjork/Bj
\366rk_The Music From Drawing Restraint 9_09_Bolographic Entrypoint.mp3\n/
home/dstromberg/Sound/Music/mp3/Bjork/Bj\366rk_The Music From Drawing
Restraint 9_08_Storm.mp3\n/home/dstromberg/Sound/Music/mp3/Bjork/Bj
\366rk_The Music From Drawing Restraint 9_11_Antarctic Return.mp3\n/home/
dstromberg/Sound/Music/mp3/Bjork/"..., 8192) = 1089

....and this is what it's trying to open:

[pid 11252] open("/home/dstromberg/Sound/Music/mp3/Bjork/BjÃ¯Â¿Â½rk_The
Music From Drawing Restraint 9_06_Shimenawa.mp3", O_RDONLY|O_LARGEFILE) =
-1 ENOENT (No such file or directory)

In case your newsreader unmunged that for you, the read has one non-ASCII
byte for o+umlaut, and the open has 3 non-ASCII bytes for o+umlaut.

Any further suggestions, folks?

strombrg · Sep 14, 2008

I found some good help with this over on OpenJDK's i18n-dev mailing
list.

it turns out that in java (and perhaps other languages with
localization support) many locales do not guarantee correct round-trip
conversion from 8 bit filenames to 16 bit and back to 8 bit - so
you'll seem to get phantom files that seem to be there for one purpose
but not another. en_US.ISO-8859-1 is one of the few that does make
this guarantee - that is, no phantom files. I'd been trying that
locale among a handful of others, but it wasn't working because I
didn't have that locale configured on my system.

The python, perl and java versions of the program are now at
http://stromberg.dnsalias.org/~strombrg/equivalence-classes.html

Thanks to all who took an interest in the project!

I wrote a small java program to read filenames from stdin (produced by
Linux' "find"), and then to divide those files up into like groups.

Actually, it was originally a python program, but I've been wanting to
expand my horizons a little, so I rewrote it in perl, and now I'm trying
to redo it in java to celebrate java going opensource, and I'll likely
rewrite it in Haskell and/or Objective Caml after the java version.

The java version of the program seems to work pretty well, and I have a
feeling it's going to prove faster than the python or perl versions
(which are athttp://stromberg.dnsalias.org/~strombrg/equivalence-
classes.html - and I hope to put the java version there too after it's
working a little better).

However, to my disappointment, the java version of the program can't seem
to deal with filenames that have umlauts in them. Filenames using only
characters in the English alphabet seem fine.

I suspect the problem is that the file_name_, as it appears in a Linux
ext3 filesystem, has an 8 bit per character representation, but java
wants to convert the string I read from stdin to a 16 bit per character
representation, and then doesn't reverse the conversion when I go to open
the file by its name.

I've googled about this for around 4 hours now, and found little but
other people having similar issues - sometimes with files, sometimes with
files inside zip archives.

The error looks like:

find /home/dstromberg/Sound/Music/mp3/Bjork -type f -print | LANG=en_US
java -jar equivs.jar equivs.main
Encoding on isr is ISO8859_1
IO error 1: java.io.FileNotFoundException: /home/dstromberg/Sound/Music/
mp3/Bjork/Bj?rk_The Music From Drawing Restraint 9_06_Shimenawa.mp3 (No
such file or directory)
java.io.FileNotFoundException: /home/dstromberg/Sound/Music/mp3/Bjork/Bj?
rk_The Music From Drawing Restraint 9_06_Shimenawa.mp3 (No such file or
directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:106)
at Sortable_file.get_prefix(Sortable_file.java:63)
at Sortable_file.compareTo(Sortable_file.java:266)
at Sortable_file.compareTo(Sortable_file.java:1)
at java.util.Arrays.mergeSort(Arrays.java:1144)
at java.util.Arrays.mergeSort(Arrays.java:1155)
at java.util.Arrays.sort(Arrays.java:1079)
at equivs.main(equivs.java:54)

The code I'm reading filenames with looks like:

InputStreamReader isr = null;
try
{
isr = (new InputStreamReader(System.in, "ISO-8859-1"));
}
catch (UnsupportedEncodingException uee)
{
System.err.println("UnsupportedEncodingException: " + uee);
uee.printStackTrace();
java.lang.System.exit(1);
}
System.err.println("Encoding on isr is " + isr.getEncoding());
BufferedReader stdin = new BufferedReader (isr);
String line;

try
{
while((line = stdin.readLine()) != null)
{
// System.out.println(line);
// System.out.flush();
lst.add(new Sortable_file(line));
}
}
catch(java.io.IOException e)
{
System.err.println("IO error 0.5: " + e);
e.printStackTrace();
java.lang.System.exit(1);
}

...and the code I'm opening the filenames with looks like:

byte[] buffer = new byte[128];
java.io.File this_file;
try
{
this_file = new java.io.File(this.filename);
java.io.FileInputStream file = new java.io.FileInputStream
(this_file);
file.read(buffer);
// System.out.println("this.prefix.length " +
this.prefix.length);
file.close();
}
catch (java.io.IOException ioe)
{
System.out.println( "IO error 1: " + ioe );
ioe.printStackTrace();
java.lang.System.exit(1);
}

(this is just one small part of the compareTo function - the goal was to
make things fast, and one of the optimizations is to compare just the
first 128 bytes of a file early in the comparison, and keep it cached in
memory to make the sort fast. Only if two files have the same prefix do
we do the expensive md5 hash - etc.).

Has anyone found a way to do:

find <options> -print | ./java-prog

...and have java-prog act on the files coming from stdin - including
opening them?

Thanks!

PS: I suspect I could write a class to read bytes and piece together
strings, but 1) That'd probably be slow and 2) I want to use the
established java class hierarchy where possible and 3) the byte arrays
still might get upconverted to a different encoding upon converting them
to a string anyway. But if that's the only way, that's fine.

Roedy Green · Sep 15, 2008

I suspect the problem is that the file_name_, as it appears in a Linux
ext3 filesystem, has an 8 bit per character representation, but java
wants to convert the string I read from stdin to a 16 bit per character
representation, and then doesn't reverse the conversion when I go to open
the file by its name.

For background on your problem, see
http://mindprod.com/jgloss/encoding.html

I suggest you put your filenames in a file with UTF-8 encoding or some
encoding that supports umlauts. Then read it with a Reader. See
http://mindprod.com/applet/fileio.html for sample code.

Alternatively encode your umlauts is some weird way for the console :
eg. u^, and convert them back.

Andreas Leitgeb · Sep 16, 2008

I suggest you put your filenames in a file with UTF-8 encoding or some

encoding that supports umlauts. Then read it with a Reader. See
http://mindprod.com/applet/fileio.html for sample code.

to the OP:

My suggestion is, that you "migrate" your system to utf-8, by renaming
all files with iso-8859-whatever umlauts to utf-8 encoded filenames,
and having system's LANG set to something like de_AT.utf-8 or
en_US.utf-8 or whatever applies to your location.

When I did that a couple of years ago, I wrote some TCL-script to
do the renaming. The script is available, but isn't optimized for
fool-proof usage. (no GUI, no "usage:"-screen). Also, no warranties
and whatsoever.
Anyway, (if still not scared/bored away) it's here:
<http://www.logic.at/people/avl/stuff/convertNamesToUtf8.tcl>
(tclsh should be available (if not preinstalled) on all linux-
distributions, anyway.) Just go to the root of a tree that contains
files with umlauts in their names, and run the script from there,
but of course only after having had a look at the script to verify
it doesn't install a trojan.

Select Eof extension files based on text list of filenames with if condition	1	May 4, 2022
Read a single byte from stdin	39	Jul 12, 2009
reading from console, InputStreamReader etc.	2	Aug 2, 2007
Advanced reading from stdin with standard C	4	Nov 12, 2007
Can someone pls help me with a little algorithm script	1	Nov 28, 2024
Reading from $stdin	9	Jan 17, 2007
require fails when requiring scripts with utf-8 filenames.	4	Jun 12, 2010
Linux: Unbuffered reading from stdin	7	Oct 22, 2007

reading filenames from stdin - with umlauts?

Dan Stromberg

Stefan Ram

Arne VajhÃ¸j

Stefan Ram

Dan Stromberg

Dan Stromberg

Dan Stromberg

Stefan Ram

Daniele Futtorovic

Stefan Ram

John W Kennedy

Daniele Futtorovic

Stefan Ram

Stefan Ram

Daniele Futtorovic

Dan Stromberg

strombrg

Roedy Green

Andreas Leitgeb

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads