Do you know how Java read character value greater than 128/255?

R

RC

I have

BufferedReader bufferedReader =
new BufferedReader(new FileReader(inputfile_name));

int c;
while ((c = bufferedReader.read()) > -1 ) {
if (c > (int)128) {
System.err.println(
(char)c + " " +
c + " " +
Integer.toOctalString(c) + " " +
Integer.toHexString(c)
);
}
}
bufferedReader.close();

This is fine, I got print all characters which ASCII value greater than
128.

Now I do the same in C

if ((fp = fopen("inputfile_name", "r")) == NULL) {
fprintf(stderr, "Can't open %s\n", argv[1]);
exit(2);
}
int c;
while ((c = getc(fp)) != EOF) {
if (c > 128) {
printf("%c %d %o %x\n", c, c, c, c);
}
}
fclose(fp);

But in C I don't get print any character ASCII value greater than 128 by
read the same file.

I just wonder why, how do Java read those character ASCII greater
than 128?
 
L

Lew

RC said:
int c;
while ((c = bufferedReader.read()) > -1 ) {
if (c > (int)128) {

128 is already an int, so casting it to int has no effect.
System.err.println(
(char)c + " " +
c + " " +
Integer.toOctalString(c) + " " +
Integer.toHexString(c)
);
}
}
bufferedReader.close();

This is fine, I got print all characters which ASCII value greater than
128.

Now I do the same in C

if ((fp = fopen("inputfile_name", "r")) == NULL) {
fprintf(stderr, "Can't open %s\n", argv[1]);
exit(2);
}
int c;
while ((c = getc(fp)) != EOF) {

The C function getc() returns a byte-scale value, not a 16-bit value as does Java.
if (c > 128) {
printf("%c %d %o %x\n", c, c, c, c);
}
}
fclose(fp);

But in C I don't get print any character ASCII value greater than 128 by
read the same file.
I just wonder why, how do Java read those character ASCII greater
than 128?

Java is likely not reading ASCII but UTF-8. Have you tried the Java program
with the InputStreamReader encoding set to "US-ASCII"?

For a fuller answer one would need to know the contents of the file.

Check out the API docs for java.io.InputStreamReader and java.nio.charset.Charset.

- Lew
 
O

Oliver Wong

RC said:
But in C I don't get print any character ASCII value greater than 128 by
read the same file.

I just wonder why, how do Java read those character ASCII greater
than 128?

I think it's basically because C uses ASCII internally, while Java uses
a modified version of UTF-16 internally.

- Oliver
 
M

Mike Schilling

Oliver Wong said:
I think it's basically because C uses ASCII internally, while Java uses
a modified version of UTF-16 internally.

It's because the C code shown was reading bytes, while the Java code shown
was reading characters. Java that reads bytes, e.g.

InputStream strm;

int b = strm.read();

would never see anything outside the range [-128..127], while C that reads
"wide" characters, e.g.

wint_t c = getwc(stdin);

can see characters outside that range.
 
T

Timothy Bendfelt

These two bits of code do not do the same thing. The java code has the
opportunity to use the file encoding, including multi-byte schemes
(e.g. UTF8) to re-map bytes in the file stream to characters represented
as UTF16 code points. The C code should just be consuming bytes and
retuning them as unsigned chars.

Question: Do both of them read the same number of characters from the
stream?

Question: What does java think your default file encoding and code page
is? You can force it to read US-ASCII or LATIN 1 and run again.
 
O

Oliver Wong

Mike Schilling said:
Oliver Wong said:
I think it's basically because C uses ASCII internally, while Java
uses a modified version of UTF-16 internally.

It's because the C code shown was reading bytes, while the Java code shown
was reading characters. Java that reads bytes, e.g.

InputStream strm;

int b = strm.read();

would never see anything outside the range [-128..127], while C that reads
"wide" characters, e.g.

wint_t c = getwc(stdin);

can see characters outside that range.

I was referring to the language-built-in datatypes known as "char" in C
and "char" in Java. Both languages seem to assume that there is a finite
number of characters that will ever used in computing (256 in the case of C,
65536 in the case of Java), and when they were shown wrong, libraries needed
to be added to support the extra characters.

The OQ (Original Question) was informally phrased (e.g. contrasting C's
printing versus Java's reading -- I would further argue that Java doesn't
"read" characters at all in this scenario, but instead reads bytes, and then
does some behind the scenes conversions to characters), so I was sort of
guessing at what the OP was really asking.

- Oliver
 
M

Mike Schilling

Oliver Wong said:
Mike Schilling said:
Oliver Wong said:
But in C I don't get print any character ASCII value greater than 128
by
read the same file.

I just wonder why, how do Java read those character ASCII greater
than 128?

I think it's basically because C uses ASCII internally, while Java
uses a modified version of UTF-16 internally.

It's because the C code shown was reading bytes, while the Java code
shown was reading characters. Java that reads bytes, e.g.

InputStream strm;

int b = strm.read();

would never see anything outside the range [-128..127], while C that
reads "wide" characters, e.g.

wint_t c = getwc(stdin);

can see characters outside that range.

I was referring to the language-built-in datatypes known as "char" in C
and "char" in Java. Both languages seem to assume that there is a finite
number of characters that will ever used in computing (256 in the case of
C, 65536 in the case of Java), and when they were shown wrong, libraries
needed to be added to support the extra characters.

Yes, but that's an apples-to-oranges comparison. Java has "byte" and "char"
for octets and character-set-members respectively. C has "char" and
"wchar_t" for those purposes. The confusion (if any) arises from the fact
that C and Java use the same name ("char") for two different things.
The OQ (Original Question) was informally phrased (e.g. contrasting C's
printing versus Java's reading -- I would further argue that Java doesn't
"read" characters at all in this scenario, but instead reads bytes, and
then does some behind the scenes conversions to characters),

Any language (or library) that handles multi-byte character sets has to do
the same.
so I was sort of guessing at what the OP was really asking.

I was too.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,535
Members
45,008
Latest member
obedient dusk

Latest Threads

Top