reading file, asii 161 (meta-space) converted to question mark

M

Michael Muller

I'm trying to read an HTML file that has been generated by MS excel.
When I use od -c to examine this file, I see lots of octal 240
(decimal 161) chars. This is supposedly a "meta space", whatever that
means. When I read the file in on Windows, everything works ok (the
characters stay as 240), but when I read the file in on linux (RH9),
the "meta-spaces" are converted to question marks, rendering the html
unreadable.

My LANG envar on unix is set to us_ENG.UTF-8. On windows, it's not
set. I tried unsetting and exporting LANG on linux -- no joy.

Help! I'm using 1.4.2 on Linux and 1.4.1 on windows. I sure hope
that's not the issue. The code that reads the file is appended.

Thanks in advance for any help anyone can offer,

-- Mike

private static String slurp(File file)
throws IOException
{
StringBuffer sb = new StringBuffer();
char[] buf = new char[1024 * 4];
BufferedReader br = new BufferedReader(new FileReader(file));
int bytesRead;
while ((bytesRead = br.read(buf, 0, buf.length)) != -1)
{
sb.append(buf, 0, bytesRead);
}

return sb.toString();
}
 
N

Neomorph

I'm trying to read an HTML file that has been generated by MS excel.
When I use od -c to examine this file, I see lots of octal 240
(decimal 161) chars.

That should be: 160 decimal.
You can't have a even octal number becoming an uneven decimal number ;-)
This is supposedly a "meta space", whatever that
means.

Usually used to 'connect' two words, so they are not split when realigning
text. Like the non-breaking space in HTML (coded as  ).
When I read the file in on Windows, everything works ok (the
characters stay as 240), but when I read the file in on linux (RH9),
the "meta-spaces" are converted to question marks, rendering the html
unreadable.

HTML should either only contain US ASCII (32-127), or should have a special
codepage/encoding set.
You should be replacing the 0240 (octal) with the code   as long as
it's not part of a parameter value.
My LANG envar on unix is set to us_ENG.UTF-8. On windows, it's not
set. I tried unsetting and exporting LANG on linux -- no joy.

Either way, the Linux font probably has no correlation to that character
code.
Help! I'm using 1.4.2 on Linux and 1.4.1 on windows. I sure hope
that's not the issue. The code that reads the file is appended.

Thanks in advance for any help anyone can offer,

-- Mike

private static String slurp(File file)
throws IOException
{
StringBuffer sb = new StringBuffer();
char[] buf = new char[1024 * 4];
BufferedReader br = new BufferedReader(new FileReader(file));
int bytesRead;
while ((bytesRead = br.read(buf, 0, buf.length)) != -1)
{
sb.append(buf, 0, bytesRead);
}

return sb.toString();
}


Cheers.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,020
Latest member
GenesisGai

Latest Threads

Top