Help on setting UTF8 chatracter set

S

Sandy80

Hi,

I have a batch program that is used for formatting some data. The
program uses a java class to take an input as a .xls file, formats
the
data from that file and converts it into a .csv file.


The issue that I am facing is that some data in the .xls file is not
represented correctly in the .csv file. For example data like "Dékán"
in the .xls sheet is represented like "Dékán" in the formatted
file.


I wanted to know how to set the character set to UTF8 in the java
class so that it represents the data correctly even after the
formatting.


Any help would be appreciated.


Regards,
Sandy
 
A

Andrew Thompson

I have a ..

..habit of multi-posting? It might be best
to curb that habit real soon, or the people
who are likely to help, might become very
*unlikely* to help.

(X-post to c.l.j.p./h., w/ f-u to c.l.j.h. only)
 
A

Andreas Leitgeb

Sandy80 said:
The issue that I am facing is that some data in the .xls file is not
represented correctly in the .csv file. For example data like "Dékán"
in the .xls sheet is represented like "Dékán" in the formatted
file.
I wanted to know how to set the character set to UTF8 in the java
class so that it represents the data correctly even after the
formatting.

Your .csv text looks like correctly UTF-8 encoded,
but your .xls text looks like some 8bit codepage encoding.

What made you think that the .csv was not correctly utf-8 ?
 
S

Sandy80

Because the data in .xls file is like "Dékán". The same data comes
like ""Dékán" after formatting.
 
A

Andreas Leitgeb

Sandy80 said:
Because the data in .xls file is like "Dékán". The same data comes
like ""Dékán" after formatting.

In that case, it's very likely, that your .csv file *really* is proper
utf-8, but you see garbage, because your system cannot display utf-8
correctly.

So perhaps you rather want your .csv in Windows-1252 encoding, to
make it appear correct to you.
 
A

Andy Dingley

The issue that I am facing is that some data in the .xls file is not
represented correctly in the .csv file. For example data like "Dékán"
in the .xls sheet is represented like "Dékán" in the formatted
file.

Your CSV looks like it is correct UTF-8, but the tool reading it
doesn't recognise or support this. "é" isn't "two wrong characters",
it's "two correct _octets_ (or bytes, if you insist) that together
represent the encoding of one correct UTF-8 character". This is good.

CSV is an ancient format, long pre-dating UTF-8. Tools old and
primitive enough to require CSV are unlikely to support UTF-8,
correctly if at all.

To give such tools a hint that the content is UTF-8 and they should
treat it as such, then you could try serialising it as UTF-8Y instead,
also known as "UTF-8 with BOM (Byte Order Mark)". This sticks three
more octets on the start of the file to indicate the order of encoded
bytes, but also indicates that the file is UTF-8 rather than plain
ASCII or IS0-8859-*. _SOME_ reader tools will see this as a hint to
work in UTF-8 mode, rather than ASCII.

Note that this means a UTF-8Y file that doesn't use non-ASCII
characters is no longer compatible with simple ASCII or ISO-8859-*
encodings, where a UTF-8 file (no BOM) would have remained so. If all
of your files contain some non-ASCII anyway (i.e. you or your data
isn't purely British or American) then this wouldn't have been useful
anyway.

On the whole, I've always had dismal success in trying to mix non-
ASCII and CSV. Anything with enough Clue to work isn't still stuck
with CSV.

To have an easy life with UTF-8 encoding issues, switch to XML. Then
it all just works and you have to actively do something dumb to break
it.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top