Charset problem

W

winlin

Hi,

I am creating a PDF from the output received from servlet. There are
special swiss/german accented characters.
One such character's output from servlet is received as for
which the unicode is "\u00E1".

What I do here, is replace all the occurrences of to \u00E1 and
thus, it displays properly.
However, this only happens on windows. When I try to do the same thing
on Linux machines, it gives me garbage characters.
Those garbage characters look like from the KOI8 character set.

Can anyone help me please?
 
W

winlin

Without more details, I can only guess that maybe your Linux box does
not have the right locales installed.

Hi,
I checked with the locale and it shows all the locales. On issuing
locale command I can see the LANG env set to 'en_US.UTF-8'. I tried
changing it to de_DE.UTF-8 with no success.

If you need some other details please let me know.

# output for locale -a (gives all the locales installed)
af_ZA
af_ZA.iso88591
an_ES
an_ES.iso885915
ar_AE
ar_AE.iso88596
ar_AE.utf8
ar_BH
ar_BH.iso88596
ar_BH.utf8
ar_DZ
ar_DZ.iso88596
ar_DZ.utf8
ar_EG
ar_EG.iso88596
ar_EG.utf8
ar_IN
ar_IN.utf8
ar_IQ
ar_IQ.iso88596
ar_IQ.utf8
ar_JO
ar_JO.iso88596
ar_JO.utf8
ar_KW
ar_KW.iso88596
ar_KW.utf8
ar_LB
ar_LB.iso88596
ar_LB.utf8
ar_LY
ar_LY.iso88596
ar_LY.utf8
ar_MA
ar_MA.iso88596
ar_MA.utf8
ar_OM
ar_OM.iso88596
ar_OM.utf8
ar_QA
ar_QA.iso88596
ar_QA.utf8
ar_SA
ar_SA.iso88596
ar_SA.utf8
ar_SD
ar_SD.iso88596
ar_SD.utf8
ar_SY
ar_SY.iso88596
ar_SY.utf8
ar_TN
ar_TN.iso88596
ar_TN.utf8
ar_YE
ar_YE.iso88596
ar_YE.utf8
be_BY
be_BY.cp1251
be_BY.utf8
bg_BG
bg_BG.cp1251
bg_BG.utf8
bokmal
bokmål
br_FR
br_FR.iso88591
bs_BA
bs_BA.iso88592
C
ca_ES
ca_ES@euro
ca_ES.iso88591
ca_ES.iso885915@euro
ca_ES.utf8
ca_ES.utf8@euro
catalan
croatian
cs_CZ
cs_CZ.iso88592
cs_CZ.utf8
cy_GB
cy_GB.iso885914
cy_GB.utf8
czech
da_DK
da_DK.iso88591
da_DK.iso885915
da_DK.utf8
danish
dansk
de_AT
de_AT@euro
de_AT.iso88591
de_AT.iso885915@euro
de_AT.utf8
de_AT.utf8@euro
de_BE
de_BE@euro
de_BE.iso88591
de_BE.iso885915@euro
de_BE.utf8
de_BE.utf8@euro
de_CH
de_CH.iso88591
de_CH.utf8
de_DE
de_DE@euro
de_DE.iso88591
de_DE.iso885915@euro
de_DE.utf8
de_DE.utf8@euro
de_LU
de_LU@euro
de_LU.iso88591
de_LU.iso885915@euro
de_LU.utf8
de_LU.utf8@euro
deutsch
dutch
eesti
el_GR
el_GR.iso88597
el_GR.utf8
en_AU
en_AU.iso88591
en_AU.utf8
en_BW
en_BW.iso88591
en_BW.utf8
en_CA
en_CA.iso88591
en_CA.utf8
en_DK
en_DK.iso88591
en_DK.utf8
en_GB
en_GB.iso88591
en_GB.iso885915
en_GB.utf8
en_HK
en_HK.iso88591
en_HK.utf8
en_IE
en_IE@euro
en_IE.iso88591
en_IE.iso885915@euro
en_IE.utf8
en_IE.utf8@euro
en_IN
en_IN.utf8
en_NZ
en_NZ.iso88591
en_NZ.utf8
en_PH
en_PH.iso88591
en_PH.utf8
en_SG
en_SG.iso88591
en_SG.utf8
en_US
en_US.iso88591
en_US.iso885915
en_US.utf8
en_ZA
en_ZA.iso88591
en_ZA.utf8
en_ZW
en_ZW.iso88591
en_ZW.utf8
es_AR
es_AR.iso88591
es_AR.utf8
es_BO
es_BO.iso88591
es_BO.utf8
es_CL
es_CL.iso88591
es_CL.utf8
es_CO
es_CO.iso88591
es_CO.utf8
es_CR
es_CR.iso88591
es_CR.utf8
es_DO
es_DO.iso88591
es_DO.utf8
es_EC
es_EC.iso88591
es_EC.utf8
es_ES
es_ES@euro
es_ES.iso88591
es_ES.iso885915@euro
es_ES.utf8
es_ES.utf8@euro
es_GT
es_GT.iso88591
es_GT.utf8
es_HN
es_HN.iso88591
es_HN.utf8
es_MX
es_MX.iso88591
es_MX.utf8
es_NI
es_NI.iso88591
es_NI.utf8
es_PA
es_PA.iso88591
es_PA.utf8
es_PE
es_PE.iso88591
es_PE.utf8
es_PR
es_PR.iso88591
es_PR.utf8
es_PY
es_PY.iso88591
es_PY.utf8
es_SV
es_SV.iso88591
es_SV.utf8
estonian
es_US
es_US.iso88591
es_US.utf8
es_UY
es_UY.iso88591
es_UY.utf8
es_VE
es_VE.iso88591
es_VE.utf8
et_EE
et_EE.iso88591
et_EE.utf8
eu_ES
eu_ES@euro
eu_ES.iso88591
eu_ES.iso885915@euro
eu_ES.utf8
eu_ES.utf8@euro
fa_IR
fa_IR.utf8
fi_FI
fi_FI@euro
fi_FI.iso88591
fi_FI.iso885915@euro
fi_FI.utf8
fi_FI.utf8@euro
finnish
fo_FO
fo_FO.iso88591
fo_FO.utf8
français
fr_BE
fr_BE@euro
fr_BE.iso88591
fr_BE.iso885915@euro
fr_BE.utf8
fr_BE.utf8@euro
fr_CA
fr_CA.iso88591
fr_CA.utf8
fr_CH
fr_CH.iso88591
fr_CH.utf8
french
fr_FR
fr_FR@euro
fr_FR.iso88591
fr_FR.iso885915@euro
fr_FR.utf8
fr_FR.utf8@euro
fr_LU
fr_LU@euro
fr_LU.iso88591
fr_LU.iso885915@euro
fr_LU.utf8
fr_LU.utf8@euro
ga_IE
ga_IE@euro
ga_IE.iso88591
ga_IE.iso885915@euro
ga_IE.utf8
ga_IE.utf8@euro
galego
galician
german
gl_ES
gl_ES@euro
gl_ES.iso88591
gl_ES.iso885915@euro
gl_ES.utf8
gl_ES.utf8@euro
greek
gv_GB
gv_GB.iso88591
gv_GB.utf8
hebrew
he_IL
he_IL.iso88598
he_IL.utf8
hi_IN
hi_IN.utf8
hr_HR
hr_HR.iso88592
hr_HR.utf8
hrvatski
hu_HU
hu_HU.iso88592
hu_HU.utf8
hungarian
icelandic
id_ID
id_ID.iso88591
id_ID.utf8
is_IS
is_IS.iso88591
is_IS.utf8
italian
it_CH
it_CH.iso88591
it_CH.utf8
it_IT
it_IT@euro
it_IT.iso88591
it_IT.iso885915@euro
it_IT.utf8
it_IT.utf8@euro
iw_IL
iw_IL.iso88598
iw_IL.utf8
ja_JP
ja_JP.eucjp
ja_JP.ujis
ja_JP.utf8
japanese
japanese.euc
ka_GE
ka_GE.georgianps
kl_GL
kl_GL.iso88591
kl_GL.utf8
ko_KR
ko_KR.euckr
ko_KR.utf8
korean
korean.euc
kw_GB
kw_GB.iso88591
kw_GB.utf8
lithuanian
lo_LA
lo_LA.utf8
lt_LT
lt_LT.iso885913
lt_LT.utf8
lv_LV
lv_LV.iso885913
lv_LV.utf8
mi_NZ
mi_NZ.iso885913
mk_MK
mk_MK.iso88595
mk_MK.utf8
mr_IN
mr_IN.utf8
ms_MY
ms_MY.iso88591
ms_MY.utf8
mt_MT
mt_MT.iso88593
mt_MT.utf8
nb_NO
nb_NO.ISO-8859-1
nl_BE
nl_BE@euro
nl_BE.iso88591
nl_BE.iso885915@euro
nl_BE.utf8
nl_BE.utf8@euro
nl_NL
nl_NL@euro
nl_NL.iso88591
nl_NL.iso885915@euro
nl_NL.utf8
nl_NL.utf8@euro
nn_NO
nn_NO.iso88591
nn_NO.utf8
no_NO
no_NO.iso88591
no_NO.utf8
norwegian
nynorsk
oc_FR
oc_FR.iso88591
pl_PL
pl_PL.iso88592
pl_PL.utf8
polish
portuguese
POSIX
pt_BR
pt_BR.iso88591
pt_BR.utf8
pt_PT
pt_PT@euro
pt_PT.iso88591
pt_PT.iso885915@euro
pt_PT.utf8
pt_PT.utf8@euro
romanian
ro_RO
ro_RO.iso88592
ro_RO.utf8
ru_RU
ru_RU.iso88595
ru_RU.koi8r
ru_RU.utf8
russian
ru_UA
ru_UA.koi8u
ru_UA.utf8
se_NO
se_NO.utf8
sk_SK
sk_SK.iso88592
sk_SK.utf8
slovak
slovene
slovenian
sl_SI
sl_SI.iso88592
sl_SI.utf8
spanish
sq_AL
sq_AL.iso88591
sq_AL.utf8
sr_YU
sr_YU@cyrillic
sr_YU.iso88592
sr_YU.iso88595@cyrillic
sr_YU.utf8
sr_YU.utf8@cyrillic
sv_FI
sv_FI@euro
sv_FI.iso88591
sv_FI.iso885915@euro
sv_FI.utf8
sv_FI.utf8@euro
sv_SE
sv_SE.iso88591
sv_SE.iso885915
sv_SE.utf8
swedish
ta_IN
ta_IN.utf8
te_IN
te_IN.utf8
tg_TJ
tg_TJ.koi8t
thai
th_TH
th_TH.tis620
th_TH.utf8
tl_PH
tl_PH.iso88591
tr_TR
tr_TR.iso88599
tr_TR.utf8
turkish
uk_UA
uk_UA.koi8u
uk_UA.utf8
ur_PK
ur_PK.utf8
uz_UZ
uz_UZ.iso88591
vi_VN
vi_VN.tcvn
vi_VN.utf8
wa_BE
wa_BE@euro
wa_BE.iso88591
wa_BE.iso885915@euro
yi_US
yi_US.cp1255
zh_CN
zh_CN.gb18030
zh_CN.gb2312
zh_CN.gbk
zh_CN.utf8
zh_HK
zh_HK.big5hkscs
zh_HK.utf8
zh_TW
zh_TW.big5
zh_TW.euctw
zh_TW.utf8
 
M

magloca

winlin @ Thursday 28 August 2008 08:10:
Hi,

I am creating a PDF from the output received from servlet. There are
special swiss/german accented characters.
One such character's output from servlet is received as for
which the unicode is "\u00E1".

What I do here, is replace all the occurrences of to \u00E1 and
thus, it displays properly.
However, this only happens on windows. When I try to do the same thing
on Linux machines, it gives me garbage characters.
Those garbage characters look like from the KOI8 character set.

Can anyone help me please?

Technically, KOI-8 isn't a character set; it's a character encoding. But
I assume what you mean is that Cyrillic characters appear in the
output. Since the KOI-8 encoding (as well as Windows-1251, BTW) maps
codepoints to Cyrillic characters that in Unicode (and ISO-Latin1 et
al.) are mapped to the accented characters you want, it seems likely
that whatever it is you're using to generate the PDFs gets confused
about what encoding is in effect. Maybe you could tell us what PDF
generator you're using.

m.
 
A

Andy Dingley

I am creating a PDF from the output received from servlet. There are
special swiss/german accented characters.
One such character's output from servlet is received as for
which the unicode is "\u00E1".

The "Unicode" for this is just U+00E1, any way you wish to represent
it. The difference between the two examples you quote is that they're
different syntactic representations for this same Unicode, one for
SGML / HTML / XML and the other for Java.

There's also the question of "encoding": how to represent characters
as a stream of bytes or octets. This isn't a problem here because both
of the forms you describe use a syntactic escaping mechanism at an
even higher level, such that Unicode characters can be represented in
an encoding (like ASCII's encoding) that doesn't support that
character. It's good practice to use UTF-8 encoding thoroughout (if
you can enforce it on the rest of the team, stuff starts to "just
work"), however this isn't always permissible, owing to limitations of
some tools. Java properties files are one example.

HTML and Java always use Unicode characters for these numeric
entities, no matter what the encoding.


What I do here, is replace all the occurrences of to \u00E1 and
thus, it displays properly.

That's to go from HTML to Java. Same character set (i.e Unicode) and
the overall encoding doesn't matter because you're not dependent on it
(while these characters are wrapped up as numeric entities).

However, this only happens on windows. When I try to do the same thing
on Linux machines, it gives me garbage characters.

That sounds like you're generating the right content, with correctly
encoded characters (probably as UTF-8), but the servlet is mis-
labelling this encoding as something else. Very easily done, and the
most common error of this type. However your particular results would
suggest the mis-labelling would be as KOI, which sounds unlikely.

Alternatively, the encoding process is broken (rare, but possible).
Your Unicode characters are being pulled out of their safe references
and converted to encoded characters, which are then getting mangled.
When looked at as UTF-8, their mangled remains looks like a radically
different set of characters, i.e. KOI.

It's hard to diagnose this stuff. Really what you need is a clear
understanding of the concepts and of your workflow, then to check each
step and to ensure that it's valid in each intermediate format (i.e.
content always matches for its encoding for creation and its encoding
on use). Life is also simpler by ignoring ISO-8859-* in favour of
consistent Unicode / UTF-8 throughout.


Wikipedia is quite readable on these topics.
 
W

winlin

winlin @ Thursday 28 August 2008 08:10:





Technically, KOI-8 isn't a character set; it's a character encoding. But
I assume what you mean is that Cyrillic characters appear in the
output. Since the KOI-8 encoding (as well as Windows-1251, BTW) maps
codepoints to Cyrillic characters that in Unicode (and ISO-Latin1 et
al.) are mapped to the accented characters you want, it seems likely
that whatever it is you're using to generate the PDFs gets confused
about what encoding is in effect. Maybe you could tell us what PDF
generator you're using.

m.

Hi All,

First of all thank you all for the effort you guys are taking to help
me out...
I have further broken the problem into a small program, which gives me
different output
for Windows and Linux (running same version of JAVA - 1_4_2_16).
The output on windows shows up the actual character as expected,
however on Linux it shows up a
garbage output probably using KOI-8R encoding.

Please see if the program helps you get to the bottom of the problem.
I also read in the documentation of Character(version 5.0) that String
and Char arrays use UTF-16 encoding and hope its not a problem.

import java.io.UnsupportedEncodingException;
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.Charset;
import java.util.Iterator;
import java.util.Map;
import java.util.Set;

public class TestCharSet {

/**
* Default Constructor
*/
public TestCharSet() {
super();
}

/**
* @param args
* @throws UnsupportedEncodingException
*/
public static void main( String[] args ) throws
UnsupportedEncodingException {
//System.out.println("Special Characters:" + "á â ä è é ê ë ï
ò ó ö ú ü") ;
String stateProvince = "á"; //This is the character á
System.out.println("State Province before conversion : " +
stateProvince) ;
String stateProvince_post =
unescapeXMLSpecialCharacters( stateProvince );
System.out.println("State Province after conversion : " +
stateProvince_post) ;

}

/**
* Replaces all occurrences of the substring in the data string
with the
* replacement string.
*
* @param data the string to check.
* @param substring the substring to replace.
* @param replacement the string the substring is replaced with.
* @return the result of the replacement(s).
*/
// @PMD:REVIEWED:AvoidReassigningParameters: Bajrang Gupta
private static String replace( String data, final String
substring,
final String replacement ) {
int index = data.indexOf( substring, 0 ) ;
while ( index >= 0 ) {
data = data.substring( 0, index ) + replacement
+ data.substring( index + substring.length() ) ;
index += replacement.length() ;
index = data.indexOf( substring, index ) ;
}
return data ;
}

/**
* Checks the string on none xml well formed characters, meaning
'<' and
* '&', and if found, escapes these characters and returns a
well formed
* xml string.
*
* @param xmlData the data string to make well formed.
* @return the well formed variant of the xml data.
*/
public static String unescapeXMLSpecialCharacters( String
xmlData ) throws UnsupportedEncodingException {
xmlData = replace( xmlData, "á", "\u00E1" ) ;
return xmlData ;
}
}
 
A

Andy Dingley

        String stateProvince = "á"; //This is the character á

It isn't.

It's an SGML(and friends)-only numeric entity that represents that
character in an SGML context. It makes no sense in Java or text files
(it's valid, it just doesn't mean anything).

It also _only_ works as "á" in SGML. It doesn't work as
"á" any more (that should render to the literal string
"á". If you have "SGML" content that uses entities, then you
not only shouldn't but in fact must not run an SGML-entity-encoder
over them. Entity encoding like this isn't idempotetent. Do it to
things that are either already entity-encoded, or are a deliberate use
of entities, and you'll break stuff (Symptom is that you see
"entities" appearing in the browser).
 
W

winlin

It isn't.

It's an SGML(and friends)-only numeric entity that represents that
character in an SGML context. It makes no sense in Java or text files
(it's valid, it just doesn't mean anything).

It also _only_ works as "á" in SGML. It doesn't work as
"á" any more (that should render to the literal string
"á".   If you have "SGML" content that uses entities, then you
not only shouldn't but in fact must not run an SGML-entity-encoder
over them. Entity encoding like this isn't idempotetent. Do it to
things that are either already entity-encoded, or are a deliberate use
of entities, and you'll break stuff (Symptom is that you see
"entities" appearing in the browser).

Hi Andy,

I understand that á does not have any meaning in JAVA and its an
SGML entity.
However, the SGML entity is generated by a Servlet, which I need to
pass to the custom PDFWrapper over pdflib (version 7.0.2). The wrapper
takes parses the content sent by the servlet and displays it on the
PDF (using content type as application/pdf and passing it to the
outputstream). The whole conversion process is String and byte[]
based.
Thus, if I do not replace the á with the unicode equivalent, the
PDF would show up as &#xE1 - WHICH IS UNDESIRABLE :)
thus, I replace the String matching á with \u00E1 which should
display the proper character.
However, it has a different behavior on different platforms, which is
the problem that I have tried to summarise in the small program above.

Please let me know, if you were able to get the problem.

Cheers
 
A

Andy Dingley

Thus, if I do not replace the á with the unicode equivalent, the
PDF would show up as &#xE1 - WHICH IS UNDESIRABLE :)

Any audience worth bothering with can read it as the entities ;-)
thus, I replace the String matching á with \u00E1 which should
display the proper character.

Agreed. I think your characters are probably perfect, it's their
encoding as octets that's the problem.

Can you trap the output and dump it in hex? Is this genuine UTF-8,
just as it ought to be? If so, the problem is mis-labelling goof UTF
as ISO-something (very common fault) or possibly in your case
labelling it as KOI-something, which would be obscure.

If the hex-dumped content looks like the wrong octets for unicode
(probably a single bare 0xE1 octet) then it was the encoding itself
that failed.
 
W

winlin

Hi Andy,

Thanks for the advise. I checked with the Hexdump. I got the output as
follows:

Windows: e1 0d 0a ...
Linux: c3 a1 0a ...

I am not very proficient with it...however, I can make out that on
windows atleast it is printing the e1 properly (Not sure, if I am
right)
No idea how to correct the goof up you mentioned... i think I am
becoming a shameless creature in asking for help ;)

Let me know, if you have something.

Cheers,
Bajrang
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top