Charset problem

Discussion in 'Java' started by winlin, Aug 28, 2008.

  1. winlin

    winlin Guest

    Hi,

    I am creating a PDF from the output received from servlet. There are
    special swiss/german accented characters.
    One such character's output from servlet is received as for
    which the unicode is "\u00E1".

    What I do here, is replace all the occurrences of to \u00E1 and
    thus, it displays properly.
    However, this only happens on windows. When I try to do the same thing
    on Linux machines, it gives me garbage characters.
    Those garbage characters look like from the KOI8 character set.

    Can anyone help me please?
    winlin, Aug 28, 2008
    #1
    1. Advertising

  2. winlin

    Guest

    **** U!
    , Aug 28, 2008
    #2
    1. Advertising

  3. winlin

    Roedy Green Guest

    Roedy Green, Aug 28, 2008
    #3
  4. winlin

    winlin Guest

    On Aug 28, 3:45 pm, Sabine Dinis Blochberger <>
    wrote:
    > winlin wrote:
    > > Hi,

    >
    > > I am creating a PDF from the output received from servlet. There are
    > > special swiss/german accented characters.
    > > One such character's output from servlet is received as for
    > > which the unicode is "\u00E1".

    >
    > > What I do here, is replace all the occurrences of to \u00E1 and
    > > thus, it displays properly.
    > > However, this only happens on windows. When I try to do the same thing
    > > on Linux machines, it gives me garbage characters.
    > > Those garbage characters look like from the KOI8 character set.

    >
    > > Can anyone help me please?

    >
    > Without more details, I can only guess that maybe your Linux box does
    > not have the right locales installed.
    > --
    > Sabine Dinis Blochberger
    >
    > Op3racionalwww.op3racional.eu


    Hi,
    I checked with the locale and it shows all the locales. On issuing
    locale command I can see the LANG env set to 'en_US.UTF-8'. I tried
    changing it to de_DE.UTF-8 with no success.

    If you need some other details please let me know.

    # output for locale -a (gives all the locales installed)
    af_ZA
    af_ZA.iso88591
    an_ES
    an_ES.iso885915
    ar_AE
    ar_AE.iso88596
    ar_AE.utf8
    ar_BH
    ar_BH.iso88596
    ar_BH.utf8
    ar_DZ
    ar_DZ.iso88596
    ar_DZ.utf8
    ar_EG
    ar_EG.iso88596
    ar_EG.utf8
    ar_IN
    ar_IN.utf8
    ar_IQ
    ar_IQ.iso88596
    ar_IQ.utf8
    ar_JO
    ar_JO.iso88596
    ar_JO.utf8
    ar_KW
    ar_KW.iso88596
    ar_KW.utf8
    ar_LB
    ar_LB.iso88596
    ar_LB.utf8
    ar_LY
    ar_LY.iso88596
    ar_LY.utf8
    ar_MA
    ar_MA.iso88596
    ar_MA.utf8
    ar_OM
    ar_OM.iso88596
    ar_OM.utf8
    ar_QA
    ar_QA.iso88596
    ar_QA.utf8
    ar_SA
    ar_SA.iso88596
    ar_SA.utf8
    ar_SD
    ar_SD.iso88596
    ar_SD.utf8
    ar_SY
    ar_SY.iso88596
    ar_SY.utf8
    ar_TN
    ar_TN.iso88596
    ar_TN.utf8
    ar_YE
    ar_YE.iso88596
    ar_YE.utf8
    be_BY
    be_BY.cp1251
    be_BY.utf8
    bg_BG
    bg_BG.cp1251
    bg_BG.utf8
    bokmal
    bokmål
    br_FR
    br_FR.iso88591
    bs_BA
    bs_BA.iso88592
    C
    ca_ES
    ca_ES@euro
    ca_ES.iso88591
    ca_ES.iso885915@euro
    ca_ES.utf8
    ca_ES.utf8@euro
    catalan
    croatian
    cs_CZ
    cs_CZ.iso88592
    cs_CZ.utf8
    cy_GB
    cy_GB.iso885914
    cy_GB.utf8
    czech
    da_DK
    da_DK.iso88591
    da_DK.iso885915
    da_DK.utf8
    danish
    dansk
    de_AT
    de_AT@euro
    de_AT.iso88591
    de_AT.iso885915@euro
    de_AT.utf8
    de_AT.utf8@euro
    de_BE
    de_BE@euro
    de_BE.iso88591
    de_BE.iso885915@euro
    de_BE.utf8
    de_BE.utf8@euro
    de_CH
    de_CH.iso88591
    de_CH.utf8
    de_DE
    de_DE@euro
    de_DE.iso88591
    de_DE.iso885915@euro
    de_DE.utf8
    de_DE.utf8@euro
    de_LU
    de_LU@euro
    de_LU.iso88591
    de_LU.iso885915@euro
    de_LU.utf8
    de_LU.utf8@euro
    deutsch
    dutch
    eesti
    el_GR
    el_GR.iso88597
    el_GR.utf8
    en_AU
    en_AU.iso88591
    en_AU.utf8
    en_BW
    en_BW.iso88591
    en_BW.utf8
    en_CA
    en_CA.iso88591
    en_CA.utf8
    en_DK
    en_DK.iso88591
    en_DK.utf8
    en_GB
    en_GB.iso88591
    en_GB.iso885915
    en_GB.utf8
    en_HK
    en_HK.iso88591
    en_HK.utf8
    en_IE
    en_IE@euro
    en_IE.iso88591
    en_IE.iso885915@euro
    en_IE.utf8
    en_IE.utf8@euro
    en_IN
    en_IN.utf8
    en_NZ
    en_NZ.iso88591
    en_NZ.utf8
    en_PH
    en_PH.iso88591
    en_PH.utf8
    en_SG
    en_SG.iso88591
    en_SG.utf8
    en_US
    en_US.iso88591
    en_US.iso885915
    en_US.utf8
    en_ZA
    en_ZA.iso88591
    en_ZA.utf8
    en_ZW
    en_ZW.iso88591
    en_ZW.utf8
    es_AR
    es_AR.iso88591
    es_AR.utf8
    es_BO
    es_BO.iso88591
    es_BO.utf8
    es_CL
    es_CL.iso88591
    es_CL.utf8
    es_CO
    es_CO.iso88591
    es_CO.utf8
    es_CR
    es_CR.iso88591
    es_CR.utf8
    es_DO
    es_DO.iso88591
    es_DO.utf8
    es_EC
    es_EC.iso88591
    es_EC.utf8
    es_ES
    es_ES@euro
    es_ES.iso88591
    es_ES.iso885915@euro
    es_ES.utf8
    es_ES.utf8@euro
    es_GT
    es_GT.iso88591
    es_GT.utf8
    es_HN
    es_HN.iso88591
    es_HN.utf8
    es_MX
    es_MX.iso88591
    es_MX.utf8
    es_NI
    es_NI.iso88591
    es_NI.utf8
    es_PA
    es_PA.iso88591
    es_PA.utf8
    es_PE
    es_PE.iso88591
    es_PE.utf8
    es_PR
    es_PR.iso88591
    es_PR.utf8
    es_PY
    es_PY.iso88591
    es_PY.utf8
    es_SV
    es_SV.iso88591
    es_SV.utf8
    estonian
    es_US
    es_US.iso88591
    es_US.utf8
    es_UY
    es_UY.iso88591
    es_UY.utf8
    es_VE
    es_VE.iso88591
    es_VE.utf8
    et_EE
    et_EE.iso88591
    et_EE.utf8
    eu_ES
    eu_ES@euro
    eu_ES.iso88591
    eu_ES.iso885915@euro
    eu_ES.utf8
    eu_ES.utf8@euro
    fa_IR
    fa_IR.utf8
    fi_FI
    fi_FI@euro
    fi_FI.iso88591
    fi_FI.iso885915@euro
    fi_FI.utf8
    fi_FI.utf8@euro
    finnish
    fo_FO
    fo_FO.iso88591
    fo_FO.utf8
    français
    fr_BE
    fr_BE@euro
    fr_BE.iso88591
    fr_BE.iso885915@euro
    fr_BE.utf8
    fr_BE.utf8@euro
    fr_CA
    fr_CA.iso88591
    fr_CA.utf8
    fr_CH
    fr_CH.iso88591
    fr_CH.utf8
    french
    fr_FR
    fr_FR@euro
    fr_FR.iso88591
    fr_FR.iso885915@euro
    fr_FR.utf8
    fr_FR.utf8@euro
    fr_LU
    fr_LU@euro
    fr_LU.iso88591
    fr_LU.iso885915@euro
    fr_LU.utf8
    fr_LU.utf8@euro
    ga_IE
    ga_IE@euro
    ga_IE.iso88591
    ga_IE.iso885915@euro
    ga_IE.utf8
    ga_IE.utf8@euro
    galego
    galician
    german
    gl_ES
    gl_ES@euro
    gl_ES.iso88591
    gl_ES.iso885915@euro
    gl_ES.utf8
    gl_ES.utf8@euro
    greek
    gv_GB
    gv_GB.iso88591
    gv_GB.utf8
    hebrew
    he_IL
    he_IL.iso88598
    he_IL.utf8
    hi_IN
    hi_IN.utf8
    hr_HR
    hr_HR.iso88592
    hr_HR.utf8
    hrvatski
    hu_HU
    hu_HU.iso88592
    hu_HU.utf8
    hungarian
    icelandic
    id_ID
    id_ID.iso88591
    id_ID.utf8
    is_IS
    is_IS.iso88591
    is_IS.utf8
    italian
    it_CH
    it_CH.iso88591
    it_CH.utf8
    it_IT
    it_IT@euro
    it_IT.iso88591
    it_IT.iso885915@euro
    it_IT.utf8
    it_IT.utf8@euro
    iw_IL
    iw_IL.iso88598
    iw_IL.utf8
    ja_JP
    ja_JP.eucjp
    ja_JP.ujis
    ja_JP.utf8
    japanese
    japanese.euc
    ka_GE
    ka_GE.georgianps
    kl_GL
    kl_GL.iso88591
    kl_GL.utf8
    ko_KR
    ko_KR.euckr
    ko_KR.utf8
    korean
    korean.euc
    kw_GB
    kw_GB.iso88591
    kw_GB.utf8
    lithuanian
    lo_LA
    lo_LA.utf8
    lt_LT
    lt_LT.iso885913
    lt_LT.utf8
    lv_LV
    lv_LV.iso885913
    lv_LV.utf8
    mi_NZ
    mi_NZ.iso885913
    mk_MK
    mk_MK.iso88595
    mk_MK.utf8
    mr_IN
    mr_IN.utf8
    ms_MY
    ms_MY.iso88591
    ms_MY.utf8
    mt_MT
    mt_MT.iso88593
    mt_MT.utf8
    nb_NO
    nb_NO.ISO-8859-1
    nl_BE
    nl_BE@euro
    nl_BE.iso88591
    nl_BE.iso885915@euro
    nl_BE.utf8
    nl_BE.utf8@euro
    nl_NL
    nl_NL@euro
    nl_NL.iso88591
    nl_NL.iso885915@euro
    nl_NL.utf8
    nl_NL.utf8@euro
    nn_NO
    nn_NO.iso88591
    nn_NO.utf8
    no_NO
    no_NO.iso88591
    no_NO.utf8
    norwegian
    nynorsk
    oc_FR
    oc_FR.iso88591
    pl_PL
    pl_PL.iso88592
    pl_PL.utf8
    polish
    portuguese
    POSIX
    pt_BR
    pt_BR.iso88591
    pt_BR.utf8
    pt_PT
    pt_PT@euro
    pt_PT.iso88591
    pt_PT.iso885915@euro
    pt_PT.utf8
    pt_PT.utf8@euro
    romanian
    ro_RO
    ro_RO.iso88592
    ro_RO.utf8
    ru_RU
    ru_RU.iso88595
    ru_RU.koi8r
    ru_RU.utf8
    russian
    ru_UA
    ru_UA.koi8u
    ru_UA.utf8
    se_NO
    se_NO.utf8
    sk_SK
    sk_SK.iso88592
    sk_SK.utf8
    slovak
    slovene
    slovenian
    sl_SI
    sl_SI.iso88592
    sl_SI.utf8
    spanish
    sq_AL
    sq_AL.iso88591
    sq_AL.utf8
    sr_YU
    sr_YU@cyrillic
    sr_YU.iso88592
    sr_YU.iso88595@cyrillic
    sr_YU.utf8
    sr_YU.utf8@cyrillic
    sv_FI
    sv_FI@euro
    sv_FI.iso88591
    sv_FI.iso885915@euro
    sv_FI.utf8
    sv_FI.utf8@euro
    sv_SE
    sv_SE.iso88591
    sv_SE.iso885915
    sv_SE.utf8
    swedish
    ta_IN
    ta_IN.utf8
    te_IN
    te_IN.utf8
    tg_TJ
    tg_TJ.koi8t
    thai
    th_TH
    th_TH.tis620
    th_TH.utf8
    tl_PH
    tl_PH.iso88591
    tr_TR
    tr_TR.iso88599
    tr_TR.utf8
    turkish
    uk_UA
    uk_UA.koi8u
    uk_UA.utf8
    ur_PK
    ur_PK.utf8
    uz_UZ
    uz_UZ.iso88591
    vi_VN
    vi_VN.tcvn
    vi_VN.utf8
    wa_BE
    wa_BE@euro
    wa_BE.iso88591
    wa_BE.iso885915@euro
    yi_US
    yi_US.cp1255
    zh_CN
    zh_CN.gb18030
    zh_CN.gb2312
    zh_CN.gbk
    zh_CN.utf8
    zh_HK
    zh_HK.big5hkscs
    zh_HK.utf8
    zh_TW
    zh_TW.big5
    zh_TW.euctw
    zh_TW.utf8
    winlin, Aug 28, 2008
    #4
  5. winlin

    magloca Guest

    winlin @ Thursday 28 August 2008 08:10:

    > Hi,
    >
    > I am creating a PDF from the output received from servlet. There are
    > special swiss/german accented characters.
    > One such character's output from servlet is received as for
    > which the unicode is "\u00E1".
    >
    > What I do here, is replace all the occurrences of to \u00E1 and
    > thus, it displays properly.
    > However, this only happens on windows. When I try to do the same thing
    > on Linux machines, it gives me garbage characters.
    > Those garbage characters look like from the KOI8 character set.
    >
    > Can anyone help me please?


    Technically, KOI-8 isn't a character set; it's a character encoding. But
    I assume what you mean is that Cyrillic characters appear in the
    output. Since the KOI-8 encoding (as well as Windows-1251, BTW) maps
    codepoints to Cyrillic characters that in Unicode (and ISO-Latin1 et
    al.) are mapped to the accented characters you want, it seems likely
    that whatever it is you're using to generate the PDFs gets confused
    about what encoding is in effect. Maybe you could tell us what PDF
    generator you're using.

    m.
    magloca, Aug 29, 2008
    #5
  6. winlin

    Andy Dingley Guest

    On 28 Aug, 07:10, winlin <> wrote:
    > I am creating a PDF from the output received from servlet. There are
    > special swiss/german accented characters.
    > One such character's output from servlet is received as for
    > which the unicode is "\u00E1".


    The "Unicode" for this is just U+00E1, any way you wish to represent
    it. The difference between the two examples you quote is that they're
    different syntactic representations for this same Unicode, one for
    SGML / HTML / XML and the other for Java.

    There's also the question of "encoding": how to represent characters
    as a stream of bytes or octets. This isn't a problem here because both
    of the forms you describe use a syntactic escaping mechanism at an
    even higher level, such that Unicode characters can be represented in
    an encoding (like ASCII's encoding) that doesn't support that
    character. It's good practice to use UTF-8 encoding thoroughout (if
    you can enforce it on the rest of the team, stuff starts to "just
    work"), however this isn't always permissible, owing to limitations of
    some tools. Java properties files are one example.

    HTML and Java always use Unicode characters for these numeric
    entities, no matter what the encoding.



    > What I do here, is replace all the occurrences of to \u00E1 and
    > thus, it displays properly.


    That's to go from HTML to Java. Same character set (i.e Unicode) and
    the overall encoding doesn't matter because you're not dependent on it
    (while these characters are wrapped up as numeric entities).


    > However, this only happens on windows. When I try to do the same thing
    > on Linux machines, it gives me garbage characters.


    That sounds like you're generating the right content, with correctly
    encoded characters (probably as UTF-8), but the servlet is mis-
    labelling this encoding as something else. Very easily done, and the
    most common error of this type. However your particular results would
    suggest the mis-labelling would be as KOI, which sounds unlikely.

    Alternatively, the encoding process is broken (rare, but possible).
    Your Unicode characters are being pulled out of their safe references
    and converted to encoded characters, which are then getting mangled.
    When looked at as UTF-8, their mangled remains looks like a radically
    different set of characters, i.e. KOI.

    It's hard to diagnose this stuff. Really what you need is a clear
    understanding of the concepts and of your workflow, then to check each
    step and to ensure that it's valid in each intermediate format (i.e.
    content always matches for its encoding for creation and its encoding
    on use). Life is also simpler by ignoring ISO-8859-* in favour of
    consistent Unicode / UTF-8 throughout.


    Wikipedia is quite readable on these topics.
    Andy Dingley, Aug 29, 2008
    #6
  7. winlin

    winlin Guest

    On Aug 29, 2:14 pm, magloca <> wrote:
    > winlin @ Thursday 28 August 2008 08:10:
    >
    > > Hi,

    >
    > > I am creating a PDF from the output received from servlet. There are
    > > special swiss/german accented characters.
    > > One such character's output from servlet is received as for
    > > which the unicode is "\u00E1".

    >
    > > What I do here, is replace all the occurrences of to \u00E1 and
    > > thus, it displays properly.
    > > However, this only happens on windows. When I try to do the same thing
    > > on Linux machines, it gives me garbage characters.
    > > Those garbage characters look like from the KOI8 character set.

    >
    > > Can anyone help me please?

    >
    > Technically, KOI-8 isn't a character set; it's a character encoding. But
    > I assume what you mean is that Cyrillic characters appear in the
    > output. Since the KOI-8 encoding (as well as Windows-1251, BTW) maps
    > codepoints to Cyrillic characters that in Unicode (and ISO-Latin1 et
    > al.) are mapped to the accented characters you want, it seems likely
    > that whatever it is you're using to generate the PDFs gets confused
    > about what encoding is in effect. Maybe you could tell us what PDF
    > generator you're using.
    >
    > m.


    Hi All,

    First of all thank you all for the effort you guys are taking to help
    me out...
    I have further broken the problem into a small program, which gives me
    different output
    for Windows and Linux (running same version of JAVA - 1_4_2_16).
    The output on windows shows up the actual character as expected,
    however on Linux it shows up a
    garbage output probably using KOI-8R encoding.

    Please see if the program helps you get to the bottom of the problem.
    I also read in the documentation of Character(version 5.0) that String
    and Char arrays use UTF-16 encoding and hope its not a problem.

    import java.io.UnsupportedEncodingException;
    import java.nio.ByteBuffer;
    import java.nio.CharBuffer;
    import java.nio.charset.Charset;
    import java.util.Iterator;
    import java.util.Map;
    import java.util.Set;

    public class TestCharSet {

    /**
    * Default Constructor
    */
    public TestCharSet() {
    super();
    }

    /**
    * @param args
    * @throws UnsupportedEncodingException
    */
    public static void main( String[] args ) throws
    UnsupportedEncodingException {
    //System.out.println("Special Characters:" + "á â ä è é ê ë ï
    ò ó ö ú ü") ;
    String stateProvince = "á"; //This is the character á
    System.out.println("State Province before conversion : " +
    stateProvince) ;
    String stateProvince_post =
    unescapeXMLSpecialCharacters( stateProvince );
    System.out.println("State Province after conversion : " +
    stateProvince_post) ;

    }

    /**
    * Replaces all occurrences of the substring in the data string
    with the
    * replacement string.
    *
    * @param data the string to check.
    * @param substring the substring to replace.
    * @param replacement the string the substring is replaced with.
    * @return the result of the replacement(s).
    */
    // @PMD:REVIEWED:AvoidReassigningParameters: Bajrang Gupta
    private static String replace( String data, final String
    substring,
    final String replacement ) {
    int index = data.indexOf( substring, 0 ) ;
    while ( index >= 0 ) {
    data = data.substring( 0, index ) + replacement
    + data.substring( index + substring.length() ) ;
    index += replacement.length() ;
    index = data.indexOf( substring, index ) ;
    }
    return data ;
    }

    /**
    * Checks the string on none xml well formed characters, meaning
    '&lt;' and
    * '&amp;', and if found, escapes these characters and returns a
    well formed
    * xml string.
    *
    * @param xmlData the data string to make well formed.
    * @return the well formed variant of the xml data.
    */
    public static String unescapeXMLSpecialCharacters( String
    xmlData ) throws UnsupportedEncodingException {
    xmlData = replace( xmlData, "á", "\u00E1" ) ;
    return xmlData ;
    }
    }
    winlin, Aug 29, 2008
    #7
  8. winlin

    Andy Dingley Guest

    On 29 Aug, 14:36, winlin <> wrote:

    >         String stateProvince = "á"; //This is the character á


    It isn't.

    It's an SGML(and friends)-only numeric entity that represents that
    character in an SGML context. It makes no sense in Java or text files
    (it's valid, it just doesn't mean anything).

    It also _only_ works as "á" in SGML. It doesn't work as
    "&amp;#xE1;" any more (that should render to the literal string
    "á". If you have "SGML" content that uses entities, then you
    not only shouldn't but in fact must not run an SGML-entity-encoder
    over them. Entity encoding like this isn't idempotetent. Do it to
    things that are either already entity-encoded, or are a deliberate use
    of entities, and you'll break stuff (Symptom is that you see
    "entities" appearing in the browser).
    Andy Dingley, Aug 29, 2008
    #8
  9. winlin

    winlin Guest

    On Aug 29, 7:30 pm, Andy Dingley <> wrote:
    > On 29 Aug, 14:36, winlin <> wrote:
    >
    > >         String stateProvince = "á"; //This is the character á

    >
    > It isn't.
    >
    > It's an SGML(and friends)-only numeric entity that represents that
    > character in an SGML context. It makes no sense in Java or text files
    > (it's valid, it just doesn't mean anything).
    >
    > It also _only_ works as "á" in SGML. It doesn't work as
    > "&amp;#xE1;" any more (that should render to the literal string
    > "á".   If you have "SGML" content that uses entities, then you
    > not only shouldn't but in fact must not run an SGML-entity-encoder
    > over them. Entity encoding like this isn't idempotetent. Do it to
    > things that are either already entity-encoded, or are a deliberate use
    > of entities, and you'll break stuff (Symptom is that you see
    > "entities" appearing in the browser).


    Hi Andy,

    I understand that á does not have any meaning in JAVA and its an
    SGML entity.
    However, the SGML entity is generated by a Servlet, which I need to
    pass to the custom PDFWrapper over pdflib (version 7.0.2). The wrapper
    takes parses the content sent by the servlet and displays it on the
    PDF (using content type as application/pdf and passing it to the
    outputstream). The whole conversion process is String and byte[]
    based.
    Thus, if I do not replace the á with the unicode equivalent, the
    PDF would show up as &#xE1 - WHICH IS UNDESIRABLE :)
    thus, I replace the String matching á with \u00E1 which should
    display the proper character.
    However, it has a different behavior on different platforms, which is
    the problem that I have tried to summarise in the small program above.

    Please let me know, if you were able to get the problem.

    Cheers
    winlin, Aug 29, 2008
    #9
  10. winlin

    Andy Dingley Guest

    On 29 Aug, 16:29, winlin <> wrote:

    > Thus, if I do not replace the á with the unicode equivalent, the
    > PDF would show up as &#xE1 - WHICH IS UNDESIRABLE :)


    Any audience worth bothering with can read it as the entities ;-)

    > thus, I replace the String matching á with \u00E1 which should
    > display the proper character.


    Agreed. I think your characters are probably perfect, it's their
    encoding as octets that's the problem.

    Can you trap the output and dump it in hex? Is this genuine UTF-8,
    just as it ought to be? If so, the problem is mis-labelling goof UTF
    as ISO-something (very common fault) or possibly in your case
    labelling it as KOI-something, which would be obscure.

    If the hex-dumped content looks like the wrong octets for unicode
    (probably a single bare 0xE1 octet) then it was the encoding itself
    that failed.
    Andy Dingley, Aug 29, 2008
    #10
  11. winlin

    winlin Guest

    Hi Andy,

    Thanks for the advise. I checked with the Hexdump. I got the output as
    follows:

    Windows: e1 0d 0a ...
    Linux: c3 a1 0a ...

    I am not very proficient with it...however, I can make out that on
    windows atleast it is printing the e1 properly (Not sure, if I am
    right)
    No idea how to correct the goof up you mentioned... i think I am
    becoming a shameless creature in asking for help ;)

    Let me know, if you have something.

    Cheers,
    Bajrang


    On Aug 29, 9:51 pm, Andy Dingley <> wrote:
    > On 29 Aug, 16:29, winlin <> wrote:
    >
    > > Thus, if I do not replace the á with the unicode equivalent, the
    > > PDF would show up as &#xE1 - WHICH IS UNDESIRABLE :)

    >
    > Any audience worth bothering with can read it as the entities ;-)
    >
    > > thus, I replace the String matching á with \u00E1 which should
    > > display the proper character.

    >
    > Agreed. I think your characters are probably perfect, it's their
    > encoding as octets that's the problem.
    >
    > Can you trap the output and dump it in hex? Is this genuine UTF-8,
    > just as it ought to be? If so, the problem is mis-labelling goof UTF
    > as ISO-something (very common fault) or possibly in your case
    > labelling it as KOI-something, which would be obscure.
    >
    > If the hex-dumped content looks like the wrong octets for unicode
    > (probably a single bare 0xE1 octet) then it was the encoding itself
    > that failed.
    winlin, Sep 1, 2008
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Son KwonNam
    Replies:
    0
    Views:
    1,055
    Son KwonNam
    Feb 4, 2004
  2. J.P.Jarolim
    Replies:
    0
    Views:
    1,045
    J.P.Jarolim
    Feb 27, 2004
  3. Leonidas

    Tomcat 5 charset problem

    Leonidas, Sep 20, 2004, in forum: Java
    Replies:
    0
    Views:
    663
    Leonidas
    Sep 20, 2004
  4. Erik A. Brandstadmoen

    Problem with Java XML charset

    Erik A. Brandstadmoen, Dec 28, 2004, in forum: Java
    Replies:
    0
    Views:
    989
    Erik A. Brandstadmoen
    Dec 28, 2004
  5. optimistx

    javascript charset <> page charset

    optimistx, Aug 14, 2008, in forum: Javascript
    Replies:
    2
    Views:
    256
    optimistx
    Aug 15, 2008
Loading...

Share This Page