ascii char 26

Discussion in 'Java' started by bob, Sep 11, 2011.

  1. bob

    bob Guest

    Anyone know why ASCII char 26 is used in place of a hyphen in UTF-8?

    I had to write this function to deal with this:

    public static String convertToAscii(String html) {
    html = html.replaceAll("\u2019", "'");
    html = html.replaceAll("\u201D", "\"");
    html = html.replaceAll("\u201C", "\"");

    byte[] b = null;
    try {
    b = html.getBytes("US-ASCII");
    } catch (UnsupportedEncodingException e) {
    e.printStackTrace();
    }

    // hyphen replace
    for (int ctr = 0; ctr < b.length; ctr++)
    if (b[ctr] == 26)
    b[ctr] = 45;

    html = new String(b);
    return html;
    }
     
    bob, Sep 11, 2011
    #1
    1. Advertising

  2. bob

    Arne Vajhøj Guest

    On 9/11/2011 5:33 PM, bob wrote:
    > Anyone know why ASCII char 26 is used in place of a hyphen in UTF-8?
    >
    > I had to write this function to deal with this:
    >
    > public static String convertToAscii(String html) {
    > html = html.replaceAll("\u2019", "'");
    > html = html.replaceAll("\u201D", "\"");
    > html = html.replaceAll("\u201C", "\"");
    >
    > byte[] b = null;
    > try {
    > b = html.getBytes("US-ASCII");
    > } catch (UnsupportedEncodingException e) {
    > e.printStackTrace();
    > }
    >
    > // hyphen replace
    > for (int ctr = 0; ctr< b.length; ctr++)
    > if (b[ctr] == 26)
    > b[ctr] = 45;
    >
    > html = new String(b);
    > return html;
    > }


    ASCII code 26 is not in general replaced with hyphen.

    If you are asking why some code may do it, then in
    some contexts (usually on Windows platform) ASCII code
    26 indicates EOF.

    Arne
     
    Arne Vajhøj, Sep 11, 2011
    #2
    1. Advertising

  3. On 9/11/2011 4:33 PM, bob wrote:
    > Anyone know why ASCII char 26 is used in place of a hyphen in UTF-8?


    The US-ASCII encoder only properly encodes characters in the range of
    0-127, i.e., the characters that are present in ASCII. Any other
    character is replaced with some sort of substitution character; in this
    case, it looks like the charset has chosen to use ^Z as the "I don't
    know what this character is" character (I would have guessed '?'
    instead, but I suppose they decided to go with the less-commonly used
    variant).

    My guess is your input is using one of the characters like the minus
    sign, em dash, or perhaps an en dash instead (there may be others),
    which are visually close in appearance to a hyphen but do not share the
    same Unicode codepoint.

    --
    Beware of bugs in the above code; I have only proved it correct, not
    tried it. -- Donald E. Knuth
     
    Joshua Cranmer, Sep 11, 2011
    #3
  4. bob

    Roedy Green Guest

    On Sun, 11 Sep 2011 14:33:05 -0700 (PDT), bob <>
    wrote, quoted or indirectly quoted someone who said :
    >Anyone know why ASCII char 26 is used in place of a hyphen in UTF-8?
    >html = html.replaceAll("\u201C", "\"");


    \u0026 is replaced by an ampersand at compile time, as if you had
    typed one into the source code.

    I presume you are talking about

    26 0x1a ^Z SUB, substitute

    \u001a is not useful. It gets replaced by a ^z character, as if you
    had typed it into the source text, possibly creating a syntax error.
    If you want this char you probably want (char)0x001a

    This is true for ascii, UTF and UTF-8. If you see a -, it might just
    be some font's attempt to render a SUB char.

    You can use ␚ in HTML or \u241a in Java to render a tiny SUB
    glyph to represent the char.

    see
    http://mindprod.com/jgloss/ascii.html
    http://mindprod.com/jgloss/unicode.html
    http://mindprod.com/jgloss/utf.html
    http://mindprod.com/jgloss/literal.html
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com
    The modern conservative is engaged in one of man's oldest exercises in moral philosophy; that is,
    the search for a superior moral justification for selfishness.
    ~ John Kenneth Galbraith (born: 1908-10-15 died: 2006-04-29 at age: 97)
     
    Roedy Green, Sep 11, 2011
    #4
  5. bob

    Eric Sosman Guest

    On 9/11/2011 5:52 PM, Joshua Cranmer wrote:
    > On 9/11/2011 4:33 PM, bob wrote:
    >> Anyone know why ASCII char 26 is used in place of a hyphen in UTF-8?

    >
    > The US-ASCII encoder only properly encodes characters in the range of
    > 0-127, i.e., the characters that are present in ASCII. Any other
    > character is replaced with some sort of substitution character; in this
    > case, it looks like the charset has chosen to use ^Z as the "I don't
    > know what this character is" character (I would have guessed '?'
    > instead, but I suppose they decided to go with the less-commonly used
    > variant).


    It makes more sense when you think of 26 not as ^Z, but as SUB.

    --
    Eric Sosman
    d
     
    Eric Sosman, Sep 11, 2011
    #5
  6. On 2011-09-11, bob <> wrote:
    > Anyone know why ASCII char 26 is used in place of a hyphen in UTF-8?


    Unicode has multiple different hyphens and hyphen-like characters.

    The traditional ASCII hyphen is the Unicode "hyphen-minus" which
    encodes to 0x2d in utf-8.

    http://www.fileformat.info/info/unicode/char/2d/index.htm suggests the
    following additional hyphen-like characters that you may actually be
    working with in your string, and that will probably be mapped to 26 in
    your case:

    hyphen U+2010
    non-breaking hyphen U+2011
    figure dash U+2012
    en dash U+2013
    minus sign U+2212
    roman uncia sign U+10191

    If hyphens are of particular interest to you it may be a better
    approach to replace non-ASCII-supported hyphens from the above list
    with "hyphen-minus", before you transcode to ASCII.

    One would tend to think there ought to be a library function somewhere
    to convert a unicode string to ASCII-supported variants of its various
    characters where possible, that you should be using instead. I don't
    know if such a function is easily available.

    Cheers,
    Bent D
    --
    Bent Dalager - - http://www.pvv.org/~bcd
    powered by emacs
     
    Bent C Dalager, Sep 12, 2011
    #6
  7. On 9/11/2011 6:18 PM, Bent C Dalager wrote:
    > One would tend to think there ought to be a library function somewhere
    > to convert a unicode string to ASCII-supported variants of its various
    > characters where possible, that you should be using instead. I don't
    > know if such a function is easily available.


    This generally falls under the umbrella of Unicode normalization, which
    can resolve, e.g., Ã… the Angstrom symbol and Ã… the Swedish letter to the
    same representation (may require compatibility normalization). You can
    do this in Java using the java.text.Normalizer class.

    --
    Beware of bugs in the above code; I have only proved it correct, not
    tried it. -- Donald E. Knuth
     
    Joshua Cranmer, Sep 12, 2011
    #7
  8. On Sep 11, 7:18 pm, Bent C Dalager <> wrote:
    > On 2011-09-11, bob <> wrote:
    >
    > > Anyone know why ASCII char 26 is used in place of a hyphen in UTF-8?

    >
    > Unicode has multiple different hyphens and hyphen-like characters.
    >
    > The traditional ASCII hyphen is the Unicode "hyphen-minus" which
    > encodes to 0x2d in utf-8.
    >
    > http://www.fileformat.info/info/unicode/char/2d/index.htmsuggests the
    > following additional hyphen-like characters that you may actually be
    > working with in your string, and that will probably be mapped to 26 in
    > your case:
    >
    > hyphen U+2010
    > non-breaking hyphen U+2011
    > figure dash U+2012
    > en dash U+2013
    > minus sign U+2212
    > roman uncia sign U+10191


    Wow, what a mess!

    > One would tend to think there ought to be a library function somewhere
    > to convert a unicode string to ASCII-supported variants of its various
    > characters where possible,


    Indeed.
     
    Retahiv Oopsiscame, Sep 12, 2011
    #8
  9. bob

    bob Guest

    You're right. I messed up, and it was the em dash. It turned into 26
    after going thru 'b = html.getBytes("US-ASCII");'

    Here's the new code:

    public static String convertToAscii(String html) {
    html = html.replaceAll("\u2019", "'");
    html = html.replaceAll("\u201D", "\"");
    html = html.replaceAll("\u201C", "\"");

    // mdash
    html = html.replaceAll("\u2014", "-");


    byte[] b = null;
    try {
    b = html.getBytes("US-ASCII");

    } catch (UnsupportedEncodingException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
    }
    return html;
    }

    Also, I'm on Android 2.1, so import java.text.Normalizer; doesn't
    work.



    On Sep 11, 4:52 pm, Joshua Cranmer <> wrote:
    > On 9/11/2011 4:33 PM, bob wrote:
    >
    > > Anyone know why ASCII char 26 is used in place of a hyphen in UTF-8?

    >
    > The US-ASCII encoder only properly encodes characters in the range of
    > 0-127, i.e., the characters that are present in ASCII. Any other
    > character is replaced with some sort of substitution character; in this
    > case, it looks like the charset has chosen to use ^Z as the "I don't
    > know what this character is" character (I would have guessed '?'
    > instead, but I suppose they decided to go with the less-commonly used
    > variant).
    >
    > My guess is your input is using one of the characters like the minus
    > sign, em dash, or perhaps an en dash instead (there may be others),
    > which are visually close in appearance to a hyphen but do not share the
    > same Unicode codepoint.
    >
    > --
    > Beware of bugs in the above code; I have only proved it correct, not
    > tried it. -- Donald E. Knuth
     
    bob, Sep 12, 2011
    #9
  10. On 9/11/2011 9:12 PM, bob wrote:
    > You're right. I messed up, and it was the em dash. It turned into 26
    > after going thru 'b = html.getBytes("US-ASCII");'
    >
    > Here's the new code:


    Hardcoding a list of tables is generally not a good thing; in
    particular, I don't think it's going to solve your problems. I have seen
    sites that use the Unicode ff and fi ligatures instead of relying on
    fonts to automatically pick up on that as well.

    If I may ask, why do you need to convert the string to US-ASCII as
    opposed to UTF-8? That is going to cause major issues for the ~90% of
    the world that doesn't speak English as their main language.

    > Also, I'm on Android 2.1, so import java.text.Normalizer; doesn't
    > work.


    It shouldn't be that hard to find other Java Unicode normalization
    libraries out there.

    --
    Beware of bugs in the above code; I have only proved it correct, not
    tried it. -- Donald E. Knuth
     
    Joshua Cranmer, Sep 12, 2011
    #10
  11. bob

    bob Guest

    Loading UTF-8 data into a WebView doesn't work right. Please see this
    thread:

    http://groups.google.com/group/android-developers/browse_thread/thread/c056cc101c8676e5?hl=en

    Thanks.



    On Sep 11, 9:25 pm, Joshua Cranmer <> wrote:
    > On 9/11/2011 9:12 PM, bob wrote:
    >
    > > You're right.  I messed up, and it was the em dash.  It turned into26
    > > after going thru 'b = html.getBytes("US-ASCII");'

    >
    > > Here's the new code:

    >
    > Hardcoding a list of tables is generally not a good thing; in
    > particular, I don't think it's going to solve your problems. I have seen
    > sites that use the Unicode ff and fi ligatures instead of relying on
    > fonts to automatically pick up on that as well.
    >
    > If I may ask, why do you need to convert the string to US-ASCII as
    > opposed to UTF-8? That is going to cause major issues for the ~90% of
    > the world that doesn't speak English as their main language.
    >
    > > Also, I'm on Android 2.1, so import java.text.Normalizer; doesn't
    > > work.

    >
    > It shouldn't be that hard to find other Java Unicode normalization
    > libraries out there.
    >
    > --
    > Beware of bugs in the above code; I have only proved it correct, not
    > tried it. -- Donald E. Knuth
     
    bob, Sep 12, 2011
    #11
  12. bob

    Roedy Green Guest

    On Sun, 11 Sep 2011 16:53:33 -0700 (PDT), Retahiv Oopsiscame
    <> wrote, quoted or indirectly quoted someone who
    said :

    >> hyphen U+2010
    >> non-breaking hyphen U+2011
    >> figure dash U+2012
    >> en dash U+2013
    >> minus sign U+2212
    >> roman uncia sign U+10191

    >
    >Wow, what a mess!


    See http://mindprod.com/jgloss/unicode.html It has a table showing
    all those dashes rendered.
    They don't all look the same. Further Unicode does not specify what
    the glyphs look like, just the code's logical function. A font
    designer is free to make all those different dashes visually distinct.
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com
    The modern conservative is engaged in one of man's oldest exercises in moral philosophy; that is,
    the search for a superior moral justification for selfishness.
    ~ John Kenneth Galbraith (born: 1908-10-15 died: 2006-04-29 at age: 97)
     
    Roedy Green, Sep 14, 2011
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. wwj
    Replies:
    7
    Views:
    590
  2. wwj
    Replies:
    24
    Views:
    2,562
    Mike Wahler
    Nov 7, 2003
  3. Ben Pfaff
    Replies:
    5
    Views:
    496
    Tristan Miller
    Jan 17, 2004
  4. Steffen Fiksdal

    void*, char*, unsigned char*, signed char*

    Steffen Fiksdal, May 8, 2005, in forum: C Programming
    Replies:
    1
    Views:
    617
    Jack Klein
    May 9, 2005
  5. lovecreatesbeauty
    Replies:
    1
    Views:
    1,132
    Ian Collins
    May 9, 2006
Loading...

Share This Page