ascii char 26

bob · Sep 11, 2011

Anyone know why ASCII char 26 is used in place of a hyphen in UTF-8?

I had to write this function to deal with this:

public static String convertToAscii(String html) {
html = html.replaceAll("\u2019", "'");
html = html.replaceAll("\u201D", "\"");
html = html.replaceAll("\u201C", "\"");

byte[] b = null;
try {
b = html.getBytes("US-ASCII");
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}

// hyphen replace
for (int ctr = 0; ctr < b.length; ctr++)
if (b[ctr] == 26)
b[ctr] = 45;

html = new String(b);
return html;
}

Arne Vajhøj · Sep 11, 2011

Anyone know why ASCII char 26 is used in place of a hyphen in UTF-8?

I had to write this function to deal with this:

public static String convertToAscii(String html) {
html = html.replaceAll("\u2019", "'");
html = html.replaceAll("\u201D", "\"");
html = html.replaceAll("\u201C", "\"");

byte[] b = null;
try {
b = html.getBytes("US-ASCII");
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}

// hyphen replace
for (int ctr = 0; ctr< b.length; ctr++)
if (b[ctr] == 26)
b[ctr] = 45;

html = new String(b);
return html;
}

ASCII code 26 is not in general replaced with hyphen.

If you are asking why some code may do it, then in
some contexts (usually on Windows platform) ASCII code
26 indicates EOF.

Arne

Joshua Cranmer · Sep 11, 2011

Anyone know why ASCII char 26 is used in place of a hyphen in UTF-8?

The US-ASCII encoder only properly encodes characters in the range of
0-127, i.e., the characters that are present in ASCII. Any other
character is replaced with some sort of substitution character; in this
case, it looks like the charset has chosen to use ^Z as the "I don't
know what this character is" character (I would have guessed '?'
instead, but I suppose they decided to go with the less-commonly used
variant).

My guess is your input is using one of the characters like the minus
sign, em dash, or perhaps an en dash instead (there may be others),
which are visually close in appearance to a hyphen but do not share the
same Unicode codepoint.

Roedy Green · Sep 11, 2011

Anyone know why ASCII char 26 is used in place of a hyphen in UTF-8?
html = html.replaceAll("\u201C", "\"");

\u0026 is replaced by an ampersand at compile time, as if you had
typed one into the source code.

I presume you are talking about

26 0x1a ^Z SUB, substitute

\u001a is not useful. It gets replaced by a ^z character, as if you
had typed it into the source text, possibly creating a syntax error.
If you want this char you probably want (char)0x001a

This is true for ascii, UTF and UTF-8. If you see a -, it might just
be some font's attempt to render a SUB char.

You can use ␚ in HTML or \u241a in Java to render a tiny SUB
glyph to represent the char.

see
http://mindprod.com/jgloss/ascii.html
http://mindprod.com/jgloss/unicode.html
http://mindprod.com/jgloss/utf.html
http://mindprod.com/jgloss/literal.html
--
Roedy Green Canadian Mind Products
http://mindprod.com
The modern conservative is engaged in one of man's oldest exercises in moral philosophy; that is,
the search for a superior moral justification for selfishness.
~ John Kenneth Galbraith (born: 1908-10-15 died: 2006-04-29 at age: 97)

Eric Sosman · Sep 11, 2011

The US-ASCII encoder only properly encodes characters in the range of
0-127, i.e., the characters that are present in ASCII. Any other
character is replaced with some sort of substitution character; in this
case, it looks like the charset has chosen to use ^Z as the "I don't
know what this character is" character (I would have guessed '?'
instead, but I suppose they decided to go with the less-commonly used
variant).

It makes more sense when you think of 26 not as ^Z, but as SUB.

Bent C Dalager · Sep 12, 2011

Anyone know why ASCII char 26 is used in place of a hyphen in UTF-8?

Unicode has multiple different hyphens and hyphen-like characters.

The traditional ASCII hyphen is the Unicode "hyphen-minus" which
encodes to 0x2d in utf-8.

http://www.fileformat.info/info/unicode/char/2d/index.htm suggests the
following additional hyphen-like characters that you may actually be
working with in your string, and that will probably be mapped to 26 in
your case:

hyphen U+2010
non-breaking hyphen U+2011
figure dash U+2012
en dash U+2013
minus sign U+2212
roman uncia sign U+10191

If hyphens are of particular interest to you it may be a better
approach to replace non-ASCII-supported hyphens from the above list
with "hyphen-minus", before you transcode to ASCII.

One would tend to think there ought to be a library function somewhere
to convert a unicode string to ASCII-supported variants of its various
characters where possible, that you should be using instead. I don't
know if such a function is easily available.

Cheers,
Bent D

Joshua Cranmer · Sep 12, 2011

One would tend to think there ought to be a library function somewhere
to convert a unicode string to ASCII-supported variants of its various
characters where possible, that you should be using instead. I don't
know if such a function is easily available.

This generally falls under the umbrella of Unicode normalization, which
can resolve, e.g., Ã… the Angstrom symbol and Ã… the Swedish letter to the
same representation (may require compatibility normalization). You can
do this in Java using the java.text.Normalizer class.

Retahiv Oopsiscame · Sep 12, 2011

Unicode has multiple different hyphens and hyphen-like characters.

The traditional ASCII hyphen is the Unicode "hyphen-minus" which
encodes to 0x2d in utf-8.

http://www.fileformat.info/info/unicode/char/2d/index.htmsuggests the
following additional hyphen-like characters that you may actually be
working with in your string, and that will probably be mapped to 26 in
your case:

hyphen U+2010
non-breaking hyphen U+2011
figure dash U+2012
en dash U+2013
minus sign U+2212
roman uncia sign U+10191

Wow, what a mess!

One would tend to think there ought to be a library function somewhere
to convert a unicode string to ASCII-supported variants of its various
characters where possible,

Indeed.

bob · Sep 12, 2011

You're right. I messed up, and it was the em dash. It turned into 26
after going thru 'b = html.getBytes("US-ASCII");'

Here's the new code:

public static String convertToAscii(String html) {
html = html.replaceAll("\u2019", "'");
html = html.replaceAll("\u201D", "\"");
html = html.replaceAll("\u201C", "\"");

// mdash
html = html.replaceAll("\u2014", "-");

byte[] b = null;
try {
b = html.getBytes("US-ASCII");

} catch (UnsupportedEncodingException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return html;
}

Also, I'm on Android 2.1, so import java.text.Normalizer; doesn't
work.

Joshua Cranmer · Sep 12, 2011

You're right. I messed up, and it was the em dash. It turned into 26
after going thru 'b = html.getBytes("US-ASCII");'

Here's the new code:

Hardcoding a list of tables is generally not a good thing; in
particular, I don't think it's going to solve your problems. I have seen
sites that use the Unicode ff and fi ligatures instead of relying on
fonts to automatically pick up on that as well.

If I may ask, why do you need to convert the string to US-ASCII as
opposed to UTF-8? That is going to cause major issues for the ~90% of
the world that doesn't speak English as their main language.

Also, I'm on Android 2.1, so import java.text.Normalizer; doesn't
work.

It shouldn't be that hard to find other Java Unicode normalization
libraries out there.

bob · Sep 12, 2011

Loading UTF-8 data into a WebView doesn't work right. Please see this
thread:

http://groups.google.com/group/android-developers/browse_thread/thread/c056cc101c8676e5?hl=en

Thanks.

Roedy Green · Sep 14, 2011

Wow, what a mess!

See http://mindprod.com/jgloss/unicode.html It has a table showing
all those dashes rendered.
They don't all look the same. Further Unicode does not specify what
the glyphs look like, just the code's logical function. A font
designer is free to make all those different dashes visually distinct.
--
Roedy Green Canadian Mind Products
http://mindprod.com
The modern conservative is engaged in one of man's oldest exercises in moral philosophy; that is,
the search for a superior moral justification for selfishness.
~ John Kenneth Galbraith (born: 1908-10-15 died: 2006-04-29 at age: 97)

Issue with textbox script?	0	Sep 5, 2022
Multiple indirection mess-up...	19	Dec 18, 2005
Read utf-8 char one by one	13	Jan 27, 2010
extended ASCII Conversion in Java	0	Jan 2, 2013
Is char obsolete?	20	Apr 8, 2011
Help in Java swings(internal Frame)	2	May 8, 2006
retriving escape unicode sequences from files ...	8	Aug 3, 2012
retriving escape unicode sequences from files ...	8	Aug 3, 2012

ascii char 26

bob

Arne Vajhøj

Joshua Cranmer

Roedy Green

Eric Sosman

Bent C Dalager

Joshua Cranmer

Retahiv Oopsiscame

bob

Joshua Cranmer

bob

Roedy Green

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads