Convert UTF-8 to ASCII

Vinay · Feb 8, 2005

Hi!

I'm trying to print a UTF-8 encoded string (called someString) such
that the output contains only US-ASCII characters. I'm doing the
following:

Charset charset = Charset.forName("US-ASCII");
CharsetEncoder encoder = charset.newEncoder();
ByteBuffer bb = null;

encoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
try {
CharBuffer cb = CharBuffer.wrap(someString);
bb = encoder.encode(cb);
} catch (CharacterCodingException e) {

e.printStackTrace();

}

CharBuffer cbb = bb.asCharBuffer();
return cbb.toString();

When I give this any string all I get is a bunch of ???? so its
probably unable to map any of the characters correctly. What is the
obviously wrong thing / missing step here?

TIA

Vinay

John C. Bollinger · Feb 8, 2005

Vinay said:
Hi!

I'm trying to print a UTF-8 encoded string (called someString) such
that the output contains only US-ASCII characters.

Do you realize that UTF-8 and US-ASCII are congruent for the entire
range of US-ASCII? And that Java chars are sixteen bits wide (not 8),
directly expressing Unicode code points for characters in the BMP?

I'm doing the
following:

Charset charset = Charset.forName("US-ASCII");
CharsetEncoder encoder = charset.newEncoder();
ByteBuffer bb = null;

encoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
try {
CharBuffer cb = CharBuffer.wrap(someString);
bb = encoder.encode(cb);
} catch (CharacterCodingException e) {

e.printStackTrace();

}

CharBuffer cbb = bb.asCharBuffer();
return cbb.toString();

There is a fundamental problem with your approach. Java chars, and thus
Java Strings, are based on the character model I described above. It is
not meaningful in Java to talk about a String as being encoded in UTF-8
or in US-ASCII -- character encodings / charsets are possible properties
of sequences of _bytes_ not sequences of characters. (That's their
whole point, in fact: to bridge between byte sequences and character
sequences.)

When I give this any string all I get is a bunch of ???? so its
probably unable to map any of the characters correctly. What is the
obviously wrong thing / missing step here?

There is at least one major technical problem in your code, which occurs
at the second-to-last line quoted above: if the encoding worked
correctly then each character will have been encoded as one byte in the
output ByteBuffer, but you use a CharBuffer view of the output to
construct your String result. This combines the first and second bytes,
third and fourth, etc. (chars are two bytes wide, remember). Unless you
have null bytes in your input, all of the resulting chars will be
outside the ASCII range. (And note that you are doing a *second*
conversion to bytes when you print / save the result, wherever you do that.)

It is very important that you learn to distinguish between characters
and character sequences on one hand, and encoded characters and encoded
character sequences on the other. The former, represented in Java by
the char primitive, and Character, String, StringBuffer, etc. classes
are what you should always use inside your applications to handle
character data. The latter, represented in Java by byte arrays and
related classes, are what you should use to communicate character data
with outside entities (files, remote hosts, etc.). You must use an
appropriate charset to convert between the two.

You should also recognize that it is rare to need to use CharsetEncoder
directly to encode character data. Most often the appropriate approach
is to use String.getBytes() or an OutputStreamWriter. The latter is
considerably more general.

Daniel Tryba · Feb 8, 2005

Vinay said:
I'm trying to print a UTF-8 encoded string (called someString) such
that the output contains only US-ASCII characters.

Well, John's posting tells it all...

But what are you actually trying to accomplish? If you want to encode
any string in a 7bit clean way (without losing the actual (eg unicode)
characters) you could use UTF-7, base64 or uuencode.

If you want to strip on ascii chars, replacing the chars with values
greater than 127 with some other chars (? might be a good one) is your
only option.

Vinay · Feb 9, 2005

So basically I have a string like:

"You can use C in a circle © instead of Copyright"

and I would like to strip out all non-ASCII characters from that
string. I don't want to replace them with ? So the output I'm expecting
is:

"You can use C in a circle instead of Copyright"

Real Gagnon · Feb 9, 2005

"You can use C in a circle © instead of Copyright"

and I would like to strip out all non-ASCII characters from that
string. I don't want to replace them with ? So the output I'm expecting
is:

"You can use C in a circle instead of Copyright"

try something like this

public static String formatString(String s) {
return s.replaceAll("[^\\p{ASCII}]","");
}

Bye.

toxa26 · Feb 9, 2005

Very simple.

String myUtfString = getMyStringSomehow();

byte [] asciiBytes = myUtfString.getBytes("US-ASCII");

Bytes are the ascii bytes

jeanlutrin · Feb 9, 2005

Very simple.

...
Bytes are the ascii bytes

"Bytes are the ascii bytes" doesn't mean very much. You seem to
imply that getBytes("US-ASCII") gives *only* the bytes of the
original String that corresponds to US-ASCII characters (ie bytes
whose 8th bit is not set).

And if it's what you meant, it doesn't seem correct.

You can try it by yourself: (konitchiwa, "hi" in Japanese)

String test = "ab\u3053\u3093\u306B\u3061\u306F\u4E16cd";
System.out.println(new String(test.getBytes("US-ASCII")));

On my system, it gives "ab??????cd" and I suspect it's the same on
other systems

So I agree that "?" is an ASCII character, but it is not what the OP
asked.

See you soon,

Jean

tcripps · Aug 24, 2006

This works for me:

Code:

import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.Charset;
import java.nio.charset.CharacterCodingException;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CharsetEncoder;
import java.nio.charset.CodingErrorAction;

import org.apache.log4j.Logger;

public class EncodingTest {
	
	/** logging support */
	private static final Logger log = Logger.getLogger(EncodingTest.class);
	
	public static void main(String[] argv) {
		
		String testString = "ab\u3053\u3093\u306B\u3061\u306F\u4E16cd";
		
		log.info(testString);
		
		String resultString = filterNonAscii(testString);
		
		log.info(resultString);
	}
	
	public static String filterNonAscii(String inString) {
		// Create the encoder and decoder for the character encoding
		Charset charset = Charset.forName("US-ASCII");
		CharsetDecoder decoder = charset.newDecoder();
		CharsetEncoder encoder = charset.newEncoder();
		// This line is the key to removing "unmappable" characters.
		encoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
		String result = inString;

		try {
			// Convert a string to bytes in a ByteBuffer
			ByteBuffer bbuf = encoder.encode(CharBuffer.wrap(inString));

			// Convert bytes in a ByteBuffer to a character ByteBuffer and then to a string.
			CharBuffer cbuf = decoder.decode(bbuf);
			result = cbuf.toString();
		} catch (CharacterCodingException cce) {
			String errorMessage = "Exception during character encoding/decoding: " + cce.getMessage();
			log.error(errorMessage, cce);
		}

		return result;	
	}
	
}

US-ASCII to UTF-8	2	Mar 9, 2010
Converting ASCII to UTF-8	2	Nov 28, 2007
Stuck with urllib.quote and Unicode/UTF-8	0	May 7, 2011
ifstream >> string with UTF-8?	6	Sep 9, 2009
From UTF-8 to windows-1252	3	Jan 6, 2011
UTF-8 to Unicode conversion in ajax response	9	May 17, 2011
Forcing a string to valid UTF-8	2	Apr 26, 2010
Converting from std::wstring to UTF-8 std::string	5	Aug 19, 2011

Convert UTF-8 to ASCII

Vinay

John C. Bollinger

Daniel Tryba

Vinay

Real Gagnon

toxa26

jeanlutrin

tcripps

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads