Convert UTF-8 to ASCII

V

Vinay

Hi!

I'm trying to print a UTF-8 encoded string (called someString) such
that the output contains only US-ASCII characters. I'm doing the
following:

Charset charset = Charset.forName("US-ASCII");
CharsetEncoder encoder = charset.newEncoder();
ByteBuffer bb = null;

encoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
try {
CharBuffer cb = CharBuffer.wrap(someString);
bb = encoder.encode(cb);
} catch (CharacterCodingException e) {

e.printStackTrace();

}

CharBuffer cbb = bb.asCharBuffer();
return cbb.toString();

When I give this any string all I get is a bunch of ???? so its
probably unable to map any of the characters correctly. What is the
obviously wrong thing / missing step here?

TIA

Vinay
 
J

John C. Bollinger

Vinay said:
Hi!

I'm trying to print a UTF-8 encoded string (called someString) such
that the output contains only US-ASCII characters.

Do you realize that UTF-8 and US-ASCII are congruent for the entire
range of US-ASCII? And that Java chars are sixteen bits wide (not 8),
directly expressing Unicode code points for characters in the BMP?
I'm doing the
following:

Charset charset = Charset.forName("US-ASCII");
CharsetEncoder encoder = charset.newEncoder();
ByteBuffer bb = null;

encoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
try {
CharBuffer cb = CharBuffer.wrap(someString);
bb = encoder.encode(cb);
} catch (CharacterCodingException e) {

e.printStackTrace();

}

CharBuffer cbb = bb.asCharBuffer();
return cbb.toString();

There is a fundamental problem with your approach. Java chars, and thus
Java Strings, are based on the character model I described above. It is
not meaningful in Java to talk about a String as being encoded in UTF-8
or in US-ASCII -- character encodings / charsets are possible properties
of sequences of _bytes_ not sequences of characters. (That's their
whole point, in fact: to bridge between byte sequences and character
sequences.)
When I give this any string all I get is a bunch of ???? so its
probably unable to map any of the characters correctly. What is the
obviously wrong thing / missing step here?

There is at least one major technical problem in your code, which occurs
at the second-to-last line quoted above: if the encoding worked
correctly then each character will have been encoded as one byte in the
output ByteBuffer, but you use a CharBuffer view of the output to
construct your String result. This combines the first and second bytes,
third and fourth, etc. (chars are two bytes wide, remember). Unless you
have null bytes in your input, all of the resulting chars will be
outside the ASCII range. (And note that you are doing a *second*
conversion to bytes when you print / save the result, wherever you do that.)

It is very important that you learn to distinguish between characters
and character sequences on one hand, and encoded characters and encoded
character sequences on the other. The former, represented in Java by
the char primitive, and Character, String, StringBuffer, etc. classes
are what you should always use inside your applications to handle
character data. The latter, represented in Java by byte arrays and
related classes, are what you should use to communicate character data
with outside entities (files, remote hosts, etc.). You must use an
appropriate charset to convert between the two.

You should also recognize that it is rare to need to use CharsetEncoder
directly to encode character data. Most often the appropriate approach
is to use String.getBytes() or an OutputStreamWriter. The latter is
considerably more general.
 
D

Daniel Tryba

Vinay said:
I'm trying to print a UTF-8 encoded string (called someString) such
that the output contains only US-ASCII characters.

Well, John's posting tells it all...

But what are you actually trying to accomplish? If you want to encode
any string in a 7bit clean way (without losing the actual (eg unicode)
characters) you could use UTF-7, base64 or uuencode.

If you want to strip on ascii chars, replacing the chars with values
greater than 127 with some other chars (? might be a good one) is your
only option.
 
V

Vinay

So basically I have a string like:

"You can use C in a circle © instead of Copyright"

and I would like to strip out all non-ASCII characters from that
string. I don't want to replace them with ? So the output I'm expecting
is:

"You can use C in a circle instead of Copyright"
 
R

Real Gagnon

"You can use C in a circle © instead of Copyright"
and I would like to strip out all non-ASCII characters from that
string. I don't want to replace them with ? So the output I'm expecting
is:

"You can use C in a circle instead of Copyright"

try something like this

public static String formatString(String s) {
return s.replaceAll("[^\\p{ASCII}]","");
}

Bye.
 
T

toxa26

Very simple.

String myUtfString = getMyStringSomehow();

byte [] asciiBytes = myUtfString.getBytes("US-ASCII");

Bytes are the ascii bytes
 
J

jeanlutrin

Very simple.
...
Bytes are the ascii bytes

"Bytes are the ascii bytes" doesn't mean very much. You seem to
imply that getBytes("US-ASCII") gives *only* the bytes of the
original String that corresponds to US-ASCII characters (ie bytes
whose 8th bit is not set).

And if it's what you meant, it doesn't seem correct.

You can try it by yourself: (konitchiwa, "hi" in Japanese)

String test = "ab\u3053\u3093\u306B\u3061\u306F\u4E16cd";
System.out.println(new String(test.getBytes("US-ASCII")));

On my system, it gives "ab??????cd" and I suspect it's the same on
other systems ;)

So I agree that "?" is an ASCII character, but it is not what the OP
asked.

See you soon,

Jean
 
Joined
Aug 24, 2006
Messages
1
Reaction score
0
This works for me:

Code:
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.Charset;
import java.nio.charset.CharacterCodingException;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CharsetEncoder;
import java.nio.charset.CodingErrorAction;

import org.apache.log4j.Logger;

public class EncodingTest {
	
	/** logging support */
	private static final Logger log = Logger.getLogger(EncodingTest.class);
	
	public static void main(String[] argv) {
		
		String testString = "ab\u3053\u3093\u306B\u3061\u306F\u4E16cd";
		
		log.info(testString);
		
		String resultString = filterNonAscii(testString);
		
		log.info(resultString);
	}
	
	public static String filterNonAscii(String inString) {
		// Create the encoder and decoder for the character encoding
		Charset charset = Charset.forName("US-ASCII");
		CharsetDecoder decoder = charset.newDecoder();
		CharsetEncoder encoder = charset.newEncoder();
		// This line is the key to removing "unmappable" characters.
		encoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
		String result = inString;

		try {
			// Convert a string to bytes in a ByteBuffer
			ByteBuffer bbuf = encoder.encode(CharBuffer.wrap(inString));

			// Convert bytes in a ByteBuffer to a character ByteBuffer and then to a string.
			CharBuffer cbuf = decoder.decode(bbuf);
			result = cbuf.toString();
		} catch (CharacterCodingException cce) {
			String errorMessage = "Exception during character encoding/decoding: " + cce.getMessage();
			log.error(errorMessage, cce);
		}

		return result;	
	}
	
}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,767
Messages
2,569,570
Members
45,045
Latest member
DRCM

Latest Threads

Top