Authenticating an UTF-8, I18N field in struts using regular expressions

Roedy Green · Sep 23, 2005

I think in practice, they precompute this stuff and just hardcode a
table into the source code (i.e. 0x00 to 0x7F -> 1 byte, 0x80 to 0x7FF -> 2
bytes, etc.)

I had written the code for all this and forgot I had done it.

See http://mindprod.com/jgloss/utf.html

It shows you how to compute UTF-7, UTF-8 and UTF-16.

The code is a lot simpler than the explanation.

Chris Uppal · Sep 23, 2005

Oliver said:
UTF-16 is variable-length as well, and NOT backwards compatible with
ASCII. I'm not all that familiar with this encoding, so I may be mistaken
here, but apparently it uses 2 bytes to encode all the characters whose
Unicode number fall between 0x000000 and 0x00FFFF, and 4 bytes for the
rest (I don't see how this is possible, and couldn't find a reference
implementation).

UTF-16 is described fully in the Unicode standard. Mostly in sections 2.5, 3.8
through 3.10, and parts 2 and 3 of Appendix C. The Unicode standard can be
downloaded (free) from:
http://www.unicode.org/standard/standard.html
Reference implentations of various recommended algorithms (including
transcoding) are also available there. You can also buy the standard as a
(huge!) book -- very cheaply considering that its almost 1600 large pages of
high quality printing, is a narrowly technical computer book, /and/ is a
standards document...

Java uses UTF-16 internally. A Java "char" is 16 bites, so it handles
the first case (requiring 2 bytes) just fine. The problem is when you try
to represent a character that requires 4 bytes under UTF-16.

It would be better not to say that Java uses UTF-16 "internally" -- that would
imply that the hidden implementation of String data was as a UTF-16
byte-sequence, which might or might not be the case. In fact the internal
implementation isn't available through any defined interfaces, so it hardly
matters to anyone except JVM implementors. It would be better to say that Java
uses UTF-16, no more than that -- a Java String is a sequence of 16-bit values
that encode a sequence of abstract Unicode characters using the UTF-16 encoding
form (if I've got the terminology right[*]). That is to say that what Java
calls a "char" has little or nothing to do with the Unicode notion of a
character (although the correspondence is close enough to be misleading). And
in particular, it means that there is no constant-time operation for finding
the nth character in a String, since UTF-16 is not a fixed-width encoding.

-- chris

([*] I probably haven't)

Oliver Wong · Sep 23, 2005

Roedy Green said:
I think you are using the term "unary" in an unusual way. Could you
please define it.

To encode the number N in the "unary" referred to above is to emit the
bit "1" N times, followed by a "0" which acts as a sort of "end of string"
marker. So 2 in this form of unary is: "110", and 5 would be "111110". The
end of string marker is nescessary because more data may follow this "unary
string", and that data might start with a 1.

- Oliver

Oliver Wong · Sep 23, 2005

Roedy Green said:
I had written the code for all this and forgot I had done it.

See http://mindprod.com/jgloss/utf.html

It shows you how to compute UTF-7, UTF-8 and UTF-16.

The code is a lot simpler than the explanation.

Except that it fails for characters outside the basic multilingual
plane. I think your implementation is essentially the same as Java's
original implementation (i.e. using hard coded tables to determine what the
"unary header" should be) before they added the "hack" involving integers.

- Oliver

Oliver Wong · Sep 23, 2005

Chris Uppal said:
UTF-16 is described fully in the Unicode standard. Mostly in sections
2.5, 3.8
through 3.10, and parts 2 and 3 of Appendix C. The Unicode standard can
be
downloaded (free) from:
http://www.unicode.org/standard/standard.html
Reference implentations of various recommended algorithms (including
transcoding) are also available there. You can also buy the standard as
a
(huge!) book -- very cheaply considering that its almost 1600 large pages
of
high quality printing, is a narrowly technical computer book, /and/ is a
standards document...

I'll take a look at the site later when I have more free time, but
here's my basic argument for the apparent impossibility of UTF-16's encoding
scheme.

There are 2^16 numbers between 0x000000 and 0x00FFFF inclusive, so if
you want to be able to distinguish between all these characters, it's
impossible to use less than 2^16 bits.

UTF-16 claims to be able to use only 16 bits to represent all the
Unicode characters whose codepoints lie between 0x000000 and 0x00FFFF
inclusive. So far so good.

But then UTF-16 says it can also represent characters in the 0x010000 to
0x10FFFD range using 4 bytes.

So my question is, when you see a 4 byte sequence, how can you know if
it is a pair of characters in the 0x000000 to 0x00FFFF range, or if it is a
single character in the 0x010000 to 0x10FFFD range?

- Oliver

Roedy Green · Sep 23, 2005

First encode the number of bytes needed
to represent the character in unary.

so for example, lets say I am trying to encode the number 0xff.

In unary that would a be string of 255 1s (possibly followed by a
followed by a 0) marker.

That would require 255/8 = 32 bytes.

Yet you get the answer 2 for that same calculation.

At any rate, here is my code for doing UTF-8

// encode 16-bit Unicode into UTF-8
void putwchar( char c )
{
if ( c < 0x80 )
{
// 7-bits done in one byte.
putByte ( c );
}
else if ( c < 0x800 )
{
// 8-11 bits done in 2 bytes
putByte ( 0xC0 | c >> 6 );
putByte ( 0x80 | c & 0x3F );
}
else if ( c < 0x10000 )
{
// 12-16 bits done in 3 bytes
putByte ( 0xE0 | c >> 12 );
putByte ( 0x80 | c >> 6 & 0x3F );
putByte ( 0x80 | c & 0x3F );
}
else if ( c < 0x200000 )
{
// 17-21 bits done in 4 bytes
putByte ( 0xF0 | c >> 18 );
putByte ( 0x80 | c >> 12 & 0x3F );
putByte ( 0x80 | c >> 6 & 0x3F );
putByte ( 0x80 | c & 0x3F );
}
}

Roedy Green · Sep 23, 2005

Except that it fails for characters outside the basic multilingual
plane. I think your implementation is essentially the same as Java's
original implementation (i.e. using hard coded tables to determine what the
"unary header" should be) before they added the "hack" involving integers.

the UTF-16 code I posted handles code points above 0xffff but the
others do not.

Roedy Green · Sep 23, 2005

So my question is, when you see a 4 byte sequence, how can you know if
it is a pair of characters in the 0x000000 to 0x00FFFF range, or if it is a
single character in the 0x010000 to 0x10FFFD range?

because they RESERVE themselves two banks of characters in the 16 bit
range for encoding high characters.

Here is my code for how UTF-16 encodes the high code points:

// encode 32- bit Unicode into UTF-16
void putwchar( int c )
{
if ( c < 0x10000 )
{
// for 16-bit unicode, use 2-byte format
int high = c >>> 8;
int low = c & 0xff;
putByte( high );
putByte( low );
}
else
{
// for unicode above 0xffff use 4-byte format
int d = c - 0x100;
int high10 = d >>> 10;
int low10 = d & 0x3ff;
int highSurrogate = 0xd800 | high10;
int lowSurrogate = 0xdc00 | low10;

int high = highSurrogate >>> 8;
int low = highSurrogate & 0xff;
putByte( high );
putByte( low );

high = lowSurrogate >>> 8;
low = lowSurrogate & 0xff;
putByte( high );
putByte( low );
}

I posted this at http://mindprod.com/jgloss/utf.html in a table.
It will be a bit scrambled here:

unicode bytes notes
00000000 yyyyyyyy xxxxxxxx yyyyyyyy xxxxxxxx 2

for numbers in range 0x0000 to 0xffff just encode them as they are in
16 bits.

000zzzzh yyyyyyyy xxxxxxxx 110110zz zzyyyyyy 110111yy xxxxxxxx 4

for numbers in range 0x10000 to 0x10FFFF, you have 21 bits to encode.
This is reduced to 20 bits by subtracting 0x100. The highorder bits
are encoded as a 16-bit base 0xd800 + high order 10 bits, and the low
order bits are encoded as a 16-bit base 0xdc00 + low order 10 bits.
The resulting pair of 16 bit characters are in the so-called so-called
high-half zone or high surrogate area (0xdc800-0xdbff) and low-half
zone or low surrogate area (0xdcff-0xdfff). Characters with values
greater than 0x10fff cannot be encoded in UTF-16. Values between
0xdc800-0xdbff and 0xd800-0xdfff are specifically reserved for use
with UTF-16 for encoding high characters, and don't have any
characters assigned to them.

16 bit Unicode encoding comes in big-endian, little-endian with the
endianness marked or implied. UTF-16 is big-endian and must be marked
as such with FE FF. UTF-16BE is big-endian unmarked. UTF-16LE is
little-endian unmarked. UTF-16 is officially defined in Annex Q of
ISO/IEC 10646-1. (Copies of ISO standards are quite expensive.) It is
also described in the Unicode Consortium's Unicode Standard, as well
as in the IETF's RFC 2781. Here is how you would encode 32-bit Unicode
to UTF-16

Oliver Wong · Sep 23, 2005

Roedy Green said:
so for example, lets say I am trying to encode the number 0xff.

In unary that would a be string of 255 1s (possibly followed by a
followed by a 0) marker.

That would require 255/8 = 32 bytes.

Yet you get the answer 2 for that same calculation.

I guess I should phrase it as "Encode, in unary, the number of bytes
needed to represent the character". The character 0xFF requires 2 bytes to
encode, so we output the value "2" in unary.

- Oliver

Chris Uppal · Sep 25, 2005

Oliver said:
There are 2^16 numbers between 0x000000 and 0x00FFFF inclusive, so if
you want to be able to distinguish between all these characters, it's
impossible to use less than 2^16 bits.

Ah ;-) What you are missing is that the entire range is /not/ used by
Unicode -- there is no use defined for any characters in the range 0xDC00
through 0xDFFF, nor in the range 0xD800 through 0xDBFF. Those ranges are known
as the low and high surrogate ranges -- but those names are misleading (as is
the wording of the Unicode standard itself in versions earlier than 4.0) in
that they suggest that those code points are part of Unicode -- they are not
and they never will be. Any occurrence of those code-points in Unicode text is
/illegal/. And that is /precisely/ to allow unambiguous parsing of UTF-16
data.

(The need for any such oddity is, of course, the result of the historical error
of failing to distinguish between Unicode text and binary representations of
that text -- specifically the assumption that Unicode characters would fall
into the 16-bit range. Fortunately the Unicode people discovered their error
in time to find a reasonably elegant hack to solve the problem, even though
that hack should never have been necessary in the first place. The Java people
failed to notice that Unicode had rejected 16bit-ness in time to fix the
language/JVM definition.)

The two ranges are 1024 wide, so using pairs of values from those ranges and
extra million code points can be encoded.

The general algorithm for decoding UTF-16 is (btw this is transcribed from
largely untested code, so it may contain errors):

high := next16BitInteger();

// check for illegal occurrence of "low surrogate"
if (isLowSurrogate(high))
error("totally bogus code"); // MUST throw error here

// check for "high surrogate"
if (! isHighSurrogate(high))
return high; // fully decoded already

low := next16BitInteger();

u := ((high & 0x3C0) << 16) + 0x10000;
y := (high & 16r3F) << 10;
x := low & 16r3FF;
code := u | y | x.

// check for attempts to circumvent security
if (code <= 0xFFFF)

error("not shortest form"); // MUST throw error here

return code;

The checks (assuming I've coded them correctly) are /not/ optional according to
Unicode standard.

-- chris

Problems uploading a file in UTF-8 using Tomcat / Struts	3	Jan 12, 2005
HOWTO: Parsing email using Python part2	1	Jul 15, 2011
compiling perl 5.8.7 on Solaris 8	3	Nov 17, 2005
Can't make this page work	6	Mar 8, 2006

Authenticating an UTF-8, I18N field in struts using regular expressions

Roedy Green

Chris Uppal

Oliver Wong

Oliver Wong

Oliver Wong

Roedy Green

Roedy Green

Roedy Green

Oliver Wong

Chris Uppal

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads