Malformed UTF-8?

Ian Macdonald · Mar 11, 2005

Hello,

We have a commercial calendaring application at work that conveniently
offers a C API. I have wrapped this API in the form of
Ruby/CorporateTime.

Recently, we've started to see ArgumentError exceptions being thrown by
the library, as it discovers calendar events that it believes to contain
malformed UTF-8.

One such allegedly bad string is the following:

irb(main):001:0> "\032p\210\004n\306\271\310gY\002".unpack("U*")
ArgumentError: malformed UTF-8 character
from (irb):1:in `unpack'
from (irb):1

This is supposed to be Japanese. Can a Japanese reader please confirm
that this is, indeed, malformed UTF-8? I need to be sure that the bug
does not lie with Ruby before I get back to our calendar admin and tell
him to go and pester Oracle.

Thanks,

Ian
--
Ian Macdonald | He who has the courage to laugh is almost
System Administrator | as much a master of the world as he who is
(e-mail address removed) | ready to die. -- Giacomo Leopardi
http://www.caliban.org |
|

Simon Strandgaard · Mar 11, 2005

One such allegedly bad string is the following:

irb(main):001:0> "\032p\210\004n\306\271\310gY\002".unpack("U*")
ArgumentError: malformed UTF-8 character
from (irb):1:in `unpack'
from (irb):1

This is supposed to be Japanese. Can a Japanese reader please confirm
that this is, indeed, malformed UTF-8? I need to be sure that the bug
does not lie with Ruby before I get back to our calendar admin and tell
him to go and pester Oracle.

the substring "\210\004" is invalid UTF8.
in hex its [0x88, 0x04].

0x88 has its uppermost bit set, so this is a dual byte sequence.
0x04 is not a valid continuation byte (upper bit should have been 1).

Simon Strandgaard · Mar 11, 2005

One such allegedly bad string is the following:

irb(main):001:0> "\032p\210\004n\306\271\310gY\002".unpack("U*")
ArgumentError: malformed UTF-8 character
from (irb):1:in `unpack'
from (irb):1

This is supposed to be Japanese. Can a Japanese reader please confirm
that this is, indeed, malformed UTF-8? I need to be sure that the bug
does not lie with Ruby before I get back to our calendar admin and tell
him to go and pester Oracle.

Click to expand...

the substring "\210\004" is invalid UTF8.
in hex its [0x88, 0x04].

0x88 has its uppermost bit set, so this is a dual byte sequence.
0x04 is not a valid continuation byte (upper bit should have been 1).

Forget this explanaition, its wrong.. (I mis-read my testcase)

0x88 is not a valid first-byte for a sequence.
In order to be a valid first-byte, then the 2 upper most bits must be set.
0x88 only has one bit set.

Nikolai Weibull · Mar 11, 2005

* Ian Macdonald (Mar 11, 2005 01:30):

irb(main):001:0> "\032p\210\004n\306\271\310gY\002".unpack("U*")
ArgumentError: malformed UTF-8 character from (irb):1:in `unpack' from
(irb):1

utf8validate.rb:

--- cut here ---
#! /usr/bin/ruby -w

ARGV[0] =~ /^(
[\x00-\x7F] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*/x

if $~.end(0) != ARGV[0].length
printf("malformed UTF-8 character starting at position %d in the input\n", $~.end(0))
exit 1
end
--- cut here ---

and from zsh:

% utf8validate.rb $'p\210\004n\306\271\310gY\002'
malformed UTF-8 character starting at position 2 in the input
%

For your input, the \210 is wrong, as this regex won't allow it. I'm
not 100% sure that this is actually correct, as I haven't verified that
the regular expression is correct, but I'm guessing it is. Anyway, now
you can tell where in the data things blow up,
nikolai

Ian Macdonald · Mar 11, 2005

* Ian Macdonald (Mar 11, 2005 01:30):

irb(main):001:0> "\032p\210\004n\306\271\310gY\002".unpack("U*")
ArgumentError: malformed UTF-8 character from (irb):1:in `unpack' from
(irb):1

Click to expand...

utf8validate.rb:

--- cut here ---
#! /usr/bin/ruby -w

ARGV[0] =~ /^(
[\x00-\x7F] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*/x

if $~.end(0) != ARGV[0].length
printf("malformed UTF-8 character starting at position %d in the input\n", $~.end(0))
exit 1
end
--- cut here ---

and from zsh:

% utf8validate.rb $'p\210\004n\306\271\310gY\002'
malformed UTF-8 character starting at position 2 in the input
%

For your input, the \210 is wrong, as this regex won't allow it. I'm
not 100% sure that this is actually correct, as I haven't verified that
the regular expression is correct, but I'm guessing it is. Anyway, now
you can tell where in the data things blow up,
nikolai

My thanks to you and Simon. It's especially nice to see a formal
definition of UTF-8 encapsulated in your regex. I wasn't aware of the
formal definition until someone at work pointed me at this excellent
resource:

http://en.wikipedia.org/wiki/UTF-8

Ian
--
Ian Macdonald | Arrakis teaches the attitude of the knife -
System Administrator | chopping off what's incomplete and saying:
(e-mail address removed) | "Now it's complete because it's ended
http://www.caliban.org | here." -- Muad'dib, "Dune"
|

ruby unicode/string explosion (0xFF in utf-8)	2	Dec 11, 2010
Meaning of "Malformed UTF-8 character"?	4	Oct 20, 2004
Ruby Weekly News 7th - 13th March 2005	1	Mar 13, 2005
Ruby Weekly News 12th - 18th June 2006	0	Jun 21, 2006
Ruby Weekly News 14th - 20th March 2005	0	Mar 20, 2005
Ruby Weekly News 14th - 20th March 2005	4	Mar 20, 2005
Ruby Weekly News 28th March - 3rd April 2005	6	Apr 4, 2005
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006

Malformed UTF-8?

Ian Macdonald

Simon Strandgaard

Simon Strandgaard

Nikolai Weibull

Ian Macdonald

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads