Malformed UTF-8?

I

Ian Macdonald

Hello,

We have a commercial calendaring application at work that conveniently
offers a C API. I have wrapped this API in the form of
Ruby/CorporateTime.

Recently, we've started to see ArgumentError exceptions being thrown by
the library, as it discovers calendar events that it believes to contain
malformed UTF-8.

One such allegedly bad string is the following:

irb(main):001:0> "\032p\210\004n\306\271\310gY\002".unpack("U*")
ArgumentError: malformed UTF-8 character
from (irb):1:in `unpack'
from (irb):1

This is supposed to be Japanese. Can a Japanese reader please confirm
that this is, indeed, malformed UTF-8? I need to be sure that the bug
does not lie with Ruby before I get back to our calendar admin and tell
him to go and pester Oracle.

Thanks,

Ian
--
Ian Macdonald | He who has the courage to laugh is almost
System Administrator | as much a master of the world as he who is
(e-mail address removed) | ready to die. -- Giacomo Leopardi
http://www.caliban.org |
|
 
S

Simon Strandgaard

One such allegedly bad string is the following:

irb(main):001:0> "\032p\210\004n\306\271\310gY\002".unpack("U*")
ArgumentError: malformed UTF-8 character
from (irb):1:in `unpack'
from (irb):1

This is supposed to be Japanese. Can a Japanese reader please confirm
that this is, indeed, malformed UTF-8? I need to be sure that the bug
does not lie with Ruby before I get back to our calendar admin and tell
him to go and pester Oracle.


the substring "\210\004" is invalid UTF8.
in hex its [0x88, 0x04].

0x88 has its uppermost bit set, so this is a dual byte sequence.
0x04 is not a valid continuation byte (upper bit should have been 1).
 
S

Simon Strandgaard

One such allegedly bad string is the following:

irb(main):001:0> "\032p\210\004n\306\271\310gY\002".unpack("U*")
ArgumentError: malformed UTF-8 character
from (irb):1:in `unpack'
from (irb):1

This is supposed to be Japanese. Can a Japanese reader please confirm
that this is, indeed, malformed UTF-8? I need to be sure that the bug
does not lie with Ruby before I get back to our calendar admin and tell
him to go and pester Oracle.

the substring "\210\004" is invalid UTF8.
in hex its [0x88, 0x04].

0x88 has its uppermost bit set, so this is a dual byte sequence.
0x04 is not a valid continuation byte (upper bit should have been 1).

Forget this explanaition, its wrong.. (I mis-read my testcase)


0x88 is not a valid first-byte for a sequence.
In order to be a valid first-byte, then the 2 upper most bits must be set.
0x88 only has one bit set.
 
N

Nikolai Weibull

* Ian Macdonald (Mar 11, 2005 01:30):
irb(main):001:0> "\032p\210\004n\306\271\310gY\002".unpack("U*")
ArgumentError: malformed UTF-8 character from (irb):1:in `unpack' from
(irb):1

utf8validate.rb:

--- cut here ---
#! /usr/bin/ruby -w

ARGV[0] =~ /^(
[\x00-\x7F] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*/x

if $~.end(0) != ARGV[0].length
printf("malformed UTF-8 character starting at position %d in the input\n", $~.end(0))
exit 1
end
--- cut here ---

and from zsh:

% utf8validate.rb $'p\210\004n\306\271\310gY\002'
malformed UTF-8 character starting at position 2 in the input
%

For your input, the \210 is wrong, as this regex won't allow it. I'm
not 100% sure that this is actually correct, as I haven't verified that
the regular expression is correct, but I'm guessing it is. Anyway, now
you can tell where in the data things blow up,
nikolai
 
I

Ian Macdonald

* Ian Macdonald (Mar 11, 2005 01:30):
irb(main):001:0> "\032p\210\004n\306\271\310gY\002".unpack("U*")
ArgumentError: malformed UTF-8 character from (irb):1:in `unpack' from
(irb):1

utf8validate.rb:

--- cut here ---
#! /usr/bin/ruby -w

ARGV[0] =~ /^(
[\x00-\x7F] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*/x

if $~.end(0) != ARGV[0].length
printf("malformed UTF-8 character starting at position %d in the input\n", $~.end(0))
exit 1
end
--- cut here ---

and from zsh:

% utf8validate.rb $'p\210\004n\306\271\310gY\002'
malformed UTF-8 character starting at position 2 in the input
%

For your input, the \210 is wrong, as this regex won't allow it. I'm
not 100% sure that this is actually correct, as I haven't verified that
the regular expression is correct, but I'm guessing it is. Anyway, now
you can tell where in the data things blow up,
nikolai

My thanks to you and Simon. It's especially nice to see a formal
definition of UTF-8 encapsulated in your regex. I wasn't aware of the
formal definition until someone at work pointed me at this excellent
resource:

http://en.wikipedia.org/wiki/UTF-8

Ian
--
Ian Macdonald | Arrakis teaches the attitude of the knife -
System Administrator | chopping off what's incomplete and saying:
(e-mail address removed) | "Now it's complete because it's ended
http://www.caliban.org | here." -- Muad'dib, "Dune"
|
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,774
Messages
2,569,598
Members
45,158
Latest member
Vinay_Kumar Nevatia
Top