Byte–stream parsing in Ruby

E

Elliott Cable

So, I’ve a problem. I’m using ncurses (or possibly not, might just
`STDIN.read(1)` or something, we’ll see) to grab byte–level input from
the terminal. Purpose being to catch and handle control characters in a
text mode application, such as “meta–3†or “control–c.â€

Currently, I have a really ugly method that manually parses UTF-8 and
ASCII directly in my Ruby source; however, this is extremely slow, and
seems quite a bit like overkill. After all, with 1.9’s wonderfully
robust `Encoding` support, it seems silly to duplicate all that
byte–parsing work that *must* be going on somewhere in Ruby already.

Here’s my current method (forgive the horrendous code, please! I fully
intended to get rid of it right from the start, so…):
http://github.com/elliottcable/nfoi...141912053fe5ae6/lib/nfoiled/window.rb#L80-175

The goal is to devise some method by which I can:

1) Determine whether or not an `Array` of so–far–received bytes is, yet,
a valid `String` of a given `Encoding` (I can get the intended input
`Encoding` by way of a simple `Encoding.find:)locale)`, so we’re always
in–the–know as to which `Encoding` the incoming bytes are intended to
be)
2) Once we know the`Array` instance containing the relevant bytes
pertains to a valid `String`, convert that into a `String` and further
store/cache/process it in some way.

Yes, this means that the `String` will almost always be one character
long; I am uninterested in parsing lengths of characters out of the
input stream, I can deal with that later. At the moment, I very simply
want to ensure that I can retrieve, in real time, the latest character
entered at the terminal, as a `String`, in any `Encoding`.

Any help would be much appreciated; I’ve been banging my head against
this on–and–off for weeks! (-:
 
B

Brian Candler

Elliott said:
The goal is to devise some method by which I can:

1) Determine whether or not an `Array` of so–far–received bytes is, yet,
a valid `String` of a given `Encoding`
"über".bytes.to_a => [195, 188, 98, 101, 114]
a = "\xc3".force_encoding("UTF-8") => "\xC3"
a.valid_encoding? => false
a << "\xbc" => "ü"
a.valid_encoding? => true
 
E

Elliott Cable

Brian said:
Elliott said:
The goal is to devise some method by which I can:

1) Determine whether or not an `Array` of so–far–received bytes is, yet,
a valid `String` of a given `Encoding`
"über".bytes.to_a => [195, 188, 98, 101, 114]
a = "\xc3".force_encoding("UTF-8") => "\xC3"
a.valid_encoding? => false
a << "\xbc" => "ü"
a.valid_encoding? => true

Hrm, #valid_encoding? is very helpful. But how can I stuff numerical
(`Fixnum`) bytes onto the string?
 
7

7stud --

Elliott said:
Hrm, #valid_encoding? is very helpful. But how can I stuff numerical
(`Fixnum`) bytes onto the string?

hex_str = "\\x%x" % 195
puts hex_str

--output:--
\xc3
 
B

Brian Candler

Elliott said:
Hrm, #valid_encoding? is very helpful. But how can I stuff numerical
(`Fixnum`) bytes onto the string?

If you're doing STDIN.read(1), you get a String. Just use << to
concatenate, or the 2-argument form of read() where you supply a buffer
to append to.

If you are forced to use Fixnum, then try Integer#chr:
=> "\xFF"

Warning: you have to deal with all the (undocumented) ruby-1.9 encoding
stupidity. Have fun guessing the behaviour of each of the methods. e.g.
=> #<Encoding:ASCII-8BIT>

Surprised that the encoding changed? This means that:
=> true

until you do:
=> false

Now have a guess what happens if you try to append another byte. Go on.
Encoding::CompatibilityError: incompatible character encodings: UTF-8
and ASCII-8BIT
from (irb):24

Haha, fooled you. You thought it was safe to append a non-UTF8 character
to a UTF8 string (after all, you did before quite happily), but this
time you get an exception. So now you have to do:
=> false

This is why I hate ruby 1.9.

Regards,

Brian.

P.S. The above example was with ruby 1.9.2 r23158 under Linux with UTF8
locale. Behaviour may or may not be different with other 1.9.x versions
and/or under different locale settings.
 
B

Brian Candler

Incidentally, I needed to do something similar in ruby-1.8 recently, and
it was very straightforward.

def is_utf8?(str)
Iconv.iconv('UTF-8','UTF-8',str)
true
rescue Iconv::IllegalSequence
false
end
 
E

Eric Hodel

Haha, fooled you. You thought it was safe to append a non-UTF8 =20
character
to a UTF8 string (after all, you did before quite happily), but this
time you get an exception. So now you have to do:

=3D> false

This is why I hate ruby 1.9.

I don't think that's a valid UTF-8 byte sequence...
Incidentally, I needed to do something similar in ruby-1.8 recently, =20=
and
it was very straightforward.

def is_utf8?(str)
Iconv.iconv('UTF-8','UTF-8',str)
true
rescue Iconv::IllegalSequence
false
end

Oh, I see there's another tool let's try it!

$ cat conv.rb
str =3D "\xFF\xFA"

require 'iconv'

converted =3D Iconv.iconv 'UTF-8', 'UTF-8', str

puts converted
$ ruby -v conv.rb
ruby 1.8.6 (2008-08-11 patchlevel 287) [universal-darwin9.0]
conv.rb:6:in `iconv': "\377\372" (Iconv::IllegalSequence)
from conv.rb:6

Ok, so it's not valid. Let's get a valid byte sequence...

$ cat conv.rb
str =3D "\xE2\x98\x83"

require 'iconv'

converted =3D Iconv.iconv 'UTF-8', 'UTF-8', str

puts converted
$ ruby conv.rb
=E2=98=83

Ok, so that works!

Now let's use 1.9's built-in encoding stuff with our valid byte =20
sequence:

$ cat conv.rb
# encoding: utf-8
str =3D "hello "
p :encoding =3D> str.encoding
str << 0xE2.chr
str << 0x98.chr
str << 0x83.chr

puts str
$ ruby19 conv.rb
{:encoding=3D>#<Encoding:UTF-8>}
hello =E2=98=83

huh, it worked fine.

So you're mad that Ruby doesn't let you shoot yourself in the foot?
 
B

Brian Candler

Eric said:
I don't think that's a valid UTF-8 byte sequence...

That's the whole point. The OP wanted to append bytes to a string, and
detect whether the resulting string was a valid set of complete UTF-8
codepoints, or whether it was necessary to wait for more byte(s) for it
to become complete.

Ruby 1.9's valid_encoding? method seems to do that for you - except that
all the automagical and undocumented mutation of Strings gets in the
way. Sometimes, ruby lets you concatenate an arbitrary byte to a UTF-8
string without an exception; sometimes it does not. It appears this is
something to do with the concept of "compatible encodings".
Now let's use 1.9's built-in encoding stuff with our valid byte
sequence:

$ cat conv.rb
# encoding: utf-8
str = "hello "
p :encoding => str.encoding
str << 0xE2.chr
str << 0x98.chr
str << 0x83.chr

puts str
$ ruby19 conv.rb
{:encoding=>#<Encoding:UTF-8>}
hello ☃

huh, it worked fine.

Yes, but you forgot to add another

p :encoding => str.encoding

to the end. This shows that the string's encoding has magically mutated
without a by-your-leave.

So now to test whether the encoding is valid or not, you have to mutate
the string back again:

str.force_encoding("UTF-8")
puts "is valid" if str.valid_encoding?

OK, then what happens if you concatenate another byte?

str << 0xFF.chr # boom

Argh, you need to mutate it back to ASCII-8BIT first.
So you're mad that Ruby doesn't let you shoot yourself in the foot?

I'm mad that Ruby has behaviour which is (a) undocumented, and (b) IMO
just plain stupid, and you have to expend ridiculous effort both to
understand it and to work around it.

I'm actually attempting to document it in my spare time, in the form of
a Test::Unit script. It looks like I'm going to have over 200
assertions. This is time I should probably have spent migrating code to
Erlang - which incidentally has a very sensible proposal for Unicode
handling.

Thank goodness for those people maintaining 1.8.6 and related forks like
Ruby Enterprise Edition.

Regards,

Brian.
 
J

James Gray

That's the whole point. The OP wanted to append bytes to a string, and
detect whether the resulting string was a valid set of complete UTF-8
codepoints, or whether it was necessary to wait for more byte(s) for =20=
it
to become complete.

Ruby 1.9's valid_encoding? method seems to do that for you - except =20=
that
all the automagical and undocumented mutation of Strings gets in the
way.

I'm pretty sure I document all the behavior we've seen in this thread =20=

(and much more), in this single article on my blog:

http://blog.grayproductions.net/articles/ruby_19s_string

I'm really not sure why you seem totally unwilling to count my =20
articles as a valid source of information after all this time. They =20
continually explain what you say is unexplained. I've asked you in =20
the past to list what they don't cover, but aside from the C API side =20=

of things (which I admit I don't cover) you're just all out of =20
excuses. I assume you simply have no desire to read them. Fair =20
enough, but hopefully others do. I feel that means we should list =20
them as an available resource.

I'm not sure what "automagical" means in this context either, but I =20
don't feel it's a good description. I assume "auto" is for =20
"automatic." Is Ruby automatically changing the Encoding? I don't =20
think so. The programmer is asking Ruby to add two Strings with =20
different Encodings. Ruby could just say no, but in this case there =20
is a way it can be done, so it makes the choice, assuming that's what =20=

you wanted.

I guess "magical" may just mean you don't understand what's happening =20=

here. I do though, so there's certainly a process we can break down =20
and understand.
Yes, but you forgot to add another

p :encoding =3D> str.encoding

to the end. This shows that the string's encoding has magically =20
mutated
without a by-your-leave.

That's not true. You asked Ruby to combine those Strings of differing =20=

content. You gave your permission.
So now to test whether the encoding is valid or not, you have to =20
mutate
the string back again:

str.force_encoding("UTF-8")
puts "is valid" if str.valid_encoding?

OK, then what happens if you concatenate another byte?

str << 0xFF.chr # boom

Argh, you need to mutate it back to ASCII-8BIT first.

As always, you are just not explaining what these examples show. The =20=

str variable contains some UTF-8 content. There is another String =20
involved here though and we should examine its Encoding:
=3D> #<Encoding:ASCII-8BIT>

So what you are really asking Ruby to do is to combine data in two =20
different Encodings. There is a way to do that here, thanks to Ruby's =20=

concept of compatible Encodings. Given that, the conversion is made. =20=

If you had wanted to keep that data in UTF-8, you should have added =20
more UTF-8 bytes to it:
0xFF.chr.force_encoding("UTF-8")).encoding
=3D> #<Encoding:UTF-8>

There's no magic here. It's a process. We can explain it. I have.

James Edward Gray II
 
B

Brian Candler

I've briefly read sections 8 to 11 again.

Where does it say that String#<< can now raise an exception, and under
what circumstances?. Ah, I finally found it, right at the end of the
*comments* at the bottom of section 8, added a month after initial
publication. (+)

Where does it say that the encoding of a String can change when you
concatenate another string onto it?

By "undocumented" I mean: I expect to type "ri String#<<" and see an
accurate description of what String#<< does, including which
combinations of inputs are valid and which are not, and which attributes
of the String may mutate based on the input supplied.

Regards,

Brian.

(+) There is a warning in the string *comparisons* section saying that,
basically, the rules are too complicated to understand, so you should
always ensure that two strings are in the same encoding before comparing
them. Arguably you could say the same applies to any other operation
which takes two strings.

But this to me shows the whole exercise is futile. If, in order to write
a valid program, you need to ensure that all strings are in the same
encoding, then there should be a global flag which sets the encoding. If
I cannot predict what will happen when string A (encoding X) encounters
string B (encoding Y), and I have to keep forcing the encodings to X,
then there's no benefit in having the capability for strings to carry
about their own encodings.

And in many apps, the encoding information is carried "out of band"
anyway: for example: in HTTP or MIME, the encoding info is in a
Content-Type: header.
 
E

Eric Hodel

If I cannot predict what will happen when string A (encoding X) =20
encounters
string B (encoding Y), and I have to keep forcing the encodings to X,
then there's no benefit in having the capability for strings to carry
about their own encodings.

I think you have a misconception about what #force_encoding does. It =20=

does not do any conversion. Use Encoding::Converter for that.

While #force_encoding does approximately what you want in the examples =20=

you've shown (ASCII, binary data and UTF-8 encodings) it won't work =20
when you're reading one multibyte encoding (say, Shift-JIS from an IO) =20=

and adding it to another multibyte encoding (say, a UTF-8 String). =20
You'll only end up with garbage if you don't use a converter.

For 1.9, I don't think io.read(1) is correct. #getc is better since =20
it'll read what you want:

$ cat file
=CF=80
$ irb19
irb(main):001:0> open 'file' do |io| p io.getc end
"=CF=80"
=3D> "=CF=80"
irb(main):002:0> open 'file' do |io| io.set_encoding 'binary'; p =20
io.getc end
"\xCF"
=3D> "\xCF"

Even for control characters:

$ ruby19 -e 'p $stdin.getc'
^I
"\t"
$=
 
J

James Gray

Where does it say that String#<< can now raise an exception, and under
what circumstances?

Quoting from the page I linked to in my last message:
It's probably worth mentioning that it is possible for a transcoding =20
operation to fail with an error. For example:

$ cat transcode.rb
# encoding: UTF-8
utf8 =3D "R=E9sum=E9=85"
latin1 =3D utf8.encode("ISO-8859-1")
$ ruby transcode.rb
transcode.rb:3:in `encode': "\xE2\x80\xA6" from UTF-8 to ISO-8859-1 =20
Ah, I finally found it, right at the end of the
*comments* at the bottom of section 8, added a month after initial
publication. (+)

What does how long it took me to write the content have to do with =20
anything? I added that comment to cover some items you had mentioned =20=

I had overlooked. Now it's invalid because it took me a while???
Where does it say that the encoding of a String can change when you
concatenate another string onto it?

Quoting from the same page:

One thing that my help a little in normalizing your data is Ruby's =20
concept of compatibleEncodings. Here's an example of checking and =20
taking advantage of compatible Encodings:

# data in two different Encodings
p ascii_my # >> "My "
puts ascii_my.encoding.name # >> US-ASCII
p utf8_resume # >> "R=E9sum=E9"
puts utf8_resume.encoding.name # >> UTF-8
# check compatibility
p Encoding.compatible?(ascii_my, utf8_resume) # >> #<Encoding:UTF-8>
# combine compatible data
my_resume =3D ascii_my + utf8_resume
p my_resume # >> "My R=E9sum=E9"
puts my_resume.encoding.name # >> UTF-8
In this example I had data in two different Encodings, US-ASCII and =20
UTF-8. I asked Ruby if the two pieces of data were compatible?(). Ruby =20=

can respond to that question in one of two ways. If it returns false, =20=

the data is not compatible and you will probably need to transcode at =20=

least one piece of it to work with the other. If an Encoding is =20
returned, the data is compatible and can be concatenated resulting in =20=

data with the returned Encoding. You can see how that played out when =20=

I combined these Strings.
(+) There is a warning in the string *comparisons* section saying =20
that,
basically, the rules are too complicated to understand, so you should
always ensure that two strings are in the same encoding before =20
comparing
them. Arguably you could say the same applies to any other operation
which takes two strings.

But this to me shows the whole exercise is futile.

But you should be doing the exact same thing in Ruby 1.8, which I =20
understand you believe to be a superior system. If you are going to =20
have two pieces of data interact, it just makes sense that they will =20
pretty much always need to be the same kinds of data.
If, in order to write a valid program, you need to ensure that all =20
strings are in the same encoding, then there should be a global flag =20=
which sets the encoding.

Like -E and -U in Ruby 1.9?
And in many apps, the encoding information is carried "out of band"
anyway: for example: in HTTP or MIME, the encoding info is in a
Content-Type: header.

Yeah, that's why a global switch won't really save you from doing your =20=

job. You need to read that header, and treat the content accordingly.

James Edward Gray II
 
E

Elliott Cable

Thanks to everybody involved here, I now have a great solution that
works really well. I also ended up using EventMachine to get the
individual bytes from the keyboard, it’s a lot more efficient. Here’s my
final solution, incase anybody’s interested:

require 'eventmachine'

module Handler
def initialize
@buffer = ""
end

def receive_data byte
byte.force_encoding Encoding.find('locale')
@buffer << byte
check_buffer
end

private
def check_buffer
if @buffer.valid_encoding?
p @buffer
@buffer = ""
end
end
end

EM.run{ EM.open_keyboard Handler }
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,071
Latest member
MetabolicSolutionsKeto

Latest Threads

Top