How can I parse binary files?

F

Fabio Vitale

I've the need to parse a binary file with the following structure:
How can I accomplish this in Ruby?

Header (36 bytes):
- Version (4 byte unsigned integer) currently 1
- UIDValidity (4 byte unsigned integer)
- UIDNext (4 byte unsigned integer)
- Last Write Counter (4 byte unsigned integer)
- the rest unused

Message data (36 bytes per message):
- Filename (23 bytes including terminating NUL character)
- Flags (1 byte bitmask)
- UID (4 byte unsigned integer)
- Message size (4 byte unsigned integer)
- Date (4 byte time_t value)

Flags mask is 1:Recent, 2:Draft, 4:Deleted, 8:Flagged, 16:Answered,
32:Seen.
 
D

Daniel Martin

Fabio Vitale said:
I've the need to parse a binary file with the following structure:
How can I accomplish this in Ruby?

In addition to parsing this yourself using ruby's String#unpack
method, you should also look at the BitStruct extension available at
http://redshift.sourceforge.net/bit-struct/

(And found via http://raa.ruby-lang.org/ by doing a search on
"binary")

Am I the only one who thinks that ruby-forum.com should include in a
prominent place pointers to standard ruby documentation, and to the
Ruby Application Archive? I don't object to people posting to the
list via the web form at ruby-forum.com, but I think that a prominent
display of common sources of information would help everyone.
 
F

Fabio Vitale

Daniel said:
In addition to parsing this yourself using ruby's String#unpack
method, you should also look at the BitStruct extension available at
http://redshift.sourceforge.net/bit-struct/

I've found bit-struct very intresting, anyway I cannot figure how to
load a binary file in a newly created bit-structure.
Any help appreciated.

Say I've an imap.mrk binary file,
I've defined class MRK as follows:

require 'bit-struct'

class MRK < BitStruct
unsigned :version, 4, "Version"
unsigned :uid_Validity, 4, "UIDValidity"
unsigned :uid_next, 4, "UIDNext"
unsigned :last_write_counter, 4, "LastWriteCounter"
rest :unused, "Unused"
end

mrk = MRK.new

And now: how to populate the mrk instance just created from the imap.mrk
binary file?


Thank you
 
A

ara.t.howard

I've found bit-struct very intresting, anyway I cannot figure how to
load a binary file in a newly created bit-structure.
Any help appreciated.

Say I've an imap.mrk binary file,
I've defined class MRK as follows:

require 'bit-struct'

class MRK < BitStruct
unsigned :version, 4, "Version"
unsigned :uid_Validity, 4, "UIDValidity"
unsigned :uid_next, 4, "UIDNext"
unsigned :last_write_counter, 4, "LastWriteCounter"
rest :unused, "Unused"
end

mrk = MRK.new

And now: how to populate the mrk instance just created from the imap.mrk
binary file?

without even looking at the docs i'd guess you could do

data = IO.read 'your.data'

mrk = MRK.new data


and, indeed, this seems to work:

harp:~ > cat a.rb
require 'bit-struct'

class C < BitStruct
unsigned :a, 16
unsigned :b, 16
unsigned :c, 16
end

c = C.new 'a' => 42

p c

buf = c.to_s

p buf

c = C.new buf

p c.a


harp:~ > ruby a.rb
#<C a=42, b=0, c=0>
"\000*\000\000\000\000"
42


incidentally, you are probably going to want

class MRK < BitStruct
unsigned :version, 32, "Version"
unsigned :uid_Validity, 32, "UIDValidity"
unsigned :uid_next, 32, "UIDNext"
unsigned :last_write_counter, 32, "LastWriteCounter"
rest :unused, "Unused"
end


the field size declares the number of __bits__ not __bytes__.


http://redshift.sourceforge.net/bit-struct/doc/index.html


regards.


-a
 
S

Simon Kröger

require 'bit-struct'

class MRK < BitStruct
unsigned :version, 4, "Version"
unsigned :uid_Validity, 4, "UIDValidity"
unsigned :uid_next, 4, "UIDNext"
unsigned :last_write_counter, 4, "LastWriteCounter"
rest :unused, "Unused"
end

mrk = MRK.new

And now: how to populate the mrk instance just created from the imap.mrk
binary file?

without even looking at the docs i'd guess you could do

data = IO.read 'your.data'

mrk = MRK.new data


and, indeed, this seems to work:

[snip]

This looks like a nice way.
I just wanted to show that in such a simple case unpack isn't that ugly, too.

open('file.bin', 'rb').do |f|
version, uidValid, uidNext, lwCounter = f.read(36).unpack('IIII')
name, flags, uid, size, date = f.read(36).unpack('Z23CIII')

#do something
end

This is of course untested because i don't have such a file, but i hope
the idea is clear.

cheers

Simon
 
D

Daniel Martin

Fabio Vitale said:
And now: how to populate the mrk instance just created from the imap.mrk
binary file?

First off, the other message's advice about your field sizes should be
taken (you want to use "32", not "4"). Also, you almost certainly
want to add :endian => :native to your structure. Finally, you'll
want to adjust the bit_length method of your MRKHeader class since it
won't construct the appropriate length just from the field info.

class MRKHeader < BitStruct
unsigned :version, 32, "Version", :endian => :native
unsigned :uid_Validity, 32, "UIDValidity", :endian => :native
unsigned :uid_next, 32, "UIDNext", :endian => :native
unsigned :last_write_counter, 32, "LastWriteCounter", :endian => :native
rest :unused, "Unused"
def MRKHeader.bit_length
super
36*8
end
end

Okay, now let's assume that you also define the per-message structure
using BitStruct as MRKMessage. (For the message code, you don't need
to redefine bit_length since it can be computed straight from the
fields. Do however use the endianness option on all the integers)

Then:

File.open("imap.mrk") {|f|
head_string = f.read(MRKHeader.round_byte_length)
raise "No header!" unless head_string
mrk_header = MRKHeader.new(head_string)
puts mrk_header.inspect
while msg_string = f.read(MRKMessage.round_byte_length) do
puts MRKMessage.new(msg_string)
end
}
 
D

Daniel Martin

Daniel Martin said:
Then:

File.open("imap.mrk") {|f|
head_string = f.read(MRKHeader.round_byte_length)
raise "No header!" unless head_string
mrk_header = MRKHeader.new(head_string)
puts mrk_header.inspect
while msg_string = f.read(MRKMessage.round_byte_length) do
puts MRKMessage.new(msg_string)
end
}

I forgot to open the file in binary mode, and forgot an inspect call.
I should have said:

File.open("imap.mrk", "rb") {|f|
head_string = f.read(MRKHeader.round_byte_length)
raise "No header!" unless head_string
mrk_header = MRKHeader.new(head_string)
puts mrk_header.inspect
while msg_string = f.read(MRKMessage.round_byte_length) do
puts MRKMessage.new(msg_string).inspect
end
}
 
F

Fabio Vitale

require 'bit-struct'

class MRKHeader < BitStruct
unsigned :version, 32, "Version", :endian => :native
unsigned :uid_Validity, 32, "UIDValidity", :endian => :native
unsigned :uid_next, 32, "UIDNext", :endian => :native
unsigned :last_write_counter, 32, "LastWriteCounter", :endian =>
:native
rest :unused, "Unused"
def MRKHeader.bit_length
super
36*8
end
end

File.open("imap.mrk", "rb") {|f|
head_string = f.read(MRKHeader.round_byte_length)
raise "No header!" unless head_string
mrk_header = MRKHeader.new(head_string)
puts mrk_header.inspect
while msg_string = f.read(MRKMessage.round_byte_length) do
puts MRKMessage.new(msg_string).inspect
end
}

Now an error is raised:
ruby b.rb
#<MRKHeader version=1, uid_Validity=1106138982, uid_next=5825,
last_write_counter=9872, unused="">
b.rb:19: uninitialized constant MRKMessage (NameError)
from b.rb:14
Exit code: 1

Also the problem is that there is to process the Message data structure:
how can I accomplish this?
Thank you all very much for the help!
 
F

Fabio Vitale

Fabio Vitale wrote:

This is the structure of class MRKMessage:

Message data (36 bytes per message):
- Filename (23 bytes including terminating NUL character)
- Flags (1 byte bitmask)
- UID (4 byte unsigned integer)
- Message size (4 byte unsigned integer)
- Date (4 byte time_t value)

Flags mask is 1:Recent, 2:Draft, 4:Deleted, 8:Flagged, 16:Answered,
32:Seen.

Now 3 major questions:

Q 1: what type must I declare for Filename in the class MRKMessage?

Q 2: what type must I declare for Flags in the class MRKMessage?

Q 3: what type must I declare for Date in the class MRKMessage?

...and 2 minor ones :))

Q 4: How to decode Flags?

Q 5: How to decode Date?

BIG BIG THANKS TO ALL!

------------
require 'bit-struct'
class MRKHeader < BitStruct
unsigned :version, 32, "Version", :endian => :native
unsigned :uid_Validity, 32, "UIDValidity", :endian => :native
unsigned :uid_next, 32, "UIDNext", :endian => :native
unsigned :last_write_counter, 32, "LastWriteCounter", :endian =>
:native
rest :unused, "Unused"
def MRKHeader.bit_length
super
36*8
end
end

class MRKMessage < BitStruct
char :filename, 184, "FileName", :endian => :native
unsigned :flags, 8, "Flags", :endian => :native
unsigned :uid, 32, "UID", :endian => :native
unsigned :msg_size, 32, "MsgSize", :endian => :native
unsigned :date, 32, "Date", :endian => :native
def MRKMessage.bit_length
super
36*8
end
end

File.open("imap.mrk", "rb") {|f|
head_string = f.read(MRKHeader.round_byte_length)
raise "No header!" unless head_string
mrk_header = MRKHeader.new(head_string)
puts mrk_header.inspect
while msg_string = f.read(MRKMessage.round_byte_length) do
puts MRKMessage.new(msg_string).inspect
end
}

This now generates:
ruby b.rb
#<MRKHeader version=1, uid_Validity=1106138982, uid_next=5825,
last_write_counter=9872, unused="">
#<MRKMessage
filename="\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\r\nmd5",
flags=48, uid=808464432, msg_size=942814256, date=1936535094>
#<MRKMessage
filename="g\000\000\000\000\000\00006\020\000\000\374P\000\000k\353\246Cmd5",
flags=48, uid=808464432, msg_size=858993712, date=1936535091>
#<MRKMessage filename="g\000\000\000\000\000\000
e\020\000\000\334\226\003\000X\373\253Cmd5", flags=48, uid=808464432,
msg_size=858993712, date=1936535092>
 
D

Daniel Martin

Fabio Vitale said:
Now 3 major questions:

Q 1: what type must I declare for Filename in the class MRKMessage?

Okay, first off I apologize but I lead you astray. Apparently it's
not enough to override bit_length in your subclass. When you read the
file, you're not getting the stuff lined up properly. Therefore I've
decided to make up for it by finishing the rest of your code for you.

Note that now I override round_byte_length instead, and we get:

require 'bit-struct'
class MRKHeader < BitStruct
unsigned :version, 32, "Version", :endian => :native
unsigned :uid_Validity, 32, "UIDValidity", :endian => :native
unsigned :uid_next, 32, "UIDNext", :endian => :native
unsigned :last_write_counter, 32, "LastWriteCounter", :endian => :native
rest :unused, "Unused"
# Override so that it gets padded properly
def MRKHeader.round_byte_length
super
36
end
end

# Ideally, I'd construct some sort of "flags" bit-struct field
# Or define a boolean field type and make this a series of boolean
# fields.

# However, for now we can deal with a series of 0s and 1s

class MRKMessageFlags < BitStruct
unsigned :flagUnused, 2, "Unused"
unsigned :flagSeen, 1, "Seen"
unsigned :flagAnswered, 1, "Answered"
unsigned :flagFlagged, 1, "Flagged"
unsigned :flagDeleted, 1, "Deleted"
unsigned :flagDraft, 1, "Draft"
unsigned :flagRecent, 1, "Recent"
end

class MRKMessage < BitStruct
# Note "text" for nul-terminated strings
text :filename, 23*8, "FileName", :endian => :native
nest :flags, MRKMessageFlags, "Flags"
unsigned :uid, 32, "UID", :endian => :native
unsigned :msg_size, 32, "MsgSize", :endian => :native
unsigned :date, 32, "Date", :endian => :native

# Now we futz with the way that date is set and gotten.
# we rename the existing date field to __date, and
# then we supply our own meaning for "date" that does
# translation into and out of seconds-since-1970

# Again, the ideal solution would be to define a new bit-struct
# field type that did this stuff itself.

alias_method :__date=, :date=
alias_method :__date, :date
def date=(time)
self.__date= time.to_i
end
def date
Time.at(self.__date)
end
# we don't need to override the length computation here
end

File.open("imap.mrk", "rb") {|f|
head_string = f.read(MRKHeader.round_byte_length)
raise "No header!" unless head_string
mrk_header = MRKHeader.new(head_string)
puts mrk_header.inspect
while msg_string = f.read(MRKMessage.round_byte_length) do
puts MRKMessage.new(msg_string).inspect
end
}

__END__

This produces (on the first bit from your file):

#<MRKHeader version=1, uid_Validity=1106138982, uid_next=5825,
last_write_counter=9872,
unused="\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\r\n">
#<MRKMessage filename="md50000004286.msg", flags=#<MRKMessageFlags
flagUnused=0, flagSeen=1, flagAnswered=1, flagFlagged=0,
flagDeleted=0, flagDraft=0, flagRecent=0>, uid=4150, msg_size=20732,
date=Mon Dec 19 12:18:35 Eastern Standard Time 2005>

This is more what you expected, right?
 
C

ChrisH

Fabio Vitale wrote:
....
This now generates:

#<MRKHeader version=1, uid_Validity=1106138982, uid_next=5825,
last_write_counter=9872, unused="">
#<MRKMessage
filename="\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\r\nmd5",
flags=48, uid=808464432, msg_size=942814256, date=1936535094>
#<MRKMessage
filename="g\000\000\000\000\000\00006\020\000\000\374P\000\000k\353\246Cmd5",
flags=48, uid=808464432, msg_size=858993712, date=1936535091>
#<MRKMessage filename="g\000\000\000\000\000\000
e\020\000\000\334\226\003\000X\373\253Cmd5", flags=48, uid=808464432,
msg_size=858993712, date=1936535092>
....

Looks like you need to investigate #unpack.

In any case, Ruby Facets has a BinaryReader mixin
(http://facets.rubyforge.org/api/more/classes/BinaryReader.html) that
does the reading and upacking for you. Just mix it into File (or a
subclass of File) and you should be good to go...

Cheers
Chris
 
F

Fabio Vitale

Daniel Martin wrote:
Martin Thank you very much: you solved my problem like a charm!
 
D

Daniel Martin

ChrisH said:
Looks like you need to investigate #unpack.

In any case, Ruby Facets has a BinaryReader mixin
(http://facets.rubyforge.org/api/more/classes/BinaryReader.html) that
does the reading and upacking for you. Just mix it into File (or a
subclass of File) and you should be good to go...

That's fine if you want to pull out each field in succession yourself,
but BitStruct provides much more than that, by providing a DSL for
packed-bit structures. Also, if you see my reply, you'll notice that
he was in fact very close to getting what he wanted.

Actually, going through this exercise has pointed up some features
that I would like to add to BitStruct, since in many cases it almost
but not quite completely was exactly what the poster wanted. It would
be nice to have an easy, obvious, and supported way to define extra
padding in a structure (as we needed to here). It would be nice to
have an easier, supported syntax for reading a structure from a binary
file. Finally, it would be nice to allow an easy way to define data
wrappers, as was done with the date property.
 
C

ChrisH

Daniel Martin wrote:
....
That's fine if you want to pull out each field in succession yourself,
but BitStruct provides much more than that, by providing a DSL for
packed-bit structures. Also, if you see my reply, you'll notice that
he was in fact very close to getting what he wanted.

Actually, going through this exercise has pointed up some features
that I would like to add to BitStruct, since in many cases it almost
but not quite completely was exactly what the poster wanted. It would
be nice to have an easy, obvious, and supported way to define extra
padding in a structure (as we needed to here). It would be nice to
have an easier, supported syntax for reading a structure from a binary
file. Finally, it would be nice to allow an easy way to define data
wrappers, as was done with the date property.

My ref to BinaryReader wasn't a slight against BitStruct, I haven't
used either
so can't comment. Just found it and figured I'd throw it into the mix.

Just occurred to me that combining BitStruct with BinaryReader and
maybe StringIO could produce a nice BinaryIO class/module?

BTW, I noticed that all the fields in the BitStruct had endianess
specified.
Is there a way to set the endianess for the whole structure? Would you

have a strucutre with mixed endianess?

Nice work
Chris
 
F

Fabio Vitale

This time I'm trying to write a binary file.

Q1: why does the structure MRKMessageFlags does not get the apropriate
values?

Q2: how do I convert a date to seconds-since-1970?

require 'bit-struct'
class MRKHeader < BitStruct
unsigned :version, 32, "Version", :endian =>
:native
unsigned :uid_Validity, 32, "UIDValidity", :endian =>
:native
unsigned :uid_next, 32, "UIDNext", :endian =>
:native
unsigned :last_write_counter, 32, "LastWriteCounter", :endian =>
:native
rest :unused, "Unused"
# Override so that it gets padded properly
def MRKHeader.round_byte_length
super
36
end
end

# Ideally, I'd construct some sort of "flags" bit-struct field
# Or define a boolean field type and make this a series of boolean
# fields.

# However, for now we can deal with a series of 0s and 1s

class MRKMessageFlags < BitStruct
unsigned :flagUnused, 2, "Unused"
unsigned :flagSeen, 1, "Seen"
unsigned :flagAnswered, 1, "Answered"
unsigned :flagFlagged, 1, "Flagged"
unsigned :flagDeleted, 1, "Deleted"
unsigned :flagDraft, 1, "Draft"
unsigned :flagRecent, 1, "Recent"
end

class MRKMessage < BitStruct
# Note "text" for nul-terminated strings
text :filename, 23*8, "FileName", :endian => :native
nest :flags, MRKMessageFlags, "Flags"
unsigned :uid, 32, "UID", :endian => :native
unsigned :msg_size, 32, "MsgSize", :endian => :native
unsigned :date, 32, "Date", :endian => :native

# Now we futz with the way that date is set and gotten.
# we rename the existing date field to __date, and
# then we supply our own meaning for "date" that does
# translation into and out of seconds-since-1970

# Again, the ideal solution would be to define a new bit-struct
# field type that did this stuff itself.

alias_method :__date=, :date=
alias_method :__date, :date
def date=(time)
self.__date= time.to_i
end
def date
Time.at(self.__date)
end
# we don't need to override the length computation here
end

File.open("imap2.mrk", "wb") {|f|
#<MRKHeader version=1, uid_Validity=1106138982, uid_next=5887,
last_write_counter=9962,
unused="\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\r\n">

mrk_header = MRKHeader.new()
mrk_header.version = 1
mrk_header.uid_Validity = 1106138982
mrk_header.uid_next = 5887
mrk_header.last_write_counter = 9962
mrk_header.unused =
"\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\r\n"
puts mrk_header.inspect

msg = MRKMessage.new()
msg.filename = "md50000006021.msg"

msg.flags.flagSeen = 1
msg.flags.flagAnswered = 0
msg.flags.flagFlagged = 1
msg.flags.flagDeleted = 0
msg.flags.flagDraft = 0
msg.flags.flagRecent = 0
puts msg.flags.inspect

msg.uid = 5885
msg.msg_size = 4184
msg.date = "Mon Jul 24 12:34:04 2006"

puts MRKMessage.new(msg).inspect
}
 
D

Daniel Martin

Fabio Vitale said:
This time I'm trying to write a binary file.

Q1: why does the structure MRKMessageFlags does not get the apropriate
values?

I'm not sure. It appears to be a bug in bit-struct. Here's a
workaround; in your code replace the bit that sets the flag bits with:

# work around bit-struct nested types bug
flags = msg.flags
flags.flagSeen = 1
flags.flagFlagged = 1
msg.flags = flags
Q2: how do I convert a date to seconds-since-1970?

Don't assign date a string, assign it a Time object. The easiest way
to get one of those is with Time.parse:

msg.date = Time.parse("Mon Jul 24 12:34:04 2006")

Your file loop now looks like this: (I also added calls to write out
the data to the file)

File.open("imap2.mrk", "wb") {|f|
#<MRKHeader version=1, uid_Validity=1106138982,
# uid_next=5887, last_write_counter=9962,
# unused="\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\r\n">

mrk_header = MRKHeader.new()
mrk_header.version = 1
mrk_header.uid_Validity = 1106138982
mrk_header.uid_next = 5887
mrk_header.last_write_counter = 9962
# I omitted modifying unused since in theory we don't need to
# change the last two bytes to "\r\n" - it is unused, after all
puts mrk_header.inspect
f.write(mrk_header)

msg = MRKMessage.new()
msg.filename = "md50000006021.msg"

# work around bit-struct nested types bug
flags = msg.flags
flags.flagSeen = 1
flags.flagFlagged = 1
msg.flags = flags

msg.uid = 5885
msg.msg_size = 4184
msg.date = Time.parse("Mon Jul 24 12:34:04 2006")

puts msg.inspect
f.write(msg)
}
 
J

Jan Svitok

Q2: have look at Time.parse or ParseDate in stdlib. Keep in mind that
it is reeeeeally slow, it does all sorts of computation (gcd for
example), so if you need it, it helps to cache it's results, or
compute offsets from a known date

J.
 
F

Fabio Vitale

Thank you all!
I'll investigate on the bit-struct nested types bug.
Anyway, with Daniel's workarund it seems to work.
 
J

Joel VanderWerf

Hi, all. Sorry to respond so late to this thread. Thanks to Daniel for
his excellent and thorough responses!

Just one comment below on Daniel's solution...

Daniel said:
require 'bit-struct'
class MRKHeader < BitStruct
unsigned :version, 32, "Version", :endian => :native
unsigned :uid_Validity, 32, "UIDValidity", :endian => :native
unsigned :uid_next, 32, "UIDNext", :endian => :native
unsigned :last_write_counter, 32, "LastWriteCounter", :endian => :native
rest :unused, "Unused"
# Override so that it gets padded properly
def MRKHeader.round_byte_length
super
36
end
end

I'd suggest defining a fixed-length "unused" field instead of a "rest"
field. The rest construct is better for variable length data, such as a
payload at the end of a packet.

So instead of

rest :unused, "Unused"

you can consume the bytes with

char :unused, (36*8-4*32), "Unused"

Then you don't have to worry about whether overriding #round_byte_length
is the right thing to do or not.

The disadvantage to doing it this way is that inspect will print out
stuff you don't care about. Solving this problem gets to the suggestions
Daniel made in another post, so I'll respond separately.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top