read write integer in binary into a file

V

Vianney Lecroart

Hello,

I have some big files with lot of "unsigned int" (4 bytes) numbers and I
want to read and write on these files.

Currently, I found this to write:

myfile << [mynum].pack("i")

and to read:

mynum = myfile.read(4).unpack("i").first

I wonder if there's not something faster/simpler to do that without the
need to convert the number into an array into a string to finally
serialize it.

Thank you.
 
P

Park Heesob

SGksDQotLS0tLSBPcmlnaW5hbCBNZXNzYWdlIC0tLS0tIA0KRnJvbTogIlZpYW5uZXkgTGVjcm9h
cnQiIDxhY2VtdHBAZ21haWwuY29tPg0KTmV3c2dyb3VwczogY29tcC5sYW5nLnJ1YnkNClRvOiAi
cnVieS10YWxrIE1MIiA8cnVieS10YWxrQHJ1YnktbGFuZy5vcmc+DQpTZW50OiBUaHVyc2RheSwg
T2N0b2JlciAyNSwgMjAwNyAxMTozNiBQTQ0KU3ViamVjdDogcmVhZCB3cml0ZSBpbnRlZ2VyIGlu
IGJpbmFyeSBpbnRvIGEgZmlsZQ0KDQoNCj4gSGVsbG8sDQo+IA0KPiBJIGhhdmUgc29tZSBiaWcg
ZmlsZXMgd2l0aCBsb3Qgb2YgInVuc2lnbmVkIGludCIgKDQgYnl0ZXMpIG51bWJlcnMgYW5kIEkN
Cj4gd2FudCB0byByZWFkIGFuZCB3cml0ZSBvbiB0aGVzZSBmaWxlcy4NCj4gDQo+IEN1cnJlbnRs
eSwgSSBmb3VuZCB0aGlzIHRvIHdyaXRlOg0KPiANCj4gbXlmaWxlIDw8IFtteW51bV0ucGFjaygi
aSIpDQo+IA0KPiBhbmQgdG8gcmVhZDoNCj4gDQo+IG15bnVtID0gbXlmaWxlLnJlYWQoNCkudW5w
YWNrKCJpIikuZmlyc3QNCj4gDQo+IEkgd29uZGVyIGlmIHRoZXJlJ3Mgbm90IHNvbWV0aGluZyBm
YXN0ZXIvc2ltcGxlciB0byBkbyB0aGF0IHdpdGhvdXQgdGhlDQo+IG5lZWQgdG8gY29udmVydCB0
aGUgbnVtYmVyIGludG8gYW4gYXJyYXkgaW50byBhIHN0cmluZyB0byBmaW5hbGx5DQo+IHNlcmlh
bGl6ZSBpdC4NCj4gDQo+IFRoYW5rIHlvdS4NCg0KSG93IGFib3V0IE1hcnNoYWw/DQoNCiBteWZp
bGUgPDwgTWFyc2hhbC5kdW1wKG15bnVtKQ0KIA0KYW5kDQoNCiBteW51bSA9IE1hcnNoYWwubG9h
ZChteWZpbGUucmVhZCkNCiANClJlZ2FyZHMsDQoNClBhcmsgSGVlc29i
 
V

Vianney Lecroart

How about Marshal?

Files are filled by an external C application that do something like:
fwrite(fp, 4, myint);

Se I have to use the same file format.
 
M

Michael Linfield

Vianney said:
Files are filled by an external C application that do something like:
fwrite(fp, 4, myint);

Se I have to use the same file format.

What file format? I dont see any problem with using Marshal, it doesnt
need a file format specified its simply just a marshal dump.
 
V

Vianney Lecroart

It seems that the marshaling of a number doesn't give a 4 bytes:

irb(main):036:0> mynum
=> 56515
irb(main):037:0> [mynum].pack("i")
=> "\303\334\000\000"
irb(main):038:0> Marshal.dump(mynum)
=> "\004\bi\002\303\334"
 
Y

yermej

Hello,

I have some big files with lot of "unsigned int" (4 bytes) numbers and I
want to read and write on these files.

Currently, I found this to write:

myfile << [mynum].pack("i")

and to read:

mynum = myfile.read(4).unpack("i").first

I wonder if there's not something faster/simpler to do that without the
need to convert the number into an array into a string to finally
serialize it.

Thank you.

Do you have to deal with each number individually? Maybe you could
build up an array of numbers and then pack them all at once:

arr = []
while work_to_do do
mynum = generate_next_number
arr << mynum
end
myfile.write arr.pack('i*')

That way you aren't creating a new array for each number.

Similarly, for reading the file:
data = file.read
num_array = data.unpack('i*')

The '*' in (un)pack means to process the rest of the data in the same
way.
 
A

Adam Preble

I wrote a function to do this which seems slightly faster, but could
perhaps stand some optimization:

def pack_int32(n)
str = ' '
str[3] = n >> 24
str[2] = n >> 16
str[1] = n >> 8
str[0] = n
str
end

Here are the benchmark results vs the other methods mentioned:

user system total real
[].pack(i): 6.234000 0.235000 6.469000 ( 6.500000)
pack_int32: 5.719000 0.015000 5.734000 ( 5.734000)
Marshal.dump: 6.594000 0.219000 6.813000 ( 6.813000)

I included Marshal.dump for completeness, but agree that it doesn't
appear to be meant for this sort of thing. Here's the source to run
the benchmark:

require 'benchmark'
number = 2_000_000
n = 1_000_000
Benchmark.bm(12) do |x|
x.report('[].pack(i):') { n.times do; [number].pack('i'); end }
x.report('pack_int32:') { n.times do; pack_int32(number); end }
x.report('Marshal.dump:') { n.times do; Marshal.dump(number); end }
end

Adam
 
P

Phrogz

I wrote a function to do this which seems slightly faster, but could
perhaps stand some optimization:

def pack_int32(n)
str = ' '
str[3] = n >> 24
str[2] = n >> 16
str[1] = n >> 8
str[0] = n
str
end

Here are the benchmark results vs the other methods mentioned:

user system total real
[].pack(i): 6.234000 0.235000 6.469000 ( 6.500000)
pack_int32: 5.719000 0.015000 5.734000 ( 5.734000)
Marshal.dump: 6.594000 0.219000 6.813000 ( 6.813000)

I included Marshal.dump for completeness, but agree that it doesn't
appear to be meant for this sort of thing. Here's the source to run
the benchmark:

require 'benchmark'
number = 2_000_000
n = 1_000_000
Benchmark.bm(12) do |x|
x.report('[].pack(i):') { n.times do; [number].pack('i'); end }
x.report('pack_int32:') { n.times do; pack_int32(number); end }
x.report('Marshal.dump:') { n.times do; Marshal.dump(number); end }
end

Using only the number 2_000_000 seems to skew the results. I see your
results with your test, but if I change it slightly to use a variety
of integers, I get more balanced results:

require 'benchmark'
MAX = 2**30
n = 1_000_000
nums = (0..n).map{ (rand*MAX).to_i }

Benchmark.bmbm do |x|
x.report('pack(i):') { nums.each{ |num| [num].pack('i') } }
x.report('pack32:') { nums.each{ |num| pack_int32(num) } }
x.report('Dump:') { nums.each{ |num| Marshal.dump(num) } }
end

Rehearsal --------------------------------------------
pack(i): 5.813000 0.109000 5.922000 ( 5.984000)
pack32: 5.234000 0.000000 5.234000 ( 5.281000)
Dump: 5.906000 0.125000 6.031000 ( 6.063000)
---------------------------------- total: 17.187000sec

user system total real
pack(i): 5.687000 0.125000 5.812000 ( 5.875000)
pack32: 5.141000 0.016000 5.157000 ( 5.188000)
Dump: 6.000000 0.078000 6.078000 ( 6.141000)
 
W

Wu Junchen

Vianney said:
Hello,

I have some big files with lot of "unsigned int" (4 bytes) numbers and I
want to read and write on these files.

Currently, I found this to write:

myfile << [mynum].pack("i")

and to read:

mynum = myfile.read(4).unpack("i").first

I wonder if there's not something faster/simpler to do that without the
need to convert the number into an array into a string to finally
serialize it.

Thank you.


irb(main):001:0> f=open('test','w')
=> #<File:test>
irb(main):002:0> f<<[65535].pack('i')
=> #<File:test>
irb(main):003:0> f.tell
=> 4
irb(main):004:0> f<<[720850].pack('i')
=> #<File:test>
irb(main):005:0> f.tell
=> 9
the integer 720850 takes 5 bytes in my file,but it should take 4 bytes
only!How can I fix this?Thanks!
 
T

Tim Hunter

Wu said:
irb(main):001:0> f=open('test','w')
=> #<File:test>
irb(main):002:0> f<<[65535].pack('i')
=> #<File:test>
irb(main):003:0> f.tell
=> 4
irb(main):004:0> f<<[720850].pack('i')
=> #<File:test>
irb(main):005:0> f.tell
=> 9
the integer 720850 takes 5 bytes in my file,but it should take 4 bytes
only!How can I fix this?Thanks!
irb(main):001:0> x = [720850].pack('i')
=> "\322\377\n\000"
irb(main):002:0> x.length
=> 4

So clearly the integer 720850 is packed into 4 bytes as requested. Why
does it occupy 5 bytes in the file? But see the "\n" in position 2? That
means that the 3rd byte is a newline character, and on Windows, in text
files, Ruby turns newlines into CRLF. 2 bytes! Since you've got binary
data in your file you don't want to write a text file, so you must open
the file with the "b" flag in addition to "w":

f = open("test", "wb")
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,535
Members
45,008
Latest member
obedient dusk

Latest Threads

Top