How to convert bytearray into integer?

Jacky · Aug 16, 2010

Hi there,

Recently I'm facing a problem to convert 4 bytes on an bytearray into
an 32-bit integer. So far as I can see, there're 3 ways: a) using
struct module, b) using ctypes module, and c) manually manipulation.

Are there any other ways?

My sample is as following:

-----
import struct
import ctypes

def test_struct(buf, offset):
return struct.unpack_from("I", buf, offset)[0]

def test_ctypes(buf, offset):
return ctypes.c_uint32.from_buffer(buf, offset).value

def test_multi(buf, offset):
return buf[offset] + (buf[offset+1] << 8) + (buf[offset+2] << 16) +
(buf[offset+3] << 24)

buf_w = bytearray(5)
buf_w[1] = 1
buf_r = buffer(buf_w)

if __name__ == '__main__':
import timeit

t1 = timeit.Timer("test_struct(buf_r, 1)",
"from __main__ import test_struct, buf_r")
t2 = timeit.Timer("test_ctypes(buf_w, 1)",
"from __main__ import test_ctypes, buf_w")
t3 = timeit.Timer("test_multi(buf_w, 1)",
"from __main__ import test_multi, buf_w")
print t1.timeit(number=1000)
print t2.timeit(number=1000)
print t3.timeit(number=1000)
-----

Yet the results are bit confusing:

-----
number = 10000
0.0081958770752
0.012549161911
0.0112121105194

number = 1000
0.00087308883667
0.00125789642334
0.00110197067261

number = 100
0.0000917911529541 # 9.17911529541e-05
0.000133991241455
0.00011420249939

number = 10
1.69277191162e-05
2.19345092773e-05
1.69277191162e-05

number = 1
1.00135803223e-05
1.00135803223e-05
5.96046447754e-06
-----

As the number of benchmarking loops decreasing, method c which is
manually manipulating overwhelms the former 2 methods. However, if
number == 10K, the struct method wins.

Why does it happen?

Thanks,
Jacky (jacky.chao.wang#gmail.com)

Thomas Jollans · Aug 16, 2010

Hi there,

Recently I'm facing a problem to convert 4 bytes on an bytearray into
an 32-bit integer. So far as I can see, there're 3 ways:

a) using struct module,

Yes, that's what it's for, and that's what you should be using.

b) using ctypes module, and

Yeeaah, that would work, but that's really not what it's for. from_buffer
wants a writable buffer interface, which is unlikely to be what you want.

c) manually manipulation.

Well, yes, you can do that, but it gets messy when you're working with more
complex data structures, or you have to consider byte order.

Are there any other ways?

You could write a C extension module tailored to your specific purpose ;-)

number = 1
1.00135803223e-05
1.00135803223e-05
5.96046447754e-06
-----

As the number of benchmarking loops decreasing, method c which is
manually manipulating overwhelms the former 2 methods. However, if
number == 10K, the struct method wins.

Why does it happen?

struct wins because it's built for the job.

As for the small numbers: don't take these numbers seriously. Just don't. This
may be caused by the way your OS's scheduler handles things for all I know. If
there is an explanation for this unscientific observation, I have two guesses
what it might be:
* struct and ctypes still need to do some setup work, or something
* somebody is optimising something, but doesn't know what they should be
optimising in the first place after only a few iterations.

Jacky · Aug 16, 2010

Hi Thomas,

Thanks for your comments! Please check mine inline.

Yes, that's what it's for, and that's what you should be using.

My concern is that struct may need to parse the format string,
construct the list, and de-reference index=0 for this generated list
to get the int out.

There should be some way more efficient?

Yeeaah, that would work, but that's really not what it's for. from_buffer
wants a writable buffer interface, which is unlikely to be what you want.

Actually my buffer is writable --- it's an bytearray. Turning it into
a R/O one make me to do extra effort: wrapping the bytearray into
buffer().

My question is, this operation seems like to be much simpler than the
former one, and it's very straightforward as well. Why is it slow?

Well, yes, you can do that, but it gets messy when you're working with more
complex data structures, or you have to consider byte order.

agree.

You could write a C extension module tailored to your specific purpose ;-)

Ha, yes. Actually I've already modified socketmodule.c myself ---
it's hard to image why socket object provides the interface:
socket.recv_from(buf[, num_bytes[, flags]]) but forget the more
generic one: socket.recv_from(buf[, offset[, num_bytes[, flags]]])

So do socket.send(...).

struct wins because it's built for the job.

As for the small numbers: don't take these numbers seriously. Just don't. This
may be caused by the way your OS's scheduler handles things for all I know. If
there is an explanation for this unscientific observation, I have two guesses
what it might be:
* struct and ctypes still need to do some setup work, or something
* somebody is optimising something, but doesn't know what they should be
optimising in the first place after only a few iterations.

Agree. Thanks.

- Jacky

Mark Dickinson · Aug 16, 2010

Hi Thomas,

Thanks for your comments! Please check mine inline.

My concern is that struct may need to parse the format string,
construct the list, and de-reference index=0 for this generated list
to get the int out.

There should be some way more efficient?

Well, you can improve on the struct solution by using the
struct.Struct class to avoid parsing the format string repeatedly:

import struct
S = struct.Struct('<I')
S.unpack_from(buffer(bytearray([1,2,3,4,5])))

Click to expand...

Click to expand...

(67305985,)

This doesn't make a huge difference on my machine (OS X 10.6.4, 64-bit
build of Python 2.6) though; it's probably more effective for long
format strings. Adding:

def test_struct2(buf, offset, S=struct.Struct('<I')):
return S.unpack_from(buf, offset)[0]

to your test code, I see a speedup of around 8% over your test_struct.

By the way, you may want to consider using an explicit byte-order/size
marker in your format string; i.e., use '<I' instead of 'I'. This
forces a 4-byte little-endian interpretation, regardless of the
platform you're running Python on.

Thomas Jollans · Aug 16, 2010

Hi Thomas,

Thanks for your comments! Please check mine inline.

My concern is that struct may need to parse the format string,
construct the list, and de-reference index=0 for this generated list
to get the int out.

There should be some way more efficient?

The struct module is written in C, not in Python. It does have to parse a
string, yes, so, if you wrote your own, limited, C function to do the job, it
might be marginally faster.

Actually my buffer is writable --- it's an bytearray. Turning it into
a R/O one make me to do extra effort: wrapping the bytearray into
buffer().

My question is, this operation seems like to be much simpler than the
former one, and it's very straightforward as well. Why is it slow?

Unlike struct, it constructs an object you're not actually interested in
around your int.

it's hard to image why socket object provides the interface:
socket.recv_from(buf[, num_bytes[, flags]]) but forget the more
generic one: socket.recv_from(buf[, offset[, num_bytes[, flags]]])

Well, that's what pointer arithmetic (in C) or slices (in Python) are for!
There's an argument to be made for sticking close to the traditional
(originally C) interface here - it's familiar.

- Thomas

Mark Dickinson · Aug 16, 2010

My concern is that struct may need to parse the format string,
construct the list, and de-reference index=0 for this generated list
to get the int out.

Click to expand...

There should be some way more efficient?

Click to expand...

Well, you can improve on the struct solution by using the
struct.Struct class to avoid parsing the format string repeatedly:

import struct
S = struct.Struct('<I')
S.unpack_from(buffer(bytearray([1,2,3,4,5])))

Click to expand...

Click to expand...

(67305985,)

This doesn't make a huge difference on my machine (OS X 10.6.4, 64-bit
build of Python 2.6) though; it's probably more effective for long
format strings.

Sorry, this was inaccurate: this makes almost *no* significant
difference on my machine for large test runs (10000 and up). For
small ones, though, it's faster. The reason is that the struct module
caches (up to 100, in the current implementation) previously used
format strings, so with your tests you're only ever parsing the format
string once anyway. Internally, the struct module converts that
format string to a Struct object, and squirrels that Struct object
away into its cache, which is implemented as a dict from format
strings to Struct objects. So the next time that the format string is
used it's simply looked up in the cache, and the Struct object
retrieved.

By the way, in Python 3.2 there's yet another fun way to do this,
using int.from_bytes.

int.from_bytes(bytearray([1,2,3,4]), 'little')

Click to expand...

Click to expand...

67305985

Jacky · Aug 17, 2010

Hi Mark,

Thanks for your reply. Agree and I'll use your suggestions. Thanks!

-Jacky

Hi Thomas,

Click to expand...

Thanks for your comments! Please check mine inline.

Click to expand...

My concern is that struct may need to parse the format string,
construct the list, and de-reference index=0 for this generated list
to get the int out.

Click to expand...

There should be some way more efficient?

Click to expand...

Well, you can improve on the struct solution by using the
struct.Struct class to avoid parsing the format string repeatedly:

import struct
S = struct.Struct('<I')
S.unpack_from(buffer(bytearray([1,2,3,4,5])))

Click to expand...

Click to expand...

(67305985,)

This doesn't make a huge difference on my machine (OS X 10.6.4, 64-bit
build of Python 2.6) though; it's probably more effective for long
format strings. Adding:

def test_struct2(buf, offset, S=struct.Struct('<I')):
return S.unpack_from(buf, offset)[0]

to your test code, I see a speedup of around 8% over your test_struct.

By the way, you may want to consider using an explicit byte-order/size
marker in your format string; i.e., use '<I' instead of 'I'. This
forces a 4-byte little-endian interpretation, regardless of the
platform you're running Python on.

Jacky · Aug 17, 2010

Hi Thomas,

Click to expand...

Thanks for your comments! Please check mine inline.

Click to expand...

My concern is that struct may need to parse the format string,
construct the list, and de-reference index=0 for this generated list
to get the int out.

Click to expand...

There should be some way more efficient?

Click to expand...

The struct module is written in C, not in Python. It does have to parse a
string, yes, so, if you wrote your own, limited, C function to do the job, it
might be marginally faster.

Actually my buffer is writable --- it's an bytearray. Turning it into
a R/O one make me to do extra effort: wrapping the bytearray into
buffer().

Click to expand...

My question is, this operation seems like to be much simpler than the
former one, and it's very straightforward as well. Why is it slow?

Click to expand...

Unlike struct, it constructs an object you're not actually interested in
around your int.

it's hard to image why socket object provides the interface:
socket.recv_from(buf[, num_bytes[, flags]]) but forget the more
generic one: socket.recv_from(buf[, offset[, num_bytes[, flags]]])

Click to expand...

Well, that's what pointer arithmetic (in C) or slices (in Python) are for!
There's an argument to be made for sticking close to the traditional
(originally C) interface here - it's familiar.

Hi Thomas, - I'm not quite follow you. It will be great if you could
show me some code no this part...

Jacky · Aug 17, 2010

Well, you can improve on the struct solution by using the
struct.Struct class to avoid parsing the format string repeatedly:

import struct
S = struct.Struct('<I')
S.unpack_from(buffer(bytearray([1,2,3,4,5])))

Click to expand...

(67305985,)

Click to expand...

This doesn't make a huge difference on my machine (OS X 10.6.4, 64-bit
build of Python 2.6) though; it's probably more effective for long
format strings.

Click to expand...

Sorry, this was inaccurate: this makes almost *no* significant
difference on my machine for large test runs (10000 and up). For
small ones, though, it's faster. The reason is that the struct module
caches (up to 100, in the current implementation) previously used
format strings, so with your tests you're only ever parsing the format
string once anyway. Internally, the struct module converts that
format string to a Struct object, and squirrels that Struct object
away into its cache, which is implemented as a dict from format
strings to Struct objects. So the next time that the format string is
used it's simply looked up in the cache, and the Struct object
retrieved.

By the way, in Python 3.2 there's yet another fun way to do this,
using int.from_bytes.

int.from_bytes(bytearray([1,2,3,4]), 'little')

Click to expand...

Click to expand...

Thanks! It looks pretty like the ctypes way.

Thomas Jollans · Aug 20, 2010

it's hard to image why socket object provides the interface:
socket.recv_from(buf[, num_bytes[, flags]]) but forget the more
generic one: socket.recv_from(buf[, offset[, num_bytes[, flags]]])

Click to expand...

Well, that's what pointer arithmetic (in C) or slices (in Python) are
for! There's an argument to be made for sticking close to the
traditional (originally C) interface here - it's familiar.

Click to expand...

Hi Thomas, - I'm not quite follow you. It will be great if you could
show me some code no this part...

When I originally wrote that, I didn't check the Python docs, I just had a
quick look at the manual page.

This is the signature of the BSD-socket recv function: (recv(2))

ssize_t recv(int sockfd, void *buf, size_t len, int flags);

so, to receive data into a buffer, you pass it the buffer pointer.

len = recv(sock, buf, full_len, 0);

To receive more data into another buffer, you pass it a pointer further on:

len = recv(sock, buf+len, full_len-len, 0);
/* or, this might be clearer, but it's 100% the same: */
len = recv(sock, & buf[len], full_len-len, 0);

Now, in Python. I assume you were referring to socket.recv_into:

socket.recv_into(buffer[, nbytes[, flags]])

It's hard to imagine why this method exists at all. I think the recv method is
perfectly adequate:

buf = bytearray()
buf[:] = sock.recv(full_len)
# then:
lngth = len(buf)
buf[lngth:] = sock.recv(full_len - lngth)

But still, nothing's stopping us from using recv_into:

# create a buffer large enough. Oh this is so C...
buf = bytearray([0]) * full_len
lngth = sock.recv_into(buf, length_of_first_bit)
# okay, now let's fill the rest !
sock.recv_into(memoryview(buf)[lngth:])

In C, you can point your pointers where ever you want. In Python, you can
point your memoryview at buffers in any way you like, but there tend to be
better ways of doing things.

Cheers,

Thomas

integer to binary 0-padded	4	Jun 15, 2011
How to get time (milisecond) of a python IO execution	2	Sep 15, 2013
Want to convert python code to java	0	Apr 1, 2017
Python point location of intersect between two lines	0	Feb 28, 2018
Profiling weirdness: Timer.timeit(), fibonacci and memoization	5	Aug 2, 2008
use python to split a video file into a set of parts	2	May 7, 2013
Namespaces and the timeit module	0	Dec 14, 2004
How to pass in argument to timeit.Timer	3	Apr 28, 2007

How to convert bytearray into integer?

Jacky

Thomas Jollans

Jacky

Mark Dickinson

Thomas Jollans

Mark Dickinson

Jacky

Jacky

Jacky

Thomas Jollans

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads