How to convert bytearray into integer?

J

Jacky

Hi there,

Recently I'm facing a problem to convert 4 bytes on an bytearray into
an 32-bit integer. So far as I can see, there're 3 ways: a) using
struct module, b) using ctypes module, and c) manually manipulation.

Are there any other ways?

My sample is as following:

-----
import struct
import ctypes

def test_struct(buf, offset):
return struct.unpack_from("I", buf, offset)[0]

def test_ctypes(buf, offset):
return ctypes.c_uint32.from_buffer(buf, offset).value

def test_multi(buf, offset):
return buf[offset] + (buf[offset+1] << 8) + (buf[offset+2] << 16) +
(buf[offset+3] << 24)

buf_w = bytearray(5)
buf_w[1] = 1
buf_r = buffer(buf_w)

if __name__ == '__main__':
import timeit

t1 = timeit.Timer("test_struct(buf_r, 1)",
"from __main__ import test_struct, buf_r")
t2 = timeit.Timer("test_ctypes(buf_w, 1)",
"from __main__ import test_ctypes, buf_w")
t3 = timeit.Timer("test_multi(buf_w, 1)",
"from __main__ import test_multi, buf_w")
print t1.timeit(number=1000)
print t2.timeit(number=1000)
print t3.timeit(number=1000)
-----

Yet the results are bit confusing:

-----
number = 10000
0.0081958770752
0.012549161911
0.0112121105194

number = 1000
0.00087308883667
0.00125789642334
0.00110197067261

number = 100
0.0000917911529541 # 9.17911529541e-05
0.000133991241455
0.00011420249939

number = 10
1.69277191162e-05
2.19345092773e-05
1.69277191162e-05

number = 1
1.00135803223e-05
1.00135803223e-05
5.96046447754e-06
-----

As the number of benchmarking loops decreasing, method c which is
manually manipulating overwhelms the former 2 methods. However, if
number == 10K, the struct method wins.

Why does it happen?

Thanks,
Jacky (jacky.chao.wang#gmail.com)
 
T

Thomas Jollans

Hi there,

Recently I'm facing a problem to convert 4 bytes on an bytearray into
an 32-bit integer. So far as I can see, there're 3 ways:
a) using struct module,

Yes, that's what it's for, and that's what you should be using.
b) using ctypes module, and

Yeeaah, that would work, but that's really not what it's for. from_buffer
wants a writable buffer interface, which is unlikely to be what you want.
c) manually manipulation.

Well, yes, you can do that, but it gets messy when you're working with more
complex data structures, or you have to consider byte order.
Are there any other ways?

You could write a C extension module tailored to your specific purpose ;-)
number = 1
1.00135803223e-05
1.00135803223e-05
5.96046447754e-06
-----

As the number of benchmarking loops decreasing, method c which is
manually manipulating overwhelms the former 2 methods. However, if
number == 10K, the struct method wins.

Why does it happen?

struct wins because it's built for the job.

As for the small numbers: don't take these numbers seriously. Just don't. This
may be caused by the way your OS's scheduler handles things for all I know. If
there is an explanation for this unscientific observation, I have two guesses
what it might be:
* struct and ctypes still need to do some setup work, or something
* somebody is optimising something, but doesn't know what they should be
optimising in the first place after only a few iterations.
 
J

Jacky

Hi Thomas,

Thanks for your comments! Please check mine inline.

Yes, that's what it's for, and that's what you should be using.

My concern is that struct may need to parse the format string,
construct the list, and de-reference index=0 for this generated list
to get the int out.

There should be some way more efficient?
Yeeaah, that would work, but that's really not what it's for. from_buffer
wants a writable buffer interface, which is unlikely to be what you want.

Actually my buffer is writable --- it's an bytearray. Turning it into
a R/O one make me to do extra effort: wrapping the bytearray into
buffer().

My question is, this operation seems like to be much simpler than the
former one, and it's very straightforward as well. Why is it slow?
Well, yes, you can do that, but it gets messy when you're working with more
complex data structures, or you have to consider byte order.

agree. :)
You could write a C extension module tailored to your specific purpose ;-)

Ha, yes. Actually I've already modified socketmodule.c myself ---
it's hard to image why socket object provides the interface:
socket.recv_from(buf[, num_bytes[, flags]]) but forget the more
generic one: socket.recv_from(buf[, offset[, num_bytes[, flags]]])

So do socket.send(...).
struct wins because it's built for the job.

As for the small numbers: don't take these numbers seriously. Just don't. This
may be caused by the way your OS's scheduler handles things for all I know. If
there is an explanation for this unscientific observation, I have two guesses
what it might be:
 * struct and ctypes still need to do some setup work, or something
 * somebody is optimising something, but doesn't know what they should be
   optimising in the first place after only a few iterations.

Agree. Thanks.

- Jacky
 
M

Mark Dickinson

Hi Thomas,

Thanks for your comments!  Please check mine inline.




My concern is that struct may need to parse the format string,
construct the list, and de-reference index=0 for this generated list
to get the int out.

There should be some way more efficient?

Well, you can improve on the struct solution by using the
struct.Struct class to avoid parsing the format string repeatedly:
import struct
S = struct.Struct('<I')
S.unpack_from(buffer(bytearray([1,2,3,4,5])))
(67305985,)

This doesn't make a huge difference on my machine (OS X 10.6.4, 64-bit
build of Python 2.6) though; it's probably more effective for long
format strings. Adding:

def test_struct2(buf, offset, S=struct.Struct('<I')):
return S.unpack_from(buf, offset)[0]

to your test code, I see a speedup of around 8% over your test_struct.

By the way, you may want to consider using an explicit byte-order/size
marker in your format string; i.e., use '<I' instead of 'I'. This
forces a 4-byte little-endian interpretation, regardless of the
platform you're running Python on.
 
T

Thomas Jollans

Hi Thomas,

Thanks for your comments! Please check mine inline.



My concern is that struct may need to parse the format string,
construct the list, and de-reference index=0 for this generated list
to get the int out.

There should be some way more efficient?

The struct module is written in C, not in Python. It does have to parse a
string, yes, so, if you wrote your own, limited, C function to do the job, it
might be marginally faster.
Actually my buffer is writable --- it's an bytearray. Turning it into
a R/O one make me to do extra effort: wrapping the bytearray into
buffer().

My question is, this operation seems like to be much simpler than the
former one, and it's very straightforward as well. Why is it slow?

Unlike struct, it constructs an object you're not actually interested in
around your int.
it's hard to image why socket object provides the interface:
socket.recv_from(buf[, num_bytes[, flags]]) but forget the more
generic one: socket.recv_from(buf[, offset[, num_bytes[, flags]]])

Well, that's what pointer arithmetic (in C) or slices (in Python) are for!
There's an argument to be made for sticking close to the traditional
(originally C) interface here - it's familiar.


- Thomas
 
M

Mark Dickinson

My concern is that struct may need to parse the format string,
construct the list, and de-reference index=0 for this generated list
to get the int out.
There should be some way more efficient?

Well, you can improve on the struct solution by using the
struct.Struct class to avoid parsing the format string repeatedly:
import struct
S = struct.Struct('<I')
S.unpack_from(buffer(bytearray([1,2,3,4,5])))

(67305985,)

This doesn't make a huge difference on my machine (OS X 10.6.4, 64-bit
build of Python 2.6) though;  it's probably more effective for long
format strings.

Sorry, this was inaccurate: this makes almost *no* significant
difference on my machine for large test runs (10000 and up). For
small ones, though, it's faster. The reason is that the struct module
caches (up to 100, in the current implementation) previously used
format strings, so with your tests you're only ever parsing the format
string once anyway. Internally, the struct module converts that
format string to a Struct object, and squirrels that Struct object
away into its cache, which is implemented as a dict from format
strings to Struct objects. So the next time that the format string is
used it's simply looked up in the cache, and the Struct object
retrieved.

By the way, in Python 3.2 there's yet another fun way to do this,
using int.from_bytes.
int.from_bytes(bytearray([1,2,3,4]), 'little')
67305985
 
J

Jacky

Hi Mark,

Thanks for your reply. Agree and I'll use your suggestions. Thanks!

-Jacky

Hi Thomas,
Thanks for your comments!  Please check mine inline.
My concern is that struct may need to parse the format string,
construct the list, and de-reference index=0 for this generated list
to get the int out.
There should be some way more efficient?

Well, you can improve on the struct solution by using the
struct.Struct class to avoid parsing the format string repeatedly:
import struct
S = struct.Struct('<I')
S.unpack_from(buffer(bytearray([1,2,3,4,5])))

(67305985,)

This doesn't make a huge difference on my machine (OS X 10.6.4, 64-bit
build of Python 2.6) though;  it's probably more effective for long
format strings. Adding:

def test_struct2(buf, offset, S=struct.Struct('<I')):
    return S.unpack_from(buf, offset)[0]

to your test code, I see a speedup of around 8% over your test_struct.

By the way, you may want to consider using an explicit byte-order/size
marker in your format string;  i.e., use '<I' instead of 'I'.  This
forces a 4-byte little-endian interpretation, regardless of the
platform you're running Python on.
 
J

Jacky

Hi Thomas,
Thanks for your comments!  Please check mine inline.
My concern is that struct may need to parse the format string,
construct the list, and de-reference index=0 for this generated list
to get the int out.
There should be some way more efficient?

The struct module is written in C, not in Python. It does have to parse a
string, yes, so, if you wrote your own, limited, C function to do the job, it
might be marginally faster.


Actually my buffer is writable --- it's an bytearray.  Turning it into
a R/O one make me to do extra effort: wrapping the bytearray into
buffer().
My question is, this operation seems like to be much simpler than the
former one, and it's very straightforward as well.  Why is it slow?

Unlike struct, it constructs an object you're not actually interested in
around your int.
it's hard to image why socket object provides the interface:
socket.recv_from(buf[, num_bytes[, flags]]) but forget the more
generic one: socket.recv_from(buf[, offset[, num_bytes[, flags]]])

Well, that's what pointer arithmetic (in C) or slices (in Python) are for!
There's an argument to be made for sticking close to the traditional
(originally C) interface here - it's familiar.

Hi Thomas, - I'm not quite follow you. It will be great if you could
show me some code no this part...
 
J

Jacky

Well, you can improve on the struct solution by using the
struct.Struct class to avoid parsing the format string repeatedly:
import struct
S = struct.Struct('<I')
S.unpack_from(buffer(bytearray([1,2,3,4,5])))
(67305985,)

This doesn't make a huge difference on my machine (OS X 10.6.4, 64-bit
build of Python 2.6) though;  it's probably more effective for long
format strings.

Sorry, this was inaccurate:  this makes almost *no* significant
difference on my machine for large test runs (10000 and up).  For
small ones, though, it's faster.  The reason is that the struct module
caches (up to 100, in the current implementation) previously used
format strings, so with your tests you're only ever parsing the format
string once anyway.  Internally, the struct module converts that
format string to a Struct object, and squirrels that Struct object
away into its cache, which is implemented as a dict from format
strings to Struct objects.  So the next time that the format string is
used it's simply looked up in the cache, and the Struct object
retrieved.

By the way, in Python 3.2 there's yet another fun way to do this,
using int.from_bytes.
int.from_bytes(bytearray([1,2,3,4]), 'little')

Thanks! It looks pretty like the ctypes way. ;)
 
T

Thomas Jollans

it's hard to image why socket object provides the interface:
socket.recv_from(buf[, num_bytes[, flags]]) but forget the more
generic one: socket.recv_from(buf[, offset[, num_bytes[, flags]]])

Well, that's what pointer arithmetic (in C) or slices (in Python) are
for! There's an argument to be made for sticking close to the
traditional (originally C) interface here - it's familiar.

Hi Thomas, - I'm not quite follow you. It will be great if you could
show me some code no this part...

When I originally wrote that, I didn't check the Python docs, I just had a
quick look at the manual page.

This is the signature of the BSD-socket recv function: (recv(2))

ssize_t recv(int sockfd, void *buf, size_t len, int flags);

so, to receive data into a buffer, you pass it the buffer pointer.

len = recv(sock, buf, full_len, 0);

To receive more data into another buffer, you pass it a pointer further on:

len = recv(sock, buf+len, full_len-len, 0);
/* or, this might be clearer, but it's 100% the same: */
len = recv(sock, & buf[len], full_len-len, 0);

Now, in Python. I assume you were referring to socket.recv_into:

socket.recv_into(buffer[, nbytes[, flags]])

It's hard to imagine why this method exists at all. I think the recv method is
perfectly adequate:

buf = bytearray()
buf[:] = sock.recv(full_len)
# then:
lngth = len(buf)
buf[lngth:] = sock.recv(full_len - lngth)

But still, nothing's stopping us from using recv_into:

# create a buffer large enough. Oh this is so C...
buf = bytearray([0]) * full_len
lngth = sock.recv_into(buf, length_of_first_bit)
# okay, now let's fill the rest !
sock.recv_into(memoryview(buf)[lngth:])

In C, you can point your pointers where ever you want. In Python, you can
point your memoryview at buffers in any way you like, but there tend to be
better ways of doing things.

Cheers,

Thomas
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,071
Latest member
MetabolicSolutionsKeto

Latest Threads

Top