How to Read Bytes from a file

gregpinero · Mar 1, 2007

It seems like this would be easy but I'm drawing a blank.

What I want to do is be able to open any file in binary mode, and read
in one byte (8 bits) at a time and then count the number of 1 bits in
that byte.

I got as far as this but it is giving me strings and I'm not sure how
to accurately get to the byte/bit level.

f1=file('somefile','rb')
while 1:
abyte=f1.read(1)

Thanks in advance for any help.

-Greg

Alex Martelli · Mar 1, 2007

It seems like this would be easy but I'm drawing a blank.

What I want to do is be able to open any file in binary mode, and read
in one byte (8 bits) at a time and then count the number of 1 bits in
that byte.

I got as far as this but it is giving me strings and I'm not sure how
to accurately get to the byte/bit level.

f1=file('somefile','rb')
while 1:
abyte=f1.read(1)

You should probaby prepare before the loop a mapping from char to number
of 1 bits in that char:

m = {}
for c in range(256):
m[c] = countones(c)

and then sum up the values of m[abyte] into a running total (break from
the loop when 'not abyte', i.e. you're reading 0 bytes even though
asking for 1 -- that tells you the fine is finished, remember to close
it).

A trivial way to do the countones function:

def countones(x):
assert x>=0
c = 0
while x:
c += (x&1)
x >>= 1
return c

you just don't want to call it too often, whence the previous advice to
call it just 256 times to prep a mapping.

If you download and install gmpy you can use gmpy.popcount as a fast
implementation of countones

.

Alex

Leif K-Brooks · Mar 1, 2007

Alex said:
You should probaby prepare before the loop a mapping from char to number
of 1 bits in that char:

m = {}
for c in range(256):
m[c] = countones(c)

Wouldn't a list be more efficient?

m = [countones(c) for c in xrange(256)]

Bart Ogryczak · Mar 1, 2007

It seems like this would be easy but I'm drawing a blank.

What I want to do is be able to open any file in binary mode, and read
in one byte (8 bits) at a time and then count the number of 1 bits in
that byte.

I got as far as this but it is giving me strings and I'm not sure how
to accurately get to the byte/bit level.

f1=file('somefile','rb')
while 1:
abyte=f1.read(1)

import struct
buf = open('somefile','rb').read()
count1 = lambda x: (x&1)+(x&2>0)+(x&4>0)+(x&8>0)+(x&16>0)+(x&32>0)+
(x&64>0)+(x&128>0)
byteOnes = map(count1,struct.unpack('B'*len(buf),buf))

byteOnes[n] is number is number of ones in byte n.

Jussi Salmela · Mar 1, 2007

Bart Ogryczak kirjoitti:

It seems like this would be easy but I'm drawing a blank.

What I want to do is be able to open any file in binary mode, and read
in one byte (8 bits) at a time and then count the number of 1 bits in
that byte.

I got as far as this but it is giving me strings and I'm not sure how
to accurately get to the byte/bit level.

f1=file('somefile','rb')
while 1:
abyte=f1.read(1)

Click to expand...

import struct
buf = open('somefile','rb').read()
count1 = lambda x: (x&1)+(x&2>0)+(x&4>0)+(x&8>0)+(x&16>0)+(x&32>0)+
(x&64>0)+(x&128>0)
byteOnes = map(count1,struct.unpack('B'*len(buf),buf))

byteOnes[n] is number is number of ones in byte n.

I guess struct.unpack is not necessary, because:

byteOnes2 = map(count1, (ord(ch) for ch in buf))

seems to do the trick also.

Cheers,
Jussi

Alex Martelli · Mar 1, 2007

Leif K-Brooks said:
Alex said:

You should probaby prepare before the loop a mapping from char to number
of 1 bits in that char:

m = {}
for c in range(256):
m[c] = countones(c)

Click to expand...

Wouldn't a list be more efficient?

m = [countones(c) for c in xrange(256)]

Yes, or an array.array -- actually I meant to use m[chr(c)] above (so
you could use the character you're reading directly to index m, rather
than calling ord(byte) a bazillion times for each byte you're reading),
but if you're using the numbers (as I did before) a list or array is
better.

Alex

gregpinero · Mar 1, 2007

It seems like this would be easy but I'm drawing a blank.

Click to expand...

What I want to do is be able to open any file in binary mode, and read
in one byte (8 bits) at a time and then count the number of 1 bits in
that byte.

Click to expand...

I got as far as this but it is giving me strings and I'm not sure how
to accurately get to the byte/bit level.

Click to expand...

f1=file('somefile','rb')
while 1:
abyte=f1.read(1)

Click to expand...

import struct
buf = open('somefile','rb').read()
count1 = lambda x: (x&1)+(x&2>0)+(x&4>0)+(x&8>0)+(x&16>0)+(x&32>0)+
(x&64>0)+(x&128>0)
byteOnes = map(count1,struct.unpack('B'*len(buf),buf))

byteOnes[n] is number is number of ones in byte n.

This solution looks nice, but how does it work? I'm guessing
struct.unpack will provide me with 8 bit bytes (will this work on any
system?)

How does count1 work exactly?

Thanks for the help.

-Greg

John Machin · Mar 1, 2007

import struct
buf = open('somefile','rb').read()
count1 = lambda x: (x&1)+(x&2>0)+(x&4>0)+(x&8>0)+(x&16>0)+(x&32>0)+
(x&64>0)+(x&128>0)
byteOnes = map(count1,struct.unpack('B'*len(buf),buf))

byteOnes = map(count1,struct.unpack('%dB'%len(buf),buf))

Bart Ogryczak · Mar 1, 2007

On Mar 1, 7:52 am, "(e-mail address removed)" <[email protected]>
wrote:

Click to expand...

import struct
buf = open('somefile','rb').read()
count1 = lambda x: (x&1)+(x&2>0)+(x&4>0)+(x&8>0)+(x&16>0)+(x&32>0)+
(x&64>0)+(x&128>0)
byteOnes = map(count1,struct.unpack('B'*len(buf),buf))

Click to expand...

byteOnes[n] is number is number of ones in byte n.

Click to expand...

This solution looks nice, but how does it work? I'm guessing
struct.unpack will provide me with 8 bit bytes

unpack with 'B' format gives you int value equivalent to unsigned char
(1 byte).

(will this work on any system?)

Any system with 8-bit bytes, which would mean any system made after
1965. I'm not aware of any Python implementation for UNIVAC, so I
wouldn't worry ;-)

How does count1 work exactly?

1,2,4,8,16,32,64,128 in binary are
1,10,100,1000,10000,100000,1000000,10000000
x&1 == 1 if x has first bit set to 1
x&2 == 2, so (x&2>0) == True if x has second bit set to 1
.... and so on.
In the context of int, True is interpreted as 1, False as 0.

gregpinero · Mar 1, 2007

unpack with 'B' format gives you int value equivalent to unsigned char
(1 byte).

Any system with 8-bit bytes, which would mean any system made after
1965. I'm not aware of any Python implementation for UNIVAC, so I
wouldn't worry ;-)

1,2,4,8,16,32,64,128 in binary are
1,10,100,1000,10000,100000,1000000,10000000
x&1 == 1 if x has first bit set to 1
x&2 == 2, so (x&2>0) == True if x has second bit set to 1
... and so on.
In the context of int, True is interpreted as 1, False as 0.

Thanks Bart. That's perfect. The other suggestion was to precompute
count1 for all possible bytes, I guess that's 0-256, right?

Thanks again everyone for the help.

-Greg

Dennis Lee Bieber · Mar 2, 2007

Thanks Bart. That's perfect. The other suggestion was to precompute
count1 for all possible bytes, I guess that's 0-256, right?

Better not be -- 256 => 1 0000 0000 (x1 00)

0..255 (x00 .. xFF)
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/

Hendrik van Rooyen · Mar 2, 2007

Thanks Bart. That's perfect. The other suggestion was to precompute
count1 for all possible bytes, I guess that's 0-256, right?

0 to 255 inclusive, actually - that is 256 numbers...

The largest number representable in a byte is 255

eight bits, of value 128,64,32,16,8,4,2,1

Their sum is 255...

And then there is zero.

- Hendrik

Bart Ogryczak · Mar 2, 2007

Thanks Bart. That's perfect. The other suggestion was to precompute
count1 for all possible bytes, I guess that's 0-256, right?

0-255 actually. It'd be worth it, if accessing dictionary with
precomputed values would be significantly faster then calculating the
lambda, which I doubt. I suspect it actually might be slower.

Piet van Oostrum · Mar 5, 2007

Bart Ogryczak said:
BO> Any system with 8-bit bytes, which would mean any system made after
BO> 1965. I'm not aware of any Python implementation for UNIVAC, so I
BO> wouldn't worry ;-)

1965? I worked with non-8-byte machines (CDC) until the beginning of the
80's. :=( In fact in that time the institution where Guido worked also had such
a machine, but Python came later.

Bart Ogryczak · Mar 5, 2007

1965? I worked with non-8-byte machines (CDC) until the beginning of the
80's. :=( In fact in that time the institution where Guido worked also had such
a machine, but Python came later.

Right, I should have written 'designed' not 'made'. UNIVACs also have
been produced until early 1980s. Anyway, I'd call it
paleoinformatics ;-)

Gabriel Genellina · Mar 6, 2007

0-255 actually. It'd be worth it, if accessing dictionary with
precomputed values would be significantly faster then calculating the
lambda, which I doubt. I suspect it actually might be slower.

Dictionary access is highly optimized in Python. In fact, using a
precomputed dictionary is about 12 times faster:

py> import timeit
py> count1 = lambda x:
(x&1)+(x&2>0)+(x&4>0)+(x&8>0)+(x&16>0)+(x&32>0)+(x&64>0)+
(x&128>0)
py> d256 = dict((i, count1(i)) for i in range(256))
py> timeit.Timer("for x in range(256): w = d256[x]", "from __main__ import
d256"
).repeat(number=10000)
[0.54261253874445003, 0.54763468541393934, 0.54499943428564279]
py> timeit.Timer("for x in range(256): w = count1(x)", "from __main__
import cou
nt1").repeat(number=10000)
[6.1867963665773118, 6.1967124313285638, 6.1666287195719178]

Hendrik van Rooyen · Mar 6, 2007

Piet van Oostrum said:
1965? I worked with non-8-byte machines (CDC) until the beginning of the
80's. :=( In fact in that time the institution where Guido worked also had such
a machine, but Python came later.

Those behemoths were EXPENSIVE - so it made a lot of sense to keep using
them until the point that it became obvious even to an accountant that the
maintenance cost was no longer worth it...

Would actually not surprise me if there were still a few around, doing
electricity
accounts or something.

- Hendrik

Hendrik van Rooyen · Mar 6, 2007

Right, I should have written 'designed' not 'made'. UNIVACs also have
been produced until early 1980s. Anyway, I'd call it
paleoinformatics ;-)

The correct term is: "Data Processing", or DP for short.

- Hendrik

Dennis Lee Bieber · Mar 6, 2007

Would actually not surprise me if there were still a few around, doing
electricity
accounts or something.

I would hope the accountants for that electric company have taken
into consideration the cost of the electricity to do that billing <G>
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/

Matthias Julius · Mar 6, 2007

Gabriel Genellina said:
En Fri, 02 Mar 2007 08:22:36 -0300, Bart Ogryczak

Dictionary access is highly optimized in Python. In fact, using a
precomputed dictionary is about 12 times faster:

Why using a dictionary and not a list?

Matthias

How to read from a .csv file in Java?	1	Nov 6, 2023
How to read a directory path from a txt file	6	Jun 2, 2014
How to try a range of hex values in C# code ?	0	Nov 19, 2022
Frustrating circular bytes issue	1	Jun 26, 2012
Struggling to read from a file using a for loop.	0	Oct 8, 2019
Read xml column inside csv file with Python	0	Jul 23, 2022
how to get bytes from bytearray without copying	0	Mar 3, 2014
How to effectively develop a web application from scratch?	0	Jul 2, 2023

How to Read Bytes from a file

gregpinero

Alex Martelli

Leif K-Brooks

Bart Ogryczak

Jussi Salmela

Alex Martelli

gregpinero

John Machin

Bart Ogryczak

gregpinero

Dennis Lee Bieber

Hendrik van Rooyen

Bart Ogryczak

Piet van Oostrum

Bart Ogryczak

Gabriel Genellina

Hendrik van Rooyen

Hendrik van Rooyen

Dennis Lee Bieber

Matthias Julius

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads