how to do reading of binary files?

jvdb · Jun 8, 2007

Hi all,

I need some help on the following issue. I can't seem to solve it.

I have a binary (pcl) file.
In this file i want to search for specific codes (like <0C>). I have
tried to solve it by reading the file character by character, but this
is very slow. Especially when it comes to files which are large
(>10MB) this is consuming quite some time.
Does anyone has a hint/clue/solution on this?

thanks already!

Jeroen

Diez B. Roggisch · Jun 8, 2007

jvdb said:
Hi all,

I need some help on the following issue. I can't seem to solve it.

I have a binary (pcl) file.
In this file i want to search for specific codes (like <0C>). I have
tried to solve it by reading the file character by character, but this
is very slow. Especially when it comes to files which are large
(>10MB) this is consuming quite some time.
Does anyone has a hint/clue/solution on this?

What has the searching to do with the reading? 10MB easily fit into the
main memory of a decent PC, so just do

contents = open("file").read() # yes I know I should close the file...

print contents.find('\x0c')

Diez

jvdb · Jun 8, 2007

jvdb schrieb: .......
What has the searching to do with the reading? 10MB easily fit into the
main memory of a decent PC, so just do

contents = open("file").read() # yes I know I should close the file...

print contents.find('\x0c')

Diez

True. But there is another issue attached to the one i wrote.
When i know how much this occurs, i know the amount of pages in the
file. After that i would like to be able to extract a given amount of
data:
file x contains 20 <0C>. then for example i would like to extract from
instance 5 to instance 12 from the file.
The reason why i want to do this: The 0C stands for a pagebreak in PCL
language. This way i would be absle to extract a certain amount of
pages from the file.

Diez B. Roggisch · Jun 8, 2007

jvdb said:
True. But there is another issue attached to the one i wrote.
When i know how much this occurs, i know the amount of pages in the
file. After that i would like to be able to extract a given amount of
data:
file x contains 20 <0C>. then for example i would like to extract from
instance 5 to instance 12 from the file.
The reason why i want to do this: The 0C stands for a pagebreak in PCL
language. This way i would be absle to extract a certain amount of
pages from the file.

And? Finding the respective indices by using

last_needle_position = 0
positions = []
while last_needle_position != -1:
last_needle_position = contents.find(needle, last_needle_position+1)
if last_needle_position != -1:
positions.append(last_needle_position)

will find all the pagepbreaks. then just slice contents appropriatly.
Did you read the python tutorial?

diez

Marc 'BlackJack' Rintsch · Jun 8, 2007

jvdb said:
jvdb said:

True. But there is another issue attached to the one i wrote.
When i know how much this occurs, i know the amount of pages in the
file. After that i would like to be able to extract a given amount of
data:
file x contains 20 <0C>. then for example i would like to extract from
instance 5 to instance 12 from the file.
The reason why i want to do this: The 0C stands for a pagebreak in PCL
language. This way i would be absle to extract a certain amount of
pages from the file.

Click to expand...

And? Finding the respective indices by using

last_needle_position = 0
positions = []
while last_needle_position != -1:
last_needle_position = contents.find(needle, last_needle_position+1)
if last_needle_position != -1:
positions.append(last_needle_position)

will find all the pagepbreaks. then just slice contents appropriatly.
Did you read the python tutorial?

Maybe splitting at '\x0c', selecting/slicing the wanted pages and joining
them again is enough, depending of the size of the files and memory of
course.

One problem I see is that '\x0c' may not always be the page end. It may
occur in "rastered image" data too I guess.

Ciao,
Marc 'BlackJack' Rintsch

jvdb · Jun 8, 2007

And? Finding the respective indices by using

Click to expand...

last_needle_position = 0
positions = []
while last_needle_position != -1:
last_needle_position = contents.find(needle, last_needle_position+1)
if last_needle_position != -1:
positions.append(last_needle_position)

Click to expand...

will find all the pagepbreaks. then just slice contents appropriatly.
Did you read the python tutorial?

Click to expand...

Maybe splitting at '\x0c', selecting/slicing the wanted pages and joining
them again is enough, depending of the size of the files and memory of
course.

One problem I see is that '\x0c' may not always be the page end. It may
occur in "rastered image" data too I guess.

Ciao,
Marc 'BlackJack' Rintsch

Hi,

your last comment is also something i have noticed. There are a number
of occasions where this will happen. I also have to deal with this.
I will dive into this on monday, after this hot weekend.

cheers,
Jeroen

Grant Edwards · Jun 8, 2007

I have a binary (pcl) file.
In this file i want to search for specific codes (like <0C>). I have
tried to solve it by reading the file character by character, but this
is very slow. Especially when it comes to files which are large
(>10MB) this is consuming quite some time.
Does anyone has a hint/clue/solution on this?

I'd memmap the file.

http://docs.python.org/lib/module-mmap.html

If you prefer it to appear as an array of bytes instead of a
string, the various numeric/array packags can do that.

Numarray: http://stsdas.stsci.edu/numarray/numarray-1.5.html/module-numarray.memmap.html
Vmaps: http://snafu.freedom.org/Vmaps/Vmaps.html
Numpy: <documentation is not free>

Since I can't point you to Numpy docs, here's a link to a
newsgroup thread with an example for numpy:

http://groups.google.com/group/comp.lang.python/browse_frm/thread/c63c3e281df99897/2336baa98386d5e7

Roger Miller · Jun 8, 2007

What has the searching to do with the reading? 10MB easily fit into the
main memory of a decent PC, so just do

contents = open("file").read() # yes I know I should close the file...

print contents.find('\x0c')

Diez

Better make that 'open("file", "rb").

How to convert MBOX files for Zimbra Mail compatibility?	0	Apr 7, 2026
Can I upload PST files to Office 365 online archive mailbox?	0	Mar 20, 2026
Is it possible to import multiple MBOX files into Apple Mail at once?	0	Apr 16, 2026
How to Make CSV Contact Files Work Seamlessly Across All Smartphones?	0	Sep 17, 2025
Why should I split large MBOX files?	0	Mar 31, 2026
How do I change MBOX files into PST?	4	Feb 11, 2025
How do I combine PST files while keeping folder structure intact?	0	Apr 22, 2026
How do I efficiently convert EML files into PST format?	4	Dec 23, 2024

how to do reading of binary files?

jvdb

Diez B. Roggisch

jvdb

Diez B. Roggisch

Marc 'BlackJack' Rintsch

jvdb

Grant Edwards

Roger Miller

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads