Implementing file reading in C/Python

Johannes Bauer · Jan 9, 2009

Marc said:
Yours took ~37 minutes for 2 GiB here. This "just" ~15 minutes:

Ah, ok... when implementing your suggestions int he other post, I did
not get such a drastic performance increase. I really will have a look
at it and try to locate where I'm wasting the time.

Thanks a lot,
Kind regards,
Johannes

Johannes Bauer · Jan 9, 2009

mk said:
This probably doesn't contribute much, but have you tried using Python
profiler? You might have *something* wrong that eats up a lot of time in
the code.

No - and I've not known there was a profiler yet have found anything
meaningful (there seems to be an profiling C interface, but that won't
get me anywhere). Is that a seperate tool or something? Could you
provide a link?

The factor of 30 indeed does not seem right -- I have done somewhat
similar stuff (calculating Levenshtein distance [edit distance] on words
read from very large files), coded the same algorithm in pure Python and
C++ (using linked lists in C++) and Python version was 2.5 times slower.

Yup, that was about what I had expected (and what I could well live
with, it's a tradeoff).

Thanks,
Kind regards,
Johannes

pruebauno · Jan 9, 2009

No - and I've not known there was a profiler yet have found anything
meaningful (there seems to be an profiling C interface, but that won't
get me anywhere). Is that a seperate tool or something? Could you
provide a link?
Thanks,
Kind regards,
Johannes

It is part of the python standard library:

http://docs.python.org/library/profile.html

rurpy · Jan 9, 2009

mk said:
mk said:

The factor of 30 indeed does not seem right -- I have done somewhat
similar stuff (calculating Levenshtein distance [edit distance] on words
read from very large files), coded the same algorithm in pure Python and
C++ (using linked lists in C++) and Python version was 2.5 times slower.

Click to expand...

Yup, that was about what I had expected (and what I could well live
with, it's a tradeoff).

The rule-of-thumb I use is that Python is generally 5 to 50 times
slower than C. It is considered blasphemy to say it in this group,
but Python is slow. It does of course have many compensating
advantages
that make using it advantageous when runtime speed is not of primary
importance.

MRAB · Jan 9, 2009

Marc said:
As this was horribly slow (20 Minutes for a 2GB file) I coded the whole
thing in C also:

Click to expand...

Yours took ~37 minutes for 2 GiB here. This "just" ~15 minutes:

#!/usr/bin/env python
from __future__ import division, with_statement
import os
import sys
from collections import defaultdict
from functools import partial
from itertools import imap

def iter_max_values(blocks, block_count):
for i, block in enumerate(blocks):
histogram = defaultdict(int)
for byte in block:
histogram[byte] += 1

yield max((count, byte)
for value, count in histogram.iteritems())[1]

[snip]
Would it be faster if histogram was a list initialised to [0] * 256?

bearophileHUGS · Jan 9, 2009

Johannes Bauer, I was about to start writing a faster version. I think
with some care and Psyco you can go about as 5 times slower than C or
something like that.

To do that you need to use almost the same code for the C version,
with a list of 256 ints for the frequencies, not using max() but a
manual loop, not using itertools or generators, maybe splitting code
in two functions to allow Psyco to optimize better, maybe using
another array("...") for the frequences too. The data can be read into
an array.array("B"), and so on.

But I think all this work is a waste of time. I like Python, but that
C code, after some cleaning and polishing looks fine for this job. Of
course there are other languages that may give you a little nicer code
for this program, like D, and there may be ways to use numpy too to
speed up the computation of the mode, but they don't look so much
important this time.

Bye,
bearophile

Sion Arrowsmith · Jan 9, 2009

Grant Edwards said:
If I were you, I'd try mmap()ing the file instead of reading it
into string objects one chunk at a time.

You've snipped the bit further on in that sentence where the OP
says that the file of interest is 2GB. Do you still want to try
mmap'ing it?

Marc 'BlackJack' Rintsch · Jan 9, 2009

Marc said:
Marc said:

def iter_max_values(blocks, block_count):
for i, block in enumerate(blocks):
histogram = defaultdict(int)
for byte in block:
histogram[byte] += 1

yield max((count, byte)
for value, count in histogram.iteritems())[1]

Click to expand...

[snip]
Would it be faster if histogram was a list initialised to [0] * 256?

Don't know. Then for every byte in the 2Â GiB we have to call `ord()`.
Maybe the speedup from the list compensates this, maybe not.

I think that we have to to something with *every* byte of that really
large file *at Python level* is the main problem here. In C that's just
some primitive numbers. Python has all the object overhead.

Ciao,
Marc 'BlackJack' Rintsch

John Machin · Jan 10, 2009

The factor of 30 indeed does not seem right -- I have done somewhat
similar stuff (calculating Levenshtein distance [edit distance] on words
read from very large files), coded the same algorithm in pure Python and
C++ (using linked lists in C++) and Python version was 2.5 times slower.

Levenshtein distance using linked lists? That's novel. Care to
divulge?

And if C++ is using linked lists and Python isn't, it's not really the
same algorithm, is it?

Cheers,
John

Rhamphoryncus · Jan 10, 2009

Marc 'BlackJack' Rintsch wrote:

def iter_max_values(blocks, block_count):
for i, block in enumerate(blocks):
histogram = defaultdict(int)
for byte in block:
histogram[byte] += 1
yield max((count, byte)
for value, count in histogram.iteritems())[1]

Click to expand...

Click to expand...

[snip]
Would it be faster if histogram was a list initialised to [0] * 256?

Click to expand...

Don't know. Then for every byte in the 2 GiB we have to call `ord()`..
Maybe the speedup from the list compensates this, maybe not.

I think that we have to to something with *every* byte of that really
large file *at Python level* is the main problem here. In C that's just
some primitive numbers. Python has all the object overhead.

struct's B format might help here. Also, struct.unpack_from could
probably be combined with mmap to avoid copying the input. Not to
mention that the 0..256 ints are all saved and won't be allocated/
deallocated.

Sion Arrowsmith · Jan 12, 2009

Grant Edwards said:
Sure. The larger the file, the more you gain from mmap'ing it.
2GB should easily fit within the process's virtual memory
space.

Assuming you're in a 64bit world. Me, I've only got 2GB of address
space available to play in -- mmap'ing all of it out of the question.

But I supposed that mmap'ing it chunk at a time instead of reading
chunk at a time might be worth considering.

sturlamolden · Jan 12, 2009

You've snipped the bit further on in that sentence where the OP
says that the file of interest is 2GB. Do you still want to try
mmap'ing it?

Python's mmap object does not take an offset parameter. If it did, one
could mmap smaller portions of the file.

Sion Arrowsmith · Jan 12, 2009

In case the cancel didn't get through:

Sion Arrowsmith said:
Assuming you're in a 64bit world. Me, I've only got 2GB of address
space available to play in -- mmap'ing all of it out of the question.

And today's moral is: try it before posting. Yeah, I can map a 2GB
file no problem, complete with associated 2GB+ allocated VM. The
addressing is clearly not working how I was expecting it too.

sturlamolden · Jan 12, 2009

And today's moral is: try it before posting. Yeah, I can map a 2GB
file no problem, complete with associated 2GB+ allocated VM. The
addressing is clearly not working how I was expecting it too.

The virtual memory space of a 32 bit process is 4 GB.

Hrvoje Niksic · Jan 12, 2009

sturlamolden said:
Python's mmap object does not take an offset parameter. If it did, one
could mmap smaller portions of the file.

As of 2.6 it does, but that might not be of much use if you're using
2.5.x or earlier. If you speak Python/C and really need offset, you
could backport the mmap module from 2.6 and compile it under a
different name for 2.5.

Steve Holden · Jan 13, 2009

sturlamolden said:
The virtual memory space of a 32 bit process is 4 GB.

I believe, though, that in some environments 2GB of that is mapped onto
the operating system, to allow system calls to access OS memory
structures without any VM remapping being required - see

http://blogs.technet.com/markrussinovich/archive/2008/11/17/3155406.aspx.

Things have, however, improved if we are to believe what we read in

http://www.tenouk.com/WinVirtualAddressSpace.html

The very idea of mapping part of a process's virtual address space onto
an area in which "low-level system code resides, so writing to this
region may corrupt the system, with potentially catastrophic
consequences" seems to be asking for trouble to me. It's surprising
things used to don't go wrong with Windows all the time, really. Oh,
wait a minute, they did, didn't they? Still do for that matter ...

getting-sicker-of-vista-by-the-minute-ly yr's - steve

Steve Holden · Jan 13, 2009

sturlamolden said:
The virtual memory space of a 32 bit process is 4 GB.

After my last post I should also point out

a) That was specific to 32-bit processes, and

b)
http://regions.cmg.org/regions/mcmg/Virtual memory constraints in 32bit Windows.pdf
describes the situation better, and outliones some steps you can take to
get relief.

regards
Steve

Marc 'BlackJack' Rintsch · Jan 13, 2009

The very idea of mapping part of a process's virtual address space onto
an area in which "low-level system code resides, so writing to this
region may corrupt the system, with potentially catastrophic
consequences" seems to be asking for trouble to me.

That's why those regions are usually "write protected" and "no execution
allowed" from the code in the user area of the virtual address space.

Ciao,
Marc 'BlackJack' Rintsch

David Bolen · Jan 14, 2009

Johannes Bauer said:
Yup, I changed the Python code to behave the same way the C code did -
however overall it's not much of an improvement: Takes about 15 minutes
to execute (still factor 23).

Not sure this is completely fair if you're only looking for a pure
Python solution, but to be honest, looping through a gazillion
individual bytes of information sort of begs for trying to offload
that into a library that can execute faster, while maintaining the
convenience of Python outside of the pure number crunching.

I'd assume numeric/numpy might have applicable functions, but I don't
use those libraries much, whereas I've been using OpenCV recently for
a lot of image processing work, and it has matrix/histogram support,
which seems to be a good match for your needs.

For example, assuming the OpenCV library and ctypes-opencv wrapper, add
the following before the file I/O loop:

from opencv import *

# Histogram for each file chunk
hist = cvCreateHist([256], CV_HIST_ARRAY, [(0,256)])

then, replace (using one of your posted methods as a sample):

datamap = { }
for i in data:
datamap = datamap.get(i, 0) + 1

array = sorted([(b, a) for (a, b) in datamap.items()], reverse=True)
most = ord(array[0][1])

with:

matrix = cvMat(1, len(data), CV_8UC1, data)
cvCalcHist([matrix], hist)
most = cvGetMinMaxHistValue(hist,
min_val = False, max_val = False,
min_idx = False, max_idx = True)

should give you your results in a fraction of the time. I didn't run
with a full size data file, but for a smaller one using smaller chunks
the OpenCV varient ran in about 1/10 of the time, and that was while
leaving all the other remaining Python code in place.

Note that it may not be identical results to some of your other
methods in the case of multiple values with the same counts, as the
OpenCV histogram min/max call will always pick the lower value in such
cases, whereas some of your code (such as above) will pick the upper
value, or your original code depended on the order of information
returned by dict.items.

This sort of small dedicated high performance choke point is probably
also perfect for something like Pyrex/Cython, although that would
require a compiler to build the extension for the histogram code.

-- David

mk · Jan 23, 2009

John said:
The factor of 30 indeed does not seem right -- I have done somewhat
similar stuff (calculating Levenshtein distance [edit distance] on words
read from very large files), coded the same algorithm in pure Python and
C++ (using linked lists in C++) and Python version was 2.5 times slower.

Click to expand...

Levenshtein distance using linked lists? That's novel. Care to
divulge?

I meant: using linked lists to store words that are compared. I found
using vectors was slow.

Regards,
mk

Drawing missing in bitmap in a pure C win32 program	4	Jun 3, 2023
How to try a range of hex values in C# code ?	0	Nov 19, 2022
Weird Behavior with Rays in C and OpenGL	4	Feb 13, 2024
Please help with C programming to save GPS reception data in Raspberry Pi.	0	Dec 8, 2022
Function is not worked in C	2	Jun 27, 2023
How can I fix my pattern coding error in c++	0	Mar 19, 2023
Python client/server that reads HTML body from server	1	Apr 12, 2023
Processing in Python help	0	Aug 31, 2022

Implementing file reading in C/Python

Johannes Bauer

Johannes Bauer

pruebauno

rurpy

MRAB

bearophileHUGS

Sion Arrowsmith

Marc 'BlackJack' Rintsch

John Machin

Rhamphoryncus

Sion Arrowsmith

sturlamolden

Sion Arrowsmith

sturlamolden

Hrvoje Niksic

Steve Holden

Steve Holden

Marc 'BlackJack' Rintsch

David Bolen

mk

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads