Python(2.5) reads an input file FASTER than pure C(Mingw)

n00m · Apr 26, 2008

Both codes below read the same huge(~35MB) text file.
In the file > 1000000 lines, the length of each line < 99 chars.

Stable result:
Python runs ~0.65s
C : ~0.70s

Any thoughts?

import time
t=time.time()
f=open('D:\\some.txt','r')
z=f.readlines()
f.close()
print len(z)
print time.time()-t
m=input()
print z[m]

#include <cstdio>
#include <cstdlib>
#include <iostream>
#include <ctime>

using namespace std;
char vs[1002000][99];
FILE *fp=fopen("D:\\some.txt","r");

int main() {
int i=0;
while (true) {
if (!fgets(vs,999,fp)) break;
++i;
}
fclose(fp);
cout << i << endl;
cout << clock()/CLOCKS_PER_SEC << endl;

int m;
cin >> m;
cout << vs[m];
system("pause");
return 0;
}

Carl Banks · Apr 26, 2008

Both codes below read the same huge(~35MB) text file.
In the file > 1000000 lines, the length of each line < 99 chars.

Stable result:
Python runs ~0.65s
C : ~0.70s

Any thoughts?

Yes.

Most of the dirty work in the Python example is spent in tight loop
written in C. This is very likely to be faster on Python on Windows
than your "C" example for several reasons:

1. Python is compiled with Microsoft's C compiler, which produces more
optimal code than Mingw.

2. The Python readline() function has been in the library for a long
time and has had time for many developers to optimize it's
performance.

3. Your "pure C" code isn't even C, let alone pure C. It's C++. On
most systems, the C++ iostream libraries have a lot more overhead than
C's stdio.

And, finally, we must not fail to observe that you measured these
times without startup, which is obviousy much greater for Python. (Of
course, we only need to point this so it's not misunderstood that
you're claiming this Python process will terminate faster than the C++
one.)

So, I must regrettably opine that your example isn't very meaningful.

import time
t=time.time()
f=open('D:\\some.txt','r')
z=f.readlines()
f.close()
print len(z)
print time.time()-t
m=input()
print z[m]

#include <cstdio>
#include <cstdlib>
#include <iostream>
#include <ctime>

using namespace std;
char vs[1002000][99];
FILE *fp=fopen("D:\\some.txt","r");

int main() {
int i=0;
while (true) {
if (!fgets(vs,999,fp)) break;
++i;
}
fclose(fp);
cout << i << endl;
cout << clock()/CLOCKS_PER_SEC << endl;

int m;
cin >> m;
cout << vs[m];
system("pause");
return 0;

}

n00m · Apr 26, 2008

fgets() from C++ iostream library???

I guess if I'd came up with "Python reads SLOWER than C"
I'd get another (not less) smart explanation "why it's so".

SL · Apr 26, 2008

n00m said:
import time
t=time.time()
f=open('D:\\some.txt','r')
z=f.readlines()
f.close()
print len(z)
print time.time()-t
m=input()
print z[m]

#include <cstdio>
#include <cstdlib>
#include <iostream>
#include <ctime>

using namespace std;
char vs[1002000][99];
FILE *fp=fopen("D:\\some.txt","r");

int main() {
int i=0;
while (true) {
if (!fgets(vs,999,fp)) break;
++i;
}

first of all I would rewrite the C loop to:

int main() {
int i=0;
while (fgets(vs,999,fp))
++i;
}

but I think that the difference comes from what you do in the beginning of
the C source:

char vs[1002000][99];

this reserves 99,198,000 bytes so expect a lot of cache trashing in the C
code!

Is there an implementation of f.readlines on the internet somewhere?
interested to see in how they implemented it. I'm pretty sure they did it
smarter than just reserve 100meg of data

fclose(fp);
cout << i << endl;
cout << clock()/CLOCKS_PER_SEC << endl;

int m;
cin >> m;
cout << vs[m];
system("pause");
return 0;
}

Click to expand...

SL · Apr 26, 2008

SL said:
n00m said:

using namespace std;
char vs[1002000][99];
if (!fgets(vs,999,fp)) break;

Click to expand...

BTW why are you declaring the array as 99 and pass 999 to fgets to read a
line?

hdante · Apr 26, 2008

Both codes below read the same huge(~35MB) text file.
In the file > 1000000 lines, the length of each line < 99 chars.

Stable result:
Python runs ~0.65s
C : ~0.70s

Any thoughts?

import time
t=time.time()
f=open('D:\\some.txt','r')
z=f.readlines()
f.close()
print len(z)
print time.time()-t
m=input()
print z[m]

#include <cstdio>
#include <cstdlib>
#include <iostream>
#include <ctime>

using namespace std;
char vs[1002000][99];
FILE *fp=fopen("D:\\some.txt","r");

int main() {
int i=0;
while (true) {
if (!fgets(vs,999,fp)) break;
++i;
}
fclose(fp);
cout << i << endl;
cout << clock()/CLOCKS_PER_SEC << endl;

int m;
cin >> m;
cout << vs[m];
system("pause");
return 0;

}

First try again with pure C code and compile with a C compiler, not
with C++ code and C++ compiler.
Then, tweak the code to use more buffering, to make it more similar
to readline code, like this (not tested):

#include <stdio.h>
#include <time.h>

char vs[1002000][100];
char buffer[65536];

int main(void) {
FILE *fp;
int i, m;
clock_t begin, end;
double t;

begin = clock();
fp = fopen("cvspython.txt", "r");
i = 0;
setvbuf(fp, buffer, _IOFBF, sizeof(buffer));
while(1) {
if(!fgets(vs, 100, fp)) break;
++i;
}
fclose(fp);
printf("%d\n", i);
end = clock();
t = (double)(end - begin)/CLOCKS_PER_SEC;
printf("%g\n", t);

scanf("%d", &m);
printf("%s\n", vs[m]);
getchar();
return 0;
}

Finally, repeat your statement again, if necessary.

hdante · Apr 26, 2008

fgets() from C++ iostream library???

fgets is part of the standard C++ library and it lives in the std
namespace.

n00m · Apr 26, 2008

char vs[1002000][99];

In the file 1001622(or so) records like phone number + f/l names.
So the reserving makes sense, i think. Populating of vector<string>
is by zillion times slower.

I was greatly surprised how fast it is. As a matter of fact, it's the
point
of my message here.

line?

It doesn't matter. All the same, reading of a current line stopped at
'\n'.
Big number 999(9999,99999,...) just ensures that each line is read up
to its end.

namespace.

I thought it's from <stdio.h>. Anyway, it does not matter. PS Thanx
for the code.

2All:
Origin of my "discovery" is from http://www.spoj.pl/ranks/SBANK/start=400
I never thought it can be done in Python (there are a couple of Py
solutions
even faster than mine) and without stuff like radix sort etc.

n00m · Apr 26, 2008

hdante:

I run your code quite a few times.
Its time = 0.734s.
Of mine = 0.703-0.718s.

PS All I have is an ancient Mingw compiler (~1.9.5v) in Dev-C++.

Carl Banks · Apr 26, 2008

fgets() from C++ iostream library???

Sheesh. That'll teach me to read carefully. (Ok, it probably won't.)

Other two points still apply.

Carl Banks

hdante · Apr 26, 2008

hdante:

I run your code quite a few times.
Its time = 0.734s.
Of mine = 0.703-0.718s.

PS All I have is an ancient Mingw compiler (~1.9.5v) in Dev-C++.

Okay, now I believe in you.

The next step would be to reimplement readline.

n00m · Apr 27, 2008

No so simple, guys.
E.g., I can't solve (in Python) this: http://www.spoj.pl/problems/INTEST/
Keep getting TLE (time limit exceeded). Any ideas? After all, it's
weekend.

450. Enormous Input Test
Problem code: INTEST

The purpose of this problem is to verify whether the method you are
using to read input data is sufficiently fast to handle problems
branded with the enormous Input/Output warning. You are expected to be
able to process at least 2.5MB of input data per second at runtime.

Input
The input begins with two positive integers n k (n, k<=107). The next
n lines of input contain one positive integer ti, not greater than
109, each.

Output
Write a single integer to output, denoting how many integers ti are
divisible by k.

Example
Input:
7 3
1
51
966369
7
9
999996
11

Output:
4

hdante · Apr 27, 2008

No so simple, guys.
E.g., I can't solve (in Python) this:http://www.spoj.pl/problems/INTEST/
Keep getting TLE (time limit exceeded). Any ideas? After all, it's
weekend.

450. Enormous Input Test
Problem code: INTEST

The purpose of this problem is to verify whether the method you are
using to read input data is sufficiently fast to handle problems
branded with the enormous Input/Output warning. You are expected to be
able to process at least 2.5MB of input data per second at runtime.

Input
The input begins with two positive integers n k (n, k<=107). The next
n lines of input contain one positive integer ti, not greater than
109, each.

Output
Write a single integer to output, denoting how many integers ti are
divisible by k.

Example
Input:
7 3
1
51
966369
7
9
999996
11

Output:
4

Maybe the problem is not in reading the input.

PS: you can throw a lot of time away in that site.

n00m · Apr 27, 2008

I'm there since summer 2004

(with several time breaks)

n00m · Apr 27, 2008

Btw seems all accepted pyth solutions (for this prob) used Psyco.

Christian Heimes · Apr 27, 2008

SL said:
Is there an implementation of f.readlines on the internet somewhere?
interested to see in how they implemented it. I'm pretty sure they did
it smarter than just reserve 100meg of data

Of course it is. Checkout the Python sources

Christian

n00m · Apr 27, 2008

Dennis said:
(untested for both):
-=-=-=-=-=-=-

Many thanks but alas both your codes got "wrong answer" verdict.
I can't understand why; they seem Ok (but I'm a bit sleepy

).

n00m · Apr 27, 2008

One more brick.
This time I compare list.sort() vs sort(vector<string>).
Incredible. Python does it by 8.3s / 2.75s = 3 times faster than C++.

import time
f=open('D:\\v.txt','r')
z=f.readlines()
f.close()
t=time.time()
z.sort()
print time.time()-t
m=int(raw_input())
print z[m]

#include <cstdio>
#include <cstdlib>
#include <iostream>
#include <algorithm>
#include <vector>
#include <string>
#include <ctime>

using namespace std;

vector<string> vs;

FILE *fp=fopen("D:\\v.txt","r");

int main() {
int i=0;
while (true) {
char line[50];
if (!fgets(line,50,fp)) break;
vs.push_back(line);
++i;
}
fclose(fp);

double t;
t=clock()/CLOCKS_PER_SEC;
sort(vs.begin(),vs.end());
cout << clock()/CLOCKS_PER_SEC << endl;

int m;
cin >> m;
cout << vs[m];
getchar();
return 0;
}

n00m · Apr 27, 2008

Oops... I spotted a slip in my C++ code. Forgot " - t" in

cout << clock()/CLOCKS_PER_SEC << endl;

The correct proportion is 7.5s / 2.75s = 2.7 times.

SL · Apr 27, 2008

Have you tried this now?

First try again with pure C code and compile with a C compiler, not
with C++ code and C++ compiler.
Then, tweak the code to use more buffering, to make it more similar
to readline code, like this (not tested):

#include <stdio.h>
#include <time.h>

char vs[1002000][100];
char buffer[65536];

int main(void) {
FILE *fp;
int i, m;
clock_t begin, end;
double t;

begin = clock();
fp = fopen("cvspython.txt", "r");
i = 0;
setvbuf(fp, buffer, _IOFBF, sizeof(buffer));
while(1) {
if(!fgets(vs, 100, fp)) break;
++i;
}
fclose(fp);
printf("%d\n", i);
end = clock();
t = (double)(end - begin)/CLOCKS_PER_SEC;
printf("%g\n", t);

scanf("%d", &m);
printf("%s\n", vs[m]);
getchar();
return 0;
}

Click to expand...

How to give matrix input from keyboard or file in python	0	Aug 11, 2010
Translater + module + tkinter	1	Feb 16, 2023
reading a python file object in c extension crashes python	0	Jun 14, 2005
C++ 2 Ruby	12	Jun 9, 2008
feedback on code design	23	May 30, 2012
Binary Search Tree Input Problem	4	Jul 27, 2004
reading first line of a txt file with c	16	Jun 9, 2005
ANSI C Challenge on readint	20	Apr 26, 2004

Python(2.5) reads an input file FASTER than pure C(Mingw)

n00m

Carl Banks

n00m

SL

SL

hdante

hdante

n00m

n00m

Carl Banks

hdante

n00m

hdante

n00m

n00m

Christian Heimes

n00m

n00m

n00m

SL

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads