Python(2.5) reads an input file FASTER than pure C(Mingw)

N

n00m

Both codes below read the same huge(~35MB) text file.
In the file > 1000000 lines, the length of each line < 99 chars.

Stable result:
Python runs ~0.65s
C : ~0.70s

Any thoughts?


import time
t=time.time()
f=open('D:\\some.txt','r')
z=f.readlines()
f.close()
print len(z)
print time.time()-t
m=input()
print z[m]


#include <cstdio>
#include <cstdlib>
#include <iostream>
#include <ctime>

using namespace std;
char vs[1002000][99];
FILE *fp=fopen("D:\\some.txt","r");

int main() {
int i=0;
while (true) {
if (!fgets(vs,999,fp)) break;
++i;
}
fclose(fp);
cout << i << endl;
cout << clock()/CLOCKS_PER_SEC << endl;

int m;
cin >> m;
cout << vs[m];
system("pause");
return 0;
}
 
C

Carl Banks

Both codes below read the same huge(~35MB) text file.
In the file > 1000000 lines, the length of each line < 99 chars.

Stable result:
Python runs ~0.65s
C : ~0.70s

Any thoughts?

Yes.

Most of the dirty work in the Python example is spent in tight loop
written in C. This is very likely to be faster on Python on Windows
than your "C" example for several reasons:

1. Python is compiled with Microsoft's C compiler, which produces more
optimal code than Mingw.

2. The Python readline() function has been in the library for a long
time and has had time for many developers to optimize it's
performance.

3. Your "pure C" code isn't even C, let alone pure C. It's C++. On
most systems, the C++ iostream libraries have a lot more overhead than
C's stdio.

And, finally, we must not fail to observe that you measured these
times without startup, which is obviousy much greater for Python. (Of
course, we only need to point this so it's not misunderstood that
you're claiming this Python process will terminate faster than the C++
one.)


So, I must regrettably opine that your example isn't very meaningful.

import time
t=time.time()
f=open('D:\\some.txt','r')
z=f.readlines()
f.close()
print len(z)
print time.time()-t
m=input()
print z[m]

#include <cstdio>
#include <cstdlib>
#include <iostream>
#include <ctime>

using namespace std;
char vs[1002000][99];
FILE *fp=fopen("D:\\some.txt","r");

int main() {
int i=0;
while (true) {
if (!fgets(vs,999,fp)) break;
++i;
}
fclose(fp);
cout << i << endl;
cout << clock()/CLOCKS_PER_SEC << endl;

int m;
cin >> m;
cout << vs[m];
system("pause");
return 0;

}
 
N

n00m

fgets() from C++ iostream library???

I guess if I'd came up with "Python reads SLOWER than C"
I'd get another (not less) smart explanation "why it's so".
 
S

SL

n00m said:
import time
t=time.time()
f=open('D:\\some.txt','r')
z=f.readlines()
f.close()
print len(z)
print time.time()-t
m=input()
print z[m]


#include <cstdio>
#include <cstdlib>
#include <iostream>
#include <ctime>

using namespace std;
char vs[1002000][99];
FILE *fp=fopen("D:\\some.txt","r");

int main() {
int i=0;
while (true) {
if (!fgets(vs,999,fp)) break;
++i;
}


first of all I would rewrite the C loop to:

int main() {
int i=0;
while (fgets(vs,999,fp))
++i;
}

but I think that the difference comes from what you do in the beginning of
the C source:

char vs[1002000][99];

this reserves 99,198,000 bytes so expect a lot of cache trashing in the C
code!

Is there an implementation of f.readlines on the internet somewhere?
interested to see in how they implemented it. I'm pretty sure they did it
smarter than just reserve 100meg of data :)

fclose(fp);
cout << i << endl;
cout << clock()/CLOCKS_PER_SEC << endl;

int m;
cin >> m;
cout << vs[m];
system("pause");
return 0;
}
 
H

hdante

Both codes below read the same huge(~35MB) text file.
In the file > 1000000 lines, the length of each line < 99 chars.

Stable result:
Python runs ~0.65s
C : ~0.70s

Any thoughts?

import time
t=time.time()
f=open('D:\\some.txt','r')
z=f.readlines()
f.close()
print len(z)
print time.time()-t
m=input()
print z[m]

#include <cstdio>
#include <cstdlib>
#include <iostream>
#include <ctime>

using namespace std;
char vs[1002000][99];
FILE *fp=fopen("D:\\some.txt","r");

int main() {
    int i=0;
    while (true) {
        if (!fgets(vs,999,fp)) break;
        ++i;
    }
    fclose(fp);
    cout << i << endl;
    cout << clock()/CLOCKS_PER_SEC << endl;

    int m;
    cin >> m;
    cout << vs[m];
    system("pause");
return 0;

}


First try again with pure C code and compile with a C compiler, not
with C++ code and C++ compiler.
Then, tweak the code to use more buffering, to make it more similar
to readline code, like this (not tested):

#include <stdio.h>
#include <time.h>

char vs[1002000][100];
char buffer[65536];

int main(void) {
FILE *fp;
int i, m;
clock_t begin, end;
double t;

begin = clock();
fp = fopen("cvspython.txt", "r");
i = 0;
setvbuf(fp, buffer, _IOFBF, sizeof(buffer));
while(1) {
if(!fgets(vs, 100, fp)) break;
++i;
}
fclose(fp);
printf("%d\n", i);
end = clock();
t = (double)(end - begin)/CLOCKS_PER_SEC;
printf("%g\n", t);

scanf("%d", &m);
printf("%s\n", vs[m]);
getchar();
return 0;
}

Finally, repeat your statement again, if necessary.
 
N

n00m

char vs[1002000][99];

In the file 1001622(or so) records like phone number + f/l names.
So the reserving makes sense, i think. Populating of vector<string>
is by zillion times slower.


I was greatly surprised how fast it is. As a matter of fact, it's the
point
of my message here.

line?

It doesn't matter. All the same, reading of a current line stopped at
'\n'.
Big number 999(9999,99999,...) just ensures that each line is read up
to its end.

namespace.

I thought it's from <stdio.h>. Anyway, it does not matter. PS Thanx
for the code.


2All:
Origin of my "discovery" is from http://www.spoj.pl/ranks/SBANK/start=400
I never thought it can be done in Python (there are a couple of Py
solutions
even faster than mine) and without stuff like radix sort etc.
 
N

n00m

hdante:

I run your code quite a few times.
Its time = 0.734s.
Of mine = 0.703-0.718s.

PS All I have is an ancient Mingw compiler (~1.9.5v) in Dev-C++.
 
H

hdante

hdante:

I run your code quite a few times.
Its time = 0.734s.
Of mine = 0.703-0.718s.

PS All I have is an ancient Mingw compiler (~1.9.5v) in Dev-C++.

Okay, now I believe in you. :p
The next step would be to reimplement readline.
 
N

n00m

No so simple, guys.
E.g., I can't solve (in Python) this: http://www.spoj.pl/problems/INTEST/
Keep getting TLE (time limit exceeded). Any ideas? After all, it's
weekend.


450. Enormous Input Test
Problem code: INTEST

The purpose of this problem is to verify whether the method you are
using to read input data is sufficiently fast to handle problems
branded with the enormous Input/Output warning. You are expected to be
able to process at least 2.5MB of input data per second at runtime.

Input
The input begins with two positive integers n k (n, k<=107). The next
n lines of input contain one positive integer ti, not greater than
109, each.

Output
Write a single integer to output, denoting how many integers ti are
divisible by k.

Example
Input:
7 3
1
51
966369
7
9
999996
11

Output:
4
 
H

hdante

No so simple, guys.
E.g., I can't solve (in Python) this:http://www.spoj.pl/problems/INTEST/
Keep getting TLE (time limit exceeded). Any ideas? After all, it's
weekend.

450. Enormous Input Test
Problem code: INTEST

The purpose of this problem is to verify whether the method you are
using to read input data is sufficiently fast to handle problems
branded with the enormous Input/Output warning. You are expected to be
able to process at least 2.5MB of input data per second at runtime.

Input
The input begins with two positive integers n k (n, k<=107). The next
n lines of input contain one positive integer ti, not greater than
109, each.

Output
Write a single integer to output, denoting how many integers ti are
divisible by k.

Example
Input:
7 3
1
51
966369
7
9
999996
11

Output:
4

Maybe the problem is not in reading the input.

PS: you can throw a lot of time away in that site. :)
 
C

Christian Heimes

SL said:
Is there an implementation of f.readlines on the internet somewhere?
interested to see in how they implemented it. I'm pretty sure they did
it smarter than just reserve 100meg of data :)

Of course it is. Checkout the Python sources :)

Christian
 
N

n00m

Dennis said:
(untested for both):
-=-=-=-=-=-=-

Many thanks but alas both your codes got "wrong answer" verdict.
I can't understand why; they seem Ok (but I'm a bit sleepy:)).
 
N

n00m

One more brick.
This time I compare list.sort() vs sort(vector<string>).
Incredible. Python does it by 8.3s / 2.75s = 3 times faster than C++.


import time
f=open('D:\\v.txt','r')
z=f.readlines()
f.close()
t=time.time()
z.sort()
print time.time()-t
m=int(raw_input())
print z[m]


#include <cstdio>
#include <cstdlib>
#include <iostream>
#include <algorithm>
#include <vector>
#include <string>
#include <ctime>

using namespace std;

vector<string> vs;

FILE *fp=fopen("D:\\v.txt","r");

int main() {
int i=0;
while (true) {
char line[50];
if (!fgets(line,50,fp)) break;
vs.push_back(line);
++i;
}
fclose(fp);

double t;
t=clock()/CLOCKS_PER_SEC;
sort(vs.begin(),vs.end());
cout << clock()/CLOCKS_PER_SEC << endl;

int m;
cin >> m;
cout << vs[m];
getchar();
return 0;
}
 
N

n00m

Oops... I spotted a slip in my C++ code. Forgot " - t" in

cout << clock()/CLOCKS_PER_SEC << endl;

The correct proportion is 7.5s / 2.75s = 2.7 times.
 
S

SL

Have you tried this now?
First try again with pure C code and compile with a C compiler, not
with C++ code and C++ compiler.
Then, tweak the code to use more buffering, to make it more similar
to readline code, like this (not tested):
#include <stdio.h>
#include <time.h>
char vs[1002000][100];
char buffer[65536];
int main(void) {
FILE *fp;
int i, m;
clock_t begin, end;
double t;
begin = clock();
fp = fopen("cvspython.txt", "r");
i = 0;
setvbuf(fp, buffer, _IOFBF, sizeof(buffer));
while(1) {
if(!fgets(vs, 100, fp)) break;
++i;
}
fclose(fp);
printf("%d\n", i);
end = clock();
t = (double)(end - begin)/CLOCKS_PER_SEC;
printf("%g\n", t);

scanf("%d", &m);
printf("%s\n", vs[m]);
getchar();
return 0;
}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,763
Messages
2,569,563
Members
45,039
Latest member
CasimiraVa

Latest Threads

Top