Is there a faster way to do this?

ronald.johnson · Aug 5, 2008

I have a csv file containing product information that is 700+ MB in
size. I'm trying to go through and pull out unique product ID's only
as there are a lot of multiples. My problem is that I am appending the
ProductID to an array and then searching through that array each time
to see if I've seen the product ID before. So each search takes longer
and longer. I let the script run for 2 hours before killing it and had
only run through less than 1/10 if the file.

Heres the code:
import string

def checkForProduct(product_id, product_list):
for product in product_list:
if product == product_id:
return 1
return 0

input_file="c:\\input.txt"
output_file="c:\\output.txt"
product_info = []
input_count = 0

input = open(input_file,"r")
output = open(output_file, "w")

for line in input:
break_down = line.split(",")
product_number = break_down[2]
input_count+=1
if input_count == 1:
product_info.append(product_number)
output.write(line)
output_count = 1
if not checkForProduct(product_number,product_info):
product_info.append(product_number)
output.write(line)
output_count+=1

output.close()
input.close()
print input_count
print output_count

Gary Herron · Aug 5, 2008

I have a csv file containing product information that is 700+ MB in
size. I'm trying to go through and pull out unique product ID's only
as there are a lot of multiples. My problem is that I am appending the
ProductID to an array and then searching through that array each time
to see if I've seen the product ID before. So each search takes longer
and longer. I let the script run for 2 hours before killing it and had
only run through less than 1/10 if the file.

Store your ID's in a dictionary or a set. Then test for for existence
of a new ID in that set. That test will be *much* more efficient that
searching a list. (It uses a hashing scheme.)

IDs = set()
for row in ...
ID = extractIdFromRow(row)
if ID not in IDs:
set.add(ID)
... whatever ...

In fact if *all* you are doing is trying to identify all product IDs
that occur in the file (no matter how many times they occur)

IDs = set()
for row in ...
ID = extractIdFromRow(row)
set,add(ID)

and your set is will contain *one* copy of each ID added, no matter how
many were added.

Better yet, if you can write you ID extraction as a generator or list
comprehension...

IDs = set(extractIdFromRow(row) for row in rowsOfTable)

or some such would be most efficient.

Gary Herron

Heres the code:
import string

def checkForProduct(product_id, product_list):
for product in product_list:
if product == product_id:
return 1
return 0

input_file="c:\\input.txt"
output_file="c:\\output.txt"
product_info = []
input_count = 0

input = open(input_file,"r")
output = open(output_file, "w")

for line in input:
break_down = line.split(",")
product_number = break_down[2]
input_count+=1
if input_count == 1:
product_info.append(product_number)
output.write(line)
output_count = 1
if not checkForProduct(product_number,product_info):
product_info.append(product_number)
output.write(line)
output_count+=1

output.close()
input.close()
print input_count
print output_count

Avinash Vora · Aug 5, 2008

I have a csv file containing product information that is 700+ MB in
size. I'm trying to go through and pull out unique product ID's only
as there are a lot of multiples. My problem is that I am appending the
ProductID to an array and then searching through that array each time
to see if I've seen the product ID before. So each search takes longer
and longer. I let the script run for 2 hours before killing it and had
only run through less than 1/10 if the file.

Why not split the file into more manageable chunks, especially as it's
just what seems like plaintext?

Heres the code:
import string

def checkForProduct(product_id, product_list):
for product in product_list:
if product == product_id:
return 1
return 0

input_file="c:\\input.txt"
output_file="c:\\output.txt"
product_info = []
input_count = 0

input = open(input_file,"r")
output = open(output_file, "w")

for line in input:
break_down = line.split(",")
product_number = break_down[2]
input_count+=1
if input_count == 1:
product_info.append(product_number)
output.write(line)
output_count = 1

This seems redundant.

if not checkForProduct(product_number,product_info):
product_info.append(product_number)
output.write(line)
output_count+=1

File writing is extremely expensive. In fact, so is reading. Think
about reading the file in whole chunks. Put those chunks into Python
data structures, and make your output information in Python data
structures. If you use a dictionary and search the ID's there, you'll
notice some speed improvements as Python does a dictionary lookup far
quicker than searching a list. Then, output your data all at once at
the end.

Gary Herron · Aug 5, 2008

Avinash said:
I have a csv file containing product information that is 700+ MB in
size. I'm trying to go through and pull out unique product ID's only
as there are a lot of multiples. My problem is that I am appending the
ProductID to an array and then searching through that array each time
to see if I've seen the product ID before. So each search takes longer
and longer. I let the script run for 2 hours before killing it and had
only run through less than 1/10 if the file.

Click to expand...

Why not split the file into more manageable chunks, especially as it's
just what seems like plaintext?

Heres the code:
import string

def checkForProduct(product_id, product_list):
for product in product_list:
if product == product_id:
return 1
return 0

input_file="c:\\input.txt"
output_file="c:\\output.txt"
product_info = []
input_count = 0

input = open(input_file,"r")
output = open(output_file, "w")

for line in input:
break_down = line.split(",")
product_number = break_down[2]
input_count+=1
if input_count == 1:
product_info.append(product_number)
output.write(line)
output_count = 1

Click to expand...

This seems redundant.

if not checkForProduct(product_number,product_info):
product_info.append(product_number)
output.write(line)
output_count+=1

Click to expand...

File writing is extremely expensive. In fact, so is reading. Think
about reading the file in whole chunks. Put those chunks into Python
data structures, and make your output information in Python data
structures.

Don't bother yourself with this suggestion about reading in chunks --
Python already does this for you, and does so more efficiently that you
could. The code
for line in open(input_file,"r"):
reads in large chunks (efficiently) and then serves up the contents
line-by-line.

Gary Herron

RPM1 · Aug 6, 2008

I have a csv file containing product information that is 700+ MB in
size. I'm trying to go through and pull out unique product ID's only
as there are a lot of multiples. My problem is that I am appending the
ProductID to an array and then searching through that array each time
to see if I've seen the product ID before. So each search takes longer
and longer. I let the script run for 2 hours before killing it and had
only run through less than 1/10 if the file.

I think you need to learn about Python's dictionary data type.

Roy H. Han · Aug 6, 2008

Why not just use sets?

a = set()
a.add(1)
a.add(2)

Tomasz Rola · Aug 6, 2008

I have a csv file containing product information that is 700+ MB in
size. I'm trying to go through and pull out unique product ID's only
as there are a lot of multiples. My problem is that I am appending the
ProductID to an array and then searching through that array each time
to see if I've seen the product ID before. So each search takes longer
and longer. I let the script run for 2 hours before killing it and had
only run through less than 1/10 if the file.

My take:

I assume you have 80 bytes per line, that makes 10 milion lines for 700M
file. To be quite sure lets round it to 20 milions. Next, I don't want to
trash my disk with 700M+ files, so I assume reading the line, breaking it
and getting product id takes roughly the same time as generating random id
by my code. So, I:

1. read all records line by line (or just make random ids), append the
product id to the list (actually, I preallocate the list with empty space
and fill it up)
2. sort() the list
3. iterate the list, count the unique ids, optionally write to file

The code (most of it is just making random names, which was real fun):

import string

RECORD_NUM = 20*1024*1024 # 700M/80-bytes-per-line = ca. 10M+ records
ALPHA = string.digits + string.ascii_letters
RAND = None

def random_iter ( ) :
x = 12345678910
y = 234567891011
M = 2**61 - 1
M0 = 2**31 - 1
pot10 = [ 1, 10, 100, 1000, 10000, 100000 ]
while 1 :
p = x * y
l = pot10[5 - (p % 10)]
n = (p / l) % M
d = l * (n % 10)
p = p % M0
x1 = y - d + p
y1 = x + d + p
x, y = x1, y1
yield n
pass
pass

def next_random ( ) :
global RAND
if RAND == None :
RAND = random_iter()
return RAND.next()

def num_to_name ( n ) :
s = []
len_a = len(ALPHA)
while n > 0 :
n1, r = divmod(n, len_a)
s.append(ALPHA[r])
n = n1
pass
return "".join(s)

def get_product_id ( ) :
return num_to_name(next_random())

def dry_run ( n ) :
r = [ 0 ] * n
while n > 0 :
n -= 1
r[n] = get_product_id()
return r

###

if __name__ == "__main__":
print "hello"
for i in range(10) : print get_product_id()
print
print "RECORD_NUM: %d" % RECORD_NUM
products = dry_run(RECORD_NUM)
print "RECORDS PRODUCED: %d" % len(products)
products.sort()
i = 0
lastp = ""
count = 0
while i < len(products) :
if lastp != products :
lastp = products
count += 1
i += 1
pass
print "UNIQUE RECORDS: %d" % count

I may have made some bugs, but it works on my computer. Or seems to ;-/.

For 20 mln products, on my Athlon XP @1800 / 1.5G ram, Debian Linux box,
it takes about 13 minutes to go through list generation, about 3 minutes
to sort the list, and few more seconds to skim it (writing should not be
much longer). All summed up, about 18 minutes of real time, with some
other programs computing a little etc in the background - so, much less
then 2 hours.

Regards,
Tomasz Rola

--
** A C programmer asked whether computer had Buddha's nature. **
** As the answer, master did "rm -rif" on the programmer's home **
** directory. And then the C programmer became enlightened... **
** **
** Tomasz Rola mailto:[email protected] **

Boris Borcic · Aug 6, 2008

Is your product ID always the 3rd and last item on the line ?
Else your output won't separate IDs.

And how does

output = open(output_file,'w')
for x in set(line.split(',')[2] for line in open(input_file)) :
output.write(x)
output.close()

behave ?

faster way to do this?	17	Dec 1, 2011
Is this a good way to learn programming?	2	Sep 16, 2018
Is there a nicer way to do this?	12	Oct 4, 2007
Problem--IOError: [Errno 13] Permission denied	0	Oct 28, 2007
Is there a more efficient way to do this ?	14	Jan 27, 2004
Is there a better way to implement this:	3	Jan 22, 2007
New line conversion with Popen attached to a pty	2	Jun 20, 2013
Is there a better way to do this?	18	May 4, 2007

Is there a faster way to do this?

ronald.johnson

Gary Herron

Avinash Vora

Gary Herron

RPM1

Roy H. Han

Tomasz Rola

Boris Borcic

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads