Generator Expressions and CSV

Zaki · Jul 17, 2009

Hey all,

I'm really new to Python and this may seem like a really dumb
question, but basically, I wrote a script to do the following, however
the processing time/memory usage is not what I'd like it to be. Any
suggestions?

Outline:
1. Read tab delim files from a directory, files are of 3 types:
install, update, and q. All 3 types contain ID values that are the
only part of interest.
2. Using set() and set.add(), generate a list of unique IDs from
install and update files.
3. Using the set created in (2), check the q files to see if there are
matches for IDs. Keep all matches, and add any non matches (which only
occur once in the q file) to a queue of lines to be removed from teh q
files.
4. Remove the lines in the q for each file. (I haven't quite written
the code for this, but I was going to implement this using csv.writer
and rewriting all the lines in the file except for the ones in the
removal queue).

Now, I've tried running this and it takes much longer than I'd like. I
was wondering if there might be a better way to do things (I thought
generator expressions might be a good way to attack this problem, as
you could generate the set, and then check to see if there's a match,
and write each line that way).

Emile van Sebille · Jul 17, 2009

On 7/17/2009 10:58 AM Zaki said...

Now, I've tried running this and it takes much longer than I'd like. I
was wondering if there might be a better way to do things

Suppose, for the sake of argument, that you've written highly efficient
code. Then the processing time would already be entirely optimized and
no improvements possible. It's running as fast as it can. We can't help.

On the other hand, maybe you didn't. In that case, you'll need to
profile your code to determine where the time is consumed. At a
minimum, you'll need to post the slow parts so we can see the
implementation and suggest improvements.

Emile

MRAB · Jul 17, 2009

Zaki said:
Hey all,

I'm really new to Python and this may seem like a really dumb
question, but basically, I wrote a script to do the following, however
the processing time/memory usage is not what I'd like it to be. Any
suggestions?

Outline:
1. Read tab delim files from a directory, files are of 3 types:
install, update, and q. All 3 types contain ID values that are the
only part of interest.
2. Using set() and set.add(), generate a list of unique IDs from
install and update files.
3. Using the set created in (2), check the q files to see if there are
matches for IDs. Keep all matches, and add any non matches (which only
occur once in the q file) to a queue of lines to be removed from teh q
files.
4. Remove the lines in the q for each file. (I haven't quite written
the code for this, but I was going to implement this using csv.writer
and rewriting all the lines in the file except for the ones in the
removal queue).

Now, I've tried running this and it takes much longer than I'd like. I
was wondering if there might be a better way to do things (I thought
generator expressions might be a good way to attack this problem, as
you could generate the set, and then check to see if there's a match,
and write each line that way).

Why are you checking and removing lines in 2 steps? Why not copy the
matching lines to a new q file and then replace the old file with the
new one (or, maybe, delete the new q file if no lines were removed)?

Zaki · Jul 17, 2009

Why are you checking and removing lines in 2 steps? Why not copy the
matching lines to a new q file and then replace the old file with the
new one (or, maybe, delete the new q file if no lines were removed)?

That's what I've done now.

Here is the final code that I have running. It's very much 'hack' type
code and not at all efficient or optimized and any help in optimizing
it would be greatly appreciated.

import csv
import sys
import os
import time

begin = time.time()

#Check minutes elapsed
def timeElapsed():
current = time.time()
elapsed = current-begin
return round(elapsed/60)

#USAGE: python logcleaner.py <input_dir> <output_dir>

inputdir = sys.argv[1]
outputdir = sys.argv[2]

logfilenames = os.listdir(inputdir)

IDs = set() #IDs from update and install logs
foundOnceInQuery = set()
#foundTwiceInQuery = set()
#IDremovalQ = set() Note: Unnecessary, duplicate of foundOnceInQuery;
Queue of IDs to remove from query logs (IDs found only once in query
logs)

#Generate Filename Queues For Install/Update Logs, Query Logs
iNuQ = []
queryQ = []

for filename in logfilenames:
if filename.startswith("par1.install") or filename.startswith
("par1.update"):
iNuQ.append(filename)
elif filename.startswith("par1.query"):
queryQ.append(filename)

totalfiles = len(iNuQ) + len(queryQ)
print "Total # of Files to be Processed:" , totalfiles
print "Install/Update Logs to be processed:" , len(iNuQ)
print "Query logs to be processed:" , len(queryQ)

#Process install/update queue to generate list of valid IDs
currentfile = 1
for file in iNuQ:
print "Processing", currentfile, "install/update log out of", len
(iNuQ)
print timeElapsed()
reader = csv.reader(open(inputdir+file),delimiter = '\t')
for row in reader:
IDs.add(row[2])
currentfile+=1

print "Finished processing install/update logs"
print "Unique IDs found:" , len(IDs)
print "Total Time Elapsed:", timeElapsed()

currentfile = 1
for file in queryQ:
print "Processing", currentfile, "query log out of", len(queryQ)
print timeElapsed()
reader = csv.reader(open(inputdir+file), delimiter = '\t')
outputfile = csv.writer(open(outputdir+file), 'w')
for row in reader:
if row[2] in IDs:
ouputfile.writerow(row)
else:
if row[2] in foundOnceInQuery:
foundOnceInQuery.remove(row[2])
outputfile.writerow(row)
#IDremovalQ.remove(row[2])
#foundTwiceInQuery.add(row[2])

else:
foundOnceInQuery.add(row[2])
#IDremovalQ.add(row[2])

currentfile+=1

print "Finished processing query logs and writing new files"
print "# of Query log entries removed:" , len(foundOnceInQuery)
print "Total Time Elapsed:", timeElapsed()

MRAB · Jul 17, 2009

Zaki said:
Why are you checking and removing lines in 2 steps? Why not copy the
matching lines to a new q file and then replace the old file with the
new one (or, maybe, delete the new q file if no lines were removed)?

Click to expand...

That's what I've done now.

Here is the final code that I have running. It's very much 'hack' type
code and not at all efficient or optimized and any help in optimizing
it would be greatly appreciated.

import csv
import sys
import os
import time

begin = time.time()

#Check minutes elapsed
def timeElapsed():
current = time.time()
elapsed = current-begin
return round(elapsed/60)

#USAGE: python logcleaner.py <input_dir> <output_dir>

inputdir = sys.argv[1]
outputdir = sys.argv[2]

logfilenames = os.listdir(inputdir)

IDs = set() #IDs from update and install logs
foundOnceInQuery = set()
#foundTwiceInQuery = set()
#IDremovalQ = set() Note: Unnecessary, duplicate of foundOnceInQuery;
Queue of IDs to remove from query logs (IDs found only once in query
logs)

#Generate Filename Queues For Install/Update Logs, Query Logs
iNuQ = []
queryQ = []

for filename in logfilenames:
if filename.startswith("par1.install") or filename.startswith
("par1.update"):

if filename.startswith(("par1.install", "par1.update")):

iNuQ.append(filename)
elif filename.startswith("par1.query"):
queryQ.append(filename)

totalfiles = len(iNuQ) + len(queryQ)
print "Total # of Files to be Processed:" , totalfiles
print "Install/Update Logs to be processed:" , len(iNuQ)
print "Query logs to be processed:" , len(queryQ)

#Process install/update queue to generate list of valid IDs
currentfile = 1
for file in iNuQ:
> print "Processing", currentfile, "install/update log out of", len
> (iNuQ)
> print timeElapsed()
> reader = csv.reader(open(inputdir+file),delimiter = '\t')
> for row in reader:
> IDs.add(row[2])
> currentfile+=1

Best not to call it 'file'; that's a built-in name.

Also you could use 'enumerate', and joining filepaths is safer with
os.path.join().

for currentfile, filename in enumerate(iNuQ, start=1):
print "Processing", currentfile, "install/update log out of", len(iNuQ)
print timeElapsed()
current_path = os.path.join(inputdir, filename)
reader = csv.reader(open(current_path), delimiter = '\t')
for row in reader:
IDs.add(row[2])

print "Finished processing install/update logs"
print "Unique IDs found:" , len(IDs)
print "Total Time Elapsed:", timeElapsed()

currentfile = 1
for file in queryQ:

Similar remarks to above ...

print "Processing", currentfile, "query log out of", len(queryQ)
print timeElapsed()
reader = csv.reader(open(inputdir+file), delimiter = '\t')
outputfile = csv.writer(open(outputdir+file), 'w')

.... and also here.

for row in reader:
if row[2] in IDs:
ouputfile.writerow(row)

Should be 'outputfile'.

else:
if row[2] in foundOnceInQuery:
foundOnceInQuery.remove(row[2])

You're removing the ID here ...

outputfile.writerow(row)
#IDremovalQ.remove(row[2])
#foundTwiceInQuery.add(row[2])

else:
foundOnceInQuery.add(row[2])

.... and adding it again here!

#IDremovalQ.add(row[2])

currentfile+=1

For safety you should close the files after use.

print "Finished processing query logs and writing new files"
print "# of Query log entries removed:" , len(foundOnceInQuery)
print "Total Time Elapsed:", timeElapsed()

Apart from that, it looks OK.

How big are the q files? If they're not too big and most of the time
you're not removing rows, you could put the output rows into a list and
then create the output file only if rows have been removed, otherwise
just copy the input file, which might be faster.

Emile van Sebille · Jul 17, 2009

On 7/17/2009 1:08 PM Zaki said...

Here is the final code that I have running. It's very much 'hack' type
code and not at all efficient or optimized and any help in optimizing
it would be greatly appreciated.

There are some things I'd approach differently , eg I might prefer glob
to build iNuQ and queryQ [1], and although glob is generally fast, I'm
not sure it'd be faster. But overall it looks like most of the time is
spent in your 'for row in' loops, and as you're reading each file only
once, and would have to anyway, there's not much that'll improve overall
timing. I don't know what csvreader is doing under the covers, but if
your files are reasonably sized for your system you might try timing
something that reads in the full file and splits:

for each in filelist:
for row in open(filelist).readlines():
if row.split()[2] in ....

-----[1]-----
import glob

iNuQ = glob.glob(os.sep.join(inputdir,"par1.install*")
queryQ = glob.glob(os.sep.join(inputdir,"par1.query*")

Emile

Zaki · Jul 17, 2009

That's what I've done now.

Click to expand...

Here is the final code that I have running. It's very much 'hack' type
code and not at all efficient or optimized and any help in optimizing
it would be greatly appreciated.

Click to expand...

import csv
import sys
import os
import time

Click to expand...

begin = time.time()

Click to expand...

#Check minutes elapsed
def timeElapsed():
current = time.time()
elapsed = current-begin
return round(elapsed/60)

Click to expand...

#USAGE: python logcleaner.py <input_dir> <output_dir>

Click to expand...

inputdir = sys.argv[1]
outputdir = sys.argv[2]

Click to expand...

logfilenames = os.listdir(inputdir)

Click to expand...

IDs = set() #IDs from update and install logs
foundOnceInQuery = set()
#foundTwiceInQuery = set()
#IDremovalQ = set() Note: Unnecessary, duplicate of foundOnceInQuery;
Queue of IDs to remove from query logs (IDs found only once in query
logs)

Click to expand...

#Generate Filename Queues For Install/Update Logs, Query Logs
iNuQ = []
queryQ = []

Click to expand...

for filename in logfilenames:
if filename.startswith("par1.install") or filename.startswith
("par1.update"):

Click to expand...

if filename.startswith(("par1.install", "par1.update")):

iNuQ.append(filename)
elif filename.startswith("par1.query"):
queryQ.append(filename)

Click to expand...

totalfiles = len(iNuQ) + len(queryQ)
print "Total # of Files to be Processed:" , totalfiles
print "Install/Update Logs to be processed:" , len(iNuQ)
print "Query logs to be processed:" , len(queryQ)

Click to expand...

#Process install/update queue to generate list of valid IDs
currentfile = 1
for file in iNuQ:

Click to expand...

> print "Processing", currentfile, "install/update log out of", len
> (iNuQ)
> print timeElapsed()
> reader = csv.reader(open(inputdir+file),delimiter = '\t')
> for row in reader:
> IDs.add(row[2])
> currentfile+=1

Best not to call it 'file'; that's a built-in name.

Also you could use 'enumerate', and joining filepaths is safer with
os.path.join().

for currentfile, filename in enumerate(iNuQ, start=1):
print "Processing", currentfile, "install/update log out of", len(iNuQ)
print timeElapsed()
current_path = os.path.join(inputdir, filename)
reader = csv.reader(open(current_path), delimiter = '\t')
for row in reader:
IDs.add(row[2])

print "Finished processing install/update logs"
print "Unique IDs found:" , len(IDs)
print "Total Time Elapsed:", timeElapsed()

Click to expand...

currentfile = 1
for file in queryQ:

Click to expand...

Similar remarks to above ...

print "Processing", currentfile, "query log out of", len(queryQ)
print timeElapsed()
reader = csv.reader(open(inputdir+file), delimiter = '\t')
outputfile = csv.writer(open(outputdir+file), 'w')

Click to expand...

... and also here.

for row in reader:
if row[2] in IDs:
ouputfile.writerow(row)

Click to expand...

Should be 'outputfile'.

else:
if row[2] in foundOnceInQuery:
foundOnceInQuery.remove(row[2])

Click to expand...

You're removing the ID here ...

outputfile.writerow(row)
#IDremovalQ.remove(row[2])
#foundTwiceInQuery.add(row[2])

Click to expand...

else:
foundOnceInQuery.add(row[2])

Click to expand...

... and adding it again here!

#IDremovalQ.add(row[2])

Click to expand...

currentfile+=1

Click to expand...

For safety you should close the files after use.

print "Finished processing query logs and writing new files"
print "# of Query log entries removed:" , len(foundOnceInQuery)
print "Total Time Elapsed:", timeElapsed()

Click to expand...

Apart from that, it looks OK.

How big are the q files? If they're not too big and most of the time
you're not removing rows, you could put the output rows into a list and
then create the output file only if rows have been removed, otherwise
just copy the input file, which might be faster.

MRAB, could you please repost what I sent to you here as I meant to
post it in the main discussion.

Jon Clements · Jul 17, 2009

Why are you checking and removing lines in 2 steps? Why not copy the
matching lines to a new q file and then replace the old file with the
new one (or, maybe, delete the new q file if no lines were removed)?

Click to expand...

That's what I've done now.

Here is the final code that I have running. It's very much 'hack' type
code and not at all efficient or optimized and any help in optimizing
it would be greatly appreciated.

import csv
import sys
import os
import time

begin = time.time()

#Check minutes elapsed
def timeElapsed():
current = time.time()
elapsed = current-begin
return round(elapsed/60)

#USAGE: python logcleaner.py <input_dir> <output_dir>

inputdir = sys.argv[1]
outputdir = sys.argv[2]

logfilenames = os.listdir(inputdir)

IDs = set() #IDs from update and install logs
foundOnceInQuery = set()
#foundTwiceInQuery = set()
#IDremovalQ = set() Note: Unnecessary, duplicate of foundOnceInQuery;
Queue of IDs to remove from query logs (IDs found only once in query
logs)

#Generate Filename Queues For Install/Update Logs, Query Logs
iNuQ = []
queryQ = []

for filename in logfilenames:
if filename.startswith("par1.install") or filename.startswith
("par1.update"):
iNuQ.append(filename)
elif filename.startswith("par1.query"):
queryQ.append(filename)

totalfiles = len(iNuQ) + len(queryQ)
print "Total # of Files to be Processed:" , totalfiles
print "Install/Update Logs to be processed:" , len(iNuQ)
print "Query logs to be processed:" , len(queryQ)

#Process install/update queue to generate list of valid IDs
currentfile = 1
for file in iNuQ:
print "Processing", currentfile, "install/update log out of", len
(iNuQ)
print timeElapsed()
reader = csv.reader(open(inputdir+file),delimiter = '\t')
for row in reader:
IDs.add(row[2])
currentfile+=1

print "Finished processing install/update logs"
print "Unique IDs found:" , len(IDs)
print "Total Time Elapsed:", timeElapsed()

currentfile = 1
for file in queryQ:
print "Processing", currentfile, "query log out of", len(queryQ)
print timeElapsed()
reader = csv.reader(open(inputdir+file), delimiter = '\t')
outputfile = csv.writer(open(outputdir+file), 'w')
for row in reader:
if row[2] in IDs:
ouputfile.writerow(row)
else:
if row[2] in foundOnceInQuery:
foundOnceInQuery.remove(row[2])
outputfile.writerow(row)
#IDremovalQ.remove(row[2])
#foundTwiceInQuery.add(row[2])

else:
foundOnceInQuery.add(row[2])
#IDremovalQ.add(row[2])

currentfile+=1

print "Finished processing query logs and writing new files"
print "# of Query log entries removed:" , len(foundOnceInQuery)
print "Total Time Elapsed:", timeElapsed()

Just a couple of ideas:

1) load the data into a sqlite3 database and use an SQL query to
extract your results (has the potential of doing what you want without
you coding it, plus if your requirements change, maybe somewhat more
flexible)

2) Pre-sort your input files via ID, then match-merge (may add some
time/space required to sort, but then the merge should be fairly
quick, plus you'll have access to the entire row in the process, not
just the ID)

Jon.

Zaki · Jul 17, 2009

That's what I've done now.

Click to expand...

Here is the final code that I have running. It's very much 'hack' type
code and not at all efficient or optimized and any help in optimizing
it would be greatly appreciated.

Click to expand...

import csv
import sys
import os
import time

Click to expand...

begin = time.time()

Click to expand...

#Check minutes elapsed
def timeElapsed():
current = time.time()
elapsed = current-begin
return round(elapsed/60)

Click to expand...

#USAGE: python logcleaner.py <input_dir> <output_dir>

Click to expand...

inputdir = sys.argv[1]
outputdir = sys.argv[2]

Click to expand...

logfilenames = os.listdir(inputdir)

Click to expand...

IDs = set() #IDs from update and install logs
foundOnceInQuery = set()
#foundTwiceInQuery = set()
#IDremovalQ = set() Note: Unnecessary, duplicate of foundOnceInQuery;
Queue of IDs to remove from query logs (IDs found only once in query
logs)

Click to expand...

#Generate Filename Queues For Install/Update Logs, Query Logs
iNuQ = []
queryQ = []

Click to expand...

for filename in logfilenames:
if filename.startswith("par1.install") or filename.startswith
("par1.update"):
iNuQ.append(filename)
elif filename.startswith("par1.query"):
queryQ.append(filename)

Click to expand...

totalfiles = len(iNuQ) + len(queryQ)
print "Total # of Files to be Processed:" , totalfiles
print "Install/Update Logs to be processed:" , len(iNuQ)
print "Query logs to be processed:" , len(queryQ)

Click to expand...

#Process install/update queue to generate list of valid IDs
currentfile = 1
for file in iNuQ:
print "Processing", currentfile, "install/update log out of", len
(iNuQ)
print timeElapsed()
reader = csv.reader(open(inputdir+file),delimiter = '\t')
for row in reader:
IDs.add(row[2])
currentfile+=1

Click to expand...

print "Finished processing install/update logs"
print "Unique IDs found:" , len(IDs)
print "Total Time Elapsed:", timeElapsed()

Click to expand...

currentfile = 1
for file in queryQ:
print "Processing", currentfile, "query log out of", len(queryQ)
print timeElapsed()
reader = csv.reader(open(inputdir+file), delimiter = '\t')
outputfile = csv.writer(open(outputdir+file), 'w')
for row in reader:
if row[2] in IDs:
ouputfile.writerow(row)
else:
if row[2] in foundOnceInQuery:
foundOnceInQuery.remove(row[2])
outputfile.writerow(row)
#IDremovalQ.remove(row[2])
#foundTwiceInQuery.add(row[2])

Click to expand...

else:
foundOnceInQuery.add(row[2])
#IDremovalQ.add(row[2])

Click to expand...

currentfile+=1

Click to expand...

print "Finished processing query logs and writing new files"
print "# of Query log entries removed:" , len(foundOnceInQuery)
print "Total Time Elapsed:", timeElapsed()

Click to expand...

Just a couple of ideas:

1) load the data into a sqlite3 database and use an SQL query to
extract your results (has the potential of doing what you want without
you coding it, plus if your requirements change, maybe somewhat more
flexible)

2) Pre-sort your input files via ID, then match-merge (may add some
time/space required to sort, but then the merge should be fairly
quick, plus you'll have access to the entire row in the process, not
just the ID)

Jon.

Thanks Jon for the ideas, and yeah I might look into the SQLite
solution especially since I might need to do other sub selections. I
was also considering constructing a set for both install/update logs
and then another set for query logs and then do an intersection/other
set manipulations.

What I was really interested in was seeing if I could use generators
to try and accomplish what I'm doing especially since I'm uisng a lot
of for loops with conditionals and at this point it wouldn't matter if
the data is consumed in the intermediate steps before producing
output. Any help with getting this code in generator form would be
greatly appreciated.

Jon Clements · Jul 18, 2009

Zaki wrote:
Hey all,
I'm really new to Python and this may seem like a really dumb
question, but basically, I wrote a script to do the following, however
the processing time/memory usage is not what I'd like it to be. Any
suggestions?
Outline:
1. Read tab delim files from a directory, files are of 3 types:
install, update, and q. All 3 types contain ID values that are the
only part of interest.
2. Using set() and set.add(), generate a list of unique IDs from
install and update files.
3. Using the set created in (2), check the q files to see if there are
matches for IDs. Keep all matches, and add any non matches (which only
occur once in the q file) to a queue of lines to be removed from teh q
files.
4. Remove the lines in the q for each file. (I haven't quite written
the code for this, but I was going to implement this using csv.writer
and rewriting all the lines in the file except for the ones in the
removal queue).
Now, I've tried running this and it takes much longer than I'd like. I
was wondering if there might be a better way to do things (I thought
generator expressions might be a good way to attack this problem, as
you could generate the set, and then check to see if there's a match,
and write each line that way).
Why are you checking and removing lines in 2 steps? Why not copy the
matching lines to a new q file and then replace the old file with the
new one (or, maybe, delete the new q file if no lines were removed)?
That's what I've done now.
Here is the final code that I have running. It's very much 'hack' type
code and not at all efficient or optimized and any help in optimizing
it would be greatly appreciated.
import csv
import sys
import os
import time
begin = time.time()
#Check minutes elapsed
def timeElapsed():
current = time.time()
elapsed = current-begin
return round(elapsed/60)
#USAGE: python logcleaner.py <input_dir> <output_dir>
inputdir = sys.argv[1]
outputdir = sys.argv[2]
logfilenames = os.listdir(inputdir)
IDs = set() #IDs from update and install logs
foundOnceInQuery = set()
#foundTwiceInQuery = set()
#IDremovalQ = set() Note: Unnecessary, duplicate of foundOnceInQuery;
Queue of IDs to remove from query logs (IDs found only once in query
logs)
#Generate Filename Queues For Install/Update Logs, Query Logs
iNuQ = []
queryQ = []
for filename in logfilenames:
if filename.startswith("par1.install") or filename.startswith
("par1.update"):
iNuQ.append(filename)
elif filename.startswith("par1.query"):
queryQ.append(filename)
totalfiles = len(iNuQ) + len(queryQ)
print "Total # of Files to be Processed:" , totalfiles
print "Install/Update Logs to be processed:" , len(iNuQ)
print "Query logs to be processed:" , len(queryQ)
#Process install/update queue to generate list of valid IDs
currentfile = 1
for file in iNuQ:
print "Processing", currentfile, "install/update log out of", len
(iNuQ)
print timeElapsed()
reader = csv.reader(open(inputdir+file),delimiter = '\t')
for row in reader:
IDs.add(row[2])
currentfile+=1
print "Finished processing install/update logs"
print "Unique IDs found:" , len(IDs)
print "Total Time Elapsed:", timeElapsed()
currentfile = 1
for file in queryQ:
print "Processing", currentfile, "query log out of", len(queryQ)
print timeElapsed()
reader = csv.reader(open(inputdir+file), delimiter = '\t')
outputfile = csv.writer(open(outputdir+file), 'w')
for row in reader:
if row[2] in IDs:
ouputfile.writerow(row)
else:
if row[2] in foundOnceInQuery:
foundOnceInQuery.remove(row[2])
outputfile.writerow(row)
#IDremovalQ.remove(row[2])
#foundTwiceInQuery.add(row[2])
else:
foundOnceInQuery.add(row[2])
#IDremovalQ.add(row[2])
currentfile+=1
print "Finished processing query logs and writing new files"
print "# of Query log entries removed:" , len(foundOnceInQuery)
print "Total Time Elapsed:", timeElapsed()

Click to expand...

Click to expand...

Just a couple of ideas:

Click to expand...

1) load the data into a sqlite3 database and use an SQL query to
extract your results (has the potential of doing what you want without
you coding it, plus if your requirements change, maybe somewhat more
flexible)

Click to expand...

2) Pre-sort your input files via ID, then match-merge (may add some
time/space required to sort, but then the merge should be fairly
quick, plus you'll have access to the entire row in the process, not
just the ID)

Click to expand...

Jon.

Click to expand...

Thanks Jon for the ideas, and yeah I might look into the SQLite
solution especially since I might need to do other sub selections. I
was also considering constructing a set for both install/update logs
and then another set for query logs and then do an intersection/other
set manipulations.

What I was really interested in was seeing if I could use generators
to try and accomplish what I'm doing especially since I'm uisng a lot
of for loops with conditionals and at this point it wouldn't matter if
the data is consumed in the intermediate steps before producing
output. Any help with getting this code in generator form would be
greatly appreciated.

Well effectively, you are using generators.

A very small optimisation might be to load the smallest set of data
first (if you are already, I apologise!). Read the others and see if
it's a matching ID, where it is, remove it from the smallest set, and
those are the records to be added on the next parse.

Posting when tired, so hope that makes some sense!

Otherwise, I'd be tempted to pre-sort or use sqlite.

Generator expressions vs. comprehensions	7	May 24, 2010
KML to CSV file conversion using Python and Windows Powershell	0	Oct 14, 2022
csv read clean up and write out to csv	2	Nov 2, 2012
Grouping on and exporting to csv files	1	Mar 20, 2013
How to Create a random password generator in a separate window	4	May 26, 2022
CSV, lists, and functions	4	Mar 19, 2013
Prime number generator	11	Jul 10, 2013
Sharing: File Reader Generator with & w/o Policy	14	Mar 15, 2014

Generator Expressions and CSV

Zaki

Emile van Sebille

MRAB

Zaki

MRAB

Emile van Sebille

Zaki

Jon Clements

Zaki

Jon Clements

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads