*.csv to *.txt after adding columns


B

Bryan Britten

Hey, gang, I've got a problem here that I'm sure a handful of you will knowhow to solve. I've got about 6 *.csv files that I am trying to open; change the header names (to get rid of spaces); add two new columns, which are just the results of a string.split() command; drop the column I just split; and then finally export to *.txt files. Here's the code I'm using:

import os
import csv


fileHandle = 'Path/To/Data'
varNames = 'ID\tCaseNum\tDate\tTime\tBlock\tIUCR\tPrimaryType\tDescription\tLocDesc\tArrest\tDomestic\tBeat\tDistrict\tWard\tCommArea\tFBICode\tXCoord\tYCoord\tYear\tUpdatedOn\tLat\tLong\tLoc\n'

for csvFile in os.listdir(fileHandle):
outFile = open(fileHandle + os.path.splitext(csvFile)[0] + '.txt', 'w')
inFile = open(fileHandle + csvFile, 'rb')
reader = csv.reader(inFile, delimiter=',')
rowNum = 0
for row in reader:
if rowNum == 0:
outFile.write(varNames)
rowNum += 1
else:
date, time = row[2].split()
row.insert(3, date)
row.insert(4, time)
row.remove(row[2])
outFile.write('\t'.join(row) + '\n')
outFile.close()
inFile.close()


The issue I'm having is that the *.txt files I'm generating are empty. I assume some unraised error is being thrown, but I'm new to Python and am selftaught, so I don't know where to look or how to troubleshoot.

When I run the code on just one file and print the output instead of writing it, it looks exactly like what I'd want. So I'm at a loss for where the problem is.

Any help is appreciated!
 
Ad

Advertisements

D

Dave Angel

Hey, gang, I've got a problem here that I'm sure a handful of you will know how to solve. I've got about 6 *.csv files that I am trying to open; change the header names (to get rid of spaces); add two new columns, which are just the results of a string.split() command; drop the column I just split; and then finally export to *.txt files. Here's the code I'm using:

import os
import csv


fileHandle = 'Path/To/Data'
varNames = 'ID\tCaseNum\tDate\tTime\tBlock\tIUCR\tPrimaryType\tDescription\tLocDesc\tArrest\tDomestic\tBeat\tDistrict\tWard\tCommArea\tFBICode\tXCoord\tYCoord\tYear\tUpdatedOn\tLat\tLong\tLoc\n'

for csvFile in os.listdir(fileHandle):
outFile = open(fileHandle + os.path.splitext(csvFile)[0] + '.txt', 'w')
inFile = open(fileHandle + csvFile, 'rb')
reader = csv.reader(inFile, delimiter=',')
rowNum = 0
for row in reader:
if rowNum == 0:
outFile.write(varNames)
rowNum += 1
else:
date, time = row[2].split()
row.insert(3, date)
row.insert(4, time)
row.remove(row[2])
outFile.write('\t'.join(row) + '\n')
outFile.close()
inFile.close()


The issue I'm having is that the *.txt files I'm generating are empty. I assume some unraised error is being thrown,

How can you NOT know if an exception is being raised? How are you
running the code, in a terminal window? Can't you just see what gets
printed (as stderr)?

but I'm new to Python and am self taught, so I don't know where to look
or how to troubleshoot.

When I run the code on just one file and print the output instead of writing it, it looks exactly like what I'd want. So I'm at a loss for where the problem is.

You describe two changes, but don't show them. How about you do them
one at a time, or perhaps even better, add prints so it both does the
file(s) AND prints the output?

What I think is happening is that you're missing the path separator
between the "Path/To/Data" and the basename. You should be combining
those with os.path.join(), not with +

For a quick & dirty check, add a trailing slash to the fileHandle. But
that's not the right way to fix it.
 
B

Bryan Britten

Dave -

I can't print the output because there are close to 1,000,000 records. It would be extremely inefficient and resource intensive to look at every row. Like I said, when I take just one file and run the code over the first few records I get what I'd expect to see. Here's an example(non-redacted code):

INPUT:

import csv

fileHandle = 'C:/Users/Bryan/Data Analysis/Crime Analysis/Data/'

varNames = 'ID\tCaseNum\tDate\tTime\tBlock\tIUCR\tPrimaryType\tDescription\tLocDesc\tArrest\tDomestic\tBeat\tDistrict\tWard\tCommArea\tFBICode\tXCoord\tYCoord\tYear\tUpdatedOn\tLat\tLong\tLoc\n'

outFile = open(fileHandle + 'ChiCrime01_02.txt', 'w')
inFile = open(fileHandle + 'ChiCrime01_02.csv', 'rb')
reader = csv.reader(inFile, delimiter=',')
rowNum = 0
for row in reader:
if rowNum < 5:
if rowNum >= 1:
date, time = row[2].split()
row.insert(3, date)
row.insert(4, time)
row.remove(row[2])
print '\t'.join(row)
rowNum+=1
else:
print varNames
rowNum+=1


OUTPUT:

ID CaseNum Date Time Block IUCR PrimaryType DescriptionLocDesc Arrest Domestic Beat District Ward CommArea FBICode XCoord YCoord Year UpdatedOn Lat Long Loc

2924745 HJ602602 12/31/2002 23:59 006XX W 117TH ST 841THEFT FINANCIAL ID THEFT:$300 &UNDER RESIDENCE PORCH/HALLWAY FALSE FALSE 524 5 34 53 6 1173831 1827387 2002 3/30/2006 21:10 41.68175482 -87.63931351 (41.681754819160666, -87.63931350564216)

2523290 HJ101091 12/31/2002 23:59 002XX W 112TH PL 1310 CRIMINAL DAMAGE TO PROPERTY APARTMENT FALSE FALSE 522 34 49 14 1176848 1830383 2002 3/30/2006 21:10 41.68990907 -87.62817988 (41.689909072449474, -87.62817987594765)

2527332 HJ105139 12/31/2002 23:55 005XX E 89TH PL 486 BATTERY DOMESTIC BATTERY SIMPLE RESIDENCE FALSE TRUE 633 6 44 08B 1181369 1845794 2002 3/30/2006 21:10 41.73209609 -87.61115533 (41.732096089465905, -87.61115532670617)

2524251 HJ100175 12/31/2002 23:55 012XX S KARLOV AVE 041A BATTERY AGGRAVATED: HANDGUN SIDEWALK FALSE FALSE 1011 24 29 04B 1149196 1894387 2002 3/30/2006 21:10 41.86612296 -87.72776536 (41.86612295941429, -87.72776535755746)


Like I said, the output is exactly what I want, but it doesn't seem to be writing to the file and I don't know why. I said I didn't know if it was raising an exception because I'm new to Python and I didn't know if there weresome methods that included "silent" errors where it would continue the code but produce the wrong results, such as not writing my files.

Lastly, why does everyone seem to push for os.path.join versus the method Ihave used? Is it just a 'standard' that people like to see?

Thanks for your help
 
P

Peter Otten

Bryan said:
Hey, gang, I've got a problem here that I'm sure a handful of you will
know how to solve. I've got about 6 *.csv files that I am trying to open;
change the header names (to get rid of spaces); add two new columns, which
are just the results of a string.split() command; drop the column I just
split; and then finally export to *.txt files. Here's the code I'm using:

import os
import csv


fileHandle = 'Path/To/Data'
varNames =
'ID\tCaseNum\tDate\tTime\tBlock\tIUCR\tPrimaryType\tDescription\tLocDesc\tArrest\tDomestic\tBeat\tDistrict\tWard\tCommArea\tFBICode\tXCoord\tYCoord\tYear\tUpdatedOn\tLat\tLong\tLoc\n'

You may be processing not just the .csv files, but also the .txt files from
a previous run of your script. Remove them manually to verify my guess.
If I'm right you can (in order of my preference)

(1) write the output into another directory and also do (2)
(2) filter the input files with glob.glob("/path/to/data/*.csv")
(3) or just add the two lines below to your current code.
for csvFile in os.listdir(fileHandle):

if not csvFile.endswith(".csv"):
continue
outFile = open(fileHandle + os.path.splitext(csvFile)[0] + '.txt',
'w')
inFile = open(fileHandle + csvFile, 'rb')
reader = csv.reader(inFile, delimiter=',')
rowNum = 0
for row in reader:
if rowNum == 0:
outFile.write(varNames)
rowNum += 1
else:
date, time = row[2].split()
row.insert(3, date)
row.insert(4, time)
row.remove(row[2])
outFile.write('\t'.join(row) + '\n')
outFile.close()
inFile.close()


The issue I'm having is that the *.txt files I'm generating are empty. I
assume some unraised error is being thrown, but I'm new to Python and am
self taught, so I don't know where to look or how to troubleshoot.

When I run the code on just one file and print the output instead of
writing it, it looks exactly like what I'd want. So I'm at a loss for
where the problem is.

Any help is appreciated!
 
R

rusi

Hey, gang, I've got a problem here that I'm sure a handful of you will know how to solve. I've got about 6 *.csv files that I am trying to open; change the header names (to get rid of spaces); add two new columns, which arejust the results of a string.split() command; drop the column I just split; and then finally export to *.txt files. Here's the code I'm using:

Its generally a good idea in programming to separate separable problems. Since you say:
The issue I'm having is that the *.txt files I'm generating are empty. I
assume some unraised error is being thrown, but I'm new to Python and am self
taught, so I don't know where to look or how to troubleshoot.
When I run the code on just one file and print the output instead of writing
it, it looks exactly like what I'd want. So I'm at a loss for where the
problem is.

you should probably write and have checked the code that handles one file. For example a function that takes as parameters the two: the input/output files (maybe names or handles), and then
- check that that function works for one file
- call it in your loop

[BTW your code seems to mixup the variable-names for file-names and handles]
 
D

Dave Angel

Dave -

I can't print the output because there are close to 1,000,000 records. It would be extremely inefficient and resource intensive to look at every row.

Not if you made a sample directory with about 3 files, each containing
half a dozen lines.
Like I said, when I take just one file and run the code over the first
few records I get what I'd expect to see. Here's an example(non-redacted code):

INPUT:

import csv

fileHandle = 'C:/Users/Bryan/Data Analysis/Crime Analysis/Data/'

Now, that directory specification ends with a slash. So "+" will work
correctly. But your original did not.
varNames = 'ID\tCaseNum\tDate\tTime\tBlock\tIUCR\tPrimaryType\tDescription\tLocDesc\tArrest\tDomestic\tBeat\tDistrict\tWard\tCommArea\tFBICode\tXCoord\tYCoord\tYear\tUpdatedOn\tLat\tLong\tLoc\n'

outFile = open(fileHandle + 'ChiCrime01_02.txt', 'w')
inFile = open(fileHandle + 'ChiCrime01_02.csv', 'rb')

Instead of changing this code, you could have switched to a directory
containing only one file.
Like I said, the output is exactly what I want, but it doesn't seem to be writing to the file and I don't know why. I said I didn't know if it was raising an exception because I'm new to Python and I didn't know if there were some methods that included "silent" errors where it would continue the code but produce the wrong results, such as not writing my files.

The only "silent" exception I know of is the one triggered by
sys.exit(), used to terminate the process.
Lastly, why does everyone seem to push for os.path.join versus the method I have used? Is it just a 'standard' that people like to see?

Because os.path.join will be smart enough to add the slashes only if
they are necessary. That can be useful, especially if the directory
you're using as a prefix came from the user.


I think Peter's suggestion is probably right on; you don't limit your
infiles to *.csv, so you will be processing *.txt files the second time.

Another useful debugging aid would have been to add print statements
right after opening the files, perhaps something like:

print "Reading :", infile.name
print "Writing:", outfile.name

If those names had been missing slashes, I would have been vindicated,
while if they were the same, you'd know Peter had nailed it.
 
Ad

Advertisements

B

Bryan Britten

Peter nailed it. Adding in the two lines of code to ensure I was just working with *.csv files fixed the problem. Thanks to everyone for the help and suggestions on best practices.
 
Ad

Advertisements

D

Dennis Lee Bieber

Hey, gang, I've got a problem here that I'm sure a handful of you will know how to solve. I've got about 6 *.csv files that I am trying to open; change the header names (to get rid of spaces); add two new columns, which are just the results of a string.split() command; drop the column I just split; and then finally export to *.txt files. Here's the code I'm using:

import os
import csv


fileHandle = 'Path/To/Data'

That seems a misnomer... It is not a "file handle" of any sort, just a
base path.
varNames = 'ID\tCaseNum\tDate\tTime\tBlock\tIUCR\tPrimaryType\tDescription\tLocDesc\tArrest\tDomestic\tBeat\tDistrict\tWard\tCommArea\tFBICode\tXCoord\tYCoord\tYear\tUpdatedOn\tLat\tLong\tLoc\n'

for csvFile in os.listdir(fileHandle):
outFile = open(fileHandle + os.path.splitext(csvFile)[0] + '.txt', 'w')

Note that your "fileHandle" does not end with a / -- so I have no idea
what types of names you are trying to open...

Recommend you use

os.path.join(fileHande, os.path.splitext(csvFile)[0] + ".txt")

inFile = open(fileHandle + csvFile, 'rb')
ditto

reader = csv.reader(inFile, delimiter=',')
rowNum = 0
for row in reader:
if rowNum == 0:
outFile.write(varNames)
rowNum += 1
else:
date, time = row[2].split()
row.insert(3, date)
row.insert(4, time)
row.remove(row[2])
outFile.write('\t'.join(row) + '\n')

Apparently your "text file" is a tab-separated file...

The CSV module is capable of processing TSV just as easily as CSV.

I'd also drop the whole confusion of rowNum


-=-=-=-=- pseudo-code, not runnable

fin = open(os.path.join(basepath, filename), "rb")
csvin = csv.reader(fin, delimiter=",")
fout = open(os.path.join(basepath, otherfilename), "wb")
csvout = csv.writer(fout, delimiter="\t")

junk = reader.next() #skip header
csvout.write(["ID", "caseNum", ... ])

for row in reader:
out = row[:2]
out.extend(row[2].split())
out.extend(row[3:])
csvout.write(row)

fout.close()
fin.close()
del csvout #just paranoia, free up structures
del csvin #ditto

-=-=-=-=-
 

Top