O
oyekomova
I would like to know how to convert a csv file with a header row into a
floating point array without the header row.
floating point array without the header row.
oyekomova said:I would like to know how to convert a csv file with a header row into a
floating point array without the header row.
oyekomova said:I would like to know how to convert a csv file with a header row into a
floating point array without the header row.
Robert said:oyekomova said:I would like to know how to convert a csv file with a header row into a
floating point array without the header row.
Use the standard library module csv. Something like the following is a cheap and
cheerful solution:
import csv
import numpy
def float_array_from_csv(filename, skip_header=True):
f = open(filename)
try:
reader = csv.reader(f)
floats = []
if skip_header:
reader.next()
for row in reader:
floats.append(map(float, row))
finally:
f.close()
return numpy.array(floats)
--
Robert Kern
"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
oyekomova said:Thanks for your help. I compared the following code in NumPy with the
csvread in Matlab for a very large csv file. Matlab read the file in
577 seconds. On the other hand, this code below kept running for over 2
hours. Can this program be made more efficient? FYI - The csv file was
a simple 6 column file with a header row and more than a million
records.
import csv
from numpy import array
import time
t1=time.clock()
file_to_read = file('somename.csv','r')
read_from = csv.reader(file_to_read)
read_from.next()
datalist = [ map(float, row[:]) for row in read_from ]
Robert said:oyekomova said:I would like to know how to convert a csv file with a header row into a
floating point array without the header row.
Use the standard library module csv. Something like the following is a cheap and
cheerful solution:
import csv
import numpy
def float_array_from_csv(filename, skip_header=True):
f = open(filename)
try:
reader = csv.reader(f)
floats = []
if skip_header:
reader.next()
for row in reader:
floats.append(map(float, row))
finally:
f.close()
return numpy.array(floats)
--
Robert Kern
"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
Thanks for your help. I compared the following code in NumPy with the
csvread in Matlab for a very large csv file. Matlab read the file in
577 seconds. On the other hand, this code below kept running for over 2
hours. Can this program be made more efficient? FYI - The csv file was
a simple 6 column file with a header row and more than a million
records.
import csv
from numpy import array
import time
t1=time.clock()
file_to_read = file('somename.csv','r')
read_from = csv.reader(file_to_read)
read_from.next()
datalist = [ map(float, row[:]) for row in read_from ]
# now the real data
data = array(datalist, dtype = float)
elapsed=time.clock()-t1
print elapsed
sturlamolden said:oyekomova said:Thanks for your help. I compared the following code in NumPy with the
csvread in Matlab for a very large csv file. Matlab read the file in
577 seconds. On the other hand, this code below kept running for over 2
hours. Can this program be made more efficient? FYI - The csv file was
a simple 6 column file with a header row and more than a million
records.
import csv
from numpy import array
import time
t1=time.clock()
file_to_read = file('somename.csv','r')
read_from = csv.reader(file_to_read)
read_from.next()datalist = [ map(float, row[:]) for row in read_from ]
I'm willing to bet that this is your problem. Python lists are arrays
under the hood!
Try something like this instead:
# read the whole file in one chunk
lines = file_to_read.readlines()
# count the number of columns
n = 1
for c in lines[1]:
if c == ',': n += 1
# count the number of rows
m = len(lines[1:])
#allocate
data = empty((m,n),dtype=float)
# create csv reader, skip header
reader = csv.reader(lines[1:])
# read
for i in arange(0,m):
data[i,:] = map(float,reader.next())
And if this is too slow, you may consider vectorizing the last loop:
data = empty((m,n),dtype=float)
newstr = ",".join(lines[1:])
flatdata = data.reshape((n*m)) # flatdata is a view of data, not a copy
reader = csv.reader([newstr])
flatdata[:] = map(float,reader.next())
I hope this helps!
oyekomova said:csvread in Matlab for a very large csv file. Matlab read the file in
577 seconds. On the other hand, this code below kept running for over 2
hours. Can this program be made more efficient? FYI
oyekomova said:Thanks for your help. I compared the following code in NumPy with the
csvread in Matlab for a very large csv file. Matlab read the file in
577 seconds. On the other hand, this code below kept running for over 2
hours. Can this program be made more efficient? FYI - The csv file was
a simple 6 column file with a header row and more than a million
records.
oyekomova said:Thanks for your help. I compared the following code in NumPy with the
csvread in Matlab for a very large csv file. Matlab read the file in
577 seconds. On the other hand, this code below kept running for over 2
hours. Can this program be made more efficient? FYI - The csv file was
a simple 6 column file with a header row and more than a million
records.
import csv
from numpy import array
import time
t1=time.clock()
file_to_read = file('somename.csv','r')
read_from = csv.reader(file_to_read)
read_from.next()
datalist = [ map(float, row[:]) for row in read_from ]
# now the real data
data = array(datalist, dtype = float)
elapsed=time.clock()-t1
print elapsed
Travis said:If you use numpy.fromfile, you need to skip past the initial header row
yourself. Something like this:
fid = open('somename.csv')
data = numpy.fromfile(fid, sep=',').reshape(-1,6)
# for 6-column data.
oyekomova said:Thanks to everyone for their excellent suggestions. I was able to
acheive the following results with all your suggestions. However, I am
unable to cross file size of 6 million rows. I would appreciate any
helpful suggestions on avoiding memory errors. None of the solutions
posted was able to cross this limit.
oyekomova said:Thanks for your note. I have 1Gig of RAM. Also, Matlab has no problem
in reading the file into memory. I am just running Istvan's code that
was posted earlier.
oyekomova said:Thanks for your note. I have 1Gig of RAM. Also, Matlab has no problem
in reading the file into memory. I am just running Istvan's code that
was posted earlier.
You have a CSV file of about 520 MiB, which is read into memory. Then
you have a list of list of floats, created by list comprehension, which
is larger than 274 MiB. Additionally you try to allocate a NumPy array
slightly larger than 274 MiB. Now your process is already exceeding 1
GiB, and you are probably running other processes too. That is why you
run out of memory.
So you have three options:
1. Buy more RAM.
2. Low-level code a csv-reader in C.
3. Read the data in chunks. That would mean something like this:
import time, csv, random
import numpy
def make_data(rows=6E6, cols=6):
fp = open('data.txt', 'wt')
counter = range(cols)
for row in xrange( int(rows) ):
vals = map(str, [ random.random() for x in counter ] )
fp.write( '%s\n' % ','.join( vals ) )
fp.close()
def read_test():
start = time.clock()
arrlist = None
r = 0
CHUNK_SIZE_HINT = 4096 * 4 # seems to be good
fid = file('data.txt')
while 1:
chunk = fid.readlines(CHUNK_SIZE_HINT)
if not chunk: break
reader = csv.reader(chunk)
data = [ map(float, row) for row in reader ]
arrlist = [ numpy.array(data,dtype=float), arrlist ]
r += arrlist[0].shape[0]
del data
del reader
del chunk
print 'Created list of chunks, elapsed time so far: ', time.clock()
- start
print 'Joining list...'
data = numpy.empty((r,arrlist[0].shape[1]),dtype=float)
r1 = r
while arrlist:
r0 = r1 - arrlist[0].shape[0]
data[r0:r1,:] = arrlist[0]
r1 = r0
del arrlist[0]
arrlist = arrlist[0]
print 'Elapsed time:', time.clock() - start
make_data()
read_test()
This can process a CSV file of 6 million rows in about 150 seconds on
my laptop. A CSV file of 1 million rows takes about 25 seconds.
Just reading the 6 million row CSV file ( using fid.readlines() ) takes
about 40 seconds on my laptop. Python lists are not particularly
efficient. You can probably reduce the time to ~60 seconds by writing a
new CSV reader for NumPy arrays in a C extension.
oyekomova said:Thanks to everyone for their excellent suggestions. I was able to
acheive the following results with all your suggestions. However, I am
unable to cross file size of 6 million rows. I would appreciate any
helpful suggestions on avoiding memory errors. None of the solutions
posted was able to cross this limit.
Want to reply to this thread or ask your own question?
You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.