Newbie - converting csv files to arrays in NumPy

oyekomova · Jan 9, 2007

I would like to know how to convert a csv file with a header row into a
floating point array without the header row.

Marc 'BlackJack' Rintsch · Jan 9, 2007

oyekomova said:
I would like to know how to convert a csv file with a header row into a
floating point array without the header row.

Take a look at the `csv` module in the standard library.

Ciao,
Marc 'BlackJack' Rintsch

Robert Kern · Jan 9, 2007

oyekomova said:
I would like to know how to convert a csv file with a header row into a
floating point array without the header row.

Use the standard library module csv. Something like the following is a cheap and
cheerful solution:

import csv
import numpy

def float_array_from_csv(filename, skip_header=True):
f = open(filename)
try:
reader = csv.reader(f)
floats = []
if skip_header:
reader.next()
for row in reader:
floats.append(map(float, row))
finally:
f.close()

return numpy.array(floats)

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

oyekomova · Jan 10, 2007

Thanks for your help. I compared the following code in NumPy with the
csvread in Matlab for a very large csv file. Matlab read the file in
577 seconds. On the other hand, this code below kept running for over 2
hours. Can this program be made more efficient? FYI - The csv file was
a simple 6 column file with a header row and more than a million
records.

import csv
from numpy import array
import time
t1=time.clock()
file_to_read = file('somename.csv','r')
read_from = csv.reader(file_to_read)
read_from.next()

datalist = [ map(float, row[:]) for row in read_from ]

# now the real data
data = array(datalist, dtype = float)

elapsed=time.clock()-t1
print elapsed

Robert said:
oyekomova said:

I would like to know how to convert a csv file with a header row into a
floating point array without the header row.

Click to expand...

Use the standard library module csv. Something like the following is a cheap and
cheerful solution:

import csv
import numpy

def float_array_from_csv(filename, skip_header=True):
f = open(filename)
try:
reader = csv.reader(f)
floats = []
if skip_header:
reader.next()
for row in reader:
floats.append(map(float, row))
finally:
f.close()

return numpy.array(floats)

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

sturlamolden · Jan 10, 2007

oyekomova said:
Thanks for your help. I compared the following code in NumPy with the
csvread in Matlab for a very large csv file. Matlab read the file in
577 seconds. On the other hand, this code below kept running for over 2
hours. Can this program be made more efficient? FYI - The csv file was
a simple 6 column file with a header row and more than a million
records.

import csv
from numpy import array
import time
t1=time.clock()
file_to_read = file('somename.csv','r')
read_from = csv.reader(file_to_read)
read_from.next()

datalist = [ map(float, row[:]) for row in read_from ]

I'm willing to bet that this is your problem. Python lists are arrays
under the hood!

Try something like this instead:

# read the whole file in one chunk
lines = file_to_read.readlines()
# count the number of columns
n = 1
for c in lines[1]:
if c == ',': n += 1
# count the number of rows
m = len(lines[1:])
#allocate
data = empty((m,n),dtype=float)
# create csv reader, skip header
reader = csv.reader(lines[1:])
# read
for i in arange(0,m):
data[i,:] = map(float,reader.next())

And if this is too slow, you may consider vectorizing the last loop:

data = empty((m,n),dtype=float)
newstr = ",".join(lines[1:])
flatdata = data.reshape((n*m)) # flatdata is a view of data, not a copy
reader = csv.reader([newstr])
flatdata[:] = map(float,reader.next())

I hope this helps!

Robert said:
Robert said:

oyekomova said:

I would like to know how to convert a csv file with a header row into a
floating point array without the header row.

Click to expand...

Use the standard library module csv. Something like the following is a cheap and
cheerful solution:

import csv
import numpy

def float_array_from_csv(filename, skip_header=True):
f = open(filename)
try:
reader = csv.reader(f)
floats = []
if skip_header:
reader.next()
for row in reader:
floats.append(map(float, row))
finally:
f.close()

return numpy.array(floats)

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

Click to expand...

Gabriel Genellina · Jan 10, 2007

Thanks for your help. I compared the following code in NumPy with the
csvread in Matlab for a very large csv file. Matlab read the file in
577 seconds. On the other hand, this code below kept running for over 2
hours. Can this program be made more efficient? FYI - The csv file was
a simple 6 column file with a header row and more than a million
records.

import csv
from numpy import array
import time
t1=time.clock()
file_to_read = file('somename.csv','r')
read_from = csv.reader(file_to_read)
read_from.next()

datalist = [ map(float, row[:]) for row in read_from ]

# now the real data
data = array(datalist, dtype = float)

elapsed=time.clock()-t1
print elapsed

Replace that row[:] by row, it's just a waste of time and memory.
And see http://www.scipy.org/Cookbook/InputOutput

--
Gabriel Genellina
Softlab SRL

__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas

John Machin · Jan 11, 2007

sturlamolden said:
oyekomova said:

Thanks for your help. I compared the following code in NumPy with the
csvread in Matlab for a very large csv file. Matlab read the file in
577 seconds. On the other hand, this code below kept running for over 2
hours. Can this program be made more efficient? FYI - The csv file was
a simple 6 column file with a header row and more than a million
records.

import csv
from numpy import array
import time
t1=time.clock()
file_to_read = file('somename.csv','r')
read_from = csv.reader(file_to_read)
read_from.next()

Click to expand...

datalist = [ map(float, row[:]) for row in read_from ]

Click to expand...

I'm willing to bet that this is your problem. Python lists are arrays
under the hood!

Try something like this instead:

# read the whole file in one chunk
lines = file_to_read.readlines()
# count the number of columns
n = 1
for c in lines[1]:
if c == ',': n += 1
# count the number of rows
m = len(lines[1:])

Please consider using
m = len(lines) - 1

#allocate
data = empty((m,n),dtype=float)
# create csv reader, skip header
reader = csv.reader(lines[1:])

lines[1:] again?
The OP set you an example:
read_from.next()
so you could use:
reader = csv.reader(lines)
_unused = reader.next()

# read
for i in arange(0,m):
data[i,:] = map(float,reader.next())

And if this is too slow, you may consider vectorizing the last loop:

data = empty((m,n),dtype=float)
newstr = ",".join(lines[1:])
flatdata = data.reshape((n*m)) # flatdata is a view of data, not a copy
reader = csv.reader([newstr])
flatdata[:] = map(float,reader.next())

I hope this helps!

Istvan Albert · Jan 11, 2007

oyekomova said:
csvread in Matlab for a very large csv file. Matlab read the file in
577 seconds. On the other hand, this code below kept running for over 2
hours. Can this program be made more efficient? FYI

There must be something wrong with your setup/program. I work with
large csv files as well and I never have performance problems of that
magnitude. Make sure you are not doing something else while parsing
your data.

Parsing 1 million lines with six columns with the program below takes
87 seconds on my laptop. Even your original version with extra slices
and all would still only be take about 50% more time.

import time, csv, random
from numpy import array

def make_data(rows=1E6, cols=6):
fp = open('data.txt', 'wt')
counter = range(cols)
for row in xrange( int(rows) ):
vals = map(str, [ random.random() for x in counter ] )
fp.write( '%s\n' % ','.join( vals ) )
fp.close()

def read_test():
start = time.clock()
reader = csv.reader( file('data.txt') )
data = [ map(float, row) for row in reader ]
data = array(data, dtype = float)
print 'Data size', len(data)
print 'Elapsed', time.clock() - start

#make_data()
read_test()

Travis E. Oliphant · Jan 12, 2007

oyekomova said:
Thanks for your help. I compared the following code in NumPy with the
csvread in Matlab for a very large csv file. Matlab read the file in
577 seconds. On the other hand, this code below kept running for over 2
hours. Can this program be made more efficient? FYI - The csv file was
a simple 6 column file with a header row and more than a million
records.

There is some facility to read simply-formatted files directly into NumPy.

You might try something like this.

numpy.fromfile('somename.csv', sep=',')

and then reshape the array.

-Travis

Travis E. Oliphant · Jan 12, 2007

oyekomova said:
Thanks for your help. I compared the following code in NumPy with the
csvread in Matlab for a very large csv file. Matlab read the file in
577 seconds. On the other hand, this code below kept running for over 2
hours. Can this program be made more efficient? FYI - The csv file was
a simple 6 column file with a header row and more than a million
records.

import csv
from numpy import array
import time
t1=time.clock()
file_to_read = file('somename.csv','r')
read_from = csv.reader(file_to_read)
read_from.next()

datalist = [ map(float, row[:]) for row in read_from ]

# now the real data
data = array(datalist, dtype = float)

elapsed=time.clock()-t1
print elapsed

If you use numpy.fromfile, you need to skip past the initial header row
yourself. Something like this:

fid = open('somename.csv')
data = numpy.fromfile(fid, sep=',').reshape(-1,6)
# for 6-column data.

-Travis

Robert Kern · Jan 12, 2007

Travis said:
If you use numpy.fromfile, you need to skip past the initial header row
yourself. Something like this:

fid = open('somename.csv')

# I think you also meant to include this line:
header = fid.readline()

data = numpy.fromfile(fid, sep=',').reshape(-1,6)
# for 6-column data.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

oyekomova · Jan 13, 2007

Thanks to everyone for their excellent suggestions. I was able to
acheive the following results with all your suggestions. However, I am
unable to cross file size of 6 million rows. I would appreciate any
helpful suggestions on avoiding memory errors. None of the solutions
posted was able to cross this limit.
Data size 1999999
Elapsed 63.4050884573Data size 4999999
Elapsed 177.888915777Data size 5999999'
Traceback (most recent call last):
File "C:/Documents/some.py", line 27, in <module>
read_test()
File "C:/Documents/some.py", line 21, in read_test
data = array(data, dtype = float)
MemoryError

sturlamolden · Jan 13, 2007

oyekomova said:
Thanks to everyone for their excellent suggestions. I was able to
acheive the following results with all your suggestions. However, I am
unable to cross file size of 6 million rows. I would appreciate any
helpful suggestions on avoiding memory errors. None of the solutions
posted was able to cross this limit.

The error message means you are running out of RAM.

With 6 million rows and 6 columns, the size of the data array is (only)
274 MiB. I have no problem allocating it on my laptop. How large is the
csv file and how much RAM do you have?

Also it helps to post the whole code you are trying to run. I don't
care much for guesswork.

oyekomova · Jan 13, 2007

Thanks for your note. I have 1Gig of RAM. Also, Matlab has no problem
in reading the file into memory. I am just running Istvan's code that
was posted earlier.

import time, csv, random
from numpy import array

def make_data(rows=1E6, cols=6):
fp = open('data.txt', 'wt')
counter = range(cols)
for row in xrange( int(rows) ):
vals = map(str, [ random.random() for x in counter ] )
fp.write( '%s\n' % ','.join( vals ) )
fp.close()

def read_test():
start = time.clock()
reader = csv.reader( file('data.txt') )
data = [ map(float, row) for row in reader ]
data = array(data, dtype = float)
print 'Data size', len(data)
print 'Elapsed', time.clock() - start

#make_data()
read_test()

skip · Jan 13, 2007

oyekomova> def read_test():
oyekomova> start = time.clock()
oyekomova> reader = csv.reader( file('data.txt') )
oyekomova> data = [ map(float, row) for row in reader ]
oyekomova> data = array(data, dtype = float)
oyekomova> print 'Data size', len(data)
oyekomova> print 'Elapsed', time.clock() - start

You have the entire file in memory as well as the entire array. Try
operating line-by-line.

#!/usr/bin/env python

import array
import time
import random
import csv

def make_data(nrows=1000000, cols=6):
counter = range(cols)
writer = csv.writer(open('data.txt', 'wt'))
for row in xrange(nrows):
writer.writerow([random.random() for x in counter])

def read_test():
reader = csv.reader( file('data.txt') )
data = array.array('f')
for row in reader:
data.extend(map(float, row))
print 'Data size', len(data)

start = time.clock()
make_data()
print "generate data:", (time.clock()-start)

start = time.clock()
read_test()
print "read data:", (time.clock()-start)

Skip

sturlamolden · Jan 14, 2007

oyekomova said:
Thanks for your note. I have 1Gig of RAM. Also, Matlab has no problem
in reading the file into memory. I am just running Istvan's code that
was posted earlier.

You have a CSV file of about 520 MiB, which is read into memory. Then
you have a list of list of floats, created by list comprehension, which
is larger than 274 MiB. Additionally you try to allocate a NumPy array
slightly larger than 274 MiB. Now your process is already exceeding 1
GiB, and you are probably running other processes too. That is why you
run out of memory.

So you have three options:

1. Buy more RAM.

2. Low-level code a csv-reader in C.

3. Read the data in chunks. That would mean something like this:

import time, csv, random
import numpy

def make_data(rows=6E6, cols=6):
fp = open('data.txt', 'wt')
counter = range(cols)
for row in xrange( int(rows) ):
vals = map(str, [ random.random() for x in counter ] )
fp.write( '%s\n' % ','.join( vals ) )
fp.close()

def read_test():
start = time.clock()
arrlist = None
r = 0
CHUNK_SIZE_HINT = 4096 * 4 # seems to be good
fid = file('data.txt')
while 1:
chunk = fid.readlines(CHUNK_SIZE_HINT)
if not chunk: break
reader = csv.reader(chunk)
data = [ map(float, row) for row in reader ]
arrlist = [ numpy.array(data,dtype=float), arrlist ]
r += arrlist[0].shape[0]
del data
del reader
del chunk
print 'Created list of chunks, elapsed time so far: ', time.clock()
- start
print 'Joining list...'
data = numpy.empty((r,arrlist[0].shape[1]),dtype=float)
r1 = r
while arrlist:
r0 = r1 - arrlist[0].shape[0]
data[r0:r1,:] = arrlist[0]
r1 = r0
del arrlist[0]
arrlist = arrlist[0]
print 'Elapsed time:', time.clock() - start

make_data()
read_test()

This can process a CSV file of 6 million rows in about 150 seconds on
my laptop. A CSV file of 1 million rows takes about 25 seconds.

Just reading the 6 million row CSV file ( using fid.readlines() ) takes
about 40 seconds on my laptop. Python lists are not particularly
efficient. You can probably reduce the time to ~60 seconds by writing a
new CSV reader for NumPy arrays in a C extension.

oyekomova · Jan 14, 2007

Thank you so much. Your solution works! I greatly appreciate your
help.

oyekomova said:
oyekomova said:

Thanks for your note. I have 1Gig of RAM. Also, Matlab has no problem
in reading the file into memory. I am just running Istvan's code that
was posted earlier.

Click to expand...

You have a CSV file of about 520 MiB, which is read into memory. Then
you have a list of list of floats, created by list comprehension, which
is larger than 274 MiB. Additionally you try to allocate a NumPy array
slightly larger than 274 MiB. Now your process is already exceeding 1
GiB, and you are probably running other processes too. That is why you
run out of memory.

So you have three options:

1. Buy more RAM.

2. Low-level code a csv-reader in C.

3. Read the data in chunks. That would mean something like this:

import time, csv, random
import numpy

def make_data(rows=6E6, cols=6):
fp = open('data.txt', 'wt')
counter = range(cols)
for row in xrange( int(rows) ):
vals = map(str, [ random.random() for x in counter ] )
fp.write( '%s\n' % ','.join( vals ) )
fp.close()

def read_test():
start = time.clock()
arrlist = None
r = 0
CHUNK_SIZE_HINT = 4096 * 4 # seems to be good
fid = file('data.txt')
while 1:
chunk = fid.readlines(CHUNK_SIZE_HINT)
if not chunk: break
reader = csv.reader(chunk)
data = [ map(float, row) for row in reader ]
arrlist = [ numpy.array(data,dtype=float), arrlist ]
r += arrlist[0].shape[0]
del data
del reader
del chunk
print 'Created list of chunks, elapsed time so far: ', time.clock()
- start
print 'Joining list...'
data = numpy.empty((r,arrlist[0].shape[1]),dtype=float)
r1 = r
while arrlist:
r0 = r1 - arrlist[0].shape[0]
data[r0:r1,:] = arrlist[0]
r1 = r0
del arrlist[0]
arrlist = arrlist[0]
print 'Elapsed time:', time.clock() - start

make_data()
read_test()

This can process a CSV file of 6 million rows in about 150 seconds on
my laptop. A CSV file of 1 million rows takes about 25 seconds.

Just reading the 6 million row CSV file ( using fid.readlines() ) takes
about 40 seconds on my laptop. Python lists are not particularly
efficient. You can probably reduce the time to ~60 seconds by writing a
new CSV reader for NumPy arrays in a C extension.

Travis E. Oliphant · Jan 15, 2007

oyekomova said:
Thanks to everyone for their excellent suggestions. I was able to
acheive the following results with all your suggestions. However, I am
unable to cross file size of 6 million rows. I would appreciate any
helpful suggestions on avoiding memory errors. None of the solutions
posted was able to cross this limit.

Did you try using numpy.fromfile ?

This will not require you to allocate more memory than needed. If you
specify a count, it will also not have to re-allocate memory in blocks
as the array size grows.

It's limitation is that it is not a very sophisticated csv reader (it
only understands a single separator (plus line-feeds are typically seen
as a separator).

-Travis

oyekomova · Jan 16, 2007

Travis-
Yes, I tried your suggestion, but found that it took longer to read a
large file. Thanks for your help.

How to Make CSV Contact Files Work Seamlessly Across All Smartphones?	0	Sep 17, 2025
Why should I convert PST file to CSV format?	1	Apr 2, 2026
Why do I need to export Zimbra TGZ contacts to CSV?	0	Feb 18, 2026
Migrate/Turn/Convert MBOX to CSV Directly with These Simple Tips	1	Apr 17, 2025
Whats the best approach for converting OST to PST files?	5	Feb 10, 2025
Can I convert PST to CSV without losing data?	2	Apr 17, 2026
How to Access Thunderbird CSV Exports on Android Without any data Loss?	0	Sep 30, 2025
How can I extract PST data into a CSV file?	1	Mar 20, 2026

Newbie - converting csv files to arrays in NumPy

oyekomova

Marc 'BlackJack' Rintsch

Robert Kern

oyekomova

sturlamolden

Gabriel Genellina

John Machin

Istvan Albert

Travis E. Oliphant

Travis E. Oliphant

Robert Kern

oyekomova

sturlamolden

oyekomova

skip

sturlamolden

oyekomova

Travis E. Oliphant

oyekomova

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads