Memory error due to the huge/huge input file size

T

tejsupra

Hello Everyone,

I need to read a .csv file which has a size of 2.26 GB . And I wrote a
Python script , where I need to read this file. And my Computer has 2
GB RAM Please see the code as follows:

"""
This program has been developed to retrieve all the promoter sequences
for the specified
list of genes in the given cluster

So, this program will act as a substitute to the whole EZRetrieve
system

Input arguments:

1) Cluster.txt or DowRatClust161718bwithDummy.txt
2) TransProCrossReferenceAndSequences.csv -> This is the file that has
all the promoter sequences
3) -2000
4) 500
"""

import time
import csv
import sys
import linecache
import re
from sets import Set
import gc

print time.localtime()

fileInputHandler = open(sys.argv[1],"r")
line = fileInputHandler.readline()

refSeqIDsinTransPro = []
promoterSequencesinTransPro = []
reader2 = csv.reader(open(sys.argv[2],"rb"))
reader2_list = []
reader2_list.extend(reader2)

for data2 in reader2_list:
refSeqIDsinTransPro.append(data2[3])
for data2 in reader2_list:
promoterSequencesinTransPro.append(data2[4])

while line:
l = line.rstrip('\n')
for j in range(1,len(refSeqIDsinTransPro)):
found = re.search(l,refSeqIDsinTransPro[j])
if found:
"""promoterSequencesinTransPro[j] """
print l

line = fileInputHandler.readline()


fileInputHandler.close()


The error that I got is given as follows:
Traceback (most recent call last):
File "RefSeqsToPromoterSequences.py", line 31, in <module>
reader2_list.extend(reader2)
MemoryError

I understand that the issue is Memory error and it is caused because
of the line reader2_list.extend(reader2). Is there any other
alternative method in reading the .csv file line by line?

sincerely,
Suprabhath
 
J

James Mills

refSeqIDsinTransPro = []
promoterSequencesinTransPro = []
reader2 = csv.reader(open(sys.argv[2],"rb"))
reader2_list = []
reader2_list.extend(reader2)

Without testing, this looks like you're reading the _ENTIRE_
input stream into memory! Try this:

def readCSV(file):

if type(file) == str:
fd = open(file, "rU")
else:
fd = file

sniffer = csv.Sniffer()
dialect = sniffer.sniff(fd.readline())
fd.seek(0)

reader = csv.reader(fd, dialect)
for line in reader:
yield line

for line in readCSV(open("foo.csv", "r")):
...

--JamesMills
 
J

John Machin

import linecache
Why???

reader2 = csv.reader(open(sys.argv[2],"rb"))
reader2_list = []
reader2_list.extend(reader2)

for data2 in reader2_list:
   refSeqIDsinTransPro.append(data2[3])
for data2 in reader2_list:
   promoterSequencesinTransPro.append(data2[4])


All you need to do is replace the above by:

reader2 = csv.reader(open(sys.argv[2],"rb"))

for data2 in reader2:
refSeqIDsinTransPro.append(data2[3])
promoterSequencesinTransPro.append(data2[4])
 
T

tejsupra

Hello Everyone,

I need to read a .csv file which has a size of 2.26 GB . And I wrote a
Python script , where I need to read this file. And my Computer has 2
GB RAM Please see the code as follows:

"""
This program has been developed to retrieve all the promoter sequences
for the specified
list of genes in the given cluster

So, this program will act as a substitute to the whole EZRetrieve
system

Input arguments:

1) Cluster.txt or DowRatClust161718bwithDummy.txt
2) TransProCrossReferenceAndSequences.csv -> This is the file that has
all the promoter sequences
3) -2000
4) 500
"""

import time
import csv
import sys
import linecache
import re
from sets import Set
import gc

print time.localtime()

fileInputHandler = open(sys.argv[1],"r")
line = fileInputHandler.readline()

refSeqIDsinTransPro = []
promoterSequencesinTransPro = []
reader2 = csv.reader(open(sys.argv[2],"rb"))
reader2_list = []
reader2_list.extend(reader2)

for data2 in reader2_list:
   refSeqIDsinTransPro.append(data2[3])
for data2 in reader2_list:
   promoterSequencesinTransPro.append(data2[4])

while line:
   l = line.rstrip('\n')
   for j in range(1,len(refSeqIDsinTransPro)):
      found = re.search(l,refSeqIDsinTransPro[j])
      if found:
         """promoterSequencesinTransPro[j]  """
         print l

   line = fileInputHandler.readline()

fileInputHandler.close()

The error that I got is given as follows:
Traceback (most recent call last):
  File "RefSeqsToPromoterSequences.py", line 31, in <module>
    reader2_list.extend(reader2)
MemoryError

I understand that the issue is Memory error and it is caused because
of the  line reader2_list.extend(reader2). Is there any other
alternative method in reading the .csv file  line by line?

sincerely,
Suprabhath

Thanks a Lot James Mills. It worked
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top