Memory error due to the huge/huge input file size

Discussion in 'Python' started by tejsupra@gmail.com, Nov 10, 2008.

  1. Guest

    Hello Everyone,

    I need to read a .csv file which has a size of 2.26 GB . And I wrote a
    Python script , where I need to read this file. And my Computer has 2
    GB RAM Please see the code as follows:

    """
    This program has been developed to retrieve all the promoter sequences
    for the specified
    list of genes in the given cluster

    So, this program will act as a substitute to the whole EZRetrieve
    system

    Input arguments:

    1) Cluster.txt or DowRatClust161718bwithDummy.txt
    2) TransProCrossReferenceAndSequences.csv -> This is the file that has
    all the promoter sequences
    3) -2000
    4) 500
    """

    import time
    import csv
    import sys
    import linecache
    import re
    from sets import Set
    import gc

    print time.localtime()

    fileInputHandler = open(sys.argv[1],"r")
    line = fileInputHandler.readline()

    refSeqIDsinTransPro = []
    promoterSequencesinTransPro = []
    reader2 = csv.reader(open(sys.argv[2],"rb"))
    reader2_list = []
    reader2_list.extend(reader2)

    for data2 in reader2_list:
    refSeqIDsinTransPro.append(data2[3])
    for data2 in reader2_list:
    promoterSequencesinTransPro.append(data2[4])

    while line:
    l = line.rstrip('\n')
    for j in range(1,len(refSeqIDsinTransPro)):
    found = re.search(l,refSeqIDsinTransPro[j])
    if found:
    """promoterSequencesinTransPro[j] """
    print l

    line = fileInputHandler.readline()


    fileInputHandler.close()


    The error that I got is given as follows:
    Traceback (most recent call last):
    File "RefSeqsToPromoterSequences.py", line 31, in <module>
    reader2_list.extend(reader2)
    MemoryError

    I understand that the issue is Memory error and it is caused because
    of the line reader2_list.extend(reader2). Is there any other
    alternative method in reading the .csv file line by line?

    sincerely,
    Suprabhath
     
    , Nov 10, 2008
    #1
    1. Advertising

  2. James Mills Guest

    On Tue, Nov 11, 2008 at 7:47 AM, <> wrote:
    > refSeqIDsinTransPro = []
    > promoterSequencesinTransPro = []
    > reader2 = csv.reader(open(sys.argv[2],"rb"))
    > reader2_list = []
    > reader2_list.extend(reader2)


    Without testing, this looks like you're reading the _ENTIRE_
    input stream into memory! Try this:

    def readCSV(file):

    if type(file) == str:
    fd = open(file, "rU")
    else:
    fd = file

    sniffer = csv.Sniffer()
    dialect = sniffer.sniff(fd.readline())
    fd.seek(0)

    reader = csv.reader(fd, dialect)
    for line in reader:
    yield line

    for line in readCSV(open("foo.csv", "r")):
    ...

    --JamesMills

    --
    --
    -- "Problems are solved by method"
     
    James Mills, Nov 10, 2008
    #2
    1. Advertising

  3. John Machin Guest

    On Nov 11, 8:47 am, wrote:

    > import linecache


    Why???

    > reader2 = csv.reader(open(sys.argv[2],"rb"))
    > reader2_list = []
    > reader2_list.extend(reader2)
    >
    > for data2 in reader2_list:
    >    refSeqIDsinTransPro.append(data2[3])
    > for data2 in reader2_list:
    >    promoterSequencesinTransPro.append(data2[4])



    All you need to do is replace the above by:

    reader2 = csv.reader(open(sys.argv[2],"rb"))

    for data2 in reader2:
    refSeqIDsinTransPro.append(data2[3])
    promoterSequencesinTransPro.append(data2[4])
     
    John Machin, Nov 10, 2008
    #3
  4. Guest

    On Nov 10, 4:47 pm, wrote:
    > Hello Everyone,
    >
    > I need to read a .csv file which has a size of 2.26 GB . And I wrote a
    > Python script , where I need to read this file. And my Computer has 2
    > GB RAM Please see the code as follows:
    >
    > """
    > This program has been developed to retrieve all the promoter sequences
    > for the specified
    > list of genes in the given cluster
    >
    > So, this program will act as a substitute to the whole EZRetrieve
    > system
    >
    > Input arguments:
    >
    > 1) Cluster.txt or DowRatClust161718bwithDummy.txt
    > 2) TransProCrossReferenceAndSequences.csv -> This is the file that has
    > all the promoter sequences
    > 3) -2000
    > 4) 500
    > """
    >
    > import time
    > import csv
    > import sys
    > import linecache
    > import re
    > from sets import Set
    > import gc
    >
    > print time.localtime()
    >
    > fileInputHandler = open(sys.argv[1],"r")
    > line = fileInputHandler.readline()
    >
    > refSeqIDsinTransPro = []
    > promoterSequencesinTransPro = []
    > reader2 = csv.reader(open(sys.argv[2],"rb"))
    > reader2_list = []
    > reader2_list.extend(reader2)
    >
    > for data2 in reader2_list:
    >    refSeqIDsinTransPro.append(data2[3])
    > for data2 in reader2_list:
    >    promoterSequencesinTransPro.append(data2[4])
    >
    > while line:
    >    l = line.rstrip('\n')
    >    for j in range(1,len(refSeqIDsinTransPro)):
    >       found = re.search(l,refSeqIDsinTransPro[j])
    >       if found:
    >          """promoterSequencesinTransPro[j]  """
    >          print l
    >
    >    line = fileInputHandler.readline()
    >
    > fileInputHandler.close()
    >
    > The error that I got is given as follows:
    > Traceback (most recent call last):
    >   File "RefSeqsToPromoterSequences.py", line 31, in <module>
    >     reader2_list.extend(reader2)
    > MemoryError
    >
    > I understand that the issue is Memory error and it is caused because
    > of the  line reader2_list.extend(reader2). Is there any other
    > alternative method in reading the .csv file  line by line?
    >
    > sincerely,
    > Suprabhath


    Thanks a Lot James Mills. It worked
     
    , Nov 20, 2008
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Joe
    Replies:
    7
    Views:
    824
    Lau Lei Cheong
    Jan 6, 2005
  2. Leon
    Replies:
    2
    Views:
    3,032
  3. Fresh
    Replies:
    2
    Views:
    641
    Bo Persson
    Apr 22, 2008
  4. Dave Angel
    Replies:
    4
    Views:
    344
    Steven D'Aprano
    Jul 14, 2009
  5. Nephi Immortal

    vector is slow due to huge array

    Nephi Immortal, Nov 11, 2011, in forum: C++
    Replies:
    3
    Views:
    400
    Jorgen Grahn
    Nov 11, 2011
Loading...

Share This Page