regd efficient methods to manipulate *large* files

Discussion in 'Python' started by Madhusudhanan Chandrasekaran, May 1, 2006.

  1. Hi:

    This question is not directed "entirely" at python only. But since
    I want to know how to do it in python, I am posting here.


    I am constructing a huge matrix (m x n), whose columns n are stored in
    smaller files. Once I read m such files, my matrix is complete. I want to
    pass this matrix as an input to another script of mine (I just have the
    binary.) Currently, the script reads a file (which is nothing but the
    matrix) and processes it. Is there any way of doing this in memory,
    without writing the matrix onto the disk?

    Since I have to repeat my experimentation for multiple iterations, it
    becomes expensive to write the matrix onto the disk.

    Thanks in advance. Help appreciated.

    -Madhu
     
    Madhusudhanan Chandrasekaran, May 1, 2006
    #1
    1. Advertising

  2. Madhusudhanan Chandrasekaran

    Dave Hughes Guest

    Madhusudhanan Chandrasekaran wrote:

    > Hi:
    >
    > This question is not directed "entirely" at python only. But since
    > I want to know how to do it in python, I am posting here.
    >
    >
    > I am constructing a huge matrix (m x n), whose columns n are stored in
    > smaller files. Once I read m such files, my matrix is complete. I
    > want to pass this matrix as an input to another script of mine (I
    > just have the binary.) Currently, the script reads a file (which is
    > nothing but the matrix) and processes it. Is there any way of doing
    > this in memory, without writing the matrix onto the disk?
    >
    > Since I have to repeat my experimentation for multiple iterations, it
    > becomes expensive to write the matrix onto the disk.
    >
    > Thanks in advance. Help appreciated.
    >
    > -Madhu


    Basically, you're asking about Inter Process Communication (IPC), for
    which Python provides several interfaces to mechanisms provided by the
    operating system (whatever that may be). Here's a couple of commonly
    used methods:

    Redirected I/O

    Have a look at the popen functions in the os module, or better still
    the subprocess module (which is a higher level interface to the same
    functionality). Specifically, the "Replacing the shell pipe line"
    example in the subprocess module's documentation should be interesting:

    output=`dmesg | grep hda`
    ==>
    p1 = Popen(["dmesg"], stdout=PIPE)
    p2 = Popen(["grep", "hda"], stdin=p1.stdout, stdout=PIPE)
    output = p2.communicate()[0]

    Here, the stdout of the "dmesg" process has been redirected to the
    stdin of the "grep" process. You could do something similar with your
    two scripts: e.g., the first script simply writes the content of the
    matrix in some format to stdout (e.g. print, sys.stdout.write), while
    the second script reads the content of the matrix from stdin (e.g.
    raw_input, sys.stdin.read). Here's some brutally simplistic scripts
    that demonstrate the method:

    in.py
    =====
    #!/bin/env python
    #
    # I read integers from stdin until I encounter 0

    import sys

    while True:
    i = int(sys.stdin.readline())
    print "Read %d from stdin" % i
    if i == 0:
    break


    out.py
    ======
    #!/bin/env python
    #
    # I write some numbers to stdout

    for i in [1, 2, 3, 4, 5, 0]:
    print i


    run.py
    ======
    #!/bin/env python
    #
    # I run out.py and in.py with a pipe between them, capture the
    # output of in.py and print it

    from subprocess import Popen, PIPE

    process1 = Popen(["./out.py"], stdout=PIPE)
    process2 = Popen(["./in.py"], stdin=process1.stdout, stdout=PIPE)
    output = process2.communicate()[0]

    print output


    Sockets

    Another form of IPC uses sockets to communicate between two processes
    (see the socket module or one of the higher level modules like
    SocketServer). Hence, the second process would listen on a port
    (presumably on the localhost interface, although there's no reason it
    couldn't listen on a LAN interface for example), and the first process
    connects to that port and sends the matrix data across it to the second
    process.


    Summary

    Given that your second script currently reads a file containing the
    complete matrix (if I understand your post correctly), it's probably
    easiest for you to use the Redirected I/O method (as it's very similar
    to reading a file, although there are some differences, and sometimes
    one must be careful about closing pipe ends to avoid deadlocks).
    However, the sockets method has the advantage that you can easily move
    one of the processes onto a different machine.

    There are other methods of IPC (for example, shared memory: see the
    mmap module) however the two mentioned above are available on most
    platforms whereas others may be specific to a given platform, or have
    platform specific subtleties (for example, mmap is only available on
    Windows and UNIX, and has a slightly different constructor on each).


    HTH,

    Dave.

    --
     
    Dave Hughes, May 1, 2006
    #2
    1. Advertising

  3. Madhusudhanan Chandrasekaran

    Paddy Guest

    I take it that you have a binary file that takes a file name and
    proceses the file contents.
    Sometimes Unix binaries are written so that a file name of '-', (just a
    dash), causes it to take input from stdin so that the piping mentioned
    in a previous reply could work.
    On some of our unix systems /tmp is set up as a 'virtual disk' It
    quacks like a normal disk filesystem but is actually implimented in
    RAM/virtual memory, and is faster than normal disk access.
    (Unfortunately we are not allowed to save multi-gigabyte files there as
    it affects other aspects of the OS).

    Maybe you can mount a similar filesystem if you have the RAM.

    -- Pad.
     
    Paddy, May 1, 2006
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?bWFudQ==?=

    Regd---Justify option in Alignment.

    =?Utf-8?B?bWFudQ==?=, Sep 23, 2005, in forum: ASP .Net
    Replies:
    0
    Views:
    446
    =?Utf-8?B?bWFudQ==?=
    Sep 23, 2005
  2. Dave
    Replies:
    1
    Views:
    331
    Mike Wahler
    Jan 22, 2005
  3. Replies:
    1
    Views:
    307
    Dietmar Kuehl
    Mar 3, 2006
  4. Derek Martin

    Re: Manipulate Large Binary Files

    Derek Martin, Apr 2, 2008, in forum: Python
    Replies:
    4
    Views:
    323
    Derek Martin
    Apr 4, 2008
  5. karthikbalaguru

    Regd XSD Files

    karthikbalaguru, Jan 4, 2010, in forum: XML
    Replies:
    1
    Views:
    961
    Joe Kesselman
    Jan 5, 2010
Loading...

Share This Page