python: ascii read

Discussion in 'Python' started by Sebastian Krause, Sep 16, 2004.

  1. Hello,

    I tried to read in some large ascii files (200MB-2GB) in Python using
    scipy.io.read_array, but it did not work as I expected. The whole idea
    was to find a fast Python routine to read in arbitrary ascii files, to
    replace Yorick (which I use right now and which is really fast, but not
    as general as Python). The problem with scipy.io.read_array was, that it
    is really slow, returns errors when trying to process large files and it
    also changes (cuts) the files (after scipy.io.read_array processed a 2GB
    file its size was only 64MB).

    Can someone give me hint how to use Python to do this job correctly and
    fast? (Maybe with another read-in routine.)

    Thanks.

    Greetings,
    Sebastian
    Sebastian Krause, Sep 16, 2004
    #1
    1. Advertising

  2. Sebastian Krause <> wrote:

    > Hello,
    >
    > I tried to read in some large ascii files (200MB-2GB) in Python using
    > scipy.io.read_array, but it did not work as I expected. The whole idea
    > was to find a fast Python routine to read in arbitrary ascii files, to
    > replace Yorick (which I use right now and which is really fast, but not
    > as general as Python). The problem with scipy.io.read_array was, that it
    > is really slow, returns errors when trying to process large files and it
    > also changes (cuts) the files (after scipy.io.read_array processed a 2GB
    > file its size was only 64MB).
    >
    > Can someone give me hint how to use Python to do this job correctly and
    > fast? (Maybe with another read-in routine.)


    If all you need is what you say -- read a huge amount of ASCII data into
    memory -- it's hard to beat
    data = open('thefile.txt').read()

    mmap may in fact be preferable for many uses, but it doesn't actually
    read (it _maps_ the file into memory instead).


    Alex
    Alex Martelli, Sep 16, 2004
    #2
    1. Advertising

  3. Sebastian Krause

    Robert Kern Guest

    Sebastian Krause wrote:
    > Hello,
    >
    > I tried to read in some large ascii files (200MB-2GB) in Python using
    > scipy.io.read_array, but it did not work as I expected. The whole idea
    > was to find a fast Python routine to read in arbitrary ascii files, to
    > replace Yorick (which I use right now and which is really fast, but not
    > as general as Python). The problem with scipy.io.read_array was, that it
    > is really slow, returns errors when trying to process large files and it
    > also changes (cuts) the files (after scipy.io.read_array processed a 2GB
    > file its size was only 64MB).
    >
    > Can someone give me hint how to use Python to do this job correctly and
    > fast? (Maybe with another read-in routine.)


    What kind of data is it? What operations do you want to perform on the
    data? What platform are you on?

    Some of the scipy.io.read_array behavior that you see look like bugs. We
    would greatly appreciate it if you were to send a complete bug report to
    the scipy-dev mailing list. Thank you.

    --
    Robert Kern


    "In the fields of hell where the grass grows high
    Are the graves of dreams allowed to die."
    -- Richard Harter
    Robert Kern, Sep 16, 2004
    #3
  4. I did not explictly mention that the ascii file should be read in as an
    array of numbers (either integer or float).
    To use open() and read() is very fast, but does only read in the data as
    string and it also does not work with large files.

    Sebastian

    Alex Martelli wrote:
    > Sebastian Krause <> wrote:
    >
    >
    >>Hello,
    >>
    >>I tried to read in some large ascii files (200MB-2GB) in Python using
    >>scipy.io.read_array, but it did not work as I expected. The whole idea
    >>was to find a fast Python routine to read in arbitrary ascii files, to
    >>replace Yorick (which I use right now and which is really fast, but not
    >>as general as Python). The problem with scipy.io.read_array was, that it
    >>is really slow, returns errors when trying to process large files and it
    >>also changes (cuts) the files (after scipy.io.read_array processed a 2GB
    >>file its size was only 64MB).
    >>
    >>Can someone give me hint how to use Python to do this job correctly and
    >>fast? (Maybe with another read-in routine.)

    >
    >
    > If all you need is what you say -- read a huge amount of ASCII data into
    > memory -- it's hard to beat
    > data = open('thefile.txt').read()
    >
    > mmap may in fact be preferable for many uses, but it doesn't actually
    > read (it _maps_ the file into memory instead).
    >
    >
    > Alex
    Sebastian Krause, Sep 16, 2004
    #4
  5. The input data is is large ascii file of astrophysical parameters
    (integer and float) of gaydynamics calculations. They should be read in
    as an array of integer and float numbers not as string (as open() and
    read() does). Then the array is used to make different plots from the
    data and do some (simple) operations: subtraction and divison of
    columns. I am using Scipy with Python 2.3.x under Linux (SuSE 9.1).

    Sebastian

    Robert Kern wrote:
    > Sebastian Krause wrote:
    >
    >> Hello,
    >>
    >> I tried to read in some large ascii files (200MB-2GB) in Python using
    >> scipy.io.read_array, but it did not work as I expected. The whole idea
    >> was to find a fast Python routine to read in arbitrary ascii files, to
    >> replace Yorick (which I use right now and which is really fast, but
    >> not as general as Python). The problem with scipy.io.read_array was,
    >> that it is really slow, returns errors when trying to process large
    >> files and it also changes (cuts) the files (after scipy.io.read_array
    >> processed a 2GB file its size was only 64MB).
    >>
    >> Can someone give me hint how to use Python to do this job correctly
    >> and fast? (Maybe with another read-in routine.)

    >
    >
    > What kind of data is it? What operations do you want to perform on the
    > data? What platform are you on?
    >
    > Some of the scipy.io.read_array behavior that you see look like bugs. We
    > would greatly appreciate it if you were to send a complete bug report to
    > the scipy-dev mailing list. Thank you.
    >
    Sebastian Krause, Sep 16, 2004
    #5
  6. Sebastian Krause <> wrote:

    > I did not explictly mention that the ascii file should be read in as an
    > array of numbers (either integer or float).


    Ah, right, you didn't . So I was answering the literal question you
    asked rather than the one you had in mind.

    > To use open() and read() is very fast, but does only read in the data as
    > string and it also does not work with large files.


    It works just fine with files as large as you have memory for (and mmap
    works for files as large as you have _spare address space_ for, if your
    OS is decently good at its job). But if what you want is not the job
    that .read() and mmap do, the fact that they _do_ perform that job quite
    well on large files is of course of no use to you.

    Back to, why scipy.io.read_array works so badly for you -- I don't know,
    it's rather complicated code, as well as maybe old-ish (wraps files into
    class instances to be able to iterate on their lines) and very general
    (lots of options regarding what are separators, etc, etc). If your
    needs are very specific (you know a lot about the format of those huge
    files -- e.g. they're column-oriented, or only use whitespace separators
    and \n line termination, or other such specifics) you might well be able
    to do better -- likely even in Python, worst case in C. I assume you
    need Numeric arrays, 2-d, specifically, as the result of reading your
    files? Would you know in advance whether you're reading int or float
    (it might be faster to have two separate functions)? Could you
    pre-dimension the Numeric array and pass it in, or do you need it to
    dimension itself dynamically based on file contents? The less
    flexibility you need, the simpler and faster the reading can be...


    Alex
    Alex Martelli, Sep 16, 2004
    #6
  7. Sebastian Krause

    Robert Kern Guest

    Sebastian Krause wrote:
    > The input data is is large ascii file of astrophysical parameters
    > (integer and float) of gaydynamics calculations. They should be read in
    > as an array of integer and float numbers not as string (as open() and
    > read() does). Then the array is used to make different plots from the
    > data and do some (simple) operations: subtraction and divison of
    > columns. I am using Scipy with Python 2.3.x under Linux (SuSE 9.1).


    Well, one option is to use the "lines" argument to scipy.io.read_array
    to only read in chunks at a time. It probably won't help speed any, but
    hopefully it will be correct.

    > Sebastian


    --
    Robert Kern


    "In the fields of hell where the grass grows high
    Are the graves of dreams allowed to die."
    -- Richard Harter
    Robert Kern, Sep 16, 2004
    #7
  8. Sebastian Krause

    Des Small Guest

    (Alex Martelli) writes:

    > If your needs are very specific (you know a lot about the format of
    > those huge files -- e.g. they're column-oriented, or only use
    > whitespace separators and \n line termination, or other such
    > specifics) you might well be able to do better -- likely even in
    > Python, worst case in C. I assume you need Numeric arrays, 2-d,
    > specifically, as the result of reading your files? Would you know
    > in advance whether you're reading int or float (it might be faster
    > to have two separate functions)? Could you pre-dimension the
    > Numeric array and pass it in, or do you need it to dimension itself
    > dynamically based on file contents? The less flexibility you need,
    > the simpler and faster the reading can be...


    The last time I wanted to be able to read large lumps of numerical
    data from an ASCII file, I ended up using (f)lex, for performance
    reasons. (Pure C _might_ have been faster still, of course, but it
    would _quite certainly_ also have been pure C.)

    This has caused minor irritation - the code has been in use through
    several upgrades of Python, and it is considered polite to recompile
    to match the current C API - but I'd probably do it the same way again
    in the same situation.

    Des
    --
    "[T]he structural trend in linguistics which took root with the
    International Congresses of the twenties and early thirties [...] had
    close and effective connections with phenomenology in its Husserlian
    and Hegelian versions." -- Roman Jakobson
    Des Small, Sep 16, 2004
    #8
  9. Alex Martelli said unto the world upon 2004-09-16 07:22:
    > Sebastian Krause <> wrote:
    >
    >
    >>Hello,
    >>
    >>I tried to read in some large ascii files (200MB-2GB) in Python using
    >>scipy.io.read_array, but it did not work as I expected. The whole idea
    >>was to find a fast Python routine to read in arbitrary ascii files, to
    >>replace Yorick (which I use right now and which is really fast, but not
    >>as general as Python). The problem with scipy.io.read_array was, that it
    >>is really slow, returns errors when trying to process large files and it
    >>also changes (cuts) the files (after scipy.io.read_array processed a 2GB
    >>file its size was only 64MB).
    >>
    >>Can someone give me hint how to use Python to do this job correctly and
    >>fast? (Maybe with another read-in routine.)

    >
    >
    > If all you need is what you say -- read a huge amount of ASCII data into
    > memory -- it's hard to beat
    > data = open('thefile.txt').read()
    >
    > mmap may in fact be preferable for many uses, but it doesn't actually
    > read (it _maps_ the file into memory instead).
    >
    >
    > Alex


    Hi all,

    [neophyte question warning]

    I'd not been aware of mmap until this post. Looking at the Library
    Reference and my trusty copy of Python in a Nutshell, I've gotten some
    idea of the differences between using mmap and the .read() method on a
    file object -- such as it returns a mutable object vs an immutable
    string, constraint on slice assignment that len(oldslice) must be equal
    to len(newslice), etc.

    But I don't really feel I've a handle on the significance of saying it
    maps the file into memory versus reading the file. The naive thought is
    that since the data gets into memory, the file must be read. But this
    makes me sure I'm missing a distinction in the terminology. Explanations
    and pointers for what to read gratefully received.

    And, since mmap behave differently on different platforms, I'm mostly a
    win32 user looking to transition to Linux.

    Best to all,

    Brian vdB
    Brian van den Broek, Sep 16, 2004
    #9
  10. Am Donnerstag, 16. September 2004 17:56 schrieb Brian van den Broek:
    > But I don't really feel I've a handle on the significance of saying it
    > maps the file into memory versus reading the file. The naive thought is
    > that since the data gets into memory, the file must be read. But this
    > makes me sure I'm missing a distinction in the terminology. Explanations
    > and pointers for what to read gratefully received.


    read()ing a file into memory does what it says; it reads the binary data from
    the disk all at once, and allocates main memory (as needed) to fit all the
    data there. Memory mapping a file (or device or whatever) means that the
    virtual memory architecture is involved. What happens here:

    mmapping a file creates virtual memory pages (just like virtual memory which
    is put into your paging file), which are registered with the MMU of the
    processor as being absent initially.

    Now, when the program tries to access the memory page (pages are some fixed
    short length, like 4k for most Pentium-style computers), a (page) fault is
    generated by the MMU, which invokes the operating system's handler for page
    faults. Now that the operating system sees that a certain page is accessed
    (from the page address it can deduce the offset in the file that you're
    trying to access), it loads the corresponding page from disk, and puts it
    into memory at some position, and alters the pagetable entry in the LDT to be
    present.

    Future accesses to the page will take place immediately (without a page fault
    taking place).

    Changes in memory are written to disk once the page is flushed (meaning that
    it gets removed from main memory because there are too few pages available of
    real main memory). Now, when a page is forcefully flushed (not due to closing
    the mmap), the operating system marks the pagetable entry in the LDT to be
    absent again, and the next time the program tries to access this location, a
    page-fault again takes place, and the OS can load the page from disk.

    For speed, the operating system allows you to mmap read-only, which means that
    once a page is discarded, it does not need to be written back to disk (which
    of course is faster). Some MMUs (IIRC not the Pentium-class MMU) set a dirty
    bit on the page-table entry once the page has been altered, this can also be
    used to control whether the page needs to be written back to disk after
    access.

    So, basically what you get is load on demand file handling, which is similar
    to what the paging file (virtual memory file) on win32 does for allround
    memory. Actually, internally, the architecture to handle mmapped files and
    virtual memory is the same, and you could think of the swap file as an
    operating system mmapped file, from which programs can allocate slices
    through some OS calls (well, actually through the normal malloc/calloc
    calls).

    HTH!

    Heiko.
    Heiko Wundram, Sep 16, 2004
    #10
  11. Heiko Wundram said unto the world upon 2004-09-16 12:56:
    > Am Donnerstag, 16. September 2004 17:56 schrieb Brian van den Broek:
    >
    >>But I don't really feel I've a handle on the significance of saying it
    >>maps the file into memory versus reading the file. The naive thought is
    >>that since the data gets into memory, the file must be read. But this
    >>makes me sure I'm missing a distinction in the terminology. Explanations
    >>and pointers for what to read gratefully received.

    >
    >
    > read()ing a file into memory does what it says; it reads the binary data from
    > the disk all at once, and allocates main memory (as needed) to fit all the
    > data there. Memory mapping a file (or device or whatever) means that the
    > virtual memory architecture is involved. What happens here:
    >


    <Much helpful detail SNIPed>


    >
    > HTH!
    >
    > Heiko.


    Thanks a lot for the detailed account, Heiko.

    Best,

    Brian vdB
    Brian van den Broek, Sep 16, 2004
    #11
  12. Brian van den Broek wrote:

    > But I don't really feel I've a handle on the significance of saying it
    > maps the file into memory versus reading the file. The naive thought is
    > that since the data gets into memory, the file must be read. But this
    > makes me sure I'm missing a distinction in the terminology. Explanations
    > and pointers for what to read gratefully received.


    Eventually the file is read, of course (or at least parts thereof). Mmap
    is a feature of the virtual memory system in modern operating systems,
    so you need a basic understanding of virtual memory in order to
    understand mmap. All details can be found e.g. in Modern Operating
    Systems by Andrew Tanenbaum.
    http://mirrors.kernel.org/LDP/LDP/tlk/tlk.html does a good job
    explaining how Linux handles it,, but I'll try to explain the general
    basics here in short.

    With virtual memory systems, the addresses that are used by application
    programs don't refer directly to memory locations. Instead the addresses
    are split in two parts; the first part is a page number, the second is
    the offset of the memory location in the page. The system keeps a list
    of all pages. When an address is referenced, the page is looked up in
    that list (Pages are blocks of memory, typically 4-8 kB). There are two
    possibilities:
    - The page is already in memory. In that case, the list contains the
    real physical address of the page in memory. That address is combined
    with the offset to form the physical address of the memory location.
    - The page is not in memory. The virtual memory system loads it in
    memory and stores the physical address in the list. Processing then
    continues as in the other case. Note that it may be necessary to remove
    another page from memory in order to load a new one; in that case, the
    other page is paged to disk if it is still needed so that it can be read
    again later.

    This behind-the-scenes translation and paging to and from disk is what
    allows modern operating systems to use much more memory than what's
    physically available in the system.

    mmap creates an entry in the list that says the page is not in memory,
    but tells the system what file to load it from: a range of addresses is
    'mapped' to the data in the file. It also returns the logical address of
    the data. When an address in the range is referenced, the virtual memory
    system loads the appropriate page from disk (or possibly more than one
    page at the time, for efficiency reasons) to memory and stores its
    (theirs) location in the list. An application program can access exactly
    the same way as any other part of memory.

    > And, since mmap behave differently on different platforms, I'm mostly a
    > win32 user looking to transition to Linux.


    I think Python hides much of the differences between the Windows and
    Unix implentations of mmap (Windows doesn't really have mmap; instead
    you use CreateFileMapping and MapViewOfFile).

    --
    "Codito ergo sum"
    Roel Schroeven
    Roel Schroeven, Sep 16, 2004
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. TOXiC
    Replies:
    5
    Views:
    1,219
    TOXiC
    Jan 31, 2007
  2. James O'Brien
    Replies:
    3
    Views:
    234
    Ben Morrow
    Mar 5, 2004
  3. Alextophi
    Replies:
    8
    Views:
    482
    Alan J. Flavell
    Dec 30, 2005
  4. bruce
    Replies:
    38
    Views:
    255
    Mark Lawrence
    Nov 1, 2013
  5. MRAB
    Replies:
    0
    Views:
    85
Loading...

Share This Page