Sorting Large File (Code/Performance)

Discussion in 'Python' started by Ira.Kovac@gmail.com, Jan 24, 2008.

  1. Guest

    Hello all,

    I have an Unicode text file with 1.6 billon lines (~2GB) that I'd like
    to sort based on first two characters.

    I'd greatly appreciate if someone can post sample code that can help
    me do this.

    Also, any ideas on approximately how long is the sort process going to
    take (XP, Dual Core 2.0GHz w/2GB RAM).

    Cheers,

    Ira
     
    , Jan 24, 2008
    #1
    1. Advertising

  2. Paul Rubin Guest

    writes:
    > I have an Unicode text file with 1.6 billon lines (~2GB) that I'd like
    > to sort based on first two characters.
    >
    > I'd greatly appreciate if someone can post sample code that can help
    > me do this.


    Use the unix sort command:

    sort inputfile -o outputfile

    I think there is a cygwin port.

    > Also, any ideas on approximately how long is the sort process going to
    > take (XP, Dual Core 2.0GHz w/2GB RAM).


    Eh, unix sort would probably take a while, somewhere between 15
    minutes and an hour. If you only have to do it once it's not worth
    writing special purpose code. If you have to do it a lot, get some
    more ram for that box, suck the file into memory and do a radix sort.
     
    Paul Rubin, Jan 24, 2008
    #2
    1. Advertising

  3. John Nagle Guest

    wrote:
    > Hello all,
    >
    > I have an Unicode text file with 1.6 billon lines (~2GB) that I'd like
    > to sort based on first two characters.


    Given those numbers, the average number of characters per line is
    less than 2. Please check.

    John Nagle
     
    John Nagle, Jan 24, 2008
    #3
  4. John Machin Guest

    On Jan 25, 6:18 am, wrote:
    > Hello all,
    >
    > I have an Unicode text file with 1.6 billon lines (~2GB) that I'd like
    > to sort based on first two characters.


    If you mean 1.6 American billion i.e. 1.6 * 1000 ** 3 lines, and 2 *
    1024 ** 3 bytes of data, that's 1.34 bytes per line. If you mean other
    definitions of "billion" and/or "GB", the result is even fewer bytes
    per line.

    What is a "Unicode text file"? How is it encoded: utf8, utf16,
    utf16le, utf16be, ??? If you don't know, do this:

    print repr(open('the_file', 'rb').read(100))

    and show us the results.

    What does "based on [the] first two characters" mean? Do you mean raw
    order based on the ordinal of each character i.e. no fancy language-
    specific collating sequence? Do the first two characters always belong
    to the ASCII subset?

    You'd like to sort a large file? Why? Sorting a file is just a means
    to an end, and often another means is more appropriate. What are you
    going to do with it after it's sorted?

    > I'd greatly appreciate if someone can post sample code that can help
    > me do this.


    I'm sure you would. However it would benefit you even more if instead
    of sitting on the beach next to the big arrow pointing to the drop
    zone, you were to read the manual and work out how to do it yourself.
    Here's a start: http://docs.python.org/lib/typesseq-mutable.html

    > Also, any ideas on approximately how long is the sort process going to
    > take (XP, Dual Core 2.0GHz w/2GB RAM).


    If you really have a 2GB file and only 2GB of RAM, I suggest that you
    don't hold your breath.

    Instead of writing Python code, you are probably better off doing an
    external sort. You might consider looking for a Windows port of a
    Unicode-capable Unix sort utility. Google "GnuWin32" and see if their
    sort does what you want.
     
    John Machin, Jan 24, 2008
    #4
  5. Guest

    Thanks to all who replied. It's very appreciated.

    Yes, I had to doublecheck line counts and the number of lines is ~16
    million (insetead of stated 1.6B).

    Also:

    >What is a "Unicode text file"? How is it encoded: utf8, utf16, utf16le, utf16be, ??? If you don't know, do this:

    The file is UTF-8

    > Do the first two characters always belong to the ASCII subset?

    Yes, first two always belong to ASCII subset

    > What are you going to do with it after it's sorted?

    I need to isolate all lines that start with two characters (zz to be
    particular)

    > Here's a start: http://docs.python.org/lib/typesseq-mutable.html
    > Google "GnuWin32" and see if their sort does what you want.

    Will do, thanks for the tip.

    > If you really have a 2GB file and only 2GB of RAM, I suggest that you don't hold your breath.

    I am limited with resources. Unfortunately.

    Cheers,

    Ira
     
    , Jan 24, 2008
    #5
  6. wrote:
    >> What are you going to do with it after it's sorted?

    > I need to isolate all lines that start with two characters (zz to be
    > particular)


    "Isolate" as in "extract"? Remove the rest?

    Then why don't you extract the lines first, without sorting the file? (or sort
    it afterwards if you still need to). That would heavily cut down your memory
    footprint.

    Stefan
     
    Stefan Behnel, Jan 24, 2008
    #6
  7. Stefan Behnel wrote:
    > wrote:
    >>> What are you going to do with it after it's sorted?

    >> I need to isolate all lines that start with two characters (zz to be
    >> particular)

    >
    > "Isolate" as in "extract"? Remove the rest?
    >
    > Then why don't you extract the lines first, without sorting the file? (or sort
    > it afterwards if you still need to). That would heavily cut down your memory
    > footprint.


    Just for fun, this is what I meant:

    for utf8_line in open(filename, 'rb'):
    if utf8_line.startswith('zz'):
    print utf8_line

    Stefan
     
    Stefan Behnel, Jan 24, 2008
    #7
  8. John Machin Guest

    On Jan 25, 8:26 am, wrote:

    > I need to isolate all lines that start with two characters (zz to be
    > particular)


    What does "isolate" mean to you? What does this have to do with
    sorting? What do you actually want to do with (a) the lines starting
    with "zz" (b) the other lines? What percentage of the lines start with
    "zz"?

    When looking at the GnuWin32 collection:
    (a) "Unicode" is not relevant to your sort problem.
    (b) grab yourself a wc and a grep while you're there -- they will help
    with "how many lines" and "what percentage of lines" questions.
     
    John Machin, Jan 24, 2008
    #8
  9. On Thursday 24 January 2008 20:56 John Nagle wrote:

    > wrote:
    >> Hello all,
    >>
    >> I have an Unicode text file with 1.6 billon lines (~2GB) that I'd like
    >> to sort based on first two characters.

    >
    > Given those numbers, the average number of characters per line is
    > less than 2. Please check.


    which would be true if 1.599.999.999 had 2 chars and the rest of the lines
    just one :)

    (but yes that would be an interesting question how to sort a 1 character
    line based on the first 2 of that line)

    martin





    --
    http://noneisyours.marcher.name
    http://feeds.feedburner.com/NoneIsYours

    You are not free to read this message,
    by doing so, you have violated my licence
    and are required to urinate publicly. Thank you.
     
    Martin Marcher, Jan 24, 2008
    #9
  10. John Nagle Guest

    wrote:
    > Thanks to all who replied. It's very appreciated.
    >
    > Yes, I had to double check line counts and the number of lines is ~16
    > million (instead of stated 1.6B).


    OK, that's not bad at all.

    You have a few options:

    - Get enough memory to do the sort with an in-memory sort, like UNIX "sort"
    or Python's "sort" function.
    - Thrash; in-memory sorts do very badly with virtual memory, but eventually
    they finish. Might take many hours.
    - Get a serious disk-to-disk sort program. (See "http://www.ordinal.com/".
    There's a free 30-day trial. It can probably sort your data
    in about a minute.)
    - Load the data into a database like MySQL and let it do the work.
    This is slow if done wrong, but OK if done right.
    - Write a distribution sort yourself. Fan out the incoming file into
    one file for each first letter, sort each subfile, merge the
    results.

    With DRAM at $64 for 4GB, I'd suggest just getting more memory and using
    a standard in-memory sort.

    John Nagle
     
    John Nagle, Jan 25, 2008
    #10
  11. Paul Rubin Guest

    John Nagle <> writes:
    > - Get enough memory to do the sort with an in-memory sort, like
    > UNIX "sort" or Python's "sort" function.


    Unix sort does external sorting when needed.
     
    Paul Rubin, Jan 25, 2008
    #11
  12. John Nagle Guest

    Paul Rubin wrote:
    > John Nagle <> writes:
    >> - Get enough memory to do the sort with an in-memory sort, like
    >> UNIX "sort" or Python's "sort" function.

    >
    > Unix sort does external sorting when needed.


    Ah, someone finally put that in. Good. I hadn't looked at "sort"'s manual
    page in many years.

    John Nagle
     
    John Nagle, Jan 25, 2008
    #12
  13. Paul Rubin Guest

    John Nagle <> writes:
    > > Unix sort does external sorting when needed.

    >
    > Ah, someone finally put that in. Good. I hadn't looked at
    > "sort"'s manual page in many years.


    Huh? It has been like that from the beginning. It HAD to be. Unix
    was originally written on a PDP-11. The GNU version has also always
    been external.
     
    Paul Rubin, Jan 25, 2008
    #13
  14. Asim Guest

    On Jan 24, 4:26 pm, wrote:
    > Thanks to all who replied. It's very appreciated.
    >
    > Yes, I had to doublecheck line counts and the number of lines is ~16
    > million (insetead of stated 1.6B).
    >
    > Also:
    >
    > >What is a "Unicode text file"? How is it encoded: utf8, utf16, utf16le, utf16be, ??? If you don't know, do this:

    >
    > The file is UTF-8
    >
    > > Do the first two characters always belong to the ASCII subset?

    >
    > Yes, first two always belong to ASCII subset
    >
    > > What are you going to do with it after it's sorted?

    >
    > I need to isolate all lines that start with two characters (zz to be
    > particular)
    >
    > > Here's a start:http://docs.python.org/lib/typesseq-mutable.html
    > > Google "GnuWin32" and see if their sort does what you want.

    >
    > Will do, thanks for the tip.
    >
    > > If you really have a 2GB file and only 2GB of RAM, I suggest that you don't hold your breath.

    >
    > I am limited with resources. Unfortunately.
    >


    Since the OP has stated that they are running Windows XP, and more
    than one poster has suggested installing more RAM in the box, I
    thought people should know that WinXP has certain limitations on the
    amount of memory that may be used:

    http://msdn2.microsoft.com/en-us/library/aa366778.aspx

    Firstly, the maximum amount of physical memory that may be installed
    is 4GB. Secondly, with the "4 gigabyte tuning" and
    "IMAGE_FILE_LARGE_ADDRESS_AWARE" patches, the maximum amount of
    virtual memory (phyical memory + swapfile size) that may be assigned
    to user processes is 2GB.

    Hence, even if you made a 100GB swap file with 4GB RAM installed, by
    default only a maximum of 2GB would ever be assigned to a user-
    process. With the two flags enabled, the maximum becomes 3GB.

    If the OP finds performance to be limited and thinks more RAM would
    help trying a later version of Windows would be a start, but better
    would be to try Linux or Mac OSX out.

    Cheers,
    Asim


    > Cheers,
    >
    > Ira
     
    Asim, Jan 25, 2008
    #14
  15. Asim Guest

    On Jan 25, 9:23 am, Asim <> wrote:
    > On Jan 24, 4:26 pm, wrote:
    >
    >
    >
    > > Thanks to all who replied. It's very appreciated.

    >
    > > Yes, I had to doublecheck line counts and the number of lines is ~16
    > > million (insetead of stated 1.6B).

    >
    > > Also:

    >
    > > >What is a "Unicode text file"? How is it encoded: utf8, utf16, utf16le, utf16be, ??? If you don't know, do this:

    >
    > > The file is UTF-8

    >
    > > > Do the first two characters always belong to the ASCII subset?

    >
    > > Yes, first two always belong to ASCII subset

    >
    > > > What are you going to do with it after it's sorted?

    >
    > > I need to isolate all lines that start with two characters (zz to be
    > > particular)

    >
    > > > Here's a start:http://docs.python.org/lib/typesseq-mutable.html
    > > > Google "GnuWin32" and see if their sort does what you want.

    >
    > > Will do, thanks for the tip.

    >
    > > > If you really have a 2GB file and only 2GB of RAM, I suggest that you don't hold your breath.

    >
    > > I am limited with resources. Unfortunately.

    >
    > Since the OP has stated that they are running Windows XP, and more
    > than one poster has suggested installing more RAM in the box, I
    > thought people should know that WinXP has certain limitations on the
    > amount of memory that may be used:
    >
    > http://msdn2.microsoft.com/en-us/library/aa366778.aspx
    >
    > Firstly, the maximum amount of physical memory that may be installed
    > is 4GB. Secondly, with the "4 gigabyte tuning" and
    > "IMAGE_FILE_LARGE_ADDRESS_AWARE" patches, the maximum amount of
    > virtual memory (phyical memory + swapfile size) that may be assigned
    > to user processes is 2GB.
    >
    > Hence, even if you made a 100GB swap file with 4GB RAM installed, by
    > default only a maximum of 2GB would ever be assigned to a user-
    > process. With the two flags enabled, the maximum becomes 3GB.
    >
    > If the OP finds performance to be limited and thinks more RAM would
    > help trying a later version of Windows would be a start, but better
    > would be to try Linux or Mac OSX out.
    >
    > Cheers,
    > Asim
    >
    > > Cheers,

    >
    > > Ira


    Sorry, just to clarify my response. Any 32-bit OS will only be able
    to assign 4GB of virtual memory to a single processes, the argument
    being that since processes can only issue 32-bit instructions the
    process can only address a maximum of 2^32 bytes of addresses
    (assuming the architecture is using byte-addressed memory).

    Another link that's easier to grok:

    http://www.codinghorror.com/blog/archives/000811.html

    However, a 32-bit OS may support more than 4GB of virtual memory
    (using "Physical Address Extension", or PAE) and split it more
    intelligently between processes than Windows XP or Vista does:

    http://www.ibm.com/developerworks/linux/library/l-memmod/

    So allocating more than 4GB of virtual memory to your sort application
    could be achieved through splitting your task into more than one
    process on an appropriate OS. AFAIK, such memory limitations are
    dependent on the particular Linux distro you're using, and I'm not
    sure about Mac OSX limitations. This applies doubly for 64-bit
    architectures and OS's.

    Please correct me, with references, if my conclusions are wrong.

    Cheers,
    Asim
     
    Asim, Jan 25, 2008
    #15
  16. Nicko Guest

    On Jan 24, 9:26 pm, wrote:
    > > If you really have a 2GB file and only 2GB of RAM, I suggest that you don't hold your breath.

    >
    > I am limited with resources. Unfortunately.


    As long as you have at least as much disc space spare as you need to
    hold a copy of the file then this is not too hard. Split the file
    into chunks that are small enough to fit in memory, sort each chunk
    and write it to a file and then interleave the chunks. Below is a
    cheap and cheesy outline of code to do this, from which you can start.

    For files which are hugely larger than your available memory you can
    do this recursively but for files which are 10 to 100 times too big
    the single-pass code below will probably work just fine. The
    complexity is technically O(n.(log(c)+(n/c))) where n is the size of
    input and c is the chunk size; once n/c (the number of chunks) exceeds
    log(c) the cost of merging the chunks will start to dominate, though a
    recursive version would be slowed by needing a lot more disc access.

    #!/usr/bin/env python
    from itertools import islice
    from tempfile import TemporaryFile
    import sys

    # Tweak this number to fill your memory
    lines_per_chunk = 100000

    chunkfiles = []
    mergechunks = []

    while True:
    chunk = list(islice(sys.stdin, lines_per_chunk))
    if not chunk:
    break
    chunk.sort()
    f = TemporaryFile()
    f.writelines(chunk)
    f.seek(0)
    mergechunks.append((chunk[0], len(chunkfiles)))
    chunkfiles.append(f)

    while mergechunks:
    # The next line is order O(n) in the number of chunks
    (line, fileindex) = min(mergechunks)
    mergechunks.remove((line, fileindex))
    sys.stdout.write(line)
    nextline = chunkfiles[fileindex].readline()
    if nextline == "":
    chunkfiles[fileindex].close()
    else:
    mergechunks.append((nextline, fileindex))
     
    Nicko, Jan 25, 2008
    #16
  17. Paul Rubin Guest

    Nicko <> writes:
    > # The next line is order O(n) in the number of chunks
    > (line, fileindex) = min(mergechunks)


    You should use the heapq module to make this operation O(log n) instead.
     
    Paul Rubin, Jan 25, 2008
    #17
  18. En Fri, 25 Jan 2008 17:50:17 -0200, Paul Rubin
    <"http://phr.cx"@NOSPAM.invalid> escribi�:

    > Nicko <> writes:
    >> # The next line is order O(n) in the number of chunks
    >> (line, fileindex) = min(mergechunks)

    >
    > You should use the heapq module to make this operation O(log n) instead.


    Or forget about Python and use the Windows sort command. It has been there
    since MS-DOS ages, there is no need to download and install other
    packages, and the documentation at
    http://technet.microsoft.com/en-us/library/bb491004.aspx says:

    Limits on file size:
    The sort command has no limit on file size.

    Better, since the OP only intents to extract lines starting with "zz", use
    the findstr command:
    findstr /l /b "zz" filename.exe
    would do the job.

    Why doing things more complicated than that?

    --
    Gabriel Genellina
     
    Gabriel Genellina, Jan 26, 2008
    #18
  19. Gabriel Genellina wrote:
    > use the Windows sort command. It has been
    > there since MS-DOS ages, there is no need to download and install other
    > packages, and the documentation at
    > http://technet.microsoft.com/en-us/library/bb491004.aspx says:
    >
    > Limits on file size:
    > The sort command has no limit on file size.


    Sure, since no-one can ever try it with more than 640k, it's easy to state
    that there is no limit. :)

    Stefan
     
    Stefan Behnel, Jan 27, 2008
    #19
  20. Gabriel Genellina wrote:
    > use the Windows sort command. It has been
    > there since MS-DOS ages, there is no need to download and install other
    > packages, and the documentation at
    > http://technet.microsoft.com/en-us/library/bb491004.aspx says:
    >
    > Limits on file size:
    > The sort command has no limit on file size.


    Sure, since no-one can ever try it with more than 640k, it's easy to state
    that there is no limit. :)

    Stefan
     
    Stefan Behnel, Jan 27, 2008
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Maxim
    Replies:
    0
    Views:
    420
    Maxim
    Jul 7, 2003
  2. bisuvious

    sorting large file

    bisuvious, Mar 25, 2007, in forum: C++
    Replies:
    12
    Views:
    843
    Gianni Mariani
    Apr 6, 2007
  3. JJ
    Replies:
    13
    Views:
    533
  4. Rishi  Dhupar

    Sorting a large XML file

    Rishi Dhupar, Apr 19, 2005, in forum: Perl Misc
    Replies:
    5
    Views:
    162
    John Bokma
    Apr 20, 2005
  5. Replies:
    5
    Views:
    992
    Xho Jingleheimerschmidt
    Apr 2, 2009
Loading...

Share This Page