Working with Huge Text Files

Discussion in 'Python' started by Lorn Davies, Mar 19, 2005.

  1. Lorn Davies

    Lorn Davies Guest

    Hi there, I'm a Python newbie hoping for some direction in working with
    text files that range from 100MB to 1G in size. Basically certain rows,
    sorted by the first (primary) field maybe second (date), need to be
    copied and written to their own file, and some string manipulations
    need to happen as well. An example of the current format:

    XYZ,04JAN1993,9:30:27,28.87,7600,40,0,Z,N
    XYZ,04JAN1993,9:30:28,28.87,1600,40,0,Z,N
    |
    | followed by like a million rows similar to the above, with
    | incrementing date and time, and then on to next primary field
    |
    ABC,04JAN1993,9:30:27,28.875,7600,40,0,Z,N
    |
    | etc., there are usually 10-20 of the first field per file
    | so there's a lot of repetition going on
    |

    The export would ideally look like this where the first field would be
    written as the name of the file (XYZ.txt):

    19930104, 93027, 2887, 7600, 40, 0, Z, N

    Pretty ambitious for a newbie? I really hope not. I've been looking at
    simpleParse, but it's a bit intense at first glance... not sure where
    to start, or even if I need to go that route. Any help from you guys in
    what direction to go or how to approach this would be hugely
    appreciated.

    Best regards,
    Lorn
    Lorn Davies, Mar 19, 2005
    #1
    1. Advertising

  2. Lorn Davies

    Guest

    Lorn Davies wrote:
    > Hi there, I'm a Python newbie hoping for some direction in working

    with
    > text files that range from 100MB to 1G in size. Basically certain

    rows,
    > sorted by the first (primary) field maybe second (date), need to be
    > copied and written to their own file, and some string manipulations
    > need to happen as well. An example of the current format:
    >
    > XYZ,04JAN1993,9:30:27,28.87,7600,40,0,Z,N
    > XYZ,04JAN1993,9:30:28,28.87,1600,40,0,Z,N
    > |
    > | followed by like a million rows similar to the above, with
    > | incrementing date and time, and then on to next primary field
    > |
    > ABC,04JAN1993,9:30:27,28.875,7600,40,0,Z,N
    > |
    > | etc., there are usually 10-20 of the first field per file
    > | so there's a lot of repetition going on
    > |
    >
    > The export would ideally look like this where the first field would

    be
    > written as the name of the file (XYZ.txt):
    >
    > 19930104, 93027, 2887, 7600, 40, 0, Z, N
    >
    > Pretty ambitious for a newbie? I really hope not. I've been looking

    at
    > simpleParse, but it's a bit intense at first glance... not sure where
    > to start, or even if I need to go that route. Any help from you guys

    in
    > what direction to go or how to approach this would be hugely
    > appreciated.
    >
    > Best regards,
    > Lorn


    You could use the csv module.

    Here's the example from the manual with your sample data in a file
    named simple.csv:

    import csv
    reader = csv.reader(file("some.csv"))
    for row in reader:
    print row

    """
    ['XYZ', '04JAN1993', '9:30:27', '28.87', '7600', '40', '0', 'Z', 'N ']
    ['XYZ', '04JAN1993', '9:30:28', '28.87', '1600', '40', '0', 'Z', 'N ']
    ['ABC', '04JAN1993', '9:30:27', '28.875', '7600', '40', '0', 'Z', 'N ']
    """

    The csv module while bring each line in as a list of strings.
    Of course, you want to process each line before printing it.
    And you don't just want to print it, you want to write it to a file.

    So after reading the first line, open a file for writing with the
    first field (row[0]) as the file name. Then you want to process
    fields row[1], row[2] and row[3] to get them in the right format
    and then write all the row fields except row[0] to the file that's
    open for writing.

    On every subsequent line you must check to see if row[0] has changed,
    so you'll have to store row[0] in a variable. If it's changed, close
    the file you've been writing to and open a new file with the new
    row[0]. Then continue processing lines as before.

    It will only be this simple if you can guarantee that the original
    file is actually sorted by the first field.
    , Mar 19, 2005
    #2
    1. Advertising

  3. wrote:

    > It will only be this simple if you can guarantee that the original
    > file is actually sorted by the first field.


    And if not you can either sort the file ahead of time, or just keep
    reopening the files in append mode when necessary. You could sort them
    in memory in your Python program but given the size of these files I
    think one of the other alternatives would be simpler.
    --
    Michael Hoffman
    Michael Hoffman, Mar 19, 2005
    #3
  4. Lorn Davies

    Guest

    wrote:
    > Lorn Davies wrote:
    > > Hi there, I'm a Python newbie hoping for some direction in working

    > with
    > > text files that range from 100MB to 1G in size. Basically certain

    > rows,
    > > sorted by the first (primary) field maybe second (date), need to be
    > > copied and written to their own file, and some string manipulations
    > > need to happen as well. An example of the current format:
    > >
    > > XYZ,04JAN1993,9:30:27,28.87,7600,40,0,Z,N
    > > XYZ,04JAN1993,9:30:28,28.87,1600,40,0,Z,N
    > > |
    > > | followed by like a million rows similar to the above, with
    > > | incrementing date and time, and then on to next primary field
    > > |
    > > ABC,04JAN1993,9:30:27,28.875,7600,40,0,Z,N
    > > |
    > > | etc., there are usually 10-20 of the first field per file
    > > | so there's a lot of repetition going on
    > > |
    > >
    > > The export would ideally look like this where the first field would

    > be
    > > written as the name of the file (XYZ.txt):
    > >
    > > 19930104, 93027, 2887, 7600, 40, 0, Z, N
    > >
    > > Pretty ambitious for a newbie? I really hope not. I've been looking

    > at
    > > simpleParse, but it's a bit intense at first glance... not sure

    where
    > > to start, or even if I need to go that route. Any help from you

    guys
    > in
    > > what direction to go or how to approach this would be hugely
    > > appreciated.
    > >
    > > Best regards,
    > > Lorn

    >
    > You could use the csv module.
    >
    > Here's the example from the manual with your sample data in a file
    > named simple.csv:


    Obviously, I meant "some.csv". Make sure the name in the program
    matches the file you want to process, or pass the input file name
    to the program as an argument.

    >
    > import csv
    > reader = csv.reader(file("some.csv"))
    > for row in reader:
    > print row
    >
    > """
    > ['XYZ', '04JAN1993', '9:30:27', '28.87', '7600', '40', '0', 'Z', 'N

    ']
    > ['XYZ', '04JAN1993', '9:30:28', '28.87', '1600', '40', '0', 'Z', 'N

    ']
    > ['ABC', '04JAN1993', '9:30:27', '28.875', '7600', '40', '0', 'Z', 'N

    ']
    > """
    >
    > The csv module while bring each line in as a list of strings.
    > Of course, you want to process each line before printing it.
    > And you don't just want to print it, you want to write it to a file.
    >
    > So after reading the first line, open a file for writing with the
    > first field (row[0]) as the file name. Then you want to process
    > fields row[1], row[2] and row[3] to get them in the right format
    > and then write all the row fields except row[0] to the file that's
    > open for writing.
    >
    > On every subsequent line you must check to see if row[0] has changed,
    > so you'll have to store row[0] in a variable. If it's changed, close
    > the file you've been writing to and open a new file with the new
    > row[0]. Then continue processing lines as before.
    >
    > It will only be this simple if you can guarantee that the original
    > file is actually sorted by the first field.
    , Mar 19, 2005
    #4
  5. Lorn Davies

    Guest

    Hi,

    Lorn Davies wrote:

    > ..... working with text files that range from 100MB to 1G in size.
    > .....
    > XYZ,04JAN1993,9:30:27,28.87,7600,40,0,Z,N
    > XYZ,04JAN1993,9:30:28,28.87,1600,40,0,Z,N
    > .....


    I've found that for working with simple large text files like this,
    nothing beats the plain old built-in string operations. Using a parsing
    library is convenient if the data format is complex, but otherwise it's
    overkill.
    In this particular case, even the csv module isn't much of an
    advantage. I'd just use split.

    The following code should do the job:

    data_file = open('data.txt', 'r')
    months = {'JAN':'01', 'FEB':'02', 'MAR':'03', 'APR':'04', 'MAY':'05',
    'JUN':'06', 'JUL':'07', 'AUG':'08', 'SEP':'09', 'OCT':'10', 'NOV':'11',
    'DEC':'12'}
    output_files = {}
    for line in data_file:
    fields = line.strip().split(',')
    filename = fields[0]
    if filename not in output_files:
    output_files[filename] = open(filename+'.txt', 'w')
    fields[1] = fields[1][5:] + months[fields[1][2:5]] + fields[1][:2]
    fields[2] = fields[2].replace(':', '')
    fields[3] = fields[3].replace('.', '')
    print >>output_files[filename], ', '.join(fields[1:])
    for filename in output_files:
    output_files[filename].close()
    data_file.close()

    Note that it does work with unsorted data - at the minor cost of
    keeping all output files open till the end of the entire process.

    Chirag Wazir
    http://chirag.freeshell.org
    , Mar 19, 2005
    #5
  6. I did some similar stuff way back about 12-15 years ago -- in 640k
    MS-DOS with gigabyte files on 33 MHz machines. I got good performance,
    able to bring up any record out of 10 million or so on the screen in a
    couple of seconds (not using Python, but that should not make much
    difference, maybe even some things in Python would make it work better.)

    Even though my files were text, I read them as random-access binary
    files. You need to be able to dive in at an arbitrary point in the
    file, read a chunk of data, split it up into lines, discarding any
    partial lines at the beginning and end, pull out the keys and see where
    you are. Even with a gigabyte of file, if you are reading a decent size
    chunk, you can binary search down to the spot you want in 15-20 tries or
    so. That's the first time, but after that you've got a better idea
    where to look. Use a dictionary to save the information from each chunk
    to give you an index to get a headstart on the next search. If you can
    keep 10k to 100k entries in your index, you can do 1000's of searches or
    so before you even have to worry about having too many index entries.

    I did learn that on 32-bit hardware, doing a binary search of a file
    over a gigabyte will fail if you calculate the next place to look as
    (a+b)/2, because a+b can be more than 2GB and overflow. You gotta do
    (a + (b-a)/2)


    Al
    Al Christians, Mar 19, 2005
    #6
  7. Michael Hoffman wrote:

    > wrote:
    >
    >> It will only be this simple if you can guarantee that the original
    >> file is actually sorted by the first field.

    >
    >
    > And if not you can either sort the file ahead of time, or just keep
    > reopening the files in append mode when necessary. You could sort them
    > in memory in your Python program but given the size of these files I
    > think one of the other alternatives would be simpler.


    There used to be a very nice sort program for PC's that came from
    someplace in Nevada. It cost less than $100 and could sort files
    faster than most programming languages could read or write them. For
    linux, you've gotta figure out the posix sort. If you do, please splain
    it to me.

    Al
    Al Christians, Mar 19, 2005
    #7
  8. Lorn Davies

    Lorn Davies Guest

    Thank you all very much for your suggestions and input... they've been
    very helpful. I found the easiest apporach, as a beginner to this, was
    working with Chirag's code. Thanks Chirag, I was actually able to read
    and make some edit's to the code and then use it... woohooo!

    My changes are annotated with ##:

    data_file = open('G:\pythonRead.txt', 'r')
    data_file.readline() ## this was to skip the first line
    months = {'JAN':'01', 'FEB':'02', 'MAR':'03', 'APR':'04', 'MAY':'05',
    'JUN':'06', 'JUL':'07', 'AUG':'08', 'SEP':'09', 'OCT':'10', 'NOV':'11',
    'DEC':'12'}
    output_files = {}
    for line in data_file:
    fields = line.strip().split(',')
    length = len(fields[3]) ## check how long the field is
    N = 'P','N'
    filename = fields[0]
    if filename not in output_files:
    output_files[filename] = open(filename+'.txt', 'w')
    if (fields[8] == 'N' or 'P') and (fields[6] == '0' or '1'):
    ## This line above doesn't work, can't figure out how to struct?
    fields[1] = fields[1][5:] + months[fields[1][2:5]] +
    fields[1][:2]
    fields[2] = fields[2].replace(':', '')
    if length == 6: ## check for 6 if not add a 0
    fields[3] = fields[3].replace('.', '')
    else:
    fields[3] = fields[3].replace('.', '') + '0'
    print >>output_files[filename], ', '.join(fields[1:5])
    for filename in output_files:
    output_files[filename].close()
    data_file.close()

    The main changes were to create a check for the length of fields[3], I
    wanted to normalize it at 6 digits... the problem I can seee with it
    potentially is if I come across lengths < 5, but I have some ideas to
    fix that. The other change I attempted was a criteria for what to print
    based on the value of fields[8] and fields[6]. It didn't work so well.
    I'm a little confused at how to structure booleans like that... I come
    from a little experience in a Pascal type scripting language where "x
    and y" would entail both having to be true before continuing and "x or
    y" would mean either could be true before continuing. Python, unless
    I'm misunderstanding (very possible), doesn't organize it as such. I
    thought of perhaps using a set of if, elif, else statements for
    processing the fileds, but didn't think that would be the most
    elegant/efficient solution.

    Anyway, any critiques/ideas are welcome... they'll most definitely help
    me understand this language a bit better. Thank you all again for your
    great replies and thank you Chirag for getting me up and going.

    Lorn
    Lorn Davies, Mar 19, 2005
    #8
  9. Lorn Davies

    Guest

    Lorn Davies wrote:

    > if (fields[8] == 'N' or 'P') and (fields[6] == '0' or '1'):
    > ## This line above doesn't work, can't figure out how to struct?


    In Python you would need to phrase that as follows:
    if (fields[8] == 'N' or fields[8] == 'P') and (fields[6] == '0'
    or fields[6] == '1'):
    or alternatively:
    if (fields[8] in ['N', 'P']) and (fields[6] in ['0', '1']):

    > The main changes were to create a check for the length of fields[3],
    > I wanted to normalize it at 6 digits...


    Well, you needn't really check the length - you could directly do this:
    fields[3] = (fields[3].replace('.', '') + '000000')[:6]
    (of course if there are more than 6 digits originally, they'd get
    truncated in this case)

    Chirag Wazir
    http://chirag.freeshell.org
    , Mar 20, 2005
    #9
  10. Lorn Davies

    John Machin Guest

    wrote:
    > Lorn Davies wrote:
    >
    > > if (fields[8] == 'N' or 'P') and (fields[6] == '0' or '1'):
    > > ## This line above doesn't work, can't figure out how to struct?

    >
    > In Python you would need to phrase that as follows:
    > if (fields[8] == 'N' or fields[8] == 'P') and (fields[6] == '0'
    > or fields[6] == '1'):
    > or alternatively:
    > if (fields[8] in ['N', 'P']) and (fields[6] in ['0', '1']):
    >


    and given that the files are huge, a little bit of preprocessing
    wouldn't go astray:

    initially:

    valid_8 = set(['N', 'P'])
    valid_6 = set(['0', '1'])

    then for each record:

    if fields[8] in valid_8 and fields[6] in valid_6:

    More meaningful names wouldn't go astray either :)
    John Machin, Mar 20, 2005
    #10
  11. Lorn Davies

    Guest

    John Machin wrote:

    > More meaningful names wouldn't go astray either :)


    I heartily concur!

    Instead of starting with:
    fields = line.strip().split(',')
    you could use something like:
    (f_name, f_date, f_time, ...) = line.strip().split(',')

    Of course then you won't be able to use ', '.join(fields[1:])
    for the output, but the rest of the program will be
    MUCH more readable/maintainable.

    Chirag Wazir
    http://chirag.freeshell.org
    , Mar 20, 2005
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mario Rodriguez

    uploading huge files

    Mario Rodriguez, Apr 20, 2004, in forum: ASP .Net
    Replies:
    2
    Views:
    305
    =?Utf-8?B?Q0FSZWVk?=
    Apr 20, 2004
  2. Brown Smith
    Replies:
    1
    Views:
    487
    Frankie
    Jun 25, 2005
  3. Brock Heinz
    Replies:
    8
    Views:
    515
    Brock Heinz
    Nov 23, 2004
  4. Jan
    Replies:
    6
    Views:
    478
  5. Replies:
    3
    Views:
    470
Loading...

Share This Page