Splitting a CSV file into 40,000 line chunks

Discussion in 'Ruby' started by Drew Olson, Nov 29, 2006.

  1. Drew Olson

    Drew Olson Guest

    All -

    I've written a script to split a .csv file into smaller .csv files of
    40,000 lines each. The intent here is to break the file down enough so
    that excel does not have issues reading each chunk. My code takes a
    filename from the command line and breaks it down as so:

    infile -> xyz.csv

    output -> xyz_part_1.csv
    xyz_part_2.csv
    etc...

    My code is working but I don't find it very "rubyish". In particular, I
    hate having my index and counter counters and I don't like that I had to
    declare my header variable outside of the loop. Bear in mind here that I
    can not do something like "rows = CSV.open(infile)" because ruby will
    yell and error as the input file is too big (250 mb). Any advice on
    making the code nicer is appreciated. The current code is as follows:

    require 'csv'

    infile = ARGV[0] if ARGV[0] != nil

    counter = 1
    index = 0
    header = ""
    writer = CSV.open(infile.gsub(/\./,"_part_"+counter.to_s+"."),'w')

    CSV.open(infile, 'r') do |row|
    if(index != 0 && index%40000 == 0)
    writer.close
    counter+=1
    writer = CSV.open(infile.gsub(/\./,"_part_"+counter.to_s+"."),'w')
    writer << header
    end
    if (index == 0)
    header = row
    end
    writer << row
    index += 1
    end

    writer.close()

    --
    Posted via http://www.ruby-forum.com/.
     
    Drew Olson, Nov 29, 2006
    #1
    1. Advertising

  2. On 11/29/06, Drew Olson <> wrote:
    > My code is working but I don't find it very "rubyish". In particular, I
    > hate having my index and counter counters and I don't like that I had to
    > declare my header variable outside of the loop. Bear in mind here that I
    > can not do something like "rows = CSV.open(infile)" because ruby will
    > yell and error as the input file is too big (250 mb). Any advice on
    > making the code nicer is appreciated. The current code is as follows:



    File.open("test", "w+") do |f|
    20.times do |i|
    f.write "abc|def|ghi#{i}\n"
    end
    end

    File.open("test", "r") do |fin|
    part = 0
    while row = fin.gets

    File.open("test_part#{part}", "w+") do |f|
    5.times do
    f.write(row)
    break unless row = fin.gets
    end
    end

    part += 1
    end
    end


    --
    Simon Strandgaard
    http://opcoders.com/
     
    Simon Strandgaard, Nov 29, 2006
    #2
    1. Advertising

  3. Drew Olson

    Will Jessop Guest

    Paul Lutus wrote:
    >> CSV.open(infile, 'r') do |row|

    >
    > Why are you using CSV for this? You aren't parsing the lines into fields, so
    > the fact that they contain CSV content has no bearing on the present task.
    > Your goal is to split the input file into groups of lines delimited by
    > linefeeds, not fields delimited by commas.
    >
    > Why not simply read lines from the input file and write them to a series of
    > output files, until the input file is exhausted?


    Because CSV understands csv data with embedded newlines:

    [will@localhost ~]$ cat csvfile.csv
    some, field,"new
    line"
    other,field,here
    [will@localhost ~]$ cat test.rb
    require 'csv'

    CSV.open('csvfile.csv', 'r') do |row|
    p row
    end
    [will@localhost ~]$ ruby test.rb
    ["some", " field", "new\nline"]
    ["other", "field", "here"]

    will.
     
    Will Jessop, Nov 29, 2006
    #3
  4. On Nov 29, 2006, at 10:53 AM, Nick Sieger wrote:

    > On 11/29/06, Drew Olson <> wrote:
    >>
    >> I've written a script to split a .csv file into smaller .csv files of
    >> 40,000 lines each. The intent here is to break the file down
    >> enough so
    >> that excel does not have issues reading each chunk. My code takes a
    >> filename from the command line and breaks it down as so:

    >
    >
    > Not ruby, but why not use "split -l 40000"?


    Because the sample code is also moving the headers to the new files.

    James Edward Gray II
     
    James Edward Gray II, Nov 29, 2006
    #4
  5. Drew Olson

    ChrisH Guest

    Nice Simon

    But the 'while row = fin.gets' is skipping a row each time around and
    you don't handle the header. My "fix":
    Simon Strandgaard wrote:...
    > File.open("test", "w+") do |f|
    > 20.times do |i|
    > f.write "abc|def|ghi#{i}\n"
    > end
    > end
    >
    > File.open("test", "r") do |fin|
    > part = 0

    header = fin.gets
    row = fin.gets
    while row
    >
    > File.open("test_part#{part}", "w+") do |f|

    f.write(header)
    > 5.times do
    > f.write(row)
    > break unless row = fin.gets
    > end
    > end
    >
    > part += 1
    > end
    > end

    ....


    Cheers
    Chris
     
    ChrisH, Nov 29, 2006
    #5
  6. Drew Olson

    ChrisH Guest

    James Edward Gray II wrote:
    > On Nov 29, 2006, at 10:53 AM, Nick Sieger wrote:
    >
    > > Not ruby, but why not use "split -l 40000"?

    >
    > Because the sample code is also moving the headers to the new files.
    >
    > James Edward Gray II


    Also since he's feeding Excel this is most likely on Windows rather
    than *NIX

    Cheers
    Chris
     
    ChrisH, Nov 29, 2006
    #6
  7. On 11/29/06, ChrisH <> wrote:
    > Nice Simon
    >
    > But the 'while row = fin.gets' is skipping a row each time around and
    > you don't handle the header. My "fix":


    I usually mess up in my loops like this :-(
    Thanks :)

    --
    Simon Strandgaard
     
    Simon Strandgaard, Nov 29, 2006
    #7
  8. On Nov 29, 2006, at 9:32 AM, Drew Olson wrote:

    > Any advice on making the code nicer is appreciated.


    I'll let you decide if this is any better:

    Firefly:~/Desktop$ ls
    gen_csv.rb split_csv.rb
    Firefly:~/Desktop$ ruby gen_csv.rb
    Firefly:~/Desktop$ ls
    data.csv gen_csv.rb split_csv.rb
    Firefly:~/Desktop$ cat data.csv
    rec_no,whatever
    0,"blah, blah, blah, ..."
    1,"blah, blah, blah, ..."
    2,"blah, blah, blah, ..."
    3,"blah, blah, blah, ..."
    4,"blah, blah, blah, ..."
    5,"blah, blah, blah, ..."
    6,"blah, blah, blah, ..."
    7,"blah, blah, blah, ..."
    8,"blah, blah, blah, ..."
    9,"blah, blah, blah, ..."
    10,"blah, blah, blah, ..."
    11,"blah, blah, blah, ..."
    12,"blah, blah, blah, ..."
    13,"blah, blah, blah, ..."
    14,"blah, blah, blah, ..."
    15,"blah, blah, blah, ..."
    16,"blah, blah, blah, ..."
    17,"blah, blah, blah, ..."
    18,"blah, blah, blah, ..."
    19,"blah, blah, blah, ..."
    Firefly:~/Desktop$ ruby split_csv.rb data.csv
    Firefly:~/Desktop$ ls
    data.csv data_part_2.csv data_part_4.csv gen_csv.rb
    data_part_1.csv data_part_3.csv data_part_5.csv split_csv.rb
    Firefly:~/Desktop$ cat data_part_1.csv
    rec_no,whatever
    0,"blah, blah, blah, ..."
    1,"blah, blah, blah, ..."
    2,"blah, blah, blah, ..."
    3,"blah, blah, blah, ..."
    Firefly:~/Desktop$ cat data_part_3.csv
    rec_no,whatever
    8,"blah, blah, blah, ..."
    9,"blah, blah, blah, ..."
    10,"blah, blah, blah, ..."
    11,"blah, blah, blah, ..."
    Firefly:~/Desktop$ cat split_csv.rb
    #!/usr/bin/env ruby -w

    require "rubygems"
    require "faster_csv"

    original_file = ARGV.shift
    counter = 1

    FCSV.open(original_file.sub(".", "_part_#{counter}."), "w") do |out|
    FCSV.open(original_file, :headers => true).each_with_index do |
    row, index|
    if (index % 4).zero?
    if index.nonzero?
    counter += 1
    out.reopen(original_file.sub(".", "_part_#{counter}."), "w")
    end
    out << row.headers
    end
    out << row
    end
    end

    __END__

    James Edward Gray II
     
    James Edward Gray II, Nov 29, 2006
    #8
  9. Drew Olson

    Drew Olson Guest

    Thanks for all the responses. As noted in a post above, I am trying to
    place the headers at the beginning of each split file. Should have
    mentioned that earlier. Also, regarding the above responses, I am
    working on a Windows system, hence not being able to use nice shell
    scripts.

    Anyway, I seems that my logic was on the right track, the use of ugly
    counters just made my inner-rubyist upset. Thanks for all the feedback,
    I will take a look and reading/writing the files using the File object
    as opposed to the CSV object. In fact, I didn't even think about that.
    In previous scripts I had been accessing the CSV fields individually
    but, as it is unnecessary here, I will toy around with performance gains
    using File object.

    Speaking of performance, using the code I've described seemed to be
    quite slow. This is understandable based on the fact that the input file
    has 1 million+ records, however I was wondering if I would see
    performance gain by using File objects as opposed to CSV objects. Also,
    any other tips on increasing the performance of this code would be
    helpful.

    Again, thanks for all the responses. I am continually impressed by the
    attitude and intelligence of this community.

    --
    Posted via http://www.ruby-forum.com/.
     
    Drew Olson, Nov 29, 2006
    #9
  10. On Nov 29, 2006, at 12:10 PM, Drew Olson wrote:

    > Speaking of performance, using the code I've described seemed to be
    > quite slow. This is understandable based on the fact that the input
    > file
    > has 1 million+ records, however I was wondering if I would see
    > performance gain by using File objects as opposed to CSV objects.
    > Also,
    > any other tips on increasing the performance of this code would be
    > helpful.


    CSV is known to be pretty slow, thus FasterCSV. ;)

    James Edward Gray II
     
    James Edward Gray II, Nov 29, 2006
    #10
  11. On Nov 29, 2006, at 11:29 AM, Paul Lutus wrote:

    > Will Jessop wrote:
    >
    >> Paul Lutus wrote:
    >>>> CSV.open(infile, 'r') do |row|
    >>>
    >>> Why are you using CSV for this? You aren't parsing the lines into
    >>> fields,
    >>> so the fact that they contain CSV content has no bearing on the
    >>> present
    >>> task. Your goal is to split the input file into groups of lines
    >>> delimited
    >>> by linefeeds, not fields delimited by commas.
    >>>
    >>> Why not simply read lines from the input file and write them to a
    >>> series
    >>> of output files, until the input file is exhausted?

    >>
    >> Because CSV understands csv data with embedded newlines:

    >
    > A plain-text CSV file uses linefeeds as record delimiters. A
    > program that
    > uses "readline" or "gets" splits the records just as a sane CSV parser
    > would. And IMHO a CSV file should never, ever have linefeeds
    > embedded in
    > fields.


    Your opinion doesn't make you right on this one. The CSV RFC clearly
    defines handling for carriage-returns and linefeeds. They certainly
    are allowed in fields. Here is a link to the document, in case you
    want to read up:

    http://www.ietf.org/rfc/rfc4180.txt

    Not to use a CSV parser on this task would be shooting yourself in
    the foot. The result using a simple File object would be broken and,
    much worse, it might look OK for a while. You just can't be sure you
    are never going to split a CSV file that has an embedded linefeed in
    it (especially since that's perfectly legal), and when you do you
    will be responsible for destroying data. There's just no reason for
    that.

    I know you're a don't-use-a-library guy and you know I disagree.
    This is the reason why. The edge cases will get you every time.

    James Edward Gray II
     
    James Edward Gray II, Nov 29, 2006
    #11
  12. Drew Olson

    Drew Olson Guest

    James -

    I'll give FasterCSV a try when I get home from work and out from behind
    this proxy. Here's another question: in some cases I need to sort the
    file before splitting it (in this case sorting by the 4th cell in each
    row). However, the current file I'm trying to sort and split is around
    76 MB and ruby fails when trying to store the CSV as an array. The code
    and output are below. How else can I go about this?

    Code:

    require 'csv'

    infile = ""

    if ARGV[0] != nil
    infile = ARGV[0]
    else
    puts "Please enter a file to split"
    exit
    end

    puts "loading file"

    rows = CSV.read(infile)

    puts "file loaded"

    rows.sort!{|x,y| x[3] <=> y[3]}

    counter = 1
    header = ""
    writer = CSV.open(infile.gsub(/\./,"_part_"+counter.to_s+"."),'w')

    rows.each_index do |i|
    if(i != 0 && i%40000 == 0)
    writer.close
    counter+=1
    writer = CSV.open(infile.gsub(/\./,"_part_"+counter.to_s+"."),'w')
    writer << header
    end
    if (i == 0)
    header = row
    end
    writer << row
    end

    writer.close()

    Output:

    loading file
    The exception unknown software exception (0xc00000fd) occured in the
    application at location.....etc

    --
    Posted via http://www.ruby-forum.com/.
     
    Drew Olson, Nov 29, 2006
    #12
  13. On Nov 29, 2006, at 12:56 PM, Drew Olson wrote:

    > Here's another question: in some cases I need to sort the
    > file before splitting it (in this case sorting by the 4th cell in each
    > row). However, the current file I'm trying to sort and split is around
    > 76 MB and ruby fails when trying to store the CSV as an array. The
    > code
    > and output are below. How else can I go about this?


    Hmm, that's a good question.

    Perhaps you can collect just the key values in an Array and then use
    those to reorder the lines bit by bit. That's not going to be fast
    with any library helping you, but I don't have a better idea.

    James Edward Gray II
     
    James Edward Gray II, Nov 29, 2006
    #13
  14. Reid Thompson, Nov 29, 2006
    #14
  15. Drew Olson

    Drew Olson Guest

    Drew Olson, Nov 29, 2006
    #15
  16. On Thu, 2006-11-30 at 04:44 +0900, Drew Olson wrote:
    > Reid Thompson wrote:
    > > On Thu, 2006-11-30 at 02:25 +0900, ChrisH wrote:
    > >> Also since he's feeding Excel this is most likely on Windows rather
    > >> than *NIX

    > > not a show stopper..
    > > http://www.research.att.com/sw/tools/uwin/
    > >
    > > http://unxutils.sourceforge.net/

    >
    > Reid -
    >
    > I also have cygwin installed on my box, so if there is a simply way to
    > do this in unix I'd love to know. My unix is rusty (that's being
    > generous).
    >

    rthompso@jhereg:~$ cat splitandhead
    #!/bin/bash

    fname=$1
    split -l 4000 -a 6 -d $fname $fname

    for file in `ls ${fname}0*`
    do
    sed -e '1i\Col1 Col2 Col3 Col4\' $file > ${file}.tmp
    mv ${file}.tmp $file
    done



    rthompso@jhereg:~$ head testsed
    this is a test
    this is a test
    this is a test
    this is a test
    this is a test
    this is a test
    this is a test
    this is a test
    this is a test
    this is a test
    rthompso@jhereg:~$ wc testsed
    40140 160560 602100 testsed
    rthompso@jhereg:~$ time ./splitandhead testsed

    real 0m0.499s
    user 0m0.140s
    sys 0m0.092s
    rthompso@jhereg:~$ head -2 testsed0000*
    ==> testsed000000 <==
    Col1 Col2 Col3 Col4
    this is a test

    ==> testsed000001 <==
    Col1 Col2 Col3 Col4
    this is a test

    ==> testsed000002 <==
    Col1 Col2 Col3 Col4
    this is a test

    ==> testsed000003 <==
    Col1 Col2 Col3 Col4
    this is a test

    ==> testsed000004 <==
    Col1 Col2 Col3 Col4
    this is a test

    ==> testsed000005 <==
    Col1 Col2 Col3 Col4
    this is a test

    ==> testsed000006 <==
    Col1 Col2 Col3 Col4
    this is a test

    ==> testsed000007 <==
    Col1 Col2 Col3 Col4
    this is a test

    ==> testsed000008 <==
    Col1 Col2 Col3 Col4
    this is a test

    ==> testsed000009 <==
    Col1 Col2 Col3 Col4
    this is a test

    ==> testsed000010 <==
    Col1 Col2 Col3 Col4
    this is a test
    rthompso@jhereg:~$
     
    Reid Thompson, Nov 29, 2006
    #16
  17. On Thu, 2006-11-30 at 04:04 +0900, James Edward Gray II wrote:
    > On Nov 29, 2006, at 12:56 PM, Drew Olson wrote:
    >
    > > Here's another question: in some cases I need to sort the
    > > file before splitting it (in this case sorting by the 4th cell in each
    > > row). However, the current file I'm trying to sort and split is around
    > > 76 MB and ruby fails when trying to store the CSV as an array. The
    > > code
    > > and output are below. How else can I go about this?

    >
    > Hmm, that's a good question.
    >
    > Perhaps you can collect just the key values in an Array and then use
    > those to reorder the lines bit by bit. That's not going to be fast
    > with any library helping you, but I don't have a better idea.
    >
    > James Edward Gray II

    sort -n +4 out
    or
    sort +4 out


    rthompso@jhereg:~$ head -50 out
    this is a test 10202
    this is a test 23990
    this is a test 11056
    this is a test 9606
    this is a test 28590
    this is a test 18264
    this is a test 12902
    this is a test 12856
    this is a test 27571
    this is a test 5495
    this is a test 15965
    this is a test 22229
    this is a test 18865
    this is a test 31339
    this is a test 21913
    this is a test 406
    this is a test 8602
    this is a test 5329
    this is a test 10048
    this is a test 6458
    this is a test 20069
    this is a test 19771
    this is a test 21844
    this is a test 24719
    this is a test 30894
    this is a test 25239
    this is a test 9900
    this is a test 1727
    this is a test 12042
    this is a test 20832
    this is a test 23735
    this is a test 28768
    this is a test 10283
    this is a test 390
    this is a test 10480
    this is a test 1337
    this is a test 2745
    this is a test 26398
    this is a test 32288
    this is a test 3797
    this is a test 22251
    this is a test 458
    this is a test 14679
    this is a test 29642
    this is a test 19943
    this is a test 26342
    this is a test 24232
    rthompso@jhereg:~$ sort +4 out
    this is a test 0
    this is a test 10
    this is a test 10
    this is a test 100
    this is a test 100
    this is a test 100
    this is a test 1000
    this is a test 1000
    this is a test 10001
    this is a test 10001
    this is a test 10005
    this is a test 10006
    this is a test 10006
    this is a test 10007
    this is a test 10007
    this is a test 10008
    this is a test 10008
    this is a test 10009
    this is a test 10009
    this is a test 1001
    this is a test 10010
    this is a test 10011
    this is a test 10012
    this is a test 10012
    this is a test 10012
    this is a test 10012
    this is a test 10012
    this is a test 10013
    this is a test 10014
    this is a test 10014
    this is a test 10015
    this is a test 10017
    this is a test 10017
    this is a test 10017
    this is a test 10018
    this is a test 10019
    this is a test 1002
    this is a test 1002
    this is a test 10021
    this is a test 10026
    this is a test 10026
    -------------------- OR
    rthompso@jhereg:~$ sort -n +4 out
    this is a test 0
    this is a test 2
    this is a test 3
    this is a test 3
    this is a test 4
    this is a test 7
    this is a test 10
    this is a test 10
    this is a test 12
    this is a test 13
    this is a test 14
    this is a test 14
    this is a test 15
    this is a test 16
    this is a test 16
    this is a test 20
    this is a test 21
    this is a test 24
    this is a test 25
    this is a test 28
    this is a test 29
    this is a test 29
    this is a test 29
    this is a test 30
    this is a test 30
    this is a test 31
    this is a test 32
    this is a test 32
    this is a test 34
    this is a test 35
    this is a test 35
    this is a test 35
    this is a test 36
    this is a test 36
    this is a test 37
    this is a test 37
    this is a test 38
    this is a test 38
    this is a test 41
    this is a test 41
    this is a test 41
    this is a test 41
    this is a test 41
    this is a test 41
    this is a test 42
    this is a test 42
    this is a test 42
    this is a test 43
    this is a test 45
    this is a test 46
    this is a test 46
    this is a test 47
    this is a test 47
     
    Reid Thompson, Nov 29, 2006
    #17
  18. Drew Olson

    Mike Harris Guest

    Drew Olson wrote:

    >All -
    >
    >I've written a script to split a .csv file into smaller .csv files of
    >40,000 lines each. The intent here is to break the file down enough so
    >that excel does not have issues reading each chunk. My code takes a
    >filename from the command line and breaks it down as so:
    >
    >infile -> xyz.csv
    >
    >output -> xyz_part_1.csv
    > xyz_part_2.csv
    > etc...
    >
    >My code is working but I don't find it very "rubyish". In particular, I
    >hate having my index and counter counters and I don't like that I had to
    >declare my header variable outside of the loop. Bear in mind here that I
    >can not do something like "rows = CSV.open(infile)" because ruby will
    >yell and error as the input file is too big (250 mb). Any advice on
    >making the code nicer is appreciated. The current code is as follows:
    >
    >require 'csv'
    >
    >infile = ARGV[0] if ARGV[0] != nil
    >
    >counter = 1
    >index = 0
    >header = ""
    >writer = CSV.open(infile.gsub(/\./,"_part_"+counter.to_s+"."),'w')
    >
    >CSV.open(infile, 'r') do |row|
    > if(index != 0 && index%40000 == 0)
    > writer.close
    > counter+=1
    > writer = CSV.open(infile.gsub(/\./,"_part_"+counter.to_s+"."),'w')
    > writer << header
    > end
    > if (index == 0)
    > header = row
    > end
    > writer << row
    > index += 1
    >end
    >
    >writer.close()
    >
    >
    >

    I will ignore the CSV issue, not because it isn't important, but simply
    because I'm not familar with the csv parser, and this example
    sufficiently represents the concept.

    For maximum elegance, I would write the code this way. It uses the
    helper methods (i.e. not in the stdlib) File#write_fresh, File#to_a and
    Enumerable#chunks, all of which I've written at one time or another.
    Mentally sub in the appropriate code as desired.

    File.to_a('xyz.csv').chunks(40000).each_with_index do |chunk,i|
    File.write_fresh("xyz_part_#{i+1}",chunk.join("\n"))
    end

    File.to_a returns an array of lines
    Enumerable#chunks divides an Enumerable into groups of 40k. A 100k
    array would yield 2 40k chunks and a 20k chunk.
    File#write_fresh creates the file if it doesn't exist, truncates any
    existing file, and writes the 2nd argument to the file.

    This version is much prettier than the corresponding version without the
    helper methods, but it is also clearer. It is obvious at a glance what
    it does. The same can't be said for the version without helper methods.
     
    Mike Harris, Nov 29, 2006
    #18
  19. Drew Olson

    Mike Harris Guest

    Drew Olson wrote:

    >All -
    >
    >I've written a script to split a .csv file into smaller .csv files of
    >40,000 lines each. The intent here is to break the file down enough so
    >that excel does not have issues reading each chunk. My code takes a
    >filename from the command line and breaks it down as so:
    >
    >infile -> xyz.csv
    >
    >output -> xyz_part_1.csv
    > xyz_part_2.csv
    > etc...
    >
    >My code is working but I don't find it very "rubyish". In particular, I
    >hate having my index and counter counters and I don't like that I had to
    >declare my header variable outside of the loop. Bear in mind here that I
    >can not do something like "rows = CSV.open(infile)" because ruby will
    >yell and error as the input file is too big (250 mb). Any advice on
    >making the code nicer is appreciated. The current code is as follows:
    >
    >require 'csv'
    >
    >infile = ARGV[0] if ARGV[0] != nil
    >
    >counter = 1
    >index = 0
    >header = ""
    >writer = CSV.open(infile.gsub(/\./,"_part_"+counter.to_s+"."),'w')
    >
    >CSV.open(infile, 'r') do |row|
    > if(index != 0 && index%40000 == 0)
    > writer.close
    > counter+=1
    > writer = CSV.open(infile.gsub(/\./,"_part_"+counter.to_s+"."),'w')
    > writer << header
    > end
    > if (index == 0)
    > header = row
    > end
    > writer << row
    > index += 1
    >end
    >
    >writer.close()
    >
    >
    >

    I will ignore the CSV issue, not because it isn't important, but simply
    because I'm not familar with the csv parser, and this example
    sufficiently represents the concept.

    For maximum elegance, I would write the code this way. It uses the
    helper methods (i.e. not in the stdlib) File#write_fresh, File#to_a and
    Enumerable#chunks, all of which I've written at one time or another.
    Mentally sub in the appropriate code as desired.

    File.to_a('xyz.csv').chunks(40000).each_with_index do |chunk,i|
    File.write_fresh("xyz_part_#{i+1}",chunk.join("\n"))
    end

    File.to_a returns an array of lines
    Enumerable#chunks divides an Enumerable into groups of 40k. A 100k
    array would yield 2 40k chunks and a 20k chunk.
    File#write_fresh creates the file if it doesn't exist, truncates any
    existing file, and writes the 2nd argument to the file.

    This version is much prettier than the corresponding version without the
    helper methods, but it is also clearer. It is obvious at a glance what
    it does. The same can't be said for the version without helper methods.
     
    Mike Harris, Nov 29, 2006
    #19
  20. Drew Olson

    Edwin Fine Guest

    I am perplexed by CSV.open. In IO and File, open returns something that
    quacks like an IO object. You can then call gets, puts, read, write and
    so on. The CSV open seems to return an array (or gives you a row at a
    time). This makes it very hard to write code that can take advantage of
    duck typing, e.g.

    def get_batch(f, count)
    lines = nil
    count.times do
    break unless line = f.gets
    (lines ||= []) << line
    end
    lines
    end

    get_batch(File.open("foo"), 40000) # No problem
    get_batch(CSV.open("foo", "r"), 40000) # Error

    I wonder why it was not patterned more closely on IO? Any thoughts?

    --
    Posted via http://www.ruby-forum.com/.
     
    Edwin Fine, Nov 29, 2006
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. rote
    Replies:
    3
    Views:
    445
    Mark Rae [MVP]
    Jan 24, 2008
  2. lector

    storing 1,000,000 records

    lector, Apr 6, 2008, in forum: C Programming
    Replies:
    19
    Views:
    488
    Barry Schwarz
    Apr 8, 2008
  3. Norah Jones
    Replies:
    0
    Views:
    131
    Norah Jones
    Mar 27, 2013
  4. Peter Otten
    Replies:
    4
    Views:
    184
    Chris Angelico
    Mar 28, 2013
  5. Arnaud Delobelle
    Replies:
    0
    Views:
    166
    Arnaud Delobelle
    Mar 27, 2013
Loading...

Share This Page