grep a csv?

Michael Linfield · Aug 16, 2007

If i had a huge CSV file, and i wanted to pull out say all the lines
that contained the word "Blah1" and throw it in a temporary file what
would be the best approached. My thoughts were to use

require 'rubygems'
require 'ruport'
require 'ruport/util'

t=Ruport:

ata:Table.load("filename.csv")
t.grep(/Blah1/)

### this sadly only returned an output of => []

any ideas?

Thanks!

Michael Glaesemann · Aug 16, 2007

If i had a huge CSV file, and i wanted to pull out say all the lines
that contained the word "Blah1" and throw it in a temporary file what
would be the best approached.

any ideas?

My first thought was FasterCSV and go from there. Any reason not to
use a dedicated CSV lib?

Michael Glaesemann
grzm seespotcode net

Michael Linfield · Aug 16, 2007

My first thought was FasterCSV and go from there. Any reason not to
use a dedicated CSV lib?

Michael Glaesemann
grzm seespotcode net

im planning to use ruport to graph the data, but if u can integrate the
output from using FasterCSV into a ruport graph, im all ears

Chris Carter · Aug 16, 2007

If i had a huge CSV file, and i wanted to pull out say all the lines
that contained the word "Blah1" and throw it in a temporary file what
would be the best approached. My thoughts were to use

require 'rubygems'
require 'ruport'
require 'ruport/util'

t=Ruport:ata:Table.load("filename.csv")
t.grep(/Blah1/)

### this sadly only returned an output of => []

any ideas?

Thanks!

File.readlines('filename.csv').grep(/Blah1/)

M. Edward (Ed) Borasky · Aug 16, 2007

Michael said:
If i had a huge CSV file, and i wanted to pull out say all the lines
that contained the word "Blah1" and throw it in a temporary file what
would be the best approached. My thoughts were to use

require 'rubygems'
require 'ruport'
require 'ruport/util'

t=Ruport:ata:Table.load("filename.csv")
t.grep(/Blah1/)

### this sadly only returned an output of => []

any ideas?

Thanks!

OK ... first of all, define "huge" and what are your restrictions? Let
me assume the worst case just to get started -- more than 256 columns
and more than 65536 rows and you're on Windows.

Seriously, though, if this is a *recurring* use case rather than a
one-shot "somebody gave me this *$&%^# file and wants an answer by 5 PM
tonight!" use case, I'd load it into a database (assuming your database
doesn't have a column count limitation larger than the column count in
your file, that is) and then hook up to it with DBI. But if it's a
one-shot deal and you've got a command line handy (Linux, MacOS, BSD or
Cygwin) just do "grep blah1 huge-file.csv > temp-file.csv". Bonus points
for being able to write that in Ruby and get it debugged before someone
who's been doing command-line for years types that one-liner in.

Chris Carter · Aug 16, 2007

im planning to use ruport to graph the data, but if u can integrate the
output from using FasterCSV into a ruport graph, im all ears

Ruport uses FasterCSV for its CVS parsing.

William James · Aug 16, 2007

M. Edward (Ed) Borasky wrote

Michael said:
Michael said:

If i had a huge CSV file, and i wanted to pull out say all the lines
that contained the word "Blah1" and throw it in a temporary file what
would be the best approached. My thoughts were to use

Click to expand...

require 'rubygems'
require 'ruport'
require 'ruport/util'

t=Ruport:ata:Table.load("filename.csv")
t.grep(/Blah1/)

Click to expand...

### this sadly only returned an output of => []

Click to expand...

any ideas?

Click to expand...

Thanks!

Click to expand...

OK ... first of all, define "huge" and what are your
restrictions? Let me assume the worst case just to get
started -- more than 256 columns and more than 65536 rows
and you're on Windows.

Seriously, though, if this is a *recurring* use case rather
than a one-shot "somebody gave me this *$&%^# file and wants
an answer by 5 PM tonight!" use case, I'd load it into a
database (assuming your database doesn't have a column count
limitation larger than the column count in your file, that
is) and then hook up to it with DBI. But if it's a one-shot
deal and you've got a command line handy (Linux, MacOS, BSD
or Cygwin)

Windoze has a command-line.

just do "grep blah1 huge-file.csv >
temp-file.csv". Bonus points for being able to write that in
Ruby and get it debugged before someone who's been doing
command-line for years types that one-liner in.

Chris Carter has already done it. Have you ever posted
Ruby code here?

Michael Linfield · Aug 16, 2007

M. Edward (Ed) Borasky said:
Michael said:

### this sadly only returned an output of => []

any ideas?

Thanks!

Click to expand...

OK ... first of all, define "huge" and what are your restrictions? Let
me assume the worst case just to get started -- more than 256 columns
and more than 65536 rows and you're on Windows.

Seriously, though, if this is a *recurring* use case rather than a
one-shot "somebody gave me this *$&%^# file and wants an answer by 5 PM
tonight!" use case, I'd load it into a database (assuming your database
doesn't have a column count limitation larger than the column count in
your file, that is) and then hook up to it with DBI. But if it's a
one-shot deal and you've got a command line handy (Linux, MacOS, BSD or
Cygwin) just do "grep blah1 huge-file.csv > temp-file.csv". Bonus points
for being able to write that in Ruby and get it debugged before someone
who's been doing command-line for years types that one-liner in.

lol, alright lets say the senario will be in the range of 20k - 70k
lines of data. no more than 20 columns
and i wanna avoid using command line to do this, because yes in fact
this will be used to process more than one datafile which i hope to
setup in optparse to have a command line arg that directs the prog to
the file. also i wanted to for the meantime not have to throw it on any
database...avoiding DBI for the meanwhile. But an idea flew through my
head a few minutes ago....what if i did this --

res = []
res << File.readlines('filename.csv').grep(/Blah1/) #thanks chris

ran into a small demeaning problem. this shoves all that grep'd data
into 1 element lol... res[1] => nil ...its all shoved into res[0] id
hope to fix this with a simple do statement but a little confusion hit
me while doing that with a readline command. and by shoving this into an
array will i still be able to single out columns of data. if not then
how would i shove the grep data into a second csv file, doing this all
inside ruby of course, no command line program > output.csv

Alex Gutteridge · Aug 16, 2007

M. Edward (Ed) Borasky said:
M. Edward (Ed) Borasky said:

Michael said:

### this sadly only returned an output of => []

any ideas?

Thanks!

Click to expand...

OK ... first of all, define "huge" and what are your restrictions?
Let
me assume the worst case just to get started -- more than 256 columns
and more than 65536 rows and you're on Windows.

Seriously, though, if this is a *recurring* use case rather than a
one-shot "somebody gave me this *$&%^# file and wants an answer by
5 PM
tonight!" use case, I'd load it into a database (assuming your
database
doesn't have a column count limitation larger than the column
count in
your file, that is) and then hook up to it with DBI. But if it's a
one-shot deal and you've got a command line handy (Linux, MacOS,
BSD or
Cygwin) just do "grep blah1 huge-file.csv > temp-file.csv". Bonus
points
for being able to write that in Ruby and get it debugged before
someone
who's been doing command-line for years types that one-liner in.

Click to expand...

lol, alright lets say the senario will be in the range of 20k - 70k
lines of data. no more than 20 columns
and i wanna avoid using command line to do this, because yes in fact
this will be used to process more than one datafile which i hope to
setup in optparse to have a command line arg that directs the prog to
the file. also i wanted to for the meantime not have to throw it on
any
database...avoiding DBI for the meanwhile. But an idea flew through my
head a few minutes ago....what if i did this --

res = []
res << File.readlines('filename.csv').grep(/Blah1/) #thanks chris

Array#<< appends the object onto your Array, you want to combine the
two arrays using Array#+:

irb(main):001:0> a = []
=> []
irb(main):002:0> a << [1,2,3]
=> [[1, 2, 3]]
irb(main):003:0> a = []
=> []
irb(main):004:0> a += [1,2,3]
=> [1, 2, 3]
irb(main):005:0>

Though why don't you just use:

res = File.readlines('filename.csv').grep(/Blah1/)

Alex Gutteridge

Bioinformatics Center
Kyoto University

Michael Linfield · Aug 16, 2007

Alex said:
On 16 Aug 2007, at 13:08, Michael Linfield wrote:

Though why don't you just use:

res = File.readlines('filename.csv').grep(/Blah1/)

Alex Gutteridge

can i push that into a file to temporarily use to pull all the Blah1
data from, then at the end of the program delete Blah1.csv ?

Alex Gutteridge · Aug 16, 2007

can i push that into a file to temporarily use to pull all the Blah1
data from, then at the end of the program delete Blah1.csv ?

Sure, use tempfile, but I think botp has shown why you don't really
need the temporary file (unless there's part of this problem I'm not
understanding):

irb(main):001:0> puts File.readlines('filename.csv')
this, is , a , test, foo
this, is , a , test, bar
this, is , a , test, Blah1
this, is , a , test, bar
this, Blah, is , a , test
this, is , a , Blah, test
=> nil
irb(main):002:0> puts File.readlines('filename.csv').grep(/Blah1/)
this, is , a , test, Blah1
=> nil
irb(main):003:0> require 'tempfile'
=> true
irb(main):004:0> tf = Tempfile.new('csv')
=> #<File:/tmp/csv.1339.0>
irb(main):005:0> tf.puts File.readlines('filename.csv').grep(/Blah1/)
=> nil
irb(main):006:0> tf.close
=> nil
irb(main):007:0> tf.open
=> #<File:/tmp/csv.1339.0>
irb(main):008:0> puts tf.gets
this, is , a , test, Blah1
=> nil

Alex Gutteridge

Bioinformatics Center
Kyoto University

Michael Linfield · Aug 16, 2007

Alex said:
Sure, use tempfile, but I think botp has shown why you don't really
need the temporary file (unless there's part of this problem I'm not
understanding):

the reason im going to use a tempfile is because thers going to be a
per'se blah2 blah3 ect. So im storing the grep from blah1, blah2, and
blah3 in different files so i can later pull the data from those files
to graph each line with ruport. each line will represent Blah1 Blah2
Blah3 ect. So unless i can shove the grep output into an array, and each
line be an element, ive gotta use a tempfile. Maybe theres a better way?

If thers a better approached im all ears

Thanks a ton btw, you all have been a tremendous help.

Alex Young · Aug 16, 2007

Michael said:
M. Edward (Ed) Borasky said:

Michael said:

### this sadly only returned an output of => []

any ideas?

Thanks!

Click to expand...

OK ... first of all, define "huge" and what are your restrictions? Let
me assume the worst case just to get started -- more than 256 columns
and more than 65536 rows and you're on Windows.

Seriously, though, if this is a *recurring* use case rather than a
one-shot "somebody gave me this *$&%^# file and wants an answer by 5 PM
tonight!" use case, I'd load it into a database (assuming your database
doesn't have a column count limitation larger than the column count in
your file, that is) and then hook up to it with DBI. But if it's a
one-shot deal and you've got a command line handy (Linux, MacOS, BSD or
Cygwin) just do "grep blah1 huge-file.csv > temp-file.csv". Bonus points
for being able to write that in Ruby and get it debugged before someone
who's been doing command-line for years types that one-liner in.

Click to expand...

lol, alright lets say the senario will be in the range of 20k - 70k
lines of data. no more than 20 columns
and i wanna avoid using command line to do this, because yes in fact
this will be used to process more than one datafile which i hope to
setup in optparse to have a command line arg that directs the prog to
the file. also i wanted to for the meantime not have to throw it on any
database...avoiding DBI for the meanwhile. But an idea flew through my
head a few minutes ago....what if i did this --

res = []
res << File.readlines('filename.csv').grep(/Blah1/) #thanks chris

There's a problem with using File.readlines that I don't think anyone's
mentioned yet. I don't know if it's relevant to your dataset, but CSV
fields are allowed to contain newlines if the field is quoted. For
example, this single CSV row will break your process:

1,2,"foo
Blah1",bar

The only way around that is to actually parse the file, so unless you
*know* that's not going to be a problem, I'd still suggest FasterCSV...

rio4ruby · Aug 16, 2007

the reason im going to use a tempfile is because thers going to be a
per'se blah2 blah3 ect. So im storing the grep from blah1, blah2, and
blah3 in different files so i can later pull the data from those files
to graph each line with ruport. each line will represent Blah1 Blah2
Blah3 ect. So unless i can shove the grep output into an array, and each
line be an element, ive gotta use a tempfile. Maybe theres a better way?

If thers a better approached im all ears

Thanks a ton btw, you all have been a tremendous help.

Not sure if this is what you are looking for, but if you are trying
to search a file placing the lines that contain "Blah1" into a file
called "Blah1.csv", the lines containg "Blah2" into a file called
"Blah2.csv" etc, Rio might help:

require 'rio'
rio('filename.csv').chomp.lines(/Blah[^,]*/) do |line,m|
rio(m) + '.csv' << line + $/
end

Then again it might not.

Peña, Botp · Aug 16, 2007

From: rio4ruby [mailto:[email protected]]=20
# require 'rio'
# rio('filename.csv').chomp.lines(/Blah[^,]*/) do |line,m|
# rio(m) + '.csv' << line + $/
# end

simply amazing. btw, how does rio handle big files, does it load them =
whole in memory?

thanks for rio.
kind regards -botp

rio4ruby · Aug 16, 2007

From: rio4ruby [mailto:[email protected]]
# require 'rio'
# rio('filename.csv').chomp.lines(/Blah[^,]*/) do |line,m|
# rio(m) + '.csv' << line + $/
# end

simply amazing. btw, how does rio handle big files, does it load them whole in memory?

thanks for rio.
kind regards -botp

Never. Examples that assume I have a file small enough to load into
memory irritate me.

Kaldrenon · Aug 16, 2007

There's a problem with using File.readlines that I don't think anyone's
mentioned yet. I don't know if it's relevant to your dataset, but CSV
fields are allowed to contain newlines if the field is quoted. For
example, this single CSV row will break your process:

1,2,"foo
Blah1",bar

The only way around that is to actually parse the file, so unless you
*know* that's not going to be a problem, I'd still suggest FasterCSV...

I guess the following is slightly OT, since the OP is talking about
grabbing whole lines, but it's a CSV-relevant question:

Another danger with not using CSV packages (as I've learned from my
present non-use of said packages) is that quoted elements in a row can
contain commas. I'm a fairly inexperienced programmer, and not just in
Ruby, and I haven't yet figured out an elegant way to break this down.

For example:

foo,bar,"foo,bar" is a three-column row in a CSV file, but using
split(/,/) on it will, of course, return ["foo","bar","\"foo","bar
\""], an array of size four. What's an efficient, elegant way of
gathering quoted columns? I'm sure FasterCSV is capable of dealing
with such things, but it's something I'd like to actually learn,
rather than finding a way around it.

Alex Young · Aug 16, 2007

Kaldrenon said:
There's a problem with using File.readlines that I don't think anyone's
mentioned yet. I don't know if it's relevant to your dataset, but CSV
fields are allowed to contain newlines if the field is quoted. For
example, this single CSV row will break your process:

1,2,"foo
Blah1",bar

The only way around that is to actually parse the file, so unless you
*know* that's not going to be a problem, I'd still suggest FasterCSV...

Click to expand...

I guess the following is slightly OT, since the OP is talking about
grabbing whole lines, but it's a CSV-relevant question:

Another danger with not using CSV packages (as I've learned from my
present non-use of said packages) is that quoted elements in a row can
contain commas. I'm a fairly inexperienced programmer, and not just in
Ruby, and I haven't yet figured out an elegant way to break this down.

For example:

foo,bar,"foo,bar" is a three-column row in a CSV file, but using
split(/,/) on it will, of course, return ["foo","bar","\"foo","bar
\""], an array of size four. What's an efficient, elegant way of
gathering quoted columns?

If you ignore that the quote character can also appear inside column
data, then this will work, ishkinda.

'foo,bar,"foo,bar"'.scan(/("[^"]+")|([^,]+)/).flatten.compact
=> ["foo", "bar", "\"foo,bar\""]

That breaks at least for empty fields, fields with newlines, and fields
with '"' in them.

I'm sure FasterCSV is capable of dealing
with such things, but it's something I'd like to actually learn,
rather than finding a way around it.

That's just it - the cleanest way to handle this problem is to just take
advantage of someone else's work. CSV looks just trivial enough to the
human eye that you can convince yourself "Oh, it can't be *that*
hard"... but it's still complicated enough that you get tired of coding
around the fiddly little details very, very quickly. Besides, when a
library is as easy to use as either CSV *or* FasterCSV (my preference is
for the latter, but it's not in stdlib), what's the point in stressing?

Kaldrenon · Aug 16, 2007

If you ignore that the quote character can also appear inside column
data, then this will work, ishkinda.

'foo,bar,"foo,bar"'.scan(/("[^"]+")|([^,]+)/).flatten.compact
=> ["foo", "bar", "\"foo,bar\""]

That's very cool, granting that there are still a number of conditions
for which it'll go bust. But I'm a little confused about how a pattern
like [^,]+ gets an element, given that (unless I'm mistaken) in a
standard regexp, it would only match on a string that contained a
series of commas that beginning of a line, like ",,," or "abc\n,,".
What's my mistake/confusion here?

Besides, when a
library is as easy to use as either CSV *or* FasterCSV (my preference is
for the latter, but it's not in stdlib), what's the point in stressing?

The only point is in learning - I like to be able to use packages
without the effort that went into making them, but I don't really like
jumping from DIY to "Here's a way, and don't worry about how it
works...it's magic!"

Thanks for the info,
Andrew

Phrogz · Aug 16, 2007

[...] I'm a little confused about how a pattern
like [^,]+ gets an element, given that (unless I'm mistaken) in a
standard regexp, it would only match on a string that contained a
series of commas that beginning of a line, like ",,," or "abc\n,,".
What's my mistake/confusion here?

/,+/ # matches one or more comma characters
/[z,]+/ # matches one or more z or comma characters intermixed
/[^,]+/ # matches one or more characters that aren't a comma

Read up on character classes in regexps.

Ruport GraphReport	2	Aug 16, 2007
Ruport problems	2	Aug 26, 2009
ruport	1	Aug 17, 2007
[ANN] ruport-util 0.7.0 : csv2ods and other fun stuff	2	Jun 7, 2007
hopefully my last range date question	0	Sep 3, 2007
script errors	5	Sep 4, 2007
Grouping on and exporting to csv files	1	Mar 20, 2013
parse a csv file into a text file	29	Feb 6, 2014

grep a csv?

Michael Linfield

Michael Glaesemann

Michael Linfield

Chris Carter

M. Edward (Ed) Borasky

Chris Carter

William James

Michael Linfield

Alex Gutteridge

Michael Linfield

Alex Gutteridge

Michael Linfield

Alex Young

rio4ruby

Peña, Botp

rio4ruby

Kaldrenon

Alex Young

Kaldrenon

Phrogz

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads