Parse csv similar file

Rebhan, Gilbert · Feb 6, 2007

Hi,

<newbie>

i have a txtfile with a format like that =3D

AP850KP;INCLIB;E023889;AP013;240107;0730
AP850SD$;INCLIB;E052337;AP013;240107;0730
AP850SDA;INCLIB;E050441;AP013;240107;0730
AP850SDI;INCLIB;E023889;AP013;240107;0730
AP850SDO;INCLIB;E052337;AP013;240107;0730
AP850SDS;INCLIB;E050441;AP013;240107;0730
...

i want to get a collection for every E followed by digits,
so with the example above, i want to get =3D

collections:
E023889
E052337
E050441
...

each collection should contain datasets with the rest of the line, so
f.e.
E023889 would have =3D

[AP850KP;INCLIB;AP013;240107;0730,AP850SDI;AP013;240107;0730]

questions=3D
what kind of collection is the best ? is an array sufficient ?

right now i have =3D

efas=3DArray.new
File.open("mycsvfile", "r").each do |line|
if line =3D~ /(\w+.?);(\w+);(\w+);(\w+);(\w+);(\w+)/
=20
efas<<$3.to_s<<',' unless efas.include?($3.to_s)
=20
end
end
puts efas.to_s.chop

So i have all Ed\+, but how to get further ?

Are there better ways as regular expressions ?
Any ideas ?

<newbie/>

Regards, Gilbert

Brian Candler · Feb 6, 2007

questions=
what kind of collection is the best ? is an array sufficient ?

Depends what you want to do with it. If you want to be able to find an entry
E123456 quickly, then you'd use a hash. If you want to keep only the
first/last entry for a particular key (as it seems you do), using a hash
speeds things up here too.

right now i have =

efas=Array.new
File.open("mycsvfile", "r").each do |line|
if line =~ /(\w+.?);(\w+);(\w+);(\w+);(\w+);(\w+)/

efas<<$3.to_s<<',' unless efas.include?($3.to_s)

end
end
puts efas.to_s.chop

Try:

efas = Hash.new
...
efas[$3] = [$1,$2,$4,$5,$6] unless efas.has_key?($3)
...
puts efas.inspect

Are there better ways as regular expressions ?

You could look at String#split instead

HTH,

Brian.

Rebhan, Gilbert · Feb 6, 2007

Hi,

=20
-----Original Message-----
From: Brian Candler [mailto:[email protected]]=20
Sent: Tuesday, February 06, 2007 3:37 PM
To: ruby-talk ML
Subject: Re: Parse csv similar file

what kind of collection is the best ? is an array sufficient ?

/*
Depends what you want to do with it. If you want to be able to find an
entry
E123456 quickly, then you'd use a hash. If you want to keep only the
first/last entry for a particular key (as it seems you do), using a hash
speeds things up here too.
*/

i don't need to find all entries E..... , but collect all datas
that belong to the different E.....

i want a collection for every E... that occurs, with all the lines
(except the E... itself) that contain that E in it

/*
Try:

efas =3D Hash.new
...
efas[$3] =3D [$1,$2,$4,$5,$6] unless efas.has_key?($3)
...
puts efas.inspect
*/

that gives me only one dataset in the hash, but there are more
entries that have E123456 in it.

Regards, Gilbert

Brian Candler · Feb 6, 2007

what kind of collection is the best ? is an array sufficient ?

Click to expand...

/*
Depends what you want to do with it. If you want to be able to find an
entry
E123456 quickly, then you'd use a hash. If you want to keep only the
first/last entry for a particular key (as it seems you do), using a hash
speeds things up here too.
*/

i don't need to find all entries E..... , but collect all datas
that belong to the different E.....

i want a collection for every E... that occurs, with all the lines
(except the E... itself) that contain that E in it

/*
Try:

efas = Hash.new
...
efas[$3] = [$1,$2,$4,$5,$6] unless efas.has_key?($3)
...
puts efas.inspect
*/

that gives me only one dataset in the hash, but there are more
entries that have E123456 in it.

I was just following your original example, which only kept the first line
for a particular E key.

If you want to keep them all, then I'd use a hash with each element being an
array.

efas[$3] ||= [] # create empty array if necessary
efas[$3] << [$1,$2,$4,$5,$6] # add a new line

So, given the following input

aaa,bbb,E123,ddd,eee,fff
ggg,hhh,E123,iii,jjj,kkk

you should get

efas = {
"E123" => [
["aaa","bbb","ddd","eee","fff"],
["ggg","hhh","iii","jjj","kkk"],
],
}

puts efas["E123"].size # 2
puts efas["E123"][0][3] # "eee"
puts efas["E123"][1][3] # "jjj"

In practice, to make it easier to manipulate this data, you'd probably want
to create a class to represent each object, rather than using a 5-element
array.

You would give each attribute a sensible name. I don't know what these
values mean, so I've just called them a to e here.

class Myclass
attr_accessor :a, :b, :c, :d, :e
def initialize(a, b, c, d, e)
@a = a
@b = b
@c = c
@d = d
@e = e
end
end

...
efas[$3] ||= []
efas[$3] << Myclass.new($1,$2,$4,$5,$6)

HTH,

Brian.

Phrogz · Feb 6, 2007

i have a txtfile with a format like that =

AP850KP;INCLIB;E023889;AP013;240107;0730
AP850SD$;INCLIB;E052337;AP013;240107;0730
AP850SDA;INCLIB;E050441;AP013;240107;0730
AP850SDI;INCLIB;E023889;AP013;240107;0730
AP850SDO;INCLIB;E052337;AP013;240107;0730
AP850SDS;INCLIB;E050441;AP013;240107;0730
..

i want to get a collection for every E followed by digits,
so with the example above, i want to get =

lines = DATA.readlines.map{ |line|
line.chomp.split( ';' )
}
lookup = {}
lines.each{ |data|
key = data.find{ |value| /^E/ =~ value }
lookup[ key ] = data
}
p lookup[ "E050441" ]
#=> ["AP850SDS", "INCLIB", "E050441", "AP013", "240107", "0730"]
__END__
AP850KP;INCLIB;E023889;AP013;240107;0730
AP850SD$;INCLIB;E052337;AP013;240107;0730
AP850SDA;INCLIB;E050441;AP013;240107;0730
AP850SDI;INCLIB;E023889;AP013;240107;0730
AP850SDO;INCLIB;E052337;AP013;240107;0730
AP850SDS;INCLIB;E050441;AP013;240107;0730

Drew Olson · Feb 6, 2007

Gavin said:
i want to get a collection for every E followed by digits,
so with the example above, i want to get =

Click to expand...

lines = DATA.readlines.map{ |line|
line.chomp.split( ';' )
}
lookup = {}
lines.each{ |data|
key = data.find{ |value| /^E/ =~ value }
lookup[ key ] = data
}
p lookup[ "E050441" ]
#=> ["AP850SDS", "INCLIB", "E050441", "AP013", "240107", "0730"]
__END__
AP850KP;INCLIB;E023889;AP013;240107;0730
AP850SD$;INCLIB;E052337;AP013;240107;0730
AP850SDA;INCLIB;E050441;AP013;240107;0730
AP850SDI;INCLIB;E023889;AP013;240107;0730
AP850SDO;INCLIB;E052337;AP013;240107;0730
AP850SDS;INCLIB;E050441;AP013;240107;0730

I think he wants to append this array with information each time he sees
the same key, so modify your code like so:

lines = DATA.readlines.map{ |line|
line.chomp.split( ';' )
}
lookup = {}
lines.each{ |data|
key = data.find{ |value| /^E/ =~ value }
lookup[ key ] ||= []
lookup[ key ] << data
}

Gregory Brown · Feb 6, 2007

Hi,

<newbie>

i have a txtfile with a format like that =

AP850KP;INCLIB;E023889;AP013;240107;0730
AP850SD$;INCLIB;E052337;AP013;240107;0730
AP850SDA;INCLIB;E050441;AP013;240107;0730
AP850SDI;INCLIB;E023889;AP013;240107;0730
AP850SDO;INCLIB;E052337;AP013;240107;0730
AP850SDS;INCLIB;E050441;AP013;240107;0730
...

i want to get a collection for every E followed by digits,
so with the example above, i want to get =

collections:
E023889
E052337
E050441
...

each collection should contain datasets with the rest of the line, so
f.e.
E023889 would have =

[AP850KP;INCLIB;AP013;240107;0730,AP850SDI;AP013;240107;0730]

questions=
what kind of collection is the best ? is an array sufficient ?

Just for fun, here's a Ruport example:

require "rubygems"
require "ruport"
DATA = <<-EOS
AP850KP;INCLIB;E023889;AP013;240107;0730
AP850SD$;INCLIB;E052337;AP013;240107;0730
AP850SDA;INCLIB;E050441;AP013;240107;0730
AP850SDI;INCLIB;E023889;AP013;240107;0730
AP850SDO;INCLIB;E052337;AP013;240107;0730
AP850SDS;INCLIB;E050441;AP013;240107;0730
EOS

table = Ruport:

ata::Table.parse(DATA, :has_names => false,
:csv_options=>{:col_sep=>";"})

table.column_names = %w[c1 c2 c3 c4 c5 c6] # BUG! you shouldn't need colnames

e = table.column(2).uniq
e.each { |x| table.create_group(x) { |r| r[2].eql?(x) } }

groups = table.groups

groups.attributes
["E023889", "E052337", "E050441"]
groups["E023889"].map { |r| r[0] }
["AP850KP", "AP850SDI"]
groups.each { |t| p t[0].c1 }

Click to expand...

"AP850KP"
"AP850SD$"
"AP850SDA"

===============

note that in making this example, I found a small bug in Ruport's
grouping support which I will fix

Phrogz · Feb 6, 2007

I think he wants to append this array with information each time he sees
the same key, so modify your code like so:

lines = DATA.readlines.map{ |line|
line.chomp.split( ';' )}

lookup = {}
lines.each{ |data|
key = data.find{ |value| /^E/ =~ value }
lookup[ key ] ||= []
lookup[ key ] << data

}

Curses, I didn't read carefully enough. Right you are. (And, though
it's not clear from his example, he might not even need to split the
original line into arrays of pieces, but just keep the lines.)

Phrogz · Feb 6, 2007

I think he wants to append this array with information each time he sees
the same key [...]

So here's another version:

lookup = Hash.new{ |h,k| h[k]=[] }

DATA.each_line{ |line|
line.chomp!
warn "No key in '#{line}'" unless key = line[ /\bE\w+/ ]
lookup[ key ] << line
}

p lookup[ "E050441" ]
#=> ["AP850SDA;INCLIB;E050441;AP013;240107;0730",
"AP850SDS;INCLIB;E050441;AP013;240107;0730"]

require 'pp'
pp lookup
#=> {"E050441"=>
#=> ["AP850SDA;INCLIB;E050441;AP013;240107;0730",
#=> "AP850SDS;INCLIB;E050441;AP013;240107;0730"],
#=> "E052337"=>
#=> ["AP850SD$;INCLIB;E052337;AP013;240107;0730",
#=> "AP850SDO;INCLIB;E052337;AP013;240107;0730"],
#=> "E023889"=>
#=> ["AP850KP;INCLIB;E023889;AP013;240107;0730",
#=> "AP850SDI;INCLIB;E023889;AP013;240107;0730"]}

__END__
AP850KP;INCLIB;E023889;AP013;240107;0730
AP850SD$;INCLIB;E052337;AP013;240107;0730
AP850SDA;INCLIB;E050441;AP013;240107;0730
AP850SDI;INCLIB;E023889;AP013;240107;0730
AP850SDO;INCLIB;E052337;AP013;240107;0730
AP850SDS;INCLIB;E050441;AP013;240107;0730

Rebhan, Gilbert · Feb 7, 2007

Hi,
=20
-----Original Message-----
From: Phrogz [mailto:[email protected]]=20
Sent: Tuesday, February 06, 2007 6:00 PM
To: ruby-talk ML
Subject: Re: Parse csv similar file

I think he wants to append this array with information each time he sees
the same key [...]

i still don't know how to go, so here some more notes ...

i get a folder

/timestamp
metafile.txt
/INCLIB
/PLI

metafile looks like that =3D
APLVZDT;INCLIB;E050441;AP013;240107;0730
AP400ER;INCLIB;E023889;AP013;240107;0730
AP540RBP;INCLIB;E052337;AP013;240107;0730
AP700PA;INCLIB;E050441;AP013;240107;0730
... more lines

field 1 is a filename
field 2 is a foldername, shows whether path is /INCLIB/file or /PLI/file
field 3 is a ticketnr
field 4 is a username
field 5 is a date
field 6 is a timestamp

i need to parse the metafile and =3D

1. create a folderstructure for every ticketnr that occurs, f.e.

/E050441
/INCLIB
/PLI

and put all the files that belong to that ticket
(means the line with the filename contains that ticketnr)
in the subfolder which is field 2

2. create a file in the root of the /ticketnr folder
which contains the rest of a dataset (line), means =3D

field 4
field 5
field 6

which are the same for every file with the same ticketnr

the format might look like

user=3D...
date=3D...
time=3D...

have to decide it later.

I thought with =3D

File.open("mycsvfile", "r").each do |line|
if line =3D~ /(\w+.?);(\w+);(\w+);(\w+);(\w+);(\w+)/
=20
efas<<$3.to_s<<',' unless efas.include?($3.to_s)

i get an array with all ticketnr
then i create a folderstructure for every index in that array
and put the files in it, but i don't get it.

Any ideas ?

Regards, Gilbert

Brian Candler · Feb 7, 2007

i get a folder

/timestamp
metafile.txt
/INCLIB
/PLI

metafile looks like that =
APLVZDT;INCLIB;E050441;AP013;240107;0730
AP400ER;INCLIB;E023889;AP013;240107;0730
AP540RBP;INCLIB;E052337;AP013;240107;0730
AP700PA;INCLIB;E050441;AP013;240107;0730
... more lines

field 1 is a filename
field 2 is a foldername, shows whether path is /INCLIB/file or /PLI/file
field 3 is a ticketnr
field 4 is a username
field 5 is a date
field 6 is a timestamp

i need to parse the metafile and =

1. create a folderstructure for every ticketnr that occurs, f.e.

/E050441
/INCLIB
/PLI

and put all the files that belong to that ticket
(means the line with the filename contains that ticketnr)
in the subfolder which is field 2

2. create a file in the root of the /ticketnr folder
which contains the rest of a dataset (line), means =

field 4
field 5
field 6

which are the same for every file with the same ticketnr

the format might look like

user=...
date=...
time=...

have to decide it later.

I thought with =

File.open("mycsvfile", "r").each do |line|
if line =~ /(\w+.?);(\w+);(\w+);(\w+);(\w+);(\w+)/

efas<<$3.to_s<<',' unless efas.include?($3.to_s)

i get an array with all ticketnr
then i create a folderstructure for every index in that array
and put the files in it, but i don't get it.

Any ideas ?

I'd do all the work on-the-fly. Untested code:

require 'fileutils'
SRCDIR="/path_to_src"
DSTDIR="/path_to_dst"

def copy_ticket(filename, folder, ticket, user, date, time)
srcdir = SRCDIR + File::SEPARATOR + folder
dstdir = DSTDIR + File::SEPARATOR + ticket + File::SEPARATOR + folder
FileUtils.mkdir_p(dstdir)
FileUtils.cp(srcdir + File::SEPARATOR + filename,
dstdir + File::SEPARATOR + filename)

# write out status file
statusfile = dstdir + File::SEPARATOR + "status.txt"
unless FileTest.exists?(statusfile)
File.open(statusfile, "w") do |sf|
sf.puts "user=#{user}"
sf.puts "date=#{date}"
sf.puts "time=#{time}"
end
end
end

def process_meta(f)
f.each_line do |line|
next unless line =~ /^(\w+);(\w+);(\w+);(\w+);(\w+);(\w+)$/
copy_ticket($1,$2,$3,$4,$5,$6)
end
end

# Main program
File.open("mycsvfile") do |f|
process_meta(f)
end

If you want to build up a hash of ticket IDs seen, you can do that in
process_meta as well. I'd pass in an empty hash, and update it in the
each_line loop.

HTH,

Brian.

Rebhan, Gilbert · Feb 7, 2007

Hi,

-----Original Message-----
From: Brian Candler [mailto:[email protected]]=20
Sent: Wednesday, February 07, 2007 10:41 AM
To: ruby-talk ML
Subject: Re: Parse csv similar file

thanks Brian, works like a charm

i had to add the Extension .txt (this may be altered)
to the filename and did it like that =3D

require 'fileutils'
SRCDIR=3D"/path_to_src"
DSTDIR=3D"/path_to_dst"
#EXT=3D".extension"
EXT=3D".txt"

def copy_ticket(filename, folder, ticket, user, date, time)
srcdir =3D SRCDIR + File::SEPARATOR + folder
dstdir =3D DSTDIR + File::SEPARATOR + ticket + File::SEPARATOR + =
folder
filename=3Dfilename<<EXT
...

is there a better way ?
=20

what a pitty it don't work with jruby 0.9.2

Have to go with jruby as using it in an ant script
with the <script> task

jruby gives no error, it just don't work, nothing happens ?!

Possible workaround =3D

i can create an executable via rubyscript2exe.rb and call
that .exe in my antscript.

But therefore the .exe has to accept the parameters

SRCDIR, DSTDIR,EXT when calling it

<exec ...>
<arg line=3D"SRCDIR DSTDIR EXT"/>
</exec>

How to alter your class to achieve that ?

Regards, Gilbert

P.S. :=20
i hope you are open for stupid questions here on the list ;-),
as i'm quite new to ruby (used it some month but only for small=20
purposes in ant scripts) , coming from java.

-----Original Message-----
From: Brian Candler [mailto:[email protected]]=20
Sent: Wednesday, February 07, 2007 10:41 AM
To: ruby-talk ML
Subject: Re: Parse csv similar file

Rebhan, Gilbert · Feb 7, 2007

=20
Hi,

-----Original Message-----
From: Rebhan, Gilbert [mailto:[email protected]]=20
Sent: Wednesday, February 07, 2007 11:28 AM
To: ruby-talk ML
Subject: Re: Parse csv similar file

/*
But therefore the .exe has to accept the parameters

SRCDIR, DSTDIR,EXT when calling it

<exec ...>
<arg line=3D"SRCDIR DSTDIR EXT"/>
</exec>

How to alter your class to achieve that ?
*/

OK, it works like that =3D

require 'fileutils'
SRCDIR=3DARGV[0]
DSTDIR=3DARGV[1]
EXT=3DARGV[2]

converting *.rb to *.exe and call it
*.exe "/path_to_src" "/path_to_dst" ".extension"

thanks a lot for your help !!

Regards, Gilbert

Brian Candler · Feb 7, 2007

i had to add the Extension .txt (this may be altered)
to the filename and did it like that =

require 'fileutils'
SRCDIR="/path_to_src"
DSTDIR="/path_to_dst"
#EXT=".extension"
EXT=".txt"

def copy_ticket(filename, folder, ticket, user, date, time)
srcdir = SRCDIR + File::SEPARATOR + folder
dstdir = DSTDIR + File::SEPARATOR + ticket + File::SEPARATOR + folder
filename=filename<<EXT
...

is there a better way ?

That's OK, just beware that the way you've done it you've modified the
string which was passed in. e.g.

a="foobar"
copy_ticket(a, "/tmp", "E123", "x", "y", "z")
puts a

will print "foobar.txt"

To avoid that:

filename = filename + EXT

(which creates a new String object, and then updates the local variable
'filename' to point to this new object)

This is an interesting "small" file-chomping task. I wonder what the
equivalent Java program would look like

B.

Rebhan, Gilbert · Feb 7, 2007

=20
Hi,

filename=3Dfilename<<EXT
...
=20
is there a better way ?

/*
That's OK, just beware that the way you've done it you've modified the
string which was passed in. e.g.
...
*/

yup, i know, but somewhere i read that
string concatenation via << would be better/quicker as +
because no new String object gets created.

Regards, Gilbert

Erik Veenstra · Feb 7, 2007

Just an idea...

gegroet,
Erik V. - http://www.erikveen.dds.nl/

----------------------------------------------------------------

hash =
File.open("input.txt") do |f|
f.readlines.collect do |line|
k = line.scan(/;(E\d+);/).flatten.shift
v = line.scan(/;E\d+;(.*)/).flatten.shift

[k, v]
end.select do |k, v|
k and v
end.inject({}) do |h, (k, v)|
(h[k] ||= []) << v ; h
end.inject({}) do |h, (k, v)|
h[k] = v.join(",") ; h
end
end

p hash

----------------------------------------------------------------

Erik Veenstra · Feb 7, 2007

Nice abstraction... ;]

(By heart: This group_by is part of one of the Rails packages.)

gegroet,
Erik V. - http://www.erikveen.dds.nl/

----------------------------------------------------------------

module Enumerable
def hash_by(&block)
inject({}){|h, o| (h[block[o]] ||= []) << o ; h}
end

def group_by(&block)
#hash_by(&block).values
hash_by(&block).sort.transpose.pop
end
end

hash =
File.open("input.txt") do |f|
f.readlines.group_by do |line|
line.scan(/;(E\d+);/)
end.collect do |group|
group.collect do |string|
string.scan(/;E\d+;(.*)/).flatten.shift
end.join(",")
end
end

p hash

----------------------------------------------------------------

Win32OLE + DRb - Windows = Fun	2	Feb 10, 2006
Detect file encoding utf-8	3	Aug 29, 2007
Deleting row from CSV	5	Jun 26, 2008
Parse CSV file	3	Sep 20, 2006
Processing large CSV files - how to maximise throughput?	11	Oct 25, 2013
[ANN] rs 0.1.2	0	Oct 19, 2006
Enhancing the Gateway (Help Needed)	24	Oct 28, 2007
capitalizing first letter in each line of a string.	2	Feb 9, 2011

Parse csv similar file

Rebhan, Gilbert

Brian Candler

Rebhan, Gilbert

Brian Candler

Phrogz

Drew Olson

Gregory Brown

Phrogz

Phrogz

Rebhan, Gilbert

Brian Candler

Rebhan, Gilbert

Rebhan, Gilbert

Brian Candler

Rebhan, Gilbert

Erik Veenstra

Erik Veenstra

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads