Sorting a logfile, how would you write it?

F

Frank Meyer

I've written a little ruby program which can sort logfiles with the
following format:

4.text text text
1.text text text
2.text text text
10.text text text
2.text2 text2 text2

The file is given as a command line parameter and after sorting the
entries it writes them back into this file.

The program is in the attachement.


What I want to know is how would you write such a tool in ruby? I'm
asking this because I'm still learning ruby and I want to learn how to
do it in ruby (ans its design principles).


Thank you!



Turing

Attachments:
http://www.ruby-forum.com/attachment/86/test.rb
 
W

William James

I've written a little ruby program which can sort logfiles with the
following format:

4.text text text
1.text text text
2.text text text
10.text text text
2.text2 text2 text2

The file is given as a command line parameter and after sorting the
entries it writes them back into this file.

The program is in the attachement.

What I want to know is how would you write such a tool in ruby? I'm
asking this because I'm still learning ruby and I want to learn how to
do it in ruby (ans its design principles).

Thank you!

Turing

Attachments:http://www.ruby-forum.com/attachment/86/test.rb


File.open( ARGV.first, "r+" ){|file|
array = file.readlines
file.rewind
file.truncate(0)
file.puts array.sort_by{|s| s[/^\d+/].to_i }
}
 
R

Ryan Davis

I've written a little ruby program which can sort logfiles with the
following format:

4.text text text
1.text text text
2.text text text
10.text text text
2.text2 text2 text2
...
File.open( ARGV.first, "r+" ){|file|
array = file.readlines
file.rewind
file.truncate(0)
file.puts array.sort_by{|s| s[/^\d+/].to_i }
}

your version takes a lot of memory, is slow, and doesn't properly
sort the content of the line, just the number. swap the two "2."
lines and you'll see what I mean. Using the right tool for the job
(`sort`) does wonders:

% ruby -e 'n = 1_000_000; File.open("blah.txt", "w") { |f| n.times
{ m = rand 5; f.puts "#{rand n}. file#{m} file#{m} file#{m}" } }'
% cp blah.txt blah2.txt
% time ruby -e 'File.open( ARGV.first, "r+" ) { |file| array =
file.readlines; file.rewind; file.truncate(0); file.puts array.sort_by
{|s| s[/^\d+/].to_i } }' blah.txt
real 0m8.182s ...
% time ruby -e 'path = ARGV.shift; system %(sort -n "#{path}" > "#
{path}.tmp"); File.rename "#{path}.tmp", path' blah2.txt
real 0m3.175s ...
% cmp blah.txt blah2.txt
blah.txt blah2.txt differ: char 50, line 3
% head blah.txt blah2.txt
==> blah.txt <==
3. file4 file4 file4
4. file4 file4 file4
6. file3 file3 file3
6. file1 file1 file1
6. file0 file0 file0
7. file0 file0 file0
7. file4 file4 file4
8. file1 file1 file1
8. file3 file3 file3
8. file3 file3 file3

==> blah2.txt <==
3. file4 file4 file4
4. file4 file4 file4
6. file0 file0 file0
6. file1 file1 file1
6. file3 file3 file3
7. file0 file0 file0
7. file4 file4 file4
8. file1 file1 file1
8. file3 file3 file3
8. file3 file3 file3
532 %
 
R

Robert Klemme

I've written a little ruby program which can sort logfiles with the
following format:

4.text text text
1.text text text
2.text text text
10.text text text
2.text2 text2 text2
...
File.open( ARGV.first, "r+" ){|file|
array = file.readlines
file.rewind
file.truncate(0)
file.puts array.sort_by{|s| s[/^\d+/].to_i }
}

your version takes a lot of memory, is slow, and doesn't properly sort
the content of the line, just the number. swap the two "2." lines and
you'll see what I mean. Using the right tool for the job (`sort`) does
wonders:

% ruby -e 'n = 1_000_000; File.open("blah.txt", "w") { |f| n.times { m =
rand 5; f.puts "#{rand n}. file#{m} file#{m} file#{m}" } }'
% cp blah.txt blah2.txt
% time ruby -e 'File.open( ARGV.first, "r+" ) { |file| array =
file.readlines; file.rewind; file.truncate(0); file.puts
array.sort_by{|s| s[/^\d+/].to_i } }' blah.txt
real 0m8.182s ...
% time ruby -e 'path = ARGV.shift; system %(sort -n "#{path}" >
"#{path}.tmp"); File.rename "#{path}.tmp", path' blah2.txt
real 0m3.175s ...
% cmp blah.txt blah2.txt
blah.txt blah2.txt differ: char 50, line 3
% head blah.txt blah2.txt
==> blah.txt <==
3. file4 file4 file4
4. file4 file4 file4
6. file3 file3 file3
6. file1 file1 file1
6. file0 file0 file0
7. file0 file0 file0
7. file4 file4 file4
8. file1 file1 file1
8. file3 file3 file3
8. file3 file3 file3

==> blah2.txt <==
3. file4 file4 file4
4. file4 file4 file4
6. file0 file0 file0
6. file1 file1 file1
6. file3 file3 file3
7. file0 file0 file0
7. file4 file4 file4
8. file1 file1 file1
8. file3 file3 file3
8. file3 file3 file3
532 %

It's a one liner:

ruby -i.bak -e 'puts ARGF.readlines.sort_by {|l| l[/^\d+/].to_i}' file

Less memory usage:

ruby -i.bak -e 'puts ARGF.readlines.sort! {|a,b| a[/^\d+/].to_i <=>
b[/^\d+/].to_i}' file

Kind regards

robert
 
W

William James

I've written a little ruby program which can sort logfiles with the
following format:
4.text text text
1.text text text
2.text text text
10.text text text
2.text2 text2 text2
...
File.open( ARGV.first, "r+" ){|file|
array = file.readlines
file.rewind
file.truncate(0)
file.puts array.sort_by{|s| s[/^\d+/].to_i }
}

your version takes a lot of memory,

Wrong.

When the number of lines to sort is small,
it uses a small amount of memory.
When the number of lines to sort is medium,
it uses a medium amount of memory.
When the number of lines to sort is large,
it uses a large amount of memory.

Everything is relative. If its speed is compared to the
speed of other versions written in scripting languages, it
is not slow.
and doesn't properly
sort the content of the line,

Wrong.

Looking at the source code of the original poster immediately
reveals that he wants to sort only on the number at the
beginning of the line.

just the number. swap the two "2."
lines and you'll see what I mean. Using the right tool for the job
(`sort`) does wonders:

% ruby -e 'n = 1_000_000; File.open("blah.txt", "w") { |f| n.times
{ m = rand 5; f.puts "#{rand n}. file#{m} file#{m} file#{m}" } }'
% cp blah.txt blah2.txt
% time ruby -e 'File.open( ARGV.first, "r+" ) { |file| array =
file.readlines; file.rewind; file.truncate(0); file.puts array.sort_by
{|s| s[/^\d+/].to_i } }' blah.txt
real 0m8.182s ...
% time ruby -e 'path = ARGV.shift; system %(sort -n "#{path}" > "#
{path}.tmp"); File.rename "#{path}.tmp", path' blah2.txt

Wrong.

The original poster stated:
The file is given as a command line parameter and after sorting the
entries it writes them back into this file.

Your code makes no attempt to write to the original file; it uses
a temporary file.

Furthermore, your solution won't even run:

E:\Ruby>ruby -e 'path = ARGV.shift; system %(sort -n "#{path}"
"#{path}.tmp"); File.rename "#{path}.tmp", path' data
-e:1: unterminated string meets end of file

If your code is put in a file ...

E:\Ruby>ruby try.rb data
Input file specified two times.

.... it still won't work.

Perhaps your attempt at a solution requires Unix, and you,
in your ignorance, or your thoughtlessness, or your
ignorance and your thoughtlessness, assumed that every
user of Ruby is a user of Unix.
 
W

William James

On Aug 10, 2007, at 13:54 , William James wrote:
I've written a little ruby program which can sort logfiles with the
following format:
4.text text text
1.text text text
2.text text text
10.text text text
2.text2 text2 text2
...
File.open( ARGV.first, "r+" ){|file|
array = file.readlines
file.rewind
file.truncate(0)
file.puts array.sort_by{|s| s[/^\d+/].to_i }
}
your version takes a lot of memory, is slow, and doesn't properly sort
the content of the line, just the number. swap the two "2." lines and
you'll see what I mean. Using the right tool for the job (`sort`) does
wonders:
% ruby -e 'n = 1_000_000; File.open("blah.txt", "w") { |f| n.times { m =
rand 5; f.puts "#{rand n}. file#{m} file#{m} file#{m}" } }'
% cp blah.txt blah2.txt
% time ruby -e 'File.open( ARGV.first, "r+" ) { |file| array =
file.readlines; file.rewind; file.truncate(0); file.puts
array.sort_by{|s| s[/^\d+/].to_i } }' blah.txt
real 0m8.182s ...
% time ruby -e 'path = ARGV.shift; system %(sort -n "#{path}" >
"#{path}.tmp"); File.rename "#{path}.tmp", path' blah2.txt
real 0m3.175s ...
% cmp blah.txt blah2.txt
blah.txt blah2.txt differ: char 50, line 3
% head blah.txt blah2.txt
==> blah.txt <==
3. file4 file4 file4
4. file4 file4 file4
6. file3 file3 file3
6. file1 file1 file1
6. file0 file0 file0
7. file0 file0 file0
7. file4 file4 file4
8. file1 file1 file1
8. file3 file3 file3
8. file3 file3 file3
==> blah2.txt <==
3. file4 file4 file4
4. file4 file4 file4
6. file0 file0 file0
6. file1 file1 file1
6. file3 file3 file3
7. file0 file0 file0
7. file4 file4 file4
8. file1 file1 file1
8. file3 file3 file3
8. file3 file3 file3
532 %

It's a one liner:

ruby -i.bak -e 'puts ARGF.readlines.sort_by {|l| l[/^\d+/].to_i}' file

It's my understanding that when you use -i, a temporary file
is created, the original file is deleted, and the temporary
file is renamed. Doesn't this cause unnecessary disk
fragmentation?
Less memory usage:

ruby -i.bak -e 'puts ARGF.readlines.sort! {|a,b| a[/^\d+/].to_i <=>
b[/^\d+/].to_i}' file

Of course, you're trading speed for memory.
 
E

Eric Hodel

File.open( ARGV.first, "r+" ){|file|
array = file.readlines
file.rewind
file.truncate(0)
file.puts array.sort_by{|s| s[/^\d+/].to_i }
}

your version takes a lot of memory,

Wrong.

This method uses at least 2x the file size worth of memory. That's a
lot.
E:\Ruby>ruby -e 'path = ARGV.shift; system %(sort -n "#{path}"
-e:1: unterminated string meets end of file

I just ran it, it worked fine.

You'll probably have to redo the quoting for a non-bourne-compatible
shell.
Perhaps your attempt at a solution requires Unix, and you,
in your ignorance, or your thoughtlessness, or your
ignorance and your thoughtlessness, assumed that every
user of Ruby is a user of Unix.

Please try to flame harder. This one just made me chuckle.
 
E

Eric Hodel

It's a one liner:

ruby -i.bak -e 'puts ARGF.readlines.sort_by {|l| l[/^\d+/].to_i}'
file

It's my understanding that when you use -i, a temporary file
is created, the original file is deleted, and the temporary
file is renamed. Doesn't this cause unnecessary disk
fragmentation?

If I had a filesystem where I had to worry about fragmentation I
wouldn't care. The amount of time spent figuring out some best way
to "fix" it is going to be less than the time running a defragmenter
will take.
 
W

William James

To do this safely you'll need a temporary file.
Slurping a file into memory, sorting it, then writing it back to the same
file is an unsound practice, i.e. not "rerunnable-safe". Suppose, for
example, you suffer a power failure half-way through writing back the file,
or the write fails due to "disk full" or "user disk quota exceeded" or for
any other reason. Oops, you've just corrupted your input file.

Of course. But I'm willing to take that miniscule chance when
I'm doing a write to a small file that takes a fraction of a
second.

The question remains: doesn't using a temp file cause more
disk fragmentation than writing directly to the original file?
 
R

Robert Klemme

On Aug 10, 2007, at 13:54 , William James wrote:
I've written a little ruby program which can sort logfiles with the
following format:
4.text text text
1.text text text
2.text text text
10.text text text
2.text2 text2 text2
...
File.open( ARGV.first, "r+" ){|file|
array = file.readlines
file.rewind
file.truncate(0)
file.puts array.sort_by{|s| s[/^\d+/].to_i }
}
your version takes a lot of memory, is slow, and doesn't properly sort
the content of the line, just the number. swap the two "2." lines and
you'll see what I mean. Using the right tool for the job (`sort`) does
wonders:
% ruby -e 'n = 1_000_000; File.open("blah.txt", "w") { |f| n.times { m =
rand 5; f.puts "#{rand n}. file#{m} file#{m} file#{m}" } }'
% cp blah.txt blah2.txt
% time ruby -e 'File.open( ARGV.first, "r+" ) { |file| array =
file.readlines; file.rewind; file.truncate(0); file.puts
array.sort_by{|s| s[/^\d+/].to_i } }' blah.txt
real 0m8.182s ...
% time ruby -e 'path = ARGV.shift; system %(sort -n "#{path}" >
"#{path}.tmp"); File.rename "#{path}.tmp", path' blah2.txt
real 0m3.175s ...
% cmp blah.txt blah2.txt
blah.txt blah2.txt differ: char 50, line 3
% head blah.txt blah2.txt
==> blah.txt <==
3. file4 file4 file4
4. file4 file4 file4
6. file3 file3 file3
6. file1 file1 file1
6. file0 file0 file0
7. file0 file0 file0
7. file4 file4 file4
8. file1 file1 file1
8. file3 file3 file3
8. file3 file3 file3
==> blah2.txt <==
3. file4 file4 file4
4. file4 file4 file4
6. file0 file0 file0
6. file1 file1 file1
6. file3 file3 file3
7. file0 file0 file0
7. file4 file4 file4
8. file1 file1 file1
8. file3 file3 file3
8. file3 file3 file3
532 %
It's a one liner:

ruby -i.bak -e 'puts ARGF.readlines.sort_by {|l| l[/^\d+/].to_i}' file

It's my understanding that when you use -i, a temporary file
is created, the original file is deleted, and the temporary
file is renamed.
Correct.

Doesn't this cause unnecessary disk
fragmentation?

Huh? Are you still on MS DOS? I haven't heard someone worry about disk
fragmentation in ages. I don't think that this is an issue for any
modern file system.
Less memory usage:

ruby -i.bak -e 'puts ARGF.readlines.sort! {|a,b| a[/^\d+/].to_i <=>
b[/^\d+/].to_i}' file

Of course, you're trading speed for memory.

Where exactly do you see that trade off? I was trading elegance for
memory. Sure there are effects, that could make one or the other
solution faster but if I would be really worrying about speed then I'd
use "sort" anyway.

Kind regards

robert
 
F

Frank Meyer

Thanks for all your suggestions, it helped me a lot to learn more about
Ruby's library. I didn't know that there are so many handy functions :)


And about the temporary file, I'm using it only for private purposes and
I didn't want to bother with creating a temporary file in my first
attempt to write a ruby program which can sort these log files.


Thank you all!



Turing
 
G

Gregory Brown

--- William James said:
Of course. But I'm willing to take that miniscule chance when
I'm doing a write to a small file that takes a fraction of a
second.

That may be an acceptable risk for a program written for private use.
Not so for a production program. After all, impatient users often
press [CTRL-C] in my experience, and that could cause corruption
if it occurred while the file was being rewritten.

You can of course capture that, but you're write that it's creating
additional unnecessary work.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,479
Members
44,899
Latest member
RodneyMcAu

Latest Threads

Top