Extract date from filenames using regex

C

Clement Ow

I have my code which looks like this:
delete= 5 + 2 #escape counting in weekends i.e Sat,Sun
folders = $del_path
puts delete_date = DateTime.now - delete
regexp = Regexp.compile(/(\d{4}\d{2}\d{2})/)
fileData = Struct.new:)name, :size)
deleted_files = []
folders.each do |folder|
Dir.glob(folder+"/*") do |file|
match = regexp.match(File.basename(file));
if match
file_date = DateTime.parse(match[1])
When my file name is in the format, 20080331 for example, the script
will run successfully. However, if the filename has additional
characters added to it, say, risk20080331, it'll run an error. And i
reckon it's the cause of the above line.

So is there any way I can extract the date using regex or whichever way
simpler so I can compare the deletion date and execute the rm_r command?
Thanks!
 
D

Daniel Finnie

Hi,

You can use File.mtime(file_name) which will return a Time object.

You can also match with /\d+/ (one or more digits):
("sdf555sadfsdfg")[/\d+/]
=> "555"

But watch out!:
("sdf555sadfs5867dfg")[/\d+/]
=> "555"

For a nice, object-oriented approach to file manipulation in Ruby, you
might want to check out Pathname in the standard library:
http://www.ruby-doc.org/stdlib/libdoc/pathname/rdoc/index.html

Dan

I have my code which looks like this:
delete= 5 + 2 #escape counting in weekends i.e Sat,Sun
folders = $del_path
puts delete_date = DateTime.now - delete
regexp = Regexp.compile(/(\d{4}\d{2}\d{2})/)
fileData = Struct.new:)name, :size)
deleted_files = []
folders.each do |folder|
Dir.glob(folder+"/*") do |file|
match = regexp.match(File.basename(file));
if match
file_date = DateTime.parse(match[1])
When my file name is in the format, 20080331 for example, the script
will run successfully. However, if the filename has additional
characters added to it, say, risk20080331, it'll run an error. And i
reckon it's the cause of the above line.

So is there any way I can extract the date using regex or whichever way
simpler so I can compare the deletion date and execute the rm_r command?
Thanks!
 
C

Clement Ow

Daniel said:
Hi,

You can use File.mtime(file_name) which will return a Time object.

You can also match with /\d+/ (one or more digits):
("sdf555sadfsdfg")[/\d+/]
=> "555"

But watch out!:
("sdf555sadfs5867dfg")[/\d+/]
=> "555"

For a nice, object-oriented approach to file manipulation in Ruby, you
might want to check out Pathname in the standard library:
http://www.ruby-doc.org/stdlib/libdoc/pathname/rdoc/index.html

Dan
Thanks Daniel for your input. I tried using /\d+/ but it'll extract
files that have even 2 numbers to i decided to use
/(\d\d)(\d\d)(\d\d\d\d)/ instead. It enabled me to run the command on
certain files but not all files and the following error occured:

Files/Folders deleted:
//sins00114178/mad/Singapore/CubeMorningLDN/HKD_CUBE_risk
report_09042008.dat size: 74 KB
Files/Folders deleted:
//sins00114178/mad/Singapore/CubeMorningLDN/HKD_CUBE_risk
report_10042008.dat size: 81 KB
Files/Folders deleted:
//sins00114178/mad/Singapore/CubeMorningLDN/HKD_CUBE_risk
report_11042008.dat size: 80 KB
Files/Folders deleted:
//sins00114178/mad/Singapore/CubeMorningLDN/HKD_CUBE_risk
report_14042008.dat size: 79 KB
Files/Folders deleted:
//sins00114178/mad/Singapore/CubeMorningLDN/HKD_CUBE_risk
report_15042008.dat size: 77 KB
Files/Folders deleted:
//sins00114178/mad/Singapore/CubeMorningLDN/HKD_CUBE_risk
report_16042008.dat size: 77 KB
c:/ruby/lib/ruby/1.8/date.rb:1536:in `new_by_frags': invalid date
(ArgumentError
)
from c:/ruby/lib/ruby/1.8/date.rb:1583:in `parse'
from testing.conf.rb:166:in `delFiles'
from testing.conf.rb:163:in `glob'
from testing.conf.rb:163:in `delFiles'
from testing.conf.rb:162:in `each'
from testing.conf.rb:162:in `delFiles'
from testing.conf.rb:204

Is there anything wrong with mycode that prevents deleting all the files
that I want?
 
J

Jesús Gabriel y Galán

I have my code which looks like this:
delete= 5 + 2 #escape counting in weekends i.e Sat,Sun
folders = $del_path
puts delete_date = DateTime.now - delete
regexp = Regexp.compile(/(\d{4}\d{2}\d{2})/)
fileData = Struct.new:)name, :size)
deleted_files = []
folders.each do |folder|
Dir.glob(folder+"/*") do |file|
match = regexp.match(File.basename(file));
if match
file_date = DateTime.parse(match[1])
When my file name is in the format, 20080331 for example, the script
will run successfully. However, if the filename has additional
characters added to it, say, risk20080331, it'll run an error. And i
reckon it's the cause of the above line.

Sorry, what is the error? Cause this works for me:

irb(main):001:0> regexp = Regexp.compile(/(\d{4}\d{2}\d{2})/)
=> /(\d{4}\d{2}\d{2})/
irb(main):002:0> match = regexp.match("risk20080331.log")
=> #<MatchData:0xb7ce20f4>
irb(main):003:0> match[1]
=> "20080331"
irb(main):005:0> require 'date'
=> true
irb(main):006:0> DateTime.parse(match[1])
=> #<DateTime: 4909113/2,0,2299161>

So any string that contains 4 digits followed by 2 digits followed by
2 digits will match that regexp,
independently of what it has around the numbers:

irb(main):007:0> regexp.match("12345678")[1]
=> "12345678"
irb(main):008:0> regexp.match("12345678asdfasdf")[1]
=> "12345678"
irb(main):009:0> regexp.match("asdfasdf12345678asdfasdf")[1]
=> "12345678"
irb(main):010:0> regexp.match("asdfasdf12345678")[1]
=> "12345678"

Jesus.
 
J

Jesús Gabriel y Galán

Thanks Daniel for your input. I tried using /\d+/ but it'll extract
files that have even 2 numbers to i decided to use
/(\d\d)(\d\d)(\d\d\d\d)/ instead. It enabled me to run the command on
certain files but not all files and the following error occured:

report_16042008.dat size: 77 KB
c:/ruby/lib/ruby/1.8/date.rb:1536:in `new_by_frags': invalid date
(ArgumentError
)
from c:/ruby/lib/ruby/1.8/date.rb:1583:in `parse'
from testing.conf.rb:166:in `delFiles'
from testing.conf.rb:163:in `glob'
from testing.conf.rb:163:in `delFiles'
from testing.conf.rb:162:in `each'
from testing.conf.rb:162:in `delFiles'
from testing.conf.rb:204

Is there anything wrong with mycode that prevents deleting all the files
that I want?

OK, now I see the problem. The file that is failing has a number like this:
16042008. The DateTime.parse method is trying to parse the date as:

1604-20-08 which is obviously an invalid date (month > 12).
There are two solutions to this problem:

1.- Change DateTime.parse to DateTime.strptime passing a format
that describes where in the string you have the two digits of the day, the month
and the four digits of the date. I haven't been able to gather a quick example,
cause I don't find a reference for the format string (any help here
appreciated).
The doc refers me to the date/format.rb for details and I don't see
anything clear
there.

2.- Change the regexp a little bit so you capture the day, the month
and the year
in separate groups and create the DateTime using the three values:

irb(main):011:0> regexp = Regexp.compile(/(\d{4})(\d{2})(\d{2})/)
=> /(\d{4})(\d{2})(\d{2})/
irb(main):012:0> m = regexp.match("20080103asdfasdf")
=> #<MatchData:0xb7c11a6c>
irb(main):014:0> d = DateTime.civil m[1].to_i, m[2].to_i, m[3].to_i
=> #<DateTime: 4908937/2,0,2299161>
irb(main):015:0> d.to_s
=> "2008-01-03T00:00:00+00:00"

I think you can apply the above changes to the script and it will work.
Let me know,

Jesus.
 
7

7stud --

Clement said:
Thanks Daniel for your input. I tried using /\d+/ but it'll extract
files that have even 2 numbers to i decided to use
/(\d\d)(\d\d)(\d\d\d\d)/ instead. It enabled me to run the command on
certain files but not all files and the following error occured:

Files/Folders deleted:
//sins00114178/mad/Singapore/CubeMorningLDN/HKD_CUBE_risk
report_09042008.dat size: 74 KB
Files/Folders deleted:
//sins00114178/mad/Singapore/CubeMorningLDN/HKD_CUBE_risk
report_10042008.dat size: 81 KB
Files/Folders deleted:
//sins00114178/mad/Singapore/CubeMorningLDN/HKD_CUBE_risk
report_11042008.dat size: 80 KB
Files/Folders deleted:
//sins00114178/mad/Singapore/CubeMorningLDN/HKD_CUBE_risk
report_14042008.dat size: 79 KB
Files/Folders deleted:
//sins00114178/mad/Singapore/CubeMorningLDN/HKD_CUBE_risk
report_15042008.dat size: 77 KB
Files/Folders deleted:
//sins00114178/mad/Singapore/CubeMorningLDN/HKD_CUBE_risk
report_16042008.dat size: 77 KB
c:/ruby/lib/ruby/1.8/date.rb:1536:in `new_by_frags': invalid date
(ArgumentError
)
from c:/ruby/lib/ruby/1.8/date.rb:1583:in `parse'
from testing.conf.rb:166:in `delFiles'
from testing.conf.rb:163:in `glob'
from testing.conf.rb:163:in `delFiles'
from testing.conf.rb:162:in `each'
from testing.conf.rb:162:in `delFiles'
from testing.conf.rb:204

Is there anything wrong with mycode that prevents deleting all the files
that I want?



require 'date'

str = 'sins00114178'
pattern = /(\d\d)(\d\d)(\d\d\d\d)/

match_obj = pattern.match(str)
puts match_obj[1]

file_date = DateTime.parse(match_obj[1])

--output:--
00
/usr/lib/ruby/1.8/date.rb:1214:in `new_with_hash': invalid date
(ArgumentError)
from /usr/lib/ruby/1.8/date.rb:1258:in `parse'
from r1test.rb:9
 
J

Jesús Gabriel y Galán

On Wed, Apr 23, 2008 at 9:14 AM, Clement Ow

<[email protected]> wrote:
1.- Change DateTime.parse to DateTime.strptime passing a format
that describes where in the string you have the two digits of the day, t= he month
and the four digits of the date. I haven't been able to gather a quick e= xample,
cause I don't find a reference for the format string (any help here
appreciated).
The doc refers me to the date/format.rb for details and I don't see
anything clear
there.

After a couple of trial/error tests this seems to work:

DateTime.strptime "16042008", "%d%M%Y"

So any of the two solutions will work for you.

Jesus.
 
J

Jesús Gabriel y Galán

Clement said:
//sins00114178/mad/Singapore/CubeMorningLDN/HKD_CUBE_risk
report_16042008.dat size: 77 KB
c:/ruby/lib/ruby/1.8/date.rb:1536:in `new_by_frags': invalid date
(ArgumentError
)
from c:/ruby/lib/ruby/1.8/date.rb:1583:in `parse'
from testing.conf.rb:166:in `delFiles'
from testing.conf.rb:163:in `glob'
from testing.conf.rb:163:in `delFiles'
from testing.conf.rb:162:in `each'
from testing.conf.rb:162:in `delFiles'
from testing.conf.rb:204

Is there anything wrong with mycode that prevents deleting all the files
that I want?



require 'date'

str = 'sins00114178'
pattern = /(\d\d)(\d\d)(\d\d\d\d)/

match_obj = pattern.match(str)
puts match_obj[1]

file_date = DateTime.parse(match_obj[1])

--output:--
00
/usr/lib/ruby/1.8/date.rb:1214:in `new_with_hash': invalid date
(ArgumentError)
from /usr/lib/ruby/1.8/date.rb:1258:in `parse'
from r1test.rb:9

You are right, I overlooked the fact that he had added more parens
in the regexp, so he was passing only two digits to DateTime.parse.
Anyway the changes I proposed should work for him.

Jesus.
 
C

Clement Ow

Jesús Gabriel y Galán said:
After a couple of trial/error tests this seems to work:

DateTime.strptime "16042008", "%d%M%Y"

So any of the two solutions will work for you.

Jesus.


Hi Jesus,
First of all thanks for your help!
However,
Despite using
d= DateTime.civil (match[1].to_i, match[2].to_i, match[3].to_i)
file_date=d.to_s
OR

file_date = DateTime.strptime (match[1], "%d%M%Y")

it still gives me invalid date as the error msg. But when i run it in
the fxri it seems to work fine.. This only seems to happen when the date
format is ddmmyyyy, but for yyyymmdd it has no problems though.. Any
ideas anyone? I have cracked my head but to no avail.
 
C

Clement Ow

Can you post the smallest example that fails? Do you have files with
different date formats?

Jesus.

delete_date = DateTime.now - delete

regexp = Regexp.compile(/(\d\d)(\d\d)(\d\d\d\d)/)


fileData = Struct.new:)name, :size)
deleted_files = []

folders.each do |folder|
Dir.glob(folder+"/*") do |file|
puts match = regexp.match(File.basename(file))
if match
file_date = DateTime.strptime(match[1] , fmt='%d%M%Y')
size = (File.size(file))/1024
if delete_date > file_date
deleted_files << fileData.new(file,size)
FileUtils.rm_r file
if File.exist?(file)==false
puts "Files/Folders deleted: #{file} size: #{size} KB"
end #if
end #if
end #if
end #do
end #each
end #if
end #delFiles
c:/ruby/lib/ruby/1.8/date.rb:1536:in `new_by_frags': invalid date
(ArgumentError
)
from c:/ruby/lib/ruby/1.8/date.rb:1563:in `strptime'
from testing.conf.rb:166:in `delFiles'
 
J

Jesús Gabriel y Galán

Can you post the smallest example that fails? Do you have files with
different date formats?

Jesus.


delete_date = DateTime.now - delete

regexp = Regexp.compile(/(\d\d)(\d\d)(\d\d\d\d)/)



fileData = Struct.new:)name, :size)
deleted_files = []

folders.each do |folder|
Dir.glob(folder+"/*") do |file|
puts match = regexp.match(File.basename(file))
if match
file_date = DateTime.strptime(match[1] , fmt='%d%M%Y')

size = (File.size(file))/1024
if delete_date > file_date
deleted_files << fileData.new(file,size)
FileUtils.rm_r file
if File.exist?(file)==false
puts "Files/Folders deleted: #{file} size: #{size} KB"
end #if
end #if
end #if
end #do
end #each
end #if
end #delFiles

c:/ruby/lib/ruby/1.8/date.rb:1536:in `new_by_frags': invalid date
(ArgumentError
)
from c:/ruby/lib/ruby/1.8/date.rb:1563:in `strptime'

from testing.conf.rb:166:in `delFiles'
because DateTime accepts this format?

Yes, that's exactly the issue.
however if i use strptime it also cant work. Any help will be greatly appreciated =)

If you have files with different formats, you will have to know which
format each file is, because DateTime.parse is expecting yyyymmdd,
while strptime is expecting whatever format you pass it, but only one
format. If the dates are current dates, and are only these two formats
(yyyymmdd or ddmmyyyy) I think this is safe:

regexp = /(\d{8})/
match = regexp.match(file_name)
file_date = nil
begin
file_date = DateTime.parse(match[1])
rescue ArgumentError
file_date = DateTime.strptime(match[1], "%d%M%Y")
end

However, if you have arbitrary dates, this can lead to unexpected
results. For example:

19011902

will result in 1901-19-02 while maybe you meant 19-01-1902.
Also, I think the above is safe because the century (20xx for the
year) is not a valid month, but there might be some corner case I
haven't realized.

Jesus.
 
C

Clement Ow

Jesús Gabriel y Galán said:
size = (File.size(file))/1024
end #if

because DateTime accepts this format?

Yes, that's exactly the issue.
however if i use strptime it also cant work. Any help will be greatly appreciated =)

If you have files with different formats, you will have to know which
format each file is, because DateTime.parse is expecting yyyymmdd,
while strptime is expecting whatever format you pass it, but only one
format. If the dates are current dates, and are only these two formats
(yyyymmdd or ddmmyyyy) I think this is safe:

regexp = /(\d{8})/
match = regexp.match(file_name)
file_date = nil
begin
file_date = DateTime.parse(match[1])
rescue ArgumentError
file_date = DateTime.strptime(match[1], "%d%M%Y")
end

However, if you have arbitrary dates, this can lead to unexpected
results. For example:

19011902

will result in 1901-19-02 while maybe you meant 19-01-1902.
Also, I think the above is safe because the century (20xx for the
year) is not a valid month, but there might be some corner case I
haven't realized.

Jesus.

Hey Jesus,

file_date = DateTime.strptime(match[1], "%d%M%Y")
when the above is being put, it will parse the date as eg.28012008 even
thought the date is 28032008. So, I tried some trail and error and i
used this:
file_date = DateTime.strptime(match[1], "%d%m%Y")
and bingo, it parses the date correctly and thus being able to run the
command to delete. Thanks alot for your time and help! =)

Cheers!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,199
Latest member
AnyaFlynn6

Latest Threads

Top