newbie: how to find & extract a string from a file

E

Esmail Bonakdarian

Hi,

Just starting out to explore Ruby (I like it) and I have
a question.

I have an HTML file that contains several references to jpg files.

I would like to extract the filename with the .jpg extension.
What is the best approach for this?

It appears each .jpg reference is on its own line.

Thanks!

Esmail
 
X

x1

I'm sure someone will have a better way of doing this.. but..
Assuming it has <img src="/something.jpg">

##
imgs = []
IO.readlines("c:/somefile.html").each {|line| imgs << line.split("<img
src=\"")[1].to_s.split("\"")[0] if line.match("<img src=") }
puts imgs.join("\n")
##
 
M

MonkeeSage

Esmail said:
I would like to extract the filename with the .jpg extension.
What is the best approach for this?

Hi there,

You could use raw regexps and do it yourself, but you should probably
use an HTML parser to extract HTML data. ;)

A nice HTML parser is Hpricot [1], but it requires an extension (you
cen get it very easily via gems, see the link below). It is very easy
to use, and it's fast.

Using Hpricot, you can do something like this:

require 'hpricot'
require 'open-uri'
soc = open('http://utopia.utexas.edu/maps/ireland.html')
doc = Hpricot(soc)
soc.close
doc.search('//a').each { |elem|
href = elem.attributes['href']
if not href.nil? and
['.jpg', '.jpeg'].include?(File.extname(href))
puts href
end
}

Note that you can also use the built-in REXML parser [2], and do
something like:

require 'rexml/document'
require 'open-uri'
include REXML
soc = open('http://utopia.utexas.edu/maps/ireland.html')
doc = Document.new(soc)
soc.close
doc.elements.each('//a') { |elem|
href = elem.attributes['href']
if not href.nil? and
['.jpg', '.jpeg'].include?(File.extname(href))
puts href
end
}

[1] http://code.whytheluckystiff.net/hpricot/
[2] http://www.germane-software.com/software/rexml/docs/tutorial.html

Regards,
Jordan
 
E

Esmail Bonakdarian

x1 said:
I'm sure someone will have a better way of doing this.. but..
Assuming it has <img src="/something.jpg">

##
imgs = []
IO.readlines("c:/somefile.html").each {|line| imgs << line.split("<img
src=\"")[1].to_s.split("\"")[0] if line.match("<img src=") }
puts imgs.join("\n")
##

Hi,

thanks .. this will get me started. I feel like I could do this
using various unix tools (grep/awk), but I'm trying to learn
Ruby ...

Esmail
 
E

Esmail Bonakdarian

Hi,

Thank you so much for these pointers. Am I correct in assuming
that REXML comes as part of standard Ruby? If so I think I will
go that route first.

I could cobble something together using various Linux tools
(grep and awk come to mind), but I want something in Ruby
(because I want to learn it) and also because it will be
more portable, for instance to the XP platform.

I appreciate you taking the time to post this and the references.
If you have any other ideas/approaches, I'm game.

Thanks again,

Esmail
Esmail said:
I would like to extract the filename with the .jpg extension.
What is the best approach for this?

Hi there,

You could use raw regexps and do it yourself, but you should probably
use an HTML parser to extract HTML data. ;)

A nice HTML parser is Hpricot [1], but it requires an extension (you
cen get it very easily via gems, see the link below). It is very easy
to use, and it's fast.

Using Hpricot, you can do something like this:

require 'hpricot'
require 'open-uri'
soc = open('http://utopia.utexas.edu/maps/ireland.html')
doc = Hpricot(soc)
soc.close
doc.search('//a').each { |elem|
href = elem.attributes['href']
if not href.nil? and
['.jpg', '.jpeg'].include?(File.extname(href))
puts href
end
}

Note that you can also use the built-in REXML parser [2], and do
something like:

require 'rexml/document'
require 'open-uri'
include REXML
soc = open('http://utopia.utexas.edu/maps/ireland.html')
doc = Document.new(soc)
soc.close
doc.elements.each('//a') { |elem|
href = elem.attributes['href']
if not href.nil? and
['.jpg', '.jpeg'].include?(File.extname(href))
puts href
end
}

[1] http://code.whytheluckystiff.net/hpricot/
[2] http://www.germane-software.com/software/rexml/docs/tutorial.html

Regards,
Jordan
 
M

MonkeeSage

Esmail said:
Thank you so much for these pointers. Am I correct in assuming
that REXML comes as part of standard Ruby? If so I think I will
go that route first.

Glad to help. And yes, REXML is a pure-ruby parser (uses regexp under
the hood) and is included with ruby stdlib since 1.8.
I could cobble something together using various Linux tools
(grep and awk come to mind), but I want something in Ruby
(because I want to learn it) and also because it will be
more portable, for instance to the XP platform.

REXML is more portible, albeit not as fast as Hpricot, which is
implemented as a compiled C extension for ruby.

Have fun learning ruby! It's a nice language. :)

Regards,
Jordan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,019
Latest member
RoxannaSta

Latest Threads

Top