newbie: how to find & extract a string from a file

Esmail Bonakdarian · Sep 29, 2006

Hi,

Just starting out to explore Ruby (I like it) and I have
a question.

I have an HTML file that contains several references to jpg files.

I would like to extract the filename with the .jpg extension.
What is the best approach for this?

It appears each .jpg reference is on its own line.

Thanks!

Esmail

x1 · Sep 29, 2006

I'm sure someone will have a better way of doing this.. but..
Assuming it has <img src="/something.jpg">

##
imgs = []
IO.readlines("c:/somefile.html").each {|line| imgs << line.split("<img
src=\"")[1].to_s.split("\"")[0] if line.match("<img src=") }
puts imgs.join("\n")
##

MonkeeSage · Sep 29, 2006

Esmail said:
I would like to extract the filename with the .jpg extension.
What is the best approach for this?

Hi there,

You could use raw regexps and do it yourself, but you should probably
use an HTML parser to extract HTML data.

A nice HTML parser is Hpricot [1], but it requires an extension (you
cen get it very easily via gems, see the link below). It is very easy
to use, and it's fast.

Using Hpricot, you can do something like this:

require 'hpricot'
require 'open-uri'
soc = open('http://utopia.utexas.edu/maps/ireland.html')
doc = Hpricot(soc)
soc.close
doc.search('//a').each { |elem|
href = elem.attributes['href']
if not href.nil? and
['.jpg', '.jpeg'].include?(File.extname(href))
puts href
end
}

Note that you can also use the built-in REXML parser [2], and do
something like:

require 'rexml/document'
require 'open-uri'
include REXML
soc = open('http://utopia.utexas.edu/maps/ireland.html')
doc = Document.new(soc)
soc.close
doc.elements.each('//a') { |elem|
href = elem.attributes['href']
if not href.nil? and
['.jpg', '.jpeg'].include?(File.extname(href))
puts href
end
}

[1] http://code.whytheluckystiff.net/hpricot/
[2] http://www.germane-software.com/software/rexml/docs/tutorial.html

Regards,
Jordan

Esmail Bonakdarian · Sep 29, 2006

x1 said:
I'm sure someone will have a better way of doing this.. but..
Assuming it has <img src="/something.jpg">

##
imgs = []
IO.readlines("c:/somefile.html").each {|line| imgs << line.split("<img
src=\"")[1].to_s.split("\"")[0] if line.match("<img src=") }
puts imgs.join("\n")
##

Hi,

thanks .. this will get me started. I feel like I could do this
using various unix tools (grep/awk), but I'm trying to learn
Ruby ...

Esmail

Esmail Bonakdarian · Sep 29, 2006

Hi,

Thank you so much for these pointers. Am I correct in assuming
that REXML comes as part of standard Ruby? If so I think I will
go that route first.

I could cobble something together using various Linux tools
(grep and awk come to mind), but I want something in Ruby
(because I want to learn it) and also because it will be
more portable, for instance to the XP platform.

I appreciate you taking the time to post this and the references.
If you have any other ideas/approaches, I'm game.

Thanks again,

Esmail

Esmail said:
Esmail said:

I would like to extract the filename with the .jpg extension.
What is the best approach for this?

Click to expand...

Hi there,

You could use raw regexps and do it yourself, but you should probably
use an HTML parser to extract HTML data.

A nice HTML parser is Hpricot [1], but it requires an extension (you
cen get it very easily via gems, see the link below). It is very easy
to use, and it's fast.

Using Hpricot, you can do something like this:

require 'hpricot'
require 'open-uri'
soc = open('http://utopia.utexas.edu/maps/ireland.html')
doc = Hpricot(soc)
soc.close
doc.search('//a').each { |elem|
href = elem.attributes['href']
if not href.nil? and
['.jpg', '.jpeg'].include?(File.extname(href))
puts href
end
}

Note that you can also use the built-in REXML parser [2], and do
something like:

require 'rexml/document'
require 'open-uri'
include REXML
soc = open('http://utopia.utexas.edu/maps/ireland.html')
doc = Document.new(soc)
soc.close
doc.elements.each('//a') { |elem|
href = elem.attributes['href']
if not href.nil? and
['.jpg', '.jpeg'].include?(File.extname(href))
puts href
end
}

[1] http://code.whytheluckystiff.net/hpricot/
[2] http://www.germane-software.com/software/rexml/docs/tutorial.html

Regards,
Jordan

MonkeeSage · Sep 30, 2006

Esmail said:
Thank you so much for these pointers. Am I correct in assuming
that REXML comes as part of standard Ruby? If so I think I will
go that route first.

Glad to help. And yes, REXML is a pure-ruby parser (uses regexp under
the hood) and is included with ruby stdlib since 1.8.

I could cobble something together using various Linux tools
(grep and awk come to mind), but I want something in Ruby
(because I want to learn it) and also because it will be
more portable, for instance to the XP platform.

REXML is more portible, albeit not as fast as Hpricot, which is
implemented as a compiled C extension for ruby.

Have fun learning ruby! It's a nice language.

Regards,
Jordan

How do I rename and copy a file on the server?	1	Nov 21, 2025
How do I copy a image, and video from one location to another, and rename the file?	0	Oct 20, 2025
How do I move a file from one folder to another on a server?	3	Aug 24, 2025
How can I import a PST file to an IMAP server easily?	3	Mar 18, 2026
How to Convert PST to EML File Easily?	4	Mar 26, 2026
How to read a file as binary or hex "string" so that I can do regex search?	3	Dec 18, 2024
Copy string from 2D array to a 1D array in C	1	Nov 1, 2023
How to Convert Excel to VCF Format Quickly and Easily?	2	Jan 31, 2025

newbie: how to find & extract a string from a file

Esmail Bonakdarian

x1

MonkeeSage

Esmail Bonakdarian

Esmail Bonakdarian

MonkeeSage

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads