super-newbee Ruby regex help?

Discussion in 'Ruby' started by Aaron Reimann, Aug 1, 2006.

  1. This is pretty complex considering that I am just now reading "Learn to
    Program" by Chris Pine (it is a book teaching you how to program in
    Ruby). It is very basic. I am somewhat good with PHP but and wanting
    to move into RoR and want to learn Ruby before I learn Rails.

    Anyway, I found a real life situation where I think Ruby could do this
    very quickly (and if I need to do it again, I can just run the script).
    I need to remove some stuff from a text file. Simple huh? Here is
    the site that I need the list from:

    http://edge.i-hacked.com/250-working-proxies-for-safe-web-access-from-work-or-school

    In that page there is one line of "code" that has all of the
    links...here is part of it:
    <a href="http://www.3proxy.com">3 Proxy</a> || <a
    href="http://www.3proxy.net">3 Proxy</a> || <a
    href="http://www.3proxy.org">3 Proxy</a>

    I have taken just that line and saved that as a text file.

    I need to strip everything where I wind up with this:
    3proxy\.com
    3proxy\.net
    3proxy\.org
    4proxy\.com

    I will be taking that list (all 300 of them) and adding them to my
    content filtering box. That way, all of these sites will be blocked.

    Do you guys know of any sites that might have a similar situation where
    I can see the code? or have any of you done something similar? I can
    probably modify stuff to make it fit my needs, but stuff like
    http://www.regular-expressions.info/ruby.html doesn't give me enough
    info to start.

    what i have right now is: file = File.open("list.txt","w")

    lol

    Sorry I'm a nubee... :)

    thanks,
    aaron
     
    Aaron Reimann, Aug 1, 2006
    #1
    1. Advertisements

  2. Hello !
    OK, what you need is to extract the part 3proxy.com from the String
    <a href="http://www.3proxy.com">3 Proxy</a>

    For that, a RE like the following should do

    /http:\/\/www\.([^"]+)/

    You can read it this way: "find substrings that start with http://www.
    (don't forget to escape /in the RE, else ruby will think that it is
    ending; you also need to escape the dot, although in this case it
    shouldn't matter much)
    and are followed by some text that doesn't contain ". The parenthesis
    around say you're interested in it; you'll be able to use what it did
    match with the $1 variable. Note that this part will match as much as
    possible, so you'll actually get everything you want.

    Then a possible way to do what you want would be

    proxies = [] # array where the proxies will be
    f = File.open('your_file_with_the_list_youre_reading')
    f.readlines.each do |l| # iterate on each line
    l.scan(/http:\/\/www\.([^"]+)/) do # scan the line for the pattern
    proxies << $1 # add the content of $1 to your list
    end
    end
    p proxies

    This should work...

    Have a good time with Ruby !

    Vince
     
    Vincent Fourmond, Aug 1, 2006
    #2
    1. Advertisements

  3. If the file already exists, you'll destroy it by using the "w" option.
    Since some of the anchor tags span more than one line,
    let's read the whole file at once:

    p IO.read( 'list.txt' ).
    scan( %r{<a \s+ href="http://www\.([^"]*)"}x ).flatten
     
    William James, Aug 2, 2006
    #3
  4. Thank you guys. I have not tried all that has been suggested, but I
    got this code emailed to me:

    ###
    require 'rubygems'
    require 'mechanize'

    url="http://edge.i-hacked.com/250-working-proxies-for-safe-web-access-from-work-or-school"
    agent = WWW::Mechanize.new
    page = agent.get(url)

    page.body.scan(/http:\/\/www\.([^"]+)/) do
    p $1
    end
    ###

    I had to install the 'mechanize' gem, but it works...overall. I have
    to figure out how to "write" the output into a text file. but this is
    pretty cool.

    I will be trying the one below too.

    thanks!
    aaron

     
    Aaron Reimann, Aug 2, 2006
    #4
  5. Update filename and you are set.

    url="http://edge.i-hacked.com/250-working-proxies-for-safe-web-access-from-work-or-school"
    filename="/tmp/tmp2.txt"
    agent = WWW::Mechanize.new
    page = agent.get(url)

    session_fd = File.open(filename, "w")
    page.body.scan(/http:\/\/www\.([^"]+)/) do
    session_fd.puts $1
    end
    session_fd.close
     
    Cliff Cyphers, Aug 2, 2006
    #5
  6. Mechanize has a method to get all the links for a Page:

    require "rubygems"
    require "mechanize"

    url="http://edge.i-hacked.com/250-working-proxies-for-safe-web-
    access-from-work-or-school"
    links = WWW::Mechanize.new.get(url).links.map { |a| a.uri rescue
    nil }.flatten
    File.open('links.txt', 'w') { |f| f.puts(links) }

    This saves all the relative links, however.

    -- Daniel
     
    Daniel Harple, Aug 2, 2006
    #6
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.