Using regular expressions in internet searches

M

mike.ceravolo

What is the best way to use regular expressions to extract information
from the internet if one wants to search multiple pages? Let's say I
want to search all of www.cnn.com and get a list of all the words that
follow "Michael."

(1) Is Python the best language for this? (Plus is it time-efficient?)
Is there already a search engine that can do this?

(2) How can I search multiple web pages within a single location or
path?

TIA,

Mike
 
D

Diez B. Roggisch

What is the best way to use regular expressions to extract information
from the internet if one wants to search multiple pages? Let's say I
want to search all of www.cnn.com and get a list of all the words that
follow "Michael."

(1) Is Python the best language for this? (Plus is it time-efficient?)
Is there already a search engine that can do this?

(2) How can I search multiple web pages within a single location or
path?

You'd probably better off using htdig.

Diez
 
M

MyHaz

Python would be good for this, but if you just want a chuck an rumble
solution might be.


bash $wget -r --ignore-robots -l 0 -c -t 3 http://www.cnn.com/
bash $ grep -r "Micheal.*" ./www.cnn.com/*

Or you could do a wget/python mix

like

import sys
import re
sys.os.command("wget -r --ignore-robots -l 0 -c -t 3
http://ww.cnn.com/")
re_iraq=re.compile("iraq .+?",re.IGNORECASE)

while "file in dirs under ./www.cnn.com/ "
iraqs = re_iraq.findall(file.read())
print iraqs
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top