Parsing/Crawler Questions..

bruce · Mar 4, 2009

Hi...

Sorry that this is a bit off track. Ok, maybe way off track!

But I don't have anyone to bounce this off of..

I'm working on a crawling project, crawling a college website, to extract
course/class information. I've built a quick test app in python to crawl the
site. I crawl at the top level, and work my way down to getting the required
course/class schedule. The app works. I can consistently run it and extract
the information. The required information is based upon an XPath analysis of
the DOM for the given pages that I'm parsing.

My issue is now that I have a "basic" app that works, I need to figure out
how I guarantee that I'm correctly crawling the site. How do I know when
I've got an error at a given node/branch, so that the app knows that it's
not going to fetch the underlying branch/nodes of the tree..

When running the app, I can get 5000 classes on one run, 4700 on antoher,
etc... So I need some method of determining when I get a "complete" tree...

How do I know when I have a complete "tree"!

I'm looking for someone, or some group/prof that I can talk to about these
issues. I've been searching google, linkedin, etc.. for someone to bounce
thoughts with..!

Any pointers, or people, or papers, etc... would be helpful.

Thanks

Questions that "Idiot Guides" don't start with..	3	Jan 16, 2024
BTJunkie crawler	0	Mar 18, 2010
Born Again C.S. Guy Intro/Career Questions	3	May 2, 2023
Hello from beginner with some questions!	3	Jul 30, 2021
Open source web crawler with mysql integration	8	Apr 10, 2009
Web Page Parsing/Downloading	1	Nov 22, 2013
parsley parsing question, how to make a variable grammar	0	Jun 13, 2014
simplified Python parsing question	2	Jul 30, 2012

Parsing/Crawler Questions..

bruce

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads