WWW CMS: filtering actual(ly relevant) content

L

lbrtchx

~
I was actually wondering about how do they filter and keep track of
actual content on pages out there on the net and how helpful would
current protocols and web servers be on such things
~
Only on pages designed in the 94-95's you could use the last-modified
response header as a way to have an idea of something that might have
changed on the page. Current pages in almost all sites are googled,
syndicated or just filled up with an incredible amount of clutter and
nonsense. This makes searching the net a time consuming and not so
reliable endeavor, among many other things, because they use page
contextualization; if you search for, say CSS, you may find lots of
pages that just had the acronym "CSS" on a left frame as a jump off
link to another page of probably it was included as credit ("css"
desinged by ...) in the page's footer
~
I really don't know if and how the actual content of pages is
indexed. I was thinking of basically:
~
* keeping local copies of certain pages
* on which tidy was run to make them well-formed XML, and
* keeping and managing XPath indexes of the pages and
* pasers to get the meat out of the pages
~
Any libraries or solid/comprehensive studies out there?
~
Thanks
lbrtchx
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,904
Latest member
HealthyVisionsCBDPrice

Latest Threads

Top