O
onetitfemme
Say, people would like to log into their hotmail, yahoo and gmail
accounts and "keep an eye" on some text/part of a site
..
I think something like that should be out there, since not all sites
provide RSS feeds nor are they really interested in providing
consistent and informative content (what we (almost) all are looking
for).
..
I have been mostly programming java lately. THis is how I see such an
API could -very basically indeed- be implemented:
..
1. Get the HTML text.
2. Run it through an HTML to XML/XHTML cleanser (tidy nicely fits the
bill, but I truly hate how it changes character entities whichever way
it thinks without giving you an option to let them be as you coded
them. I haven't thoroughly checked JTidy, though)
3. parse 2 using a SAX parser and handle the callbacks it produces,
based on
4. some XPath-like metadata that is kept from the page and some more
metada how it should be processed ...
..
I know XPath might not be the right technology since it uses the DOM
and it might get a little taxing when you are processing many pages ...
..
I recall there was some java project called HTMLCLient, but I wonder
what appened to it
..
I think search engines use similar algorithms and I was wondering
about how the masters do it
..
Thanks
onetitfemme
accounts and "keep an eye" on some text/part of a site
..
I think something like that should be out there, since not all sites
provide RSS feeds nor are they really interested in providing
consistent and informative content (what we (almost) all are looking
for).
..
I have been mostly programming java lately. THis is how I see such an
API could -very basically indeed- be implemented:
..
1. Get the HTML text.
2. Run it through an HTML to XML/XHTML cleanser (tidy nicely fits the
bill, but I truly hate how it changes character entities whichever way
it thinks without giving you an option to let them be as you coded
them. I haven't thoroughly checked JTidy, though)
3. parse 2 using a SAX parser and handle the callbacks it produces,
based on
4. some XPath-like metadata that is kept from the page and some more
metada how it should be processed ...
..
I know XPath might not be the right technology since it uses the DOM
and it might get a little taxing when you are processing many pages ...
..
I recall there was some java project called HTMLCLient, but I wonder
what appened to it
..
I think search engines use similar algorithms and I was wondering
about how the masters do it
..
Thanks
onetitfemme