Phht! on screenscaping

Roedy Green · Sep 30, 2011

On my website I have links to 20 different bookstores. The problem
is, there is no guarantee all the bookstores actually carry any given
book. I wanted to grey-out links to bookstores that don't for now
carry that particular book.

This means probing every bookstore with every ISBN to see if they have
it. I discovered I needed an average of 8 marker strings to analyse
the response There are about 4 different ways they say they have the
book and 4 to say they do not. I found this by trial and error,
adding more and more strings and seeing if there were responses that
could not be categorised, then translating and examining the responses
for likely markers, then looking at the original. This was complicated
somewhat since some of the bookstores are in German, French, Italian
and Spanish.

As the bookstores change their wordings, I will have to keep adjusting
my program to track.

All his would be so much easier if the bookstores would offer an
alternate computer-friendly api. You could give them an ISBN, and
they could give you back some XML, JSON, CSV etc, with a single Yes/No
instock field. It would take them all of an hour to cook something
up. Sometimes they do it, but make it so complicated and so volatile
you might as well screenscrape.

Ditto companies that sell posters, or sell anything else via
affiliates need that sort of API.
--
Roedy Green Canadian Mind Products
http://mindprod.com
It should not be considered an error when the user starts something
already started or stops something already stopped. This applies
to browsers, services, editors... It is inexcusable to
punish the user by requiring some elaborate sequence to atone,
e.g. open the task editor, find and kill some processes.

markspace · Sep 30, 2011

It would take them all of an hour to cook something
up.

Uh, right. You didn't used to work in management, perhaps?

Daniel Pitts · Sep 30, 2011

All his would be so much easier if the bookstores would offer an
alternate computer-friendly api. You could give them an ISBN, and
they could give you back some XML, JSON, CSV etc, with a single Yes/No
instock field. It would take them all of an hour to cook something
up. Sometimes they do it, but make it so complicated and so volatile
you might as well screenscrape.

Ditto companies that sell posters, or sell anything else via
affiliates need that sort of API.

As an employee of a company that has introduced an XML API (which is
used both internally and externally) , I can speak with experience that
it takes far more than an hour to cook up. Not only that, but it
requires constant maintenance and operational support. It *is* worth it
for the company because it provides benefits (easier to support an
front-end webapp which doesn't connect to databases, easy to provide
data to partners, etc...), however for an average book-store, providing
that data through an API may actually cost them money, rather than save
them. The data itself probably has value, and the maintenance of the
system to provide that data has a cost.

I would love it if all data was available freely (as in free speech and
free beer). I would also love it if all data could be standardized and
normalized appropriately. I'd also like a unicorn and world piece. I
think all of those things come together, but I'd expect to see a unicorn
before any of the others. Genetically engineer a narwhal crossed with a
pony.

The rest is much more complicated.

markspace · Sep 30, 2011

I'd also like a unicorn and world piece.

Which piece would you like? ;-)

Daniel Pitts · Sep 30, 2011

Which piece would you like? ;-)

The part that has homonym checkers ;-)

And world peace

Arne Vajhøj · Oct 1, 2011

Which piece would you like? ;-)

He should pick something with either gold or oil !

Arne

Arne Vajhøj · Oct 1, 2011

On my website I have links to 20 different bookstores. The problem
is, there is no guarantee all the bookstores actually carry any given
book. I wanted to grey-out links to bookstores that don't for now
carry that particular book.

This means probing every bookstore with every ISBN to see if they have
it. I discovered I needed an average of 8 marker strings to analyse
the response There are about 4 different ways they say they have the
book and 4 to say they do not. I found this by trial and error,
adding more and more strings and seeing if there were responses that
could not be categorised, then translating and examining the responses
for likely markers, then looking at the original. This was complicated
somewhat since some of the bookstores are in German, French, Italian
and Spanish.

As the bookstores change their wordings, I will have to keep adjusting
my program to track.

Lot of work.

And potentially non legal.

All his would be so much easier if the bookstores would offer an
alternate computer-friendly api. You could give them an ISBN, and
they could give you back some XML, JSON, CSV etc, with a single Yes/No
instock field. It would take them all of an hour to cook something
up.

If you have worked in professional software development then
you would know that there are no such thing as adding a new
feature for 1 hour of work.

Arne

Arved Sandstrom · Oct 1, 2011

On my website I have links to 20 different bookstores. The problem
is, there is no guarantee all the bookstores actually carry any given
book. I wanted to grey-out links to bookstores that don't for now
carry that particular book.

This means probing every bookstore with every ISBN to see if they have
it. I discovered I needed an average of 8 marker strings to analyse
the response There are about 4 different ways they say they have the
book and 4 to say they do not. I found this by trial and error,
adding more and more strings and seeing if there were responses that
could not be categorised, then translating and examining the responses
for likely markers, then looking at the original. This was complicated
somewhat since some of the bookstores are in German, French, Italian
and Spanish.

As the bookstores change their wordings, I will have to keep adjusting
my program to track.

All his would be so much easier if the bookstores would offer an
alternate computer-friendly api. You could give them an ISBN, and
they could give you back some XML, JSON, CSV etc, with a single Yes/No
instock field. It would take them all of an hour to cook something
up. Sometimes they do it, but make it so complicated and so volatile
you might as well screenscrape.

Ditto companies that sell posters, or sell anything else via
affiliates need that sort of API.

An hour? Even if one single bookstore decided to do that with their own
proprietary API, and they owned their own server and had a dedicated
developer on staff, it still wouldn't happen quite that quick. And how
would you the consumer then find out about this API? You don't really
believe in things like UDDI still, right? And assuming you did have some
way of discovering the API you'd still have to adapt your own client
code for it.

Way over an hour.

And _whose_ API is that? Individual bookstore API? Not practical. So
does a chain decide to do that instead? Committees, approvals. Months of
work. Industry-wide consortium, conflicting with existing proprietary
APIs? Years or never.

You're actually better off screenscraping. I definitely don't see how
this would be more work than dealing with thousands of different APIs.

AHS

Arne Vajhøj · Oct 1, 2011

You're actually better off screenscraping. I definitely don't see how
this would be more work than dealing with thousands of different APIs.

I can see two advantages of API over screen scraping for the consuming
side of the service:
* more robust in regard to handling unusual data
* easier to see what to change when a new version comes out (they
may even announce changes to an API in advance)

Arne

Arved Sandstrom · Oct 1, 2011

I can see two advantages of API over screen scraping for the consuming
side of the service:
* more robust in regard to handling unusual data
* easier to see what to change when a new version comes out (they
may even announce changes to an API in advance)

Arne

Well, according to Roedy he's got a not overly-complicated-sounding
screenscraping algorithm that works for roughly 20 bookstore websites,
and there's no reason to believe that if he added another 20 sites to
the list that the algorithm would change substantially. Unless all of
the bookstores, that he is interested in, offered the same useful API,
he'd still have to have the screenscraping code handy.

Besides, assuming it was legal, *Roedy* could offer the API as a
service. He's the aggregating screenscraper, does all the heavy-lifting,
and other people can query *his* web service.

AHS

Arne Vajhøj · Oct 1, 2011

Well, according to Roedy he's got a not overly-complicated-sounding
screenscraping algorithm that works for roughly 20 bookstore websites,
and there's no reason to believe that if he added another 20 sites to
the list that the algorithm would change substantially. Unless all of
the bookstores, that he is interested in, offered the same useful API,
he'd still have to have the screenscraping code handy.

Until everybody does if right, then he would still need the hack.

But number of cases and changes should decrease with a smaller number
of non API sites.

Arne

Roedy Green · Oct 2, 2011

Way over an hour.

The whole process yes, writing the code to generate the file given the
info was in a database I doubt would take me more than an hour.

On the other hand, when you total up the total hours of clients and
server side code, it is orders of magnitude easier to create and
maintain a programmer API. The big advantage is not so much the
format, but that it sits still (or is XML-style upward compatible) and
has no undocumented variants.
--
Roedy Green Canadian Mind Products
http://mindprod.com
It should not be considered an error when the user starts something
already started or stops something already stopped. This applies
to browsers, services, editors... It is inexcusable to
punish the user by requiring some elaborate sequence to atone,
e.g. open the task editor, find and kill some processes.

Roedy Green · Oct 2, 2011

Besides, assuming it was legal, *Roedy* could offer the API as a
service. He's the aggregating screenscraper, does all the heavy-lifting,
and other people can query *his* web service

there are all kinds of companies doing just that, though they don't
think of themselves that way.

See http://mindprod.com/jgloss/bookstores.html

There are many services that let you find out which bookstores carry a
given book. If you order through them, they get a finder's fee.

They have my problem magnified many times, since they may be polling
200+ bookstores. I poll only 20.

If we had a common API to get info about a book from a store, this
programming task would be trivial and would not require constant
maintenance. Further, it would not fail in production. No bookstore
gives any warning that is changing the format of its pages, or is
adding or changing wordings.

Further, you would not need to deal with many languages in your code.

It would not be that hard to come up with a format of the file and an
API to fetch it, and even write a sample client and server app. The
hard part is political, selling it. Perhaps Google might ask its
customers to implement it, or the ISBN people.

Perhaps somebody has already done that. It would just take inquiries
to bookstore asking them the URL to access the XXX API.

--
Roedy Green Canadian Mind Products
http://mindprod.com
It should not be considered an error when the user starts something
already started or stops something already stopped. This applies
to browsers, services, editors... It is inexcusable to
punish the user by requiring some elaborate sequence to atone,
e.g. open the task editor, find and kill some processes.

Movable Hype · Oct 2, 2011

Roedy Green said:
asking them the URL to access the XXX API.

I've never heard it called *that* before!

Arved Sandstrom · Oct 2, 2011

there are all kinds of companies doing just that, though they don't
think of themselves that way.

See http://mindprod.com/jgloss/bookstores.html

There are many services that let you find out which bookstores carry a
given book. If you order through them, they get a finder's fee.

They have my problem magnified many times, since they may be polling
200+ bookstores. I poll only 20.

If we had a common API to get info about a book from a store, this
programming task would be trivial and would not require constant
maintenance. Further, it would not fail in production. No bookstore
gives any warning that is changing the format of its pages, or is
adding or changing wordings.

Further, you would not need to deal with many languages in your code.

It would not be that hard to come up with a format of the file and an
API to fetch it, and even write a sample client and server app. The
hard part is political, selling it. Perhaps Google might ask its
customers to implement it, or the ISBN people.

Perhaps somebody has already done that. It would just take inquiries
to bookstore asking them the URL to access the XXX API.

You have described the problem well. I am no expert in this domain, but
two existing APIs that stand out in this discussion are the Google Books
API (http://code.google.com/apis/books/docs/v1/getting_started.html) and
the Amazon Product Advertising API
(https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html).

For example, in the Amazon API the ItemSearch and SimilarityLookup web
service operations are just your ticket. Google Books API has 'list' and
'get' as REST actions.

Neither of these actually help our problem; they are just examples of
what we would like to have. You're right that the problem is primarily
political; it's getting myriad bookstores to adopt a Simple Bookstore API.

It's not completely trivial technologically, though: your WSDL will be
uniform, but you'd need to write and provide implementations for PHP and
Java and all your other target languages. And _those_ implementations
would probably need to be written as SPIs, so that appropriate code in
each bookstore's backend logic (for their existing website) can be
identified and plugged in (likely with adapters).

AHS

Lew · Oct 2, 2011

Arved said:
You have described the problem well. I am no expert in this domain, but
two existing APIs that stand out in this discussion are the Google Books
API (http://code.google.com/apis/books/docs/v1/getting_started.html) and
the Amazon Product Advertising API
(https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html).

For example, in the Amazon API the ItemSearch and SimilarityLookup web
service operations are just your ticket. Google Books API has 'list' and
'get' as REST actions.

Neither of these actually help our problem; they are just examples of
what we would like to have. You're right that the problem is primarily
political; it's getting myriad bookstores to adopt a Simple Bookstore API.

It's not completely trivial technologically, though: your WSDL will be
uniform, but you'd need to write and provide implementations for PHP and
Java and all your other target languages. And _those_ implementations
would probably need to be written as SPIs, so that appropriate code in
each bookstore's backend logic (for their existing website) can be
identified and plugged in (likely with adapters).

http://xkcd.com/927/

Daniel Pitts · Oct 3, 2011

http://xkcd.com/927/

lol, I actually thought of that comic when the thread started.

Arne Vajhøj · Nov 6, 2011

The whole process yes, writing the code to generate the file given the
info was in a database I doubt would take me more than an hour.

Well - those companies have to pay for the whole process not just
for the code typing.

But if you offer them money for providing the API so it will not be
a cost to them, then they may consider doing it.

Arne

screen scraping gotcha	5	Dec 14, 2011
constructing a constant HashMap	19	Oct 16, 2011
Ubunto	74	Oct 13, 2011
New Java releases 1.6.0_29 and 1.7.0_01	3	Oct 19, 2011
Java control panel anomaly	2	Oct 23, 2011
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
Where am I?	10	Oct 13, 2011
naming convention	9	Oct 19, 2011

Phht! on screenscaping

Roedy Green

markspace

Daniel Pitts

markspace

Daniel Pitts

Arne Vajhøj

Arne Vajhøj

Arved Sandstrom

Arne Vajhøj

Arved Sandstrom

Arne Vajhøj

Roedy Green

Roedy Green

Movable Hype

Arved Sandstrom

Lew

Daniel Pitts

Arne Vajhøj

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads