Trolling a site for data

Discussion in 'ASP .Net' started by Steffan A. Cline, Nov 9, 2009.

  1. I was trying to find a way to troll/poll/scrape a site for data.
    Unfortunately the site uses AJAX with .asp which I have never worked with
    before. If this site used php, lasso or the like it would be easy to grab
    the url and query the data directly. I have done similar things before where
    the pages use a plain form and then paginate through the results. (1-100,
    101-200 etc).

    The example would be this site :
    http://www.tblaw.com/FsSales/PendingSales.aspx

    I can simply include the url to the site via the likes of curl or something
    but it only gets the first 60 records. No matter what I do, I can't find out
    how to get more than the 60.

    On another site, for example, I hit the page first to get the cookie and
    event and action so that I can keep posting them to the next page with the
    page parameters and then parse the results.

    Sorry if I am not explaining this very well.

    Any suggestions?

    Thanks,
    Steffan
     
    Steffan A. Cline, Nov 9, 2009
    #1
    1. Advertising

  2. On Nov 9, 4:59 am, "Steffan A. Cline" <> wrote:
    > I was trying to find a way to troll/poll/scrape a site for data.
    > Unfortunately the site uses AJAX with .asp which I have never worked with
    > before. If this site used php, lasso or the like it would be easy to grab
    > the url and query the data directly. I have done similar things before where
    > the pages use a plain form and then paginate through the results. (1-100,
    > 101-200 etc).
    >
    > The example would be this site :http://www.tblaw.com/FsSales/PendingSales..aspx
    >
    > I can simply include the url to the site via the likes of curl or something
    > but it only gets the first 60 records. No matter what I do, I can't find out
    > how to get more than the 60.
    >
    > On another site, for example, I hit the page first to get the cookie and
    > event and action so that I can keep posting them to the next page with the
    > page parameters and then parse the results.
    >
    > Sorry if I am not explaining this very well.
    >
    > Any suggestions?
    >
    > Thanks,
    > Steffan


    You have to learn how ajax is working. Usually it's a java/vb script
    that requests some data from the server, and takes and renders the
    resulting data back to the page. It means that you need to find how it
    is implemented in every particular case and read the output from the
    script/page that returns the resulting data.
     
    Alexey Smirnov, Nov 9, 2009
    #2
    1. Advertising

  3. in article
    , Alexey
    Smirnov at wrote on 11/9/09 1:03 AM:

    > On Nov 9, 4:59 am, "Steffan A. Cline" <> wrote:
    >> I was trying to find a way to troll/poll/scrape a site for data.
    >> Unfortunately the site uses AJAX with .asp which I have never worked with
    >> before. If this site used php, lasso or the like it would be easy to grab
    >> the url and query the data directly. I have done similar things before where
    >> the pages use a plain form and then paginate through the results. (1-100,
    >> 101-200 etc).
    >>
    >> The example would be this site
    >> :http://www.tblaw.com/FsSales/PendingSales.aspx

    >
    >>
    >> I can simply include the url to the site via the likes of curl or something
    >> but it only gets the first 60 records. No matter what I do, I can't find out
    >> how to get more than the 60.
    >>
    >> On another site, for example, I hit the page first to get the cookie and
    >> event and action so that I can keep posting them to the next page with the
    >> page parameters and then parse the results.
    >>
    >> Sorry if I am not explaining this very well.
    >>
    >> Any suggestions?
    >>
    >> Thanks,
    >> Steffan

    >
    > You have to learn how ajax is working. Usually it's a java/vb script
    > that requests some data from the server, and takes and renders the
    > resulting data back to the page. It means that you need to find how it
    > is implemented in every particular case and read the output from the
    > script/page that returns the resulting data.


    Right. I get that. The problem is that asp.net does an outstanding way of
    obfuscating. On a normal JS based AJAX query, you can easily see the URL and
    parameters being sent. The deal is that asp.net sends waaaay more data.

    I was hoping someone could help figure out the way exactly that asp.net is
    doing it. I tried parsing the headers and no luck.

    Thanks,
    Steffan
     
    Steffan A. Cline, Nov 9, 2009
    #3
  4. Steffan A. Cline

    bruce barker Guest

    one area that ap.net is different is its postback model. there are
    hidden fields __EVENTTARGET and __EVENTARGUMENT that contain info on the
    postback control. __VIEWSTATE contains state infomation. before you
    can do a form post to a asp.net server, you must do a get to get a valid
    viewstate.

    in you case you need to go a get, to get page one dat and a viewstate.
    then a form post (filling in __TEVENTTARGET) to get page two and the
    viewstate for page 3.

    if the site uses an update panel, then its just a little tricker. the
    update panel posts all the form data (there will be hidden fields to
    identify it as a async postback) via XmlHttpRequest, and gets back just
    the html (pretty simple format) for a subsection of the page. You will
    need to parse this for your data, new viewstate, and any form field
    updates (keep track of the all form field from before the post and merge
    results).

    -- bruce (sqlwork.com)

    Steffan A. Cline wrote:
    > in article
    > , Alexey
    > Smirnov at wrote on 11/9/09 1:03 AM:
    >
    >> On Nov 9, 4:59 am, "Steffan A. Cline" <> wrote:
    >>> I was trying to find a way to troll/poll/scrape a site for data.
    >>> Unfortunately the site uses AJAX with .asp which I have never worked with
    >>> before. If this site used php, lasso or the like it would be easy to grab
    >>> the url and query the data directly. I have done similar things before where
    >>> the pages use a plain form and then paginate through the results. (1-100,
    >>> 101-200 etc).
    >>>
    >>> The example would be this site
    >>> :http://www.tblaw.com/FsSales/PendingSales.aspx
    >>> I can simply include the url to the site via the likes of curl or something
    >>> but it only gets the first 60 records. No matter what I do, I can't find out
    >>> how to get more than the 60.
    >>>
    >>> On another site, for example, I hit the page first to get the cookie and
    >>> event and action so that I can keep posting them to the next page with the
    >>> page parameters and then parse the results.
    >>>
    >>> Sorry if I am not explaining this very well.
    >>>
    >>> Any suggestions?
    >>>
    >>> Thanks,
    >>> Steffan

    >> You have to learn how ajax is working. Usually it's a java/vb script
    >> that requests some data from the server, and takes and renders the
    >> resulting data back to the page. It means that you need to find how it
    >> is implemented in every particular case and read the output from the
    >> script/page that returns the resulting data.

    >
    > Right. I get that. The problem is that asp.net does an outstanding way of
    > obfuscating. On a normal JS based AJAX query, you can easily see the URL and
    > parameters being sent. The deal is that asp.net sends waaaay more data.
    >
    > I was hoping someone could help figure out the way exactly that asp.net is
    > doing it. I tried parsing the headers and no luck.
    >
    > Thanks,
    > Steffan
    >
     
    bruce barker, Nov 9, 2009
    #4
  5. On Nov 9, 2:29 pm, "Steffan A. Cline" <> wrote:
    > in article
    > , Alexey
    > Smirnov at wrote on 11/9/09 1:03 AM:
    >
    >
    >
    >
    >
    > > On Nov 9, 4:59 am, "Steffan A. Cline" <> wrote:
    > >> I was trying to find a way to troll/poll/scrape a site for data.
    > >> Unfortunately the site uses AJAX with .asp which I have never worked with
    > >> before. If this site used php, lasso or the like it would be easy to grab
    > >> the url and query the data directly. I have done similar things before where
    > >> the pages use a plain form and then paginate through the results. (1-100,
    > >> 101-200 etc).

    >
    > >> The example would be this site
    > >> :http://www.tblaw.com/FsSales/PendingSales.aspx

    >
    > >> I can simply include the url to the site via the likes of curl or something
    > >> but it only gets the first 60 records. No matter what I do, I can't find out
    > >> how to get more than the 60.

    >
    > >> On another site, for example, I hit the page first to get the cookie and
    > >> event and action so that I can keep posting them to the next page with the
    > >> page parameters and then parse the results.

    >
    > >> Sorry if I am not explaining this very well.

    >
    > >> Any suggestions?

    >
    > >> Thanks,
    > >> Steffan

    >
    > > You have to learn how ajax is working. Usually it's a java/vb script
    > > that requests some data from the server, and takes and renders the
    > > resulting data back to the page. It means that you need to find how it
    > > is implemented in every particular case and read the output from the
    > > script/page that returns the resulting data.

    >
    > Right. I get that. The problem is that asp.net does an outstanding way of
    > obfuscating. On a normal JS based AJAX query, you can easily see the URL and
    > parameters being sent. The deal is that asp.net sends waaaay more data.
    >
    > I was hoping someone could help figure out the way exactly that asp.net is
    > doing it. I tried parsing the headers and no luck.
    >
    > Thanks,
    > Steffan- Hide quoted text -
    >
    > - Show quoted text -


    As Bruce correctly noted, look into postback data, in most cases all
    information is there. For instance, if we take your URL as an example,
    we will see that the gridview has paging 1..2..3..etc. These links
    initiate asynchronous postbacks and cause a partial-page update. Each
    link has an id like 'ListView1$PagerTop$ctl01$ctlXX' where 00 is for
    page #1, 01 for page #2, etc. and urls as javascript:__doPostBack
    ('ListView1$PagerTop$ctl01$ctlXX',''). What does it mean? It does mean
    that the number of new page will be sent via postback as id of the
    link control. Sounds simple, right? Send a request to the remote
    server where you should say that your __EVENTTARGET is
    ListView1%24PagerTop%24ctl01%24ctl01 when you want to get page #2. If
    page controls are based on viewstate you need to copy the viewstate
    into request as well. This is probably where you were confused by many
    data. ViewState is used the retain the state of controls between
    postbacks. Again, if we take your example, we don't change any control
    state, and it means that you can copy original viewstate from the very
    first page. If it's necessary to know what does ViewState includes,
    you can decode it. There are some tools to do it, for example:

    http://lachlankeown.blogspot.com/2008/05/online-viewstate-viewer-decoder.html

    To debug HTTP requests, use Fiddler Web Debugger.
     
    Alexey Smirnov, Nov 11, 2009
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Ben Wilson

    Trolling for New Web Host . . .

    Ben Wilson, Feb 24, 2006, in forum: Python
    Replies:
    2
    Views:
    292
    Ben Wilson
    Feb 27, 2006
  2. RoSsIaCrIiLoIA

    [Trolling] assembly vs C language

    RoSsIaCrIiLoIA, Feb 8, 2005, in forum: C Programming
    Replies:
    6
    Views:
    451
    NoDot
    Feb 9, 2005
  3. Alf P. Steinbach
    Replies:
    4
    Views:
    383
    James Kanze
    Apr 8, 2009
  4. raviraj joshi
    Replies:
    0
    Views:
    327
    raviraj joshi
    Jul 4, 2009
  5. reshma shinde
    Replies:
    0
    Views:
    412
    reshma shinde
    Jul 4, 2009
Loading...

Share This Page