screen scrape + login

N

n8

Hi,

Hi have to do the followign and have been racking my brain with
various solutions that have had no so great results.

I want to use the System.Net.WebClient to submit data to a form (log a
user in) and then redirect to the correct article.

Here is the scenerio.
If you are not logged into the site for certain articles you are
redirected to a shtml login page. The login.shtml page posts to
another url for authentication and then lets you in. If have clicked
on an article that you have to log in to, then you are sent to the
login page with an appeneded URL,
www.domainname.com?orq:http://domainname.com/stories/112404/nca_2653091.shtml.
I have tried setting a webclient request to the url that the above
login form posts too, but I keep getting Method Not Allowed.

Any Ideas?
 
B

bruce barker

more info required, but here is typical login

1) you request a page with webclient
2) you are returned a redirect header to the login page.
3) you code detects the login redirect, then post the required form data to
the login page (manually view the login page to get the form fields required
and method).

note: an asp.net login site requires that you actually do a get to the
login page to get valid viewstate to postback. other systems may also
require scaping of the get data to before doing the actual post.

4) a successful post to the login will return a cookie value you must send
on subsequent requests, and a redirect header to the originally requested
page.


-- bruce (sqlwork.com)

| Hi,
|
| Hi have to do the followign and have been racking my brain with
| various solutions that have had no so great results.
|
| I want to use the System.Net.WebClient to submit data to a form (log a
| user in) and then redirect to the correct article.
|
| Here is the scenerio.
| If you are not logged into the site for certain articles you are
| redirected to a shtml login page. The login.shtml page posts to
| another url for authentication and then lets you in. If have clicked
| on an article that you have to log in to, then you are sent to the
| login page with an appeneded URL,
|
www.domainname.com?orq:http://domainname.com/stories/112404/nca_2653091.shtm
l.
| I have tried setting a webclient request to the url that the above
| login form posts too, but I keep getting Method Not Allowed.
|
| Any Ideas?
 
J

Joe Fallon

Scott,
FYI - that was one of the best articles on the subject I ever read.
I was completely stuck on this issue about 6 months ago and I implemented it
straight away using the concepts you presented here.

Excellent work and explanation.
 
N

n8

Thanks for the example. I had seen your example earlier and had tried
it and always get to one particular point where I cannot seem to get
beyond. There are two hidden fields both called web.fixed_values that
appear to be something like a view state but the page is shtml. I am
and have been able to pull down the site, etc. but everytime I try and
post my data (with or without the web.fixed_values) I always get the
response Method Not Allowed. Below is the code I am using along with
the sire I am trying to access with my account. ANy further help on
this would be greatly appreciated.

private void Page_Load(object sender, System.EventArgs e)
{
string LOGIN_URL = "http://augustachronicle.com/login.shtml";
string cookieAge = "31536000";

try
{
HttpWebRequest webRequest = WebRequest.Create(LOGIN_URL) as
HttpWebRequest;

StreamReader responseReader = new
StreamReader(webRequest.GetResponse().GetResponseStream());

string responseData = responseReader.ReadToEnd();
responseReader.Close();

// get the web fixed values
string fixedvalue1 = ExtractFixedValues1(responseData);

string fixedvalue2 = ExtractFixedValues2(responseData);

string postData = String.Format("web.fixed_values={0}&web.fixed_values={1}&ACTION=Login&USER={2}&PASS={3}&cookie_age={4}",fixedvalue1,fixedvalue2,userName,
password, cookieAge);

// have a cookie container ready to receive the forms auth cookie
CookieContainer cookies = new CookieContainer();

// now post to the login form
webRequest = WebRequest.Create(LOGIN_URL) as HttpWebRequest;
webRequest.Method = "POST";
webRequest.ContentType = "application/x-www-form-urlencoded";
webRequest.CookieContainer = cookies;

// write the form values into the request message
StreamWriter requestWriter = new
StreamWriter(webRequest.GetRequestStream());
requestWriter.Write(postData);
requestWriter.Close();

// we don't need the contents of the response, just the cookie it
issues
webRequest.GetResponse().Close();

// now we can send out cookie along with a request for the protected
page
webRequest = WebRequest.Create("http://augustachronicle.com/stories/112404/usc_FBC--SpurrierProfile.shtml")
as HttpWebRequest;
webRequest.CookieContainer = cookies;
responseReader = new
StreamReader(webRequest.GetResponse().GetResponseStream());

// and read the response
responseData = responseReader.ReadToEnd();
responseReader.Close();

Response.Write(responseData);
}
catch (Exception ex)
{
Response.Write(ex.ToString());
}

}

private string ExtractFixedValues1(string s)
{
string viewStateNameDelimiter = "web.fixed_values";
string valueDelimiter = "value=\"";

int viewStateNamePosition = s.IndexOf(viewStateNameDelimiter);
int viewStateValuePosition = s.IndexOf(
valueDelimiter, viewStateNamePosition
);

int viewStateStartPosition = viewStateValuePosition +
valueDelimiter.Length;
int viewStateEndPosition = s.IndexOf("\"", viewStateStartPosition);

return HttpUtility.UrlEncodeUnicode(
s.Substring(viewStateStartPosition,
viewStateEndPosition - viewStateStartPosition
)
);
}


private string ExtractFixedValues2(string s)
{
string viewStateNameDelimiter = "web.fixed_values";
string valueDelimiter = "value=\"";

int viewStateNamePosition = s.IndexOf(viewStateNameDelimiter);
int viewStateValuePosition = s.IndexOf(valueDelimiter,
viewStateNamePosition
);

int viewStateStartPosition = viewStateValuePosition +
valueDelimiter.Length;
int viewStateEndPosition = s.IndexOf("\"", viewStateStartPosition);

string sTemp = s.Remove(0,viewStateEndPosition);

viewStateNamePosition = sTemp.IndexOf(viewStateNameDelimiter);
viewStateValuePosition = sTemp.IndexOf(
valueDelimiter, viewStateNamePosition
);

viewStateStartPosition = viewStateValuePosition +
valueDelimiter.Length;
viewStateEndPosition = sTemp.IndexOf("\"", viewStateStartPosition);

return HttpUtility.UrlEncodeUnicode(
sTemp.Substring(
viewStateStartPosition,
viewStateEndPosition - viewStateStartPosition
)
);
}
 
S

Scott Allen

Everything looks like it is in order, Nathan. I'd examine the HTTP
traffic between your program and the server to make sure it all
matches exactly, even little things like the Agent header. I had one
financial site reject HttpWebRequests until I set the UserAgent
property to look just like IE. I guess it was a weak attempt at
preventing screen scraping programs.
 
N

n8

Scott,

Thanks for the information. I added a useragent to make it look like
IE, but I still get the 405 Method not allowed error message. What is
the best way to monitor the HTTP Traffic between my application and
the remote site? Are there any tools i can download to show me what
is going back and forth?

Thanks in advance,

n8
 
W

Wayne Brantley

Also, if you get a fix - please let us know.

n8 said:
Scott,

Thanks for the information. I added a useragent to make it look like
IE, but I still get the 405 Method not allowed error message. What is
the best way to monitor the HTTP Traffic between my application and
the remote site? Are there any tools i can download to show me what
is going back and forth?

Thanks in advance,

n8
 
N

n8

Scott,

I loaded the fiddler tool and traced the HTTP traffic. Everything
getting sent look sno different than when i go directly to it. Am I
to assume that they have a way of blocking screen scrapes and if so,
how would I explain this?

Thanks,

n8
 
S

Scott Allen

Hmm - I'm running out of ideas n8.

I know there are sites out there blocking scrapers, but they usually
either block an IP or use client side script and DHTML to try to screw
up programs. If your app is sending the same traffic as the browser
that wouldn't be an issue.

So, my last idea is this:

Last year I had a site that would occasionaly reject my web request
from a screen scraping program. It was in a loop moving through a
paged result set, and I couldn't figure out the random failures. On a
whim I put in a few Thread.Sleep calls to slow the scraper down
between requests and it never failed. I'm not sure if they monitored
requests by IP to only allow so many per second or minute or what,
though it was definitely timing related.

I guess the only other thing I'd do is really double check those HTTP
payloads and make sure everything matches - the headers, the POST data
is properly encoded, the cookie is sent, etc. etc.

HTH!
 
N

n8

a different approach. since i have been rackign my head against the
wall with this approach I thought I would try another. I thought I
would create the cookies on the fly that the site requires for the
user account and everything would be create. I can create the cookies
exactly, BUT if i change the domain property or use the domain
property the cookie does not get written, if i leave the property (do
not use it), the cooie gets written as localhost. how do i get around
this so I can set the domain name property?

thanks again

n8
 
S

Scott Allen

I remember trying a similar approach once, but I believe it is a
security feature that doesn't let us create a cookie from another
domain. The IE ActiveX control wouldn't let me pass cookies in at all
programaticaly. Argh.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,053
Latest member
BrodieSola

Latest Threads

Top