Extract data using Curl Unix Command & Perl Script from Webpage

F

Fiaz Idris

I have used curl and perl script to extract data from sequence
of webpages before.

But, in the following case I couldn't find a way to do it.

So, if someone can guide me a better way or add any comments
on top of my own to do it would be appreciated.

HOW I EXPECT IT TO BE DONE
--------------------------

The webpage is the following:

http://www.chennaionline.com/msuniversity/submit.asp?code=BA

and I have to extract the Registration numbers from 2225683 to
2225867.

You might want to try out a single number for e.g. 2225683 to see
the results it returns.

I normally will group all the webpage source of each of the
registration
numbers in a single file using something like

$results = qx{curl -s
http://www.chennaionline.com/msuniversity/result.asp?RegistraitonNumber=$regno};

redirected to a file and then use regular expressions to extract the
Registration No., Name, College and the marks & results of each
subject
for each student.

WHAT I EXPECT FROM YOU
----------------------

I can't find a correct way to locate the URL which will return the
results
of each Registration Number as it seems to be using JavaScript or
something.

How can I do it in this case?

If there is a complete alternative to do it. Please guide me.

I have used the same technique in some other pages and it works like a
wonder.
 
B

Bob Walton

Fiaz said:
I have used curl and perl script to extract data from sequence
of webpages before.

But, in the following case I couldn't find a way to do it.

So, if someone can guide me a better way or add any comments
on top of my own to do it would be appreciated.

HOW I EXPECT IT TO BE DONE
--------------------------

The webpage is the following:

http://www.chennaionline.com/msuniversity/submit.asp?code=BA

and I have to extract the Registration numbers from 2225683 to
2225867.

You might want to try out a single number for e.g. 2225683 to see
the results it returns.

I normally will group all the webpage source of each of the
registration
numbers in a single file using something like

$results = qx{curl -s
http://www.chennaionline.com/msuniversity/result.asp?RegistraitonNumber=$regno};

Accuracy counts------------------------------------------------^^

redirected to a file and then use regular expressions to extract the
Registration No., Name, College and the marks & results of each
subject
for each student.

WHAT I EXPECT FROM YOU
----------------------

I can't find a correct way to locate the URL which will return the
results
of each Registration Number as it seems to be using JavaScript or
something.

How can I do it in this case?


The HTML page generating the request indicates it is using the POST
method. Perhaps the CGI script which accepts the request checks to
verify that the POST method was used? In the case of the POST method,
the arguments are not supplied as part of the URL.

If there is a complete alternative to do it. Please guide me.


use LWP::UserAgent;

would be the Perlish way of doing it. See:

perldoc lwpcook

for a tutorial.

I have used the same technique in some other pages and it works like a
wonder.


Did their forms use the POST method?
 
G

gnari

[snip]

what we in turn can expect from you, is that you do a modicum of preparation
work, like making sure the url you claim does not work, is actually the
correct one

a cursory look at the html show that the input field is actually not
RegistraitonNumber , but rather Exam_Registration_Number
in addition to that there is a hidden field Codeid set to 'BA'.
and to be sure, maybe you should also include the button field,
btn_display=Results

try that, preferably with a POST
if it still fails try to set the Referer HTTP header

gnari
 
T

Tad McClellan

Fiaz Idris said:
I have used curl and perl script to extract data from sequence
of webpages before.


Why not just ditch curl and do it with Perl alone?

See this Perl FAQ:

How do I automate an HTML form submission?

But, in the following case I couldn't find a way to do it.

So, if someone can guide me a better way


I like to use the Web Scraping Proxy (wsp.pl) for developing
my many web-scraping programs:

http://www.research.att.com/~hpk/wsp/

It is a huge timesaver in reverse-engineering how to get to what you want.

The webpage is the following:

http://www.chennaionline.com/msuniversity/submit.asp?code=BA

and I have to extract the Registration numbers from 2225683 to
2225867.

redirected to a file and then use regular expressions to extract the


Using regular expressions to parse HTML can be a bad idea.

Especially since the data you want is in a table.

Use the HTML::TableExtract module instead of fragile regexes.

I can't find a correct way to locate the URL which will return the
results


See where it says

<form name="examresult" action="result.asp" method="post">

??

You take the "submit.asp..." stuff off of the URL that you got
the <form> page from, and put "result.asp" in its place.

http://www.chennaionline.com/msuniversity/result.asp

How can I do it in this case?


Let wsp.pl write a request for you (you'll probably need to edit it a bit),
and use the LWP::UserAgent module to submit the request.

If there is a complete alternative to do it. Please guide me.


Here you go:

----------------------------------------
#!/usr/bin/perl
use strict;
use warnings;
use LWP::UserAgent;
use HTTP::Request::Common;
use HTML::TableExtract;
use Data::Dumper;

my($num, $name, $college, @lines) = get_grades( '2225684' );

print "num: $num\n";
print "name: $name\n";
print "college: $college\n";

print Dumper \@lines;



sub get_grades {
my($id) = @_;

my $request = POST "http://www.chennaionline.com/msuniversity/result.asp",
[
'Codeid' => "BA",
'Exam_Registration_Number' => $id,
] ;

my $agent = new LWP::UserAgent();
my $response = $agent->request( $request );
return() unless $response->is_success;
my $content = $response->content();


### Registration No., Name, College (by table position)
my $te = new HTML::TableExtract( count => 2, depth => 1 );
$te->parse($content);

my($table) = $te->tables();
my @rows = $te->rows($table);

my $regnum = $rows[0][1];
my $name = $rows[1][1];
my $college = $rows[2][1];


### grades (by table headers)
$content =~ s/Subject\s*Code/Subject Code/; # patch silly web page
$te = new HTML::TableExtract( headers => ['Subject Code',
'Marks',
'Result'
]
);
$te->parse($content);

@rows = (); # re-used from above
foreach my $ts ($te->table_states) {
foreach my $row ($ts->rows) {
next if $row->[0] =~ /CONTROLLER OF EXAMINATION/;
my %course;
@course{ qw/subject marks result/ } = @$row; # a "hash slice"
push @rows, \%course;
}
}

return $regnum, $name, $college, @rows;
}
 
F

Fiaz Idris

a cursory look at the html show that the input field is actually not
RegistraitonNumber , but rather Exam_Registration_Number
in addition to that there is a hidden field Codeid set to 'BA'.
and to be sure, maybe you should also include the button field,
btn_display=Results

try that, preferably with a POST
if it still fails try to set the Referer HTTP header

gnari

I have tried various different combinations of the following URL
encoded query.

(1)
http://www.chennaionline.com/msuniv...gistration_Number=2225765&btn_display=Results

You may try on this page
"http://www.chennaionline.com/msuniversity/result.asp"

I have been successful for example on this page in getting the arrival
flights of airport.

(2)
http://www.hongkongairport.com/eng/...ion=All&SearchAirline=All&SearchFrom=2004-4-8

So, could someone please guide me and show what is the expected URL to
get the results returned for (1) above. Thanks.
 
F

Fiaz Idris

I happen to solve my original problem by using the following
perlscript. There are two problems with this scrpt

1) After about 90-100 times inside the loop, the loop doesn't
progress anymore but just waits. So I have to Ctrl+C the script
and use a new starting count and start again. And the same happens
again and again...

2) Occasionally the behaviour is uncertain.

Could someone guide me where I should change in the script or give
any other valuable advice. Thanks.

I am using cygwin on a windows machine with perl 5.8.2

Script
-------

#!/usr/bin/perl -w

use LWP::Simple;
use HTML::TableExtract;
use LWP::UserAgent;

my $browser = LWP::UserAgent->new;

for ($regno=2225700; $regno<=2230000; $regno=$regno+50) {

sleep 5;
print STDERR "$regno\n";
print "\n";
my $response = $browser->post(
'http://www.chennaionline.com/msuniversity/result.asp',
[
'Codeid' => 'BA',
'Exam_Registration_Number' => $regno
],
);

$curcontent = $response->{_content};

my $all_te = new HTML::TableExtract( depth=>1, count=> 2 );
my $all_tem = new HTML::TableExtract( depth=>1, count=> 3);

#$all_te->parse_file("flt.txt");
$all_te->parse($curcontent);
$all_tem->parse($curcontent);

foreach $ts ($all_te->table_states) {
foreach $row($ts->rows) {
for($i=0; $i<@$row; $i++) {
my $temprow = $row->[$i];
#print "***<$temprow>***\n";
$temprow =~ s/^[\s\W\n]+(.*)\s+$/$1/g;
#$temprow =~ s/$unknownchar//g;

if ($temprow =~ /Registration/) { next; }
if ($temprow =~ /Name/) { next; }
if ($temprow =~ /College/) { next; }

print "$temprow, ";
}
#print "\n"
}
}

foreach $ts ($all_tem->table_states) {
foreach $row($ts->rows) {
for($i=0; $i<@$row; $i++) {
my $temprow = $row->[$i];
#print "***<$temprow>***\n";
$temprow =~ s/^[\s\W\n]+(.*)\s+$/$1/g;
#$temprow =~ s/$unknownchar//g;

if ($temprow =~ /Subject/) { next; }
if ($temprow =~ /Marks/) { next; }
if ($temprow =~ /Result/) { next; }
if ($temprow =~ /CONTROLLER/) { next; }

print "$temprow, ";
}
#print "\n";
}
}
}

__END__
 
I

ifiaz

Thanks Tad,

I tried using your script for the latest results, and it works like a
wonder:

I used a for loop like this on the main part of the script.

### For Loop for the script below
my $regno;
for ($regno=2225683; $regno<=2226000; $regno=$regno+1) {
my($num, $name, $college, @lines) = get_grades( $regno );
### For Loop for the script above


### Script change as follows in get_grades function for the latest
results###
my $request = POST
"http://www.chennaionline.com/msuniversity/result1.asp",
[
'Codeid' => "BA1TO4",
'Exam_Registration_Number' => $id,
] ;
### Change the above in your code ###


But, both your version of the script and my version stops after
processing approx. the 90th student number unconditionally although the
loop extends beyond that.

Could you or someone explain why? and how I can correct this?

I know it has been a long time.


Tad said:
Fiaz Idris said:
I have used curl and perl script to extract data from sequence
of webpages before.


Why not just ditch curl and do it with Perl alone?

See this Perl FAQ:

How do I automate an HTML form submission?

But, in the following case I couldn't find a way to do it.

So, if someone can guide me a better way


I like to use the Web Scraping Proxy (wsp.pl) for developing
my many web-scraping programs:

http://www.research.att.com/~hpk/wsp/

It is a huge timesaver in reverse-engineering how to get to what you want.
The webpage is the following:

http://www.chennaionline.com/msuniversity/submit.asp?code=BA

and I have to extract the Registration numbers from 2225683 to
2225867.

redirected to a file and then use regular expressions to extract
the


Using regular expressions to parse HTML can be a bad idea.

Especially since the data you want is in a table.

Use the HTML::TableExtract module instead of fragile regexes.

I can't find a correct way to locate the URL which will return the
results


See where it says

<form name="examresult" action="result.asp" method="post">

??

You take the "submit.asp..." stuff off of the URL that you got
the <form> page from, and put "result.asp" in its place.

http://www.chennaionline.com/msuniversity/result.asp

How can I do it in this case?


Let wsp.pl write a request for you (you'll probably need to edit it a bit),
and use the LWP::UserAgent module to submit the request.

If there is a complete alternative to do it. Please guide me.


Here you go:

----------------------------------------
#!/usr/bin/perl
use strict;
use warnings;
use LWP::UserAgent;
use HTTP::Request::Common;
use HTML::TableExtract;
use Data::Dumper;

my($num, $name, $college, @lines) = get_grades( '2225684' );

print "num: $num\n";
print "name: $name\n";
print "college: $college\n";

print Dumper \@lines;



sub get_grades {
my($id) = @_;

my $request = POST "http://www.chennaionline.com/msuniversity/result.asp",
[
'Codeid' => "BA",
'Exam_Registration_Number' => $id,
] ;

my $agent = new LWP::UserAgent();
my $response = $agent->request( $request );
return() unless $response->is_success;
my $content = $response->content();


### Registration No., Name, College (by table position)
my $te = new HTML::TableExtract( count => 2, depth => 1 );
$te->parse($content);

my($table) = $te->tables();
my @rows = $te->rows($table);

my $regnum = $rows[0][1];
my $name = $rows[1][1];
my $college = $rows[2][1];


### grades (by table headers)
$content =~ s/Subject\s*Code/Subject Code/; # patch silly web page
$te = new HTML::TableExtract( headers => ['Subject Code',
'Marks',
'Result'
]
);
$te->parse($content);

@rows = (); # re-used from above
foreach my $ts ($te->table_states) {
foreach my $row ($ts->rows) {
next if $row->[0] =~ /CONTROLLER OF EXAMINATION/;
my %course;
@course{ qw/subject marks result/ } = @$row; # a "hash slice"
push @rows, \%course;
}
}

return $regnum, $name, $college, @rows;
}
 
T

Tad McClellan

Thanks Tad,


You are welcome, you can show your gratitude by composing followups properly:

Please do not top-post.

Please do not full-quote.

Please do not quote .sigs.

I tried using your script for the latest results, and it works like a
wonder:


That is how ALL of _my_ code works!

heh.


[snip code fragments]

But, both your version of the script and my version stops after
processing approx. the 90th student number unconditionally although the
loop extends beyond that.


It gets all 318 of them when I try it.

Could you or someone explain why?


Nope, since I cannot duplicate the problem.

(but I do see that 64 of the regno's return no results,
invalid registration numbers I assume...
)



[snip 150 lines of TOFU]
 
I

ifiaz

Thanks Tad,
You are welcome, you can show your gratitude by composing followups properly:

Please do not top-post.
What does this mean?
Please do not full-quote.
Does this mean I should delete unnecessary parts when I reply?
Please do not quote .sigs.
What does this mean?

Could you explain a bit clearer as I do not get your meaning. I will
follow accordingly as I am relatively new to newsgroups.
I tried using your script for the latest results, and it works like a
wonder:


That is how ALL of _my_ code works!

heh.


[snip code fragments]

But, both your version of the script and my version stops after
processing approx. the 90th student number unconditionally although the
loop extends beyond that.


It gets all 318 of them when I try it.

Is it without any change in the code?
Nope, since I cannot duplicate the problem.

(but I do see that 64 of the regno's return no results,
invalid registration numbers I assume...
)

I assure you that it is not because of no results for some regnos.

But, yet after the 90th student number, the program stops indefinitely
and I have to click ctrl+c to break.

I am using Perl 5.8.5, Windows 98 SE, Cygwin. Any comment on this is
appreciated.
 
T

Tad McClellan

ifiaz said:
What does this mean?

http://www.catb.org/~esr/jargon/html/T/top-post.html


Does this mean I should delete unnecessary parts when I reply?


Exactly right.

What does this mean?


A ".sig" is the "signature" at the end of a post, after the
line with 2 hyphens and a space char on it.

You should snip those when replying, unless the .sig itself
is what youu are commenting on.

I am relatively new to newsgroups.


Please see the Posting Guidelines for this newsgroup, and follow
the links it contains:

http://mail.augustmail.com/~tadmc/clpmisc.shtml
 
I

ifiaz

I tried using your script for the latest results, and it works like
a


That is how ALL of _my_ code works!

heh.


[snip code fragments]

But, both your version of the script and my version stops after
processing approx. the 90th student number unconditionally although the
loop extends beyond that.


It gets all 318 of them when I try it.

Could you or someone explain why?


Nope, since I cannot duplicate the problem.

(but I do see that 64 of the regno's return no results,
invalid registration numbers I assume...
)

I could see this too.

Did you make any code changes on your script?

Is it to do anything with network overloading, etc. etc.?

I am using perl 5.8.5, Windows 98 SE, cygwin

Any pointers is much appreciated. Thanks.
 
T

Tad McClellan

[ Please provide a proper attribution when you quote someone. ]

Did you make any code changes on your script?


Yes, the ones you described.

Is it to do anything with network overloading, etc. etc.?


Could be that, we can't see your network, so we cannot help with that.

It could be that the website is throttling you too.

Or it might be something else, since there may be tiny differences
in the code we are running since there have been a few edits
on each end since then...
 
I

ifiaz

It gets all 318 of them when I try it.
Yes, the ones you described.




Could be that, we can't see your network, so we cannot help with that.

It could be that the website is throttling you too.

Or it might be something else, since there may be tiny differences
in the code we are running since there have been a few edits
on each end since then...

This simple code for the URL content downloads 150 times of the same
thing without any breaks.

CODE FOLLOWS:

#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;

my $regno;

for ($regno=1; $regno<=150; $regno=$regno+1) {

my $content =
get("http://www.chennaionline.com/msuniversity/submit1.asp?code=B
A1TO4");

die "Couldn't get it!" unless defined $content;

print "$content\n";

}

CODE ENDS:

But, only with the earlier results extraction code it breaks after the
90th student.

I don't think any of the server is trying to cut you off due to network
overload.

May be it is to do with how the extraction code is written.

Please bear with me and show me how I can accomplish what I wanted
earlier.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,534
Members
45,008
Latest member
Rahul737

Latest Threads

Top