Extract data using Curl Unix Command & Perl Script from Webpage

Discussion in 'Perl Misc' started by Fiaz Idris, Mar 7, 2004.

  1. Fiaz Idris

    Fiaz Idris Guest

    I have used curl and perl script to extract data from sequence
    of webpages before.

    But, in the following case I couldn't find a way to do it.

    So, if someone can guide me a better way or add any comments
    on top of my own to do it would be appreciated.


    The webpage is the following:


    and I have to extract the Registration numbers from 2225683 to

    You might want to try out a single number for e.g. 2225683 to see
    the results it returns.

    I normally will group all the webpage source of each of the
    numbers in a single file using something like

    $results = qx{curl -s

    redirected to a file and then use regular expressions to extract the
    Registration No., Name, College and the marks & results of each
    for each student.


    I can't find a correct way to locate the URL which will return the
    of each Registration Number as it seems to be using JavaScript or

    How can I do it in this case?

    If there is a complete alternative to do it. Please guide me.

    I have used the same technique in some other pages and it works like a
    Fiaz Idris, Mar 7, 2004
    1. Advertisements

  2. Fiaz Idris

    Bob Walton Guest

    Accuracy counts------------------------------------------------^^

    The HTML page generating the request indicates it is using the POST
    method. Perhaps the CGI script which accepts the request checks to
    verify that the POST method was used? In the case of the POST method,
    the arguments are not supplied as part of the URL.

    use LWP::UserAgent;

    would be the Perlish way of doing it. See:

    perldoc lwpcook

    for a tutorial.

    Did their forms use the POST method?
    Bob Walton, Mar 7, 2004
    1. Advertisements

  3. Fiaz Idris

    gnari Guest

    what we in turn can expect from you, is that you do a modicum of preparation
    work, like making sure the url you claim does not work, is actually the
    correct one

    a cursory look at the html show that the input field is actually not
    RegistraitonNumber , but rather Exam_Registration_Number
    in addition to that there is a hidden field Codeid set to 'BA'.
    and to be sure, maybe you should also include the button field,

    try that, preferably with a POST
    if it still fails try to set the Referer HTTP header

    gnari, Mar 7, 2004

  4. Why not just ditch curl and do it with Perl alone?

    See this Perl FAQ:

    How do I automate an HTML form submission?

    I like to use the Web Scraping Proxy (wsp.pl) for developing
    my many web-scraping programs:


    It is a huge timesaver in reverse-engineering how to get to what you want.

    Using regular expressions to parse HTML can be a bad idea.

    Especially since the data you want is in a table.

    Use the HTML::TableExtract module instead of fragile regexes.

    See where it says

    <form name="examresult" action="result.asp" method="post">


    You take the "submit.asp..." stuff off of the URL that you got
    the <form> page from, and put "result.asp" in its place.


    Let wsp.pl write a request for you (you'll probably need to edit it a bit),
    and use the LWP::UserAgent module to submit the request.

    Here you go:

    use strict;
    use warnings;
    use LWP::UserAgent;
    use HTTP::Request::Common;
    use HTML::TableExtract;
    use Data::Dumper;

    my($num, $name, $college, @lines) = get_grades( '2225684' );

    print "num: $num\n";
    print "name: $name\n";
    print "college: $college\n";

    print Dumper \@lines;

    sub get_grades {
    my($id) = @_;

    my $request = POST "http://www.chennaionline.com/msuniversity/result.asp",
    'Codeid' => "BA",
    'Exam_Registration_Number' => $id,
    ] ;

    my $agent = new LWP::UserAgent();
    my $response = $agent->request( $request );
    return() unless $response->is_success;
    my $content = $response->content();

    ### Registration No., Name, College (by table position)
    my $te = new HTML::TableExtract( count => 2, depth => 1 );

    my($table) = $te->tables();
    my @rows = $te->rows($table);

    my $regnum = $rows[0][1];
    my $name = $rows[1][1];
    my $college = $rows[2][1];

    ### grades (by table headers)
    $content =~ s/Subject\s*Code/Subject Code/; # patch silly web page
    $te = new HTML::TableExtract( headers => ['Subject Code',

    @rows = (); # re-used from above
    foreach my $ts ($te->table_states) {
    foreach my $row ($ts->rows) {
    next if $row->[0] =~ /CONTROLLER OF EXAMINATION/;
    my %course;
    @course{ qw/subject marks result/ } = @$row; # a "hash slice"
    push @rows, \%course;

    return $regnum, $name, $college, @rows;
    Tad McClellan, Mar 7, 2004
  5. Fiaz Idris

    Fiaz Idris Guest

    a cursory look at the html show that the input field is actually not
    I have tried various different combinations of the following URL
    encoded query.


    You may try on this page

    I have been successful for example on this page in getting the arrival
    flights of airport.


    So, could someone please guide me and show what is the expected URL to
    get the results returned for (1) above. Thanks.
    Fiaz Idris, Apr 7, 2004
  6. Fiaz Idris

    Fiaz Idris Guest

    Fiaz Idris, Apr 7, 2004
  7. Fiaz Idris

    Fiaz Idris Guest

    I happen to solve my original problem by using the following
    perlscript. There are two problems with this scrpt

    1) After about 90-100 times inside the loop, the loop doesn't
    progress anymore but just waits. So I have to Ctrl+C the script
    and use a new starting count and start again. And the same happens
    again and again...

    2) Occasionally the behaviour is uncertain.

    Could someone guide me where I should change in the script or give
    any other valuable advice. Thanks.

    I am using cygwin on a windows machine with perl 5.8.2


    #!/usr/bin/perl -w

    use LWP::Simple;
    use HTML::TableExtract;
    use LWP::UserAgent;

    my $browser = LWP::UserAgent->new;

    for ($regno=2225700; $regno<=2230000; $regno=$regno+50) {

    sleep 5;
    print STDERR "$regno\n";
    print "\n";
    my $response = $browser->post(
    'Codeid' => 'BA',
    'Exam_Registration_Number' => $regno

    $curcontent = $response->{_content};

    my $all_te = new HTML::TableExtract( depth=>1, count=> 2 );
    my $all_tem = new HTML::TableExtract( depth=>1, count=> 3);


    foreach $ts ($all_te->table_states) {
    foreach $row($ts->rows) {
    for($i=0; $i<@$row; $i++) {
    my $temprow = $row->[$i];
    #print "***<$temprow>***\n";
    $temprow =~ s/^[\s\W\n]+(.*)\s+$/$1/g;
    #$temprow =~ s/$unknownchar//g;

    if ($temprow =~ /Registration/) { next; }
    if ($temprow =~ /Name/) { next; }
    if ($temprow =~ /College/) { next; }

    print "$temprow, ";
    #print "\n"

    foreach $ts ($all_tem->table_states) {
    foreach $row($ts->rows) {
    for($i=0; $i<@$row; $i++) {
    my $temprow = $row->[$i];
    #print "***<$temprow>***\n";
    $temprow =~ s/^[\s\W\n]+(.*)\s+$/$1/g;
    #$temprow =~ s/$unknownchar//g;

    if ($temprow =~ /Subject/) { next; }
    if ($temprow =~ /Marks/) { next; }
    if ($temprow =~ /Result/) { next; }
    if ($temprow =~ /CONTROLLER/) { next; }

    print "$temprow, ";
    #print "\n";

    Fiaz Idris, Apr 21, 2004
  8. Fiaz Idris

    ifiaz Guest

    Thanks Tad,

    I tried using your script for the latest results, and it works like a

    I used a for loop like this on the main part of the script.

    ### For Loop for the script below
    my $regno;
    for ($regno=2225683; $regno<=2226000; $regno=$regno+1) {
    my($num, $name, $college, @lines) = get_grades( $regno );
    ### For Loop for the script above

    ### Script change as follows in get_grades function for the latest
    my $request = POST
    'Codeid' => "BA1TO4",
    'Exam_Registration_Number' => $id,
    ] ;
    ### Change the above in your code ###

    But, both your version of the script and my version stops after
    processing approx. the 90th student number unconditionally although the
    loop extends beyond that.

    Could you or someone explain why? and how I can correct this?

    I know it has been a long time.

    ifiaz, Mar 15, 2005

  9. You are welcome, you can show your gratitude by composing followups properly:

    Please do not top-post.

    Please do not full-quote.

    Please do not quote .sigs.

    That is how ALL of _my_ code works!


    [snip code fragments]

    It gets all 318 of them when I try it.

    Nope, since I cannot duplicate the problem.

    (but I do see that 64 of the regno's return no results,
    invalid registration numbers I assume...

    [snip 150 lines of TOFU]
    Tad McClellan, Mar 15, 2005
  10. Fiaz Idris

    ifiaz Guest

    Thanks Tad,
    What does this mean?
    Does this mean I should delete unnecessary parts when I reply?
    What does this mean?

    Could you explain a bit clearer as I do not get your meaning. I will
    follow accordingly as I am relatively new to newsgroups.
    Is it without any change in the code?
    I assure you that it is not because of no results for some regnos.

    But, yet after the 90th student number, the program stops indefinitely
    and I have to click ctrl+c to break.

    I am using Perl 5.8.5, Windows 98 SE, Cygwin. Any comment on this is
    ifiaz, Mar 16, 2005

  11. Exactly right.

    A ".sig" is the "signature" at the end of a post, after the
    line with 2 hyphens and a space char on it.

    You should snip those when replying, unless the .sig itself
    is what youu are commenting on.

    Please see the Posting Guidelines for this newsgroup, and follow
    the links it contains:

    Tad McClellan, Mar 16, 2005
  12. Fiaz Idris

    ifiaz Guest

    I tried using your script for the latest results, and it works like
    I could see this too.

    Did you make any code changes on your script?

    Is it to do anything with network overloading, etc. etc.?

    I am using perl 5.8.5, Windows 98 SE, cygwin

    Any pointers is much appreciated. Thanks.
    ifiaz, Mar 17, 2005
  13. [ Please provide a proper attribution when you quote someone. ]

    Yes, the ones you described.

    Could be that, we can't see your network, so we cannot help with that.

    It could be that the website is throttling you too.

    Or it might be something else, since there may be tiny differences
    in the code we are running since there have been a few edits
    on each end since then...
    Tad McClellan, Mar 17, 2005
  14. Fiaz Idris

    ifiaz Guest

    It gets all 318 of them when I try it.
    This simple code for the URL content downloads 150 times of the same
    thing without any breaks.


    use strict;
    use warnings;
    use LWP::Simple;

    my $regno;

    for ($regno=1; $regno<=150; $regno=$regno+1) {

    my $content =

    die "Couldn't get it!" unless defined $content;

    print "$content\n";



    But, only with the earlier results extraction code it breaks after the
    90th student.

    I don't think any of the server is trying to cut you off due to network

    May be it is to do with how the extraction code is written.

    Please bear with me and show me how I can accomplish what I wanted
    ifiaz, Mar 17, 2005
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.