Extract data using Curl Unix Command & Perl Script from Webpage

Discussion in 'Perl Misc' started by Fiaz Idris, Mar 7, 2004.

  1. Fiaz Idris

    Fiaz Idris Guest

    I have used curl and perl script to extract data from sequence
    of webpages before.

    But, in the following case I couldn't find a way to do it.

    So, if someone can guide me a better way or add any comments
    on top of my own to do it would be appreciated.

    HOW I EXPECT IT TO BE DONE
    --------------------------

    The webpage is the following:

    http://www.chennaionline.com/msuniversity/submit.asp?code=BA

    and I have to extract the Registration numbers from 2225683 to
    2225867.

    You might want to try out a single number for e.g. 2225683 to see
    the results it returns.

    I normally will group all the webpage source of each of the
    registration
    numbers in a single file using something like

    $results = qx{curl -s
    http://www.chennaionline.com/msuniversity/result.asp?RegistraitonNumber=$regno};

    redirected to a file and then use regular expressions to extract the
    Registration No., Name, College and the marks & results of each
    subject
    for each student.

    WHAT I EXPECT FROM YOU
    ----------------------

    I can't find a correct way to locate the URL which will return the
    results
    of each Registration Number as it seems to be using JavaScript or
    something.

    How can I do it in this case?

    If there is a complete alternative to do it. Please guide me.

    I have used the same technique in some other pages and it works like a
    wonder.
    Fiaz Idris, Mar 7, 2004
    #1
    1. Advertising

  2. Fiaz Idris

    Bob Walton Guest

    Fiaz Idris wrote:

    > I have used curl and perl script to extract data from sequence
    > of webpages before.
    >
    > But, in the following case I couldn't find a way to do it.
    >
    > So, if someone can guide me a better way or add any comments
    > on top of my own to do it would be appreciated.
    >
    > HOW I EXPECT IT TO BE DONE
    > --------------------------
    >
    > The webpage is the following:
    >
    > http://www.chennaionline.com/msuniversity/submit.asp?code=BA
    >
    > and I have to extract the Registration numbers from 2225683 to
    > 2225867.
    >
    > You might want to try out a single number for e.g. 2225683 to see
    > the results it returns.
    >
    > I normally will group all the webpage source of each of the
    > registration
    > numbers in a single file using something like
    >
    > $results = qx{curl -s
    > http://www.chennaionline.com/msuniversity/result.asp?RegistraitonNumber=$regno};


    Accuracy counts------------------------------------------------^^


    >
    > redirected to a file and then use regular expressions to extract the
    > Registration No., Name, College and the marks & results of each
    > subject
    > for each student.
    >
    > WHAT I EXPECT FROM YOU
    > ----------------------
    >
    > I can't find a correct way to locate the URL which will return the
    > results
    > of each Registration Number as it seems to be using JavaScript or
    > something.
    >
    > How can I do it in this case?



    The HTML page generating the request indicates it is using the POST
    method. Perhaps the CGI script which accepts the request checks to
    verify that the POST method was used? In the case of the POST method,
    the arguments are not supplied as part of the URL.


    >
    > If there is a complete alternative to do it. Please guide me.



    use LWP::UserAgent;

    would be the Perlish way of doing it. See:

    perldoc lwpcook

    for a tutorial.


    >
    > I have used the same technique in some other pages and it works like a
    > wonder.



    Did their forms use the POST method?


    --
    Bob Walton
    Email: http://bwalton.com/cgi-bin/emailbob.pl
    Bob Walton, Mar 7, 2004
    #2
    1. Advertising

  3. Fiaz Idris

    gnari Guest

    "Fiaz Idris" <> wrote in message
    news:...
    [snip]
    > $results = qx{curl -s
    >

    http://www.chennaionline.com/msuniversity/result.asp?RegistraitonNumber=$reg
    no};
    >
    > ...
    >
    > WHAT I EXPECT FROM YOU


    what we in turn can expect from you, is that you do a modicum of preparation
    work, like making sure the url you claim does not work, is actually the
    correct one

    a cursory look at the html show that the input field is actually not
    RegistraitonNumber , but rather Exam_Registration_Number
    in addition to that there is a hidden field Codeid set to 'BA'.
    and to be sure, maybe you should also include the button field,
    btn_display=Results

    try that, preferably with a POST
    if it still fails try to set the Referer HTTP header

    gnari
    gnari, Mar 7, 2004
    #3
  4. Fiaz Idris <> wrote:

    > I have used curl and perl script to extract data from sequence
    > of webpages before.



    Why not just ditch curl and do it with Perl alone?

    See this Perl FAQ:

    How do I automate an HTML form submission?


    > But, in the following case I couldn't find a way to do it.
    >
    > So, if someone can guide me a better way



    I like to use the Web Scraping Proxy (wsp.pl) for developing
    my many web-scraping programs:

    http://www.research.att.com/~hpk/wsp/

    It is a huge timesaver in reverse-engineering how to get to what you want.


    > The webpage is the following:
    >
    > http://www.chennaionline.com/msuniversity/submit.asp?code=BA
    >
    > and I have to extract the Registration numbers from 2225683 to
    > 2225867.



    > redirected to a file and then use regular expressions to extract the



    Using regular expressions to parse HTML can be a bad idea.

    Especially since the data you want is in a table.

    Use the HTML::TableExtract module instead of fragile regexes.


    > I can't find a correct way to locate the URL which will return the
    > results



    See where it says

    <form name="examresult" action="result.asp" method="post">

    ??

    You take the "submit.asp..." stuff off of the URL that you got
    the <form> page from, and put "result.asp" in its place.

    http://www.chennaionline.com/msuniversity/result.asp


    > How can I do it in this case?



    Let wsp.pl write a request for you (you'll probably need to edit it a bit),
    and use the LWP::UserAgent module to submit the request.


    > If there is a complete alternative to do it. Please guide me.



    Here you go:

    ----------------------------------------
    #!/usr/bin/perl
    use strict;
    use warnings;
    use LWP::UserAgent;
    use HTTP::Request::Common;
    use HTML::TableExtract;
    use Data::Dumper;

    my($num, $name, $college, @lines) = get_grades( '2225684' );

    print "num: $num\n";
    print "name: $name\n";
    print "college: $college\n";

    print Dumper \@lines;



    sub get_grades {
    my($id) = @_;

    my $request = POST "http://www.chennaionline.com/msuniversity/result.asp",
    [
    'Codeid' => "BA",
    'Exam_Registration_Number' => $id,
    ] ;

    my $agent = new LWP::UserAgent();
    my $response = $agent->request( $request );
    return() unless $response->is_success;
    my $content = $response->content();


    ### Registration No., Name, College (by table position)
    my $te = new HTML::TableExtract( count => 2, depth => 1 );
    $te->parse($content);

    my($table) = $te->tables();
    my @rows = $te->rows($table);

    my $regnum = $rows[0][1];
    my $name = $rows[1][1];
    my $college = $rows[2][1];


    ### grades (by table headers)
    $content =~ s/Subject\s*Code/Subject Code/; # patch silly web page
    $te = new HTML::TableExtract( headers => ['Subject Code',
    'Marks',
    'Result'
    ]
    );
    $te->parse($content);

    @rows = (); # re-used from above
    foreach my $ts ($te->table_states) {
    foreach my $row ($ts->rows) {
    next if $row->[0] =~ /CONTROLLER OF EXAMINATION/;
    my %course;
    @course{ qw/subject marks result/ } = @$row; # a "hash slice"
    push @rows, \%course;
    }
    }

    return $regnum, $name, $college, @rows;
    }

    ----------------------------------------


    Web scraping is a blast!


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Mar 7, 2004
    #4
  5. Fiaz Idris

    Fiaz Idris Guest

    > a cursory look at the html show that the input field is actually not
    > RegistraitonNumber , but rather Exam_Registration_Number
    > in addition to that there is a hidden field Codeid set to 'BA'.
    > and to be sure, maybe you should also include the button field,
    > btn_display=Results
    >
    > try that, preferably with a POST
    > if it still fails try to set the Referer HTTP header
    >
    > gnari


    I have tried various different combinations of the following URL
    encoded query.

    (1)
    http://www.chennaionline.com/msuniv...gistration_Number=2225765&btn_display=Results

    You may try on this page
    "http://www.chennaionline.com/msuniversity/result.asp"

    I have been successful for example on this page in getting the arrival
    flights of airport.

    (2)
    http://www.hongkongairport.com/eng/...ion=All&SearchAirline=All&SearchFrom=2004-4-8

    So, could someone please guide me and show what is the expected URL to
    get the results returned for (1) above. Thanks.
    Fiaz Idris, Apr 7, 2004
    #5
  6. Fiaz Idris

    Fiaz Idris Guest

    Fiaz Idris, Apr 7, 2004
    #6
  7. Fiaz Idris

    Fiaz Idris Guest

    I happen to solve my original problem by using the following
    perlscript. There are two problems with this scrpt

    1) After about 90-100 times inside the loop, the loop doesn't
    progress anymore but just waits. So I have to Ctrl+C the script
    and use a new starting count and start again. And the same happens
    again and again...

    2) Occasionally the behaviour is uncertain.

    Could someone guide me where I should change in the script or give
    any other valuable advice. Thanks.

    I am using cygwin on a windows machine with perl 5.8.2

    Script
    -------

    #!/usr/bin/perl -w

    use LWP::Simple;
    use HTML::TableExtract;
    use LWP::UserAgent;

    my $browser = LWP::UserAgent->new;

    for ($regno=2225700; $regno<=2230000; $regno=$regno+50) {

    sleep 5;
    print STDERR "$regno\n";
    print "\n";
    my $response = $browser->post(
    'http://www.chennaionline.com/msuniversity/result.asp',
    [
    'Codeid' => 'BA',
    'Exam_Registration_Number' => $regno
    ],
    );

    $curcontent = $response->{_content};

    my $all_te = new HTML::TableExtract( depth=>1, count=> 2 );
    my $all_tem = new HTML::TableExtract( depth=>1, count=> 3);

    #$all_te->parse_file("flt.txt");
    $all_te->parse($curcontent);
    $all_tem->parse($curcontent);

    foreach $ts ($all_te->table_states) {
    foreach $row($ts->rows) {
    for($i=0; $i<@$row; $i++) {
    my $temprow = $row->[$i];
    #print "***<$temprow>***\n";
    $temprow =~ s/^[\s\W\n]+(.*)\s+$/$1/g;
    #$temprow =~ s/$unknownchar//g;

    if ($temprow =~ /Registration/) { next; }
    if ($temprow =~ /Name/) { next; }
    if ($temprow =~ /College/) { next; }

    print "$temprow, ";
    }
    #print "\n"
    }
    }

    foreach $ts ($all_tem->table_states) {
    foreach $row($ts->rows) {
    for($i=0; $i<@$row; $i++) {
    my $temprow = $row->[$i];
    #print "***<$temprow>***\n";
    $temprow =~ s/^[\s\W\n]+(.*)\s+$/$1/g;
    #$temprow =~ s/$unknownchar//g;

    if ($temprow =~ /Subject/) { next; }
    if ($temprow =~ /Marks/) { next; }
    if ($temprow =~ /Result/) { next; }
    if ($temprow =~ /CONTROLLER/) { next; }

    print "$temprow, ";
    }
    #print "\n";
    }
    }
    }

    __END__
    Fiaz Idris, Apr 21, 2004
    #7
  8. Fiaz Idris

    ifiaz Guest

    Thanks Tad,

    I tried using your script for the latest results, and it works like a
    wonder:

    I used a for loop like this on the main part of the script.

    ### For Loop for the script below
    my $regno;
    for ($regno=2225683; $regno<=2226000; $regno=$regno+1) {
    my($num, $name, $college, @lines) = get_grades( $regno );
    ### For Loop for the script above


    ### Script change as follows in get_grades function for the latest
    results###
    my $request = POST
    "http://www.chennaionline.com/msuniversity/result1.asp",
    [
    'Codeid' => "BA1TO4",
    'Exam_Registration_Number' => $id,
    ] ;
    ### Change the above in your code ###


    But, both your version of the script and my version stops after
    processing approx. the 90th student number unconditionally although the
    loop extends beyond that.

    Could you or someone explain why? and how I can correct this?

    I know it has been a long time.


    Tad McClellan wrote:
    > Fiaz Idris <> wrote:
    >
    > > I have used curl and perl script to extract data from sequence
    > > of webpages before.

    >
    >
    > Why not just ditch curl and do it with Perl alone?
    >
    > See this Perl FAQ:
    >
    > How do I automate an HTML form submission?
    >
    >
    > > But, in the following case I couldn't find a way to do it.
    > >
    > > So, if someone can guide me a better way

    >
    >
    > I like to use the Web Scraping Proxy (wsp.pl) for developing
    > my many web-scraping programs:
    >
    > http://www.research.att.com/~hpk/wsp/
    >
    > It is a huge timesaver in reverse-engineering how to get to what you

    want.
    >
    >
    > > The webpage is the following:
    > >
    > > http://www.chennaionline.com/msuniversity/submit.asp?code=BA
    > >
    > > and I have to extract the Registration numbers from 2225683 to
    > > 2225867.

    >
    >
    > > redirected to a file and then use regular expressions to extract

    the
    >
    >
    > Using regular expressions to parse HTML can be a bad idea.
    >
    > Especially since the data you want is in a table.
    >
    > Use the HTML::TableExtract module instead of fragile regexes.
    >
    >
    > > I can't find a correct way to locate the URL which will return the
    > > results

    >
    >
    > See where it says
    >
    > <form name="examresult" action="result.asp" method="post">
    >
    > ??
    >
    > You take the "submit.asp..." stuff off of the URL that you got
    > the <form> page from, and put "result.asp" in its place.
    >
    > http://www.chennaionline.com/msuniversity/result.asp
    >
    >
    > > How can I do it in this case?

    >
    >
    > Let wsp.pl write a request for you (you'll probably need to edit it a

    bit),
    > and use the LWP::UserAgent module to submit the request.
    >
    >
    > > If there is a complete alternative to do it. Please guide me.

    >
    >
    > Here you go:
    >
    > ----------------------------------------
    > #!/usr/bin/perl
    > use strict;
    > use warnings;
    > use LWP::UserAgent;
    > use HTTP::Request::Common;
    > use HTML::TableExtract;
    > use Data::Dumper;
    >
    > my($num, $name, $college, @lines) = get_grades( '2225684' );
    >
    > print "num: $num\n";
    > print "name: $name\n";
    > print "college: $college\n";
    >
    > print Dumper \@lines;
    >
    >
    >
    > sub get_grades {
    > my($id) = @_;
    >
    > my $request = POST

    "http://www.chennaionline.com/msuniversity/result.asp",
    > [
    > 'Codeid' => "BA",
    > 'Exam_Registration_Number' => $id,
    > ] ;
    >
    > my $agent = new LWP::UserAgent();
    > my $response = $agent->request( $request );
    > return() unless $response->is_success;
    > my $content = $response->content();
    >
    >
    > ### Registration No., Name, College (by table position)
    > my $te = new HTML::TableExtract( count => 2, depth => 1 );
    > $te->parse($content);
    >
    > my($table) = $te->tables();
    > my @rows = $te->rows($table);
    >
    > my $regnum = $rows[0][1];
    > my $name = $rows[1][1];
    > my $college = $rows[2][1];
    >
    >
    > ### grades (by table headers)
    > $content =~ s/Subject\s*Code/Subject Code/; # patch silly web

    page
    > $te = new HTML::TableExtract( headers => ['Subject Code',
    > 'Marks',
    > 'Result'
    > ]
    > );
    > $te->parse($content);
    >
    > @rows = (); # re-used from above
    > foreach my $ts ($te->table_states) {
    > foreach my $row ($ts->rows) {
    > next if $row->[0] =~ /CONTROLLER OF EXAMINATION/;
    > my %course;
    > @course{ qw/subject marks result/ } = @$row; # a "hash

    slice"
    > push @rows, \%course;
    > }
    > }
    >
    > return $regnum, $name, $college, @rows;
    > }
    >
    > ----------------------------------------
    >
    >
    > Web scraping is a blast!
    >
    >
    > --
    > Tad McClellan SGML consulting
    > Perl programming
    > Fort Worth, Texas
    ifiaz, Mar 15, 2005
    #8
  9. ifiaz <> wrote:


    > Thanks Tad,



    You are welcome, you can show your gratitude by composing followups properly:

    Please do not top-post.

    Please do not full-quote.

    Please do not quote .sigs.


    > I tried using your script for the latest results, and it works like a
    > wonder:



    That is how ALL of _my_ code works!

    heh.


    [snip code fragments]


    > But, both your version of the script and my version stops after
    > processing approx. the 90th student number unconditionally although the
    > loop extends beyond that.



    It gets all 318 of them when I try it.


    > Could you or someone explain why?



    Nope, since I cannot duplicate the problem.

    (but I do see that 64 of the regno's return no results,
    invalid registration numbers I assume...
    )



    [snip 150 lines of TOFU]

    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Mar 15, 2005
    #9
  10. Fiaz Idris

    ifiaz Guest

    > > Thanks Tad,
    >
    >
    > You are welcome, you can show your gratitude by composing followups

    properly:
    >
    > Please do not top-post.

    What does this mean?

    >
    > Please do not full-quote.

    Does this mean I should delete unnecessary parts when I reply?

    >
    > Please do not quote .sigs.

    What does this mean?

    Could you explain a bit clearer as I do not get your meaning. I will
    follow accordingly as I am relatively new to newsgroups.

    > > I tried using your script for the latest results, and it works like

    a
    > > wonder:

    >
    >
    > That is how ALL of _my_ code works!
    >
    > heh.
    >
    >
    > [snip code fragments]
    >
    >
    > > But, both your version of the script and my version stops after
    > > processing approx. the 90th student number unconditionally although

    the
    > > loop extends beyond that.

    >
    >
    > It gets all 318 of them when I try it.


    Is it without any change in the code?

    > > Could you or someone explain why?

    >
    > Nope, since I cannot duplicate the problem.
    >
    > (but I do see that 64 of the regno's return no results,
    > invalid registration numbers I assume...
    > )


    I assure you that it is not because of no results for some regnos.

    But, yet after the 90th student number, the program stops indefinitely
    and I have to click ctrl+c to break.

    I am using Perl 5.8.5, Windows 98 SE, Cygwin. Any comment on this is
    appreciated.
    ifiaz, Mar 16, 2005
    #10
  11. ifiaz <> wrote:
    >> > Thanks Tad,

    >>
    >>
    >> You are welcome, you can show your gratitude by composing followups

    > properly:
    >>
    >> Please do not top-post.

    > What does this mean?



    http://www.catb.org/~esr/jargon/html/T/top-post.html


    >> Please do not full-quote.

    > Does this mean I should delete unnecessary parts when I reply?



    Exactly right.


    >> Please do not quote .sigs.

    > What does this mean?



    A ".sig" is the "signature" at the end of a post, after the
    line with 2 hyphens and a space char on it.

    You should snip those when replying, unless the .sig itself
    is what youu are commenting on.


    > I am relatively new to newsgroups.



    Please see the Posting Guidelines for this newsgroup, and follow
    the links it contains:

    http://mail.augustmail.com/~tadmc/clpmisc.shtml


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Mar 16, 2005
    #11
  12. Fiaz Idris

    ifiaz Guest

    > > I tried using your script for the latest results, and it works like
    a
    > > wonder:

    >
    >
    > That is how ALL of _my_ code works!
    >
    > heh.
    >
    >
    > [snip code fragments]
    >
    >
    > > But, both your version of the script and my version stops after
    > > processing approx. the 90th student number unconditionally although

    the
    > > loop extends beyond that.

    >
    >
    > It gets all 318 of them when I try it.
    >
    >
    > > Could you or someone explain why?

    >
    >
    > Nope, since I cannot duplicate the problem.
    >
    > (but I do see that 64 of the regno's return no results,
    > invalid registration numbers I assume...
    > )


    I could see this too.

    >


    Did you make any code changes on your script?

    Is it to do anything with network overloading, etc. etc.?

    I am using perl 5.8.5, Windows 98 SE, cygwin

    Any pointers is much appreciated. Thanks.
    ifiaz, Mar 17, 2005
    #12
  13. ifiaz <> wrote:


    [ Please provide a proper attribution when you quote someone. ]


    >> It gets all 318 of them when I try it.
    >>
    >> > Could you or someone explain why?

    >>
    >> Nope, since I cannot duplicate the problem.



    > Did you make any code changes on your script?



    Yes, the ones you described.


    > Is it to do anything with network overloading, etc. etc.?



    Could be that, we can't see your network, so we cannot help with that.

    It could be that the website is throttling you too.

    Or it might be something else, since there may be tiny differences
    in the code we are running since there have been a few edits
    on each end since then...


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Mar 17, 2005
    #13
  14. Fiaz Idris

    ifiaz Guest

    > >> It gets all 318 of them when I try it.
    > >>
    > >> > Could you or someone explain why?
    > >>
    > >> Nope, since I cannot duplicate the problem.

    >
    >
    > > Did you make any code changes on your script?

    >
    >
    > Yes, the ones you described.
    >
    >
    > > Is it to do anything with network overloading, etc. etc.?

    >
    >
    > Could be that, we can't see your network, so we cannot help with

    that.
    >
    > It could be that the website is throttling you too.
    >
    > Or it might be something else, since there may be tiny differences
    > in the code we are running since there have been a few edits
    > on each end since then...


    This simple code for the URL content downloads 150 times of the same
    thing without any breaks.

    CODE FOLLOWS:

    #!/usr/bin/perl
    use strict;
    use warnings;
    use LWP::Simple;

    my $regno;

    for ($regno=1; $regno<=150; $regno=$regno+1) {

    my $content =
    get("http://www.chennaionline.com/msuniversity/submit1.asp?code=B
    A1TO4");

    die "Couldn't get it!" unless defined $content;

    print "$content\n";

    }

    CODE ENDS:

    But, only with the earlier results extraction code it breaks after the
    90th student.

    I don't think any of the server is trying to cut you off due to network
    overload.

    May be it is to do with how the extraction code is written.

    Please bear with me and show me how I can accomplish what I wanted
    earlier.
    ifiaz, Mar 17, 2005
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. dpackwood
    Replies:
    3
    Views:
    1,772
  2. Ryan Haskell
    Replies:
    6
    Views:
    7,912
    Ryan Haskell
    Jun 25, 2004
  3. Shane Nayler

    IRB error using curl command.

    Shane Nayler, Sep 17, 2008, in forum: Ruby
    Replies:
    0
    Views:
    119
    Shane Nayler
    Sep 17, 2008
  4. Glory Regained
    Replies:
    5
    Views:
    261
    Keith Keller
    Feb 14, 2005
  5. Gisle Vanem
    Replies:
    0
    Views:
    217
    Gisle Vanem
    Nov 14, 2012
Loading...

Share This Page