The huge amount response data problem

Discussion in 'Perl Misc' started by falconzyx@gmail.com, Mar 25, 2008.

  1. Guest

    I have a issue:
    1. I want to open a file and use the data from the file to construct
    the url.
    2. After I constructed the url and sent it, I got the response html
    data and some parts are what I want store inot the files.

    It seems like a very easy thing, however, the issue is that the data
    from the file that I have to open are too huge, which I have to
    consturct almost 200000 url address to send and parse response data.
    And the speed is very very slow.

    I have no idea with thread or db cache, so I want some help .

    Please give me some advices that what I should do to improve the speed

    Thanks very much.
     
    , Mar 25, 2008
    #1
    1. Advertising

  2. Guest

    On Mar 25, 10:44 am, "" <>
    wrote:
    > I have a issue:
    > 1. I want to open a file and use the data from the file to construct
    > the url.
    > 2. After I constructed the url and sent it, I got the response html
    > data and some parts are what I want store inot the files.
    >
    > It seems like a very easy thing, however, the issue is that the data
    > from the file that I have to open are too huge, which I have to
    > consturct almost 200000 url address to send and parse response data.
    > And the speed is very very slow.
    >
    > I have no idea with thread or db cache, so I want some help .
    >
    > Please give me some advices that what I should do to improve the speed
    >
    > Thanks very much.


    this is my code:

    use threads;
    use LWP::UserAgent;
    use LWP::Simple;
    use Data::Dumper;
    use strict;
    use threads::shared;



    my $wordsList = &get_request;
    #print Dumper( @wordsList );

    my @words = split("\n", $wordsList);
    #print Dumper(@words);

    my @url = &get_url(@words);
    #print Dumper(@url);
    my @thr;
    foreach my $i (1..100000) {
    push @thr, threads->new(\&get_html, $url[$i]);
    }
    foreach (@thr) {
    $_->detach; # it doesn't work!!!!!!!!!!!!!!!!
    }



    sub get_html {
    my (@url) = @_;

    }
    sub get_request {
    ..........
    return $wordsList;
    }

    sub get_url {
    my (@words) = @_;
    ................
    return @url;
    }
     
    , Mar 25, 2008
    #2
    1. Advertising

  3. Ben Bullock Guest

    Your code is hopelessly inefficient. 100,000 strings of even twenty
    characters is at least two megabytes of memory. Then you've doubled
    that number with the creation of the URL, and then you are creating
    arrays of all these things, so you've used several megabytes of
    memory.

    Instead of first creating a huge array of names, then a huge array of
    URLs, why don't you just read in one line of the file at a time, then
    try to get data from each URL? Read in one line of the first file,
    create its URL, get the response data, store it, then go back and get
    the next line of the file, etc. A 100,000 line file actually isn't
    that big.

    But if you are getting all these files from the internet, the biggest
    bottleneck is probably the time the code spends waiting for a response
    from the web servers it's requested. You'd have to think about making
    parallel requests somehow to solve that.
     
    Ben Bullock, Mar 25, 2008
    #3
  4. Guest

    On Mar 25, 3:06 pm, Ben Bullock <> wrote:
    > Your code is hopelessly inefficient. 100,000 strings of even twenty
    > characters is at least two megabytes of memory. Then you've doubled
    > that number with the creation of the URL, and then you are creating
    > arrays of all these things, so you've used several megabytes of
    > memory.
    >
    > Instead of first creating a huge array of names, then a huge array of
    > URLs, why don't you just read in one line of the file at a time, then
    > try to get data from each URL? Read in one line of the first file,
    > create its URL, get the response data, store it, then go back and get
    > the next line of the file, etc. A 100,000 line file actually isn't
    > that big.
    >
    > But if you are getting all these files from the internet, the biggest
    > bottleneck is probably the time the code spends waiting for a response
    > from the web servers it's requested. You'd have to think about making
    > parallel requests somehow to solve that.


    Thanks Ben,

    However, is there any good solution that use threads method? I use
    that, and out of memory time by time after I refactor the code as you
    told
    I try thread::pool and some other thread module that I found.
    Doesn't it really Perl suit for mutil threads programming??

    Thanks again for eveyone.
     
    , Mar 25, 2008
    #4
  5. Guest

    On Mar 25, 4:25 pm, "" <> wrote:
    > On Mar 25, 3:06 pm, Ben Bullock <> wrote:
    >
    >
    >
    > > Your code is hopelessly inefficient. 100,000 strings of even twenty
    > > characters is at least two megabytes of memory. Then you've doubled
    > > that number with the creation of the URL, and then you are creating
    > > arrays of all these things, so you've used several megabytes of
    > > memory.

    >
    > > Instead of first creating a huge array of names, then a huge array of
    > > URLs, why don't you just read in one line of the file at a time, then
    > > try to get data from each URL? Read in one line of the first file,
    > > create its URL, get the response data, store it, then go back and get
    > > the next line of the file, etc. A 100,000 line file actually isn't
    > > that big.

    >
    > > But if you are getting all these files from the internet, the biggest
    > > bottleneck is probably the time the code spends waiting for a response
    > > from the web servers it's requested. You'd have to think about making
    > > parallel requests somehow to solve that.

    >
    > Thanks Ben,
    >
    > However, is there any good solution that use threads method? I use
    > that, and out of memory time by time after I refactor the code as you
    > told
    > I try thread::pool and some other thread module that I found.
    > Doesn't it really Perl suit for mutil threads programming??
    >
    > Thanks again for eveyone.


    here is my refactor code :
    use threads;
    use LWP::UserAgent;
    use Data::Dumper;
    use strict;



    &get_request();

    sub get_request {
    open (FH, "...") or die "can not open file $!";
    while (<FH>) {
    my $i = <FH>;
    my $url = ".../$i";
    my $t = threads->new(\&get_html, $url);
    $t->join();

    }
    close (FH);
    }
    sub get_html {
    my ($url) = @_;
    my $user_agent = LWP::UserAgent->new();
    my $response = $user_agent->request(HTTP::Request->new('GET',
    $url));
    my $content = $response->content;
    format_html ($content);
    }
    sub format_html {
    my ($content) = shift;
    my $html_data = $content;
    my $word;
    my $data;
    while ( $html_data =~ m{...}igs ) {
    $word = $1;
    }
    while ( $html_data =~ m{...}igs ) {
    $data = $1;
    save_data( $word, $data );
    }
    while ( $data =~ m{...}igs ) {
    my $title = $1;
    my $sound = $1.$2;
    if ( defined($sound) ) {
    save_sound( $word, $title, $sound );
    }
    }
    }

    sub save_data {
    my ( $word, $data ) = @_;
    open ( FH, " > ..." ) or die "Can not open $!";
    print FH $data;
    close(FH);
    }

    sub save_sound {
    my ( $word, $title, $sound ) = @_;
    getstore("....", "...") or warn $!;
    }
     
    , Mar 25, 2008
    #5
  6. wrote:
    > On Mar 25, 3:06 pm, Ben Bullock <> wrote:
    >> Your code is hopelessly inefficient. 100,000 strings of even twenty
    >> characters is at least two megabytes of memory. Then you've doubled
    >> that number with the creation of the URL, and then you are creating
    >> arrays of all these things, so you've used several megabytes of
    >> memory.
    >>
    >> Instead of first creating a huge array of names, then a huge array of
    >> URLs, why don't you just read in one line of the file at a time, then
    >> try to get data from each URL? Read in one line of the first file,
    >> create its URL, get the response data, store it, then go back and get
    >> the next line of the file, etc. A 100,000 line file actually isn't
    >> that big.
    >>
    >> But if you are getting all these files from the internet, the biggest
    >> bottleneck is probably the time the code spends waiting for a response
    >> from the web servers it's requested. You'd have to think about making
    >> parallel requests somehow to solve that.

    >
    > Thanks Ben,
    >
    > However, is there any good solution that use threads method? I use
    > that, and out of memory time by time after I refactor the code as you
    > told


    That's because, if your file contains 100000 lines, your program tries
    to create 100000 simultaneous threads doesn't it?

    I would create a pool with a fixed number of threads (say 10). I'd read
    the file adding tasks to a queue of the same size, after filling the
    queue I'd pause reading the file until the queue has a spare space.
    Maybe this could be achieved by sleeping a while (say 100ms) and
    re-checking if the queue is stuill full. When a thread is created or has
    finished a task it should remove a task from the queue and process it.
    If the queue is empty the thread should sleep for a while (say 200ms)
    and try again, you'd need some mechanism to signal threads that all
    tasks have been queued (maybe a flag, a special marker task, a signal or
    a certain number of consecutive failed attempts to find work.)

    I've never tried to program something like this in Perl so I'd imagine
    someone (probably several people) has already solved this and added
    modules to CPAN to assist in this sort of task.

    There's probably some OO Design Patterns that apply too.

    > I try thread::pool and some other thread module that I found.
    > Doesn't it really Perl suit for mutil threads programming??


    I find it hard to understand what you are saying but I think the answer
    is: Yes, Perl is well suited to programming with multiple threads (or
    processes).

    --
    RGB
     
    RedGrittyBrick, Mar 25, 2008
    #6
  7. "" <> wrote:
    >consturct almost 200000 url address to send and parse response data.
    >And the speed is very very slow.
    >
    >Please give me some advices that what I should do to improve the speed


    Get a T1 line.

    jue
     
    Jürgen Exner, Mar 25, 2008
    #7
  8. Guest

    "" <> wrote:
    > I have a issue:
    > 1. I want to open a file and use the data from the file to construct
    > the url.
    > 2. After I constructed the url and sent it, I got the response html
    > data and some parts are what I want store inot the files.
    >
    > It seems like a very easy thing, however, the issue is that the data
    > from the file that I have to open are too huge, which I have to
    > consturct almost 200000 url address to send and parse response data.
    > And the speed is very very slow.


    What part is slow, waiting for the response or parsing it?

    Does those URLs point to *your* servers? If so, then you should be able
    to bypass http and go directly to the source. If not, then do you have
    permission from the owner of the servers to launch what could very well
    be a denial of service attack against them?

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    The costs of publication of this article were defrayed in part by the
    payment of page charges. This article must therefore be hereby marked
    advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
    this fact.
     
    , Mar 25, 2008
    #8
  9. Guest

    RedGrittyBrick <> wrote:
    >
    > I find it hard to understand what you are saying but I think the answer
    > is: Yes, Perl is well suited to programming with multiple threads (or
    > processes).


    I agree with the "(or processes)" part, provided you are running on a Unix
    like platform. But in my experience/opinion Perl threads mostly suck.

    --
    -------------------- http://NewsReader.Com/ --------------------
    The costs of publication of this article were defrayed in part by the
    payment of page charges. This article must therefore be hereby marked
    advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
    this fact.
     
    , Mar 25, 2008
    #9
  10. Guest

    On Mar 26, 1:50 am, wrote:
    > RedGrittyBrick <> wrote:
    >
    > > I find it hard to understand what you are saying but I think the answer
    > > is: Yes, Perl is well suited to programming with multiple threads (or
    > > processes).

    >
    > I agree with the "(or processes)" part, provided you are running on a Unix
    > like platform.  But in my experience/opinion Perl threads mostly suck.
    >
    > --
    > --------------------http://NewsReader.Com/--------------------
    > The costs of publication of this article were defrayed in part by the
    > payment of page charges. This article must therefore be hereby marked
    > advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
    > this fact.


    Here is my refactor code, which still at a very slow speed, please
    advice me how to improve it, thanks very much:

    require LWP::parallel::UserAgent;
    use HTTP::Request;
    use LWP::Simple;
    use threads;

    # display tons of debugging messages. See 'perldoc LWP::Debug'
    #use LWP::Debug qw(+);
    my $reqs = [
    HTTP::Request->new('GET',"http://www...."),
    HTTP::Request->new('GET', "......"
    ..............# about nearly 200000 url here

    ];

    my $pua = LWP::parallel::UserAgent->new();
    $pua->in_order (10000); # handle requests in order of registration
    $pua->duplicates(0); # ignore duplicates
    $pua->timeout (1); # in seconds
    $pua->redirect (1); # follow redirects

    foreach my $req (@$reqs) {
    print "Registering '".$req->url."'\n";
    if ( my $res = $pua->register ($req) ) {
    print STDERR $res->error_as_HTML;
    }
    }
    my $entries = $pua->wait();

    foreach (keys %$entries) {
    my $res = $entries->{$_}->response;
    threads->new(\&format_html, $res->content);

    }
    foreach my $thr (threads->list()) {
    $thr->join(); # I think it does not work......
    }

    sub format_html {
    my ($html_data) = shift;
    my $word;
    my $data;
    while ( $html_data =~ m{...}igs ) {
    $word = $1;
    }
    while ( $html_data =~ m{...}igs ) {
    $data = $1;
    save_data( $word, $data );
    }
    while ( $data =~ m{...}igs ) {
    my $title = $1;
    my $sound = $1.$2;
    if ( defined($sound) ) {
    save_sound( $word, $title, $sound );
    }
    }
    }

    sub save_data {
    my ( $word, $data ) = @_;
    open ( FH, " > ..." ) or die "Can not open $!";
    print FH $data;
    close(FH);
    }



    sub save_sound {
    my ( $word, $title, $sound ) = @_;
    getstore("...", "...") or warn $!;

    }
     
    , Mar 27, 2008
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?RGFuaWVsIFdhbHplbmJhY2g=?=

    Trouble with huge amount of State Server Sessions Timed out

    =?Utf-8?B?RGFuaWVsIFdhbHplbmJhY2g=?=, Jul 20, 2005, in forum: ASP .Net
    Replies:
    7
    Views:
    2,672
    Jerald Carter
    Sep 28, 2006
  2. Gabriel Genellina

    Efficient format for huge amount of data

    Gabriel Genellina, Jan 20, 2004, in forum: Java
    Replies:
    21
    Views:
    821
    Alan Chen
    Jan 23, 2004
  3. MichiMichi
    Replies:
    2
    Views:
    415
    Alexey Smirnov
    Mar 14, 2007
  4. Jan Fischer
    Replies:
    10
    Views:
    164
    Robert Klemme
    Oct 9, 2008
  5. Ishmael
    Replies:
    2
    Views:
    115
    Ted Zlatanov
    Mar 5, 2009
Loading...

Share This Page