website catcher

Discussion in 'Python' started by jwaixs, Jul 3, 2005.

  1. jwaixs

    jwaixs Guest

    Hello,

    I'm busy to build some kind of webpage framework written in Python. But
    there's a small problem in this framework. This framework should open a
    page, parse it, take some other information out of it and should store
    it in some kind of fast storage. This storage need to be very fast so
    every one who will ask for this page will get a parsed page returned
    from this storage (catcher?).

    But how could I design a good webcatcher? Is this possible in python,
    because it should always run. Which won't work with cgi-bin pages
    because the quit after the execute something. Or should it be build in
    c and imported as a module or something?

    Thank you,

    Noud Aldenhoven
     
    jwaixs, Jul 3, 2005
    #1
    1. Advertising

  2. jwaixs

    Guest

    , Jul 3, 2005
    #2
    1. Advertising

  3. jwaixs

    jwaixs Guest

    Thank you, but it's not what I mean. I don't want some kind of client
    parser thing. But I mean the page is already been parsed and ready to
    be read. But I want to store this page for more use. I need some kind
    of database that won't exit if the cgi-bin script has finished. This
    database need to be open all the time and communicate very easily with
    the cgi-bin framwork main class.
     
    jwaixs, Jul 3, 2005
    #3
  4. jwaixs wrote:
    > Thank you, but it's not what I mean. I don't want some kind of client
    > parser thing. But I mean the page is already been parsed and ready to
    > be read. But I want to store this page for more use. I need some kind
    > of database that won't exit if the cgi-bin script has finished. This
    > database need to be open all the time and communicate very easily with
    > the cgi-bin framwork main class.


    Why does it need to be "open"? Store it in a pickled file, an load read
    that pickle when you need it. Or not even as pickle, just as file in the
    FS. Basically what you are talking about is a webserver - so just use that.

    Diez
     
    Diez B. Roggisch, Jul 3, 2005
    #4
  5. jwaixs

    jwaixs Guest

    If I should put the parsedwebsites in, for example, a tablehash it will
    be at least 5 times faster than just putting it in a file that needs to
    be stored on a slow harddrive. Memory is a lot faster than harddisk
    space. And if there would be a lot of people asking for a page all of
    them have to open that file. if that are 10 requests in 5 minutes
    there's no real worry. If they are more that 10 request per second you
    really have a big problem and the framework would probably crash or
    will run uber slow. That's why I want to open the file only one time
    and keep it saved in the memory of the server where it don't need to be
    opened each time some is asking for it.

    Noud Aldenhoven
     
    jwaixs, Jul 3, 2005
    #5
  6. jwaixs wrote:
    > If I should put the parsedwebsites in, for example, a tablehash it will
    > be at least 5 times faster than just putting it in a file that needs to
    > be stored on a slow harddrive. Memory is a lot faster than harddisk
    > space. And if there would be a lot of people asking for a page all of
    > them have to open that file. if that are 10 requests in 5 minutes
    > there's no real worry. If they are more that 10 request per second you
    > really have a big problem and the framework would probably crash or
    > will run uber slow. That's why I want to open the file only one time
    > and keep it saved in the memory of the server where it don't need to be
    > opened each time some is asking for it.


    I don't think that's correct. An apache serves static pages with high
    speed - and "slow hardrives" means about 32MByte/s nowadays. Which
    equals 256MBit/s - is your machine connected to a GBit connection? And
    if it's for internet usage, do you have a GBit connection - if so, I
    envy you...

    And if your speed has to have that high, I wonder if python can be used
    at all. BTW, 10 reqeuest per seconds of maybe 100KB pages is next to
    nothing - just 10MBit. It's not really fast. And images and the like are
    also usually served from HD.

    You are of course right that memory is faster than harddrives. but HDs
    are (ususally) faster than network IO - so that's your limiting factor,
    if at all. And starting CGI subrpocesses introduces also lots of
    overhead - better use fastcgis then.


    I think that we're talking about two things here:

    - premature optimization on your side. Worry about speed later, if it
    _is_ an issue. Not now.

    - what you seem to want is a convenient way of having data serverd to
    you in a pythonesque way. I personally don't see anything wrong with
    storing and retrieving pages from HD - after all, that's where they end
    up anyway ebentually. So if you write yourself a HTMLRetrieval class
    that abstratcs that for you and

    1) takes a piece of HTML and stores that, maybe associated with some
    metadata
    2) can retrieve these chunks of based on some key

    you are pretty much done. If you want, you can back it up using a RDBMS,
    hoping that it will do the in-memory-caching for you. But remember that
    there will be no connection pooling using CGIs, so that introduces overhead.

    Or you go for your own standalone process that serves the pages
    through some RPC mechanism.

    Or you ditch CGIs at all and use some webframework that serves from an
    permanenty running python process with several worker threads - then you
    can use in-process memory by global variables to store that memory. For
    that, I recommend twisted.

    Diez
     
    Diez B. Roggisch, Jul 3, 2005
    #6
  7. jwaixs

    jwaixs Guest

    Well, thank you for your advice. So I have a couple of solutions, but
    it can't become a server at its own so this means I will deal with
    files.

    Thank you for your advice, I'll first make it work... than the server.

    Noud Aldenhoven
     
    jwaixs, Jul 3, 2005
    #7
  8. jwaixs

    gene tani Guest

    gene tani, Jul 3, 2005
    #8
  9. jwaixs

    Mike Meyer Guest

    "jwaixs" <> writes:

    > If I should put the parsedwebsites in, for example, a tablehash it will
    > be at least 5 times faster than just putting it in a file that needs to
    > be stored on a slow harddrive. Memory is a lot faster than harddisk
    > space. And if there would be a lot of people asking for a page all of
    > them have to open that file. if that are 10 requests in 5 minutes
    > there's no real worry. If they are more that 10 request per second you
    > really have a big problem and the framework would probably crash or
    > will run uber slow. That's why I want to open the file only one time
    > and keep it saved in the memory of the server where it don't need to be
    > opened each time some is asking for it.


    While Diez gave you some good reasons not to worry about this, and had
    some great advice, he missed one important reason you shouldn't worry
    about this:

    Your OS almost certainly has a disk cache.

    This means that if you get 10 requests for a page in a second, the
    first one will come off the disk and wind up in the OS disk cache. The
    next nine requests will get the pages from the OS disk cache, and not
    go to the disk at all.

    When you keep these pages in memory yourself, you're basically
    declaring that they are so important that you don't trust the OS to
    cache them properly. The exact details of how your using extra memory
    interact with the disk cache vary with the OS, but there's a fair
    chance that you're cutting down on the amount of disk cache the system
    will have available.

    In the end, if the OS disagrees with you about how important your
    pages are, it will win. Your pages will get paged out to disk, and
    have to be read back from disk even though you have them stored in
    memory. With extra overhead in the form of an interrupt when your
    process tries to access the swapped out page, at that.

    A bunch of very smart people have spent a lot of time making modern
    operating systems perform well. Worrying about things that it is
    already worrying about is generally a waste of time - a clear case of
    premature optimization.

    <mike
    --
    Mike Meyer <> http://www.mired.org/home/mwm/
    Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
     
    Mike Meyer, Jul 3, 2005
    #9
  10. jwaixs wrote:
    > I need some kind
    > of database that won't exit if the cgi-bin script has finished. This
    > database need to be open all the time and communicate very easily with
    > the cgi-bin framwork main class.


    Maybe long-running multi-threaded processes for FastCGI, SCGI or similar
    is what you're looking for instead short-lived CGI-BIN programs forked
    by the web server.

    Ciao, Michael.
     
    =?ISO-8859-1?Q?Michael_Str=F6der?=, Jul 6, 2005
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. morc
    Replies:
    6
    Views:
    5,271
    Hal Rosser
    Feb 10, 2006
  2. Douglas Alan

    GUI application death catcher?

    Douglas Alan, Feb 18, 2004, in forum: Python
    Replies:
    7
    Views:
    391
    Douglas Alan
    Feb 20, 2004
  3. robert
    Replies:
    0
    Views:
    320
    robert
    Jun 7, 2006
  4. Adrienne Boswell

    Completely OT - Rolo the Mouse Catcher

    Adrienne Boswell, Jan 10, 2009, in forum: HTML
    Replies:
    10
    Views:
    741
    Neredbojias
    Jan 12, 2009
  5. Chuck

    Podcast catcher in Python

    Chuck, Sep 11, 2009, in forum: Python
    Replies:
    17
    Views:
    482
    Chuck
    Sep 19, 2009
Loading...

Share This Page