creating garbage collectable objects (caching objects)

Discussion in 'Python' started by News123, Jun 28, 2009.

  1. News123

    News123 Guest

    Hi.

    I started playing with PIL.

    I'm performing operations on multiple images and would like compromise
    between speed and memory requirement.

    The fast approach would load all images upfront and create then multiple
    result files. The problem is, that I do not have enough memory to load
    all files.

    The slow approach is to load each potential source file only when it is
    needed and to release it immediately after (leaving it up to the gc to
    free memory when needed)



    The question, that I have is whether there is any way to tell python,
    that certain objects could be garbage collected if needed and ask python
    at a later time whether the object has been collected so far (image has
    to be reloaded) or not (image would not have to be reloaded)


    # Fastest approach:
    imgs = {}
    for fname in all_image_files:
    imgs[fname] = Image.open(fname)
    for creation_rule in all_creation_rules():
    img = Image.new(...)
    for img_file in creation_rule.input_files():
    img = do_somethingwith(img,imgs[img_file])
    img.save()


    # Slowest approach:
    for creation_rule in all_creation_rules():
    img = Image.new(...)
    for img_file in creation_rule.input_files():
    src_img = Image.open(img_file)
    img = do_somethingwith(img,src_img)
    img.save()



    # What I'd like to do is something like:
    imgs = GarbageCollectable_dict()
    for creation_rule in all_creation_rules():
    img = Image.new(...)
    for img_file in creation_rule.input_files():
    if src_img in imgs: # if 'm lucke the object is still there
    src_img = imgs[img_file]
    else:
    src_img = Image.open(img_file)
    img = do_somethingwith(img,src_img)
    img.save()



    Is this possible?

    Thaks in advance for an answer or any other ideas of
    how I could do smart caching without hogging all the system's
    memory
    News123, Jun 28, 2009
    #1
    1. Advertising

  2. News123

    Terry Reedy Guest

    News123 wrote:
    > Hi.
    >
    > I started playing with PIL.
    >
    > I'm performing operations on multiple images and would like compromise
    > between speed and memory requirement.
    >
    > The fast approach would load all images upfront and create then multiple
    > result files. The problem is, that I do not have enough memory to load
    > all files.
    >
    > The slow approach is to load each potential source file only when it is
    > needed and to release it immediately after (leaving it up to the gc to
    > free memory when needed)
    >
    > The question, that I have is whether there is any way to tell python,
    > that certain objects could be garbage collected if needed and ask python
    > at a later time whether the object has been collected so far (image has
    > to be reloaded) or not (image would not have to be reloaded)


    See the weakref module. But note that in CPython, objects are collected
    as soon as there all no normal references, not when 'needed'.
    >
    >
    > # Fastest approach:
    > imgs = {}
    > for fname in all_image_files:
    > imgs[fname] = Image.open(fname)
    > for creation_rule in all_creation_rules():
    > img = Image.new(...)
    > for img_file in creation_rule.input_files():
    > img = do_somethingwith(img,imgs[img_file])
    > img.save()
    >
    >
    > # Slowest approach:
    > for creation_rule in all_creation_rules():
    > img = Image.new(...)
    > for img_file in creation_rule.input_files():
    > src_img = Image.open(img_file)
    > img = do_somethingwith(img,src_img)
    > img.save()
    >
    >
    >
    > # What I'd like to do is something like:
    > imgs = GarbageCollectable_dict()
    > for creation_rule in all_creation_rules():
    > img = Image.new(...)
    > for img_file in creation_rule.input_files():
    > if src_img in imgs: # if 'm lucke the object is still there
    > src_img = imgs[img_file]
    > else:
    > src_img = Image.open(img_file)
    > img = do_somethingwith(img,src_img)
    > img.save()


    > Is this possible?
    Terry Reedy, Jun 28, 2009
    #2
    1. Advertising

  3. News123

    Simon Forman Guest

    On Jun 28, 11:03 am, News123 <> wrote:
    > Hi.
    >
    > I started playing with PIL.
    >
    > I'm performing operations on multiple images and would like compromise
    > between speed and memory requirement.
    >
    > The fast approach would load all images upfront and create then multiple
    > result files. The problem is, that I do not have enough memory to load
    > all files.
    >
    > The slow approach is to load each potential source file only when it is
    > needed and to release it immediately after (leaving it up to the gc to
    > free memory when needed)
    >
    > The question, that I have is whether there is any way to tell python,
    > that certain objects could be garbage collected if needed and ask python
    > at a later time whether the object has been collected so far (image has
    > to be reloaded) or not (image would not have to be reloaded)
    >
    > # Fastest approach:
    > imgs = {}
    > for fname in all_image_files:
    >     imgs[fname] = Image.open(fname)
    > for creation_rule in all_creation_rules():
    >     img = Image.new(...)
    >     for img_file in creation_rule.input_files():
    >         img = do_somethingwith(img,imgs[img_file])
    >     img.save()
    >
    > # Slowest approach:
    > for creation_rule in all_creation_rules():
    >     img = Image.new(...)
    >     for img_file in creation_rule.input_files():
    >         src_img = Image.open(img_file)
    >         img = do_somethingwith(img,src_img)
    >     img.save()
    >
    > # What I'd like to do is something like:
    > imgs = GarbageCollectable_dict()
    > for creation_rule in all_creation_rules():
    >     img = Image.new(...)
    >     for img_file in creation_rule.input_files():
    >         if src_img in imgs: # if 'm lucke the object is still there
    >                 src_img = imgs[img_file]
    >         else:
    >                 src_img = Image.open(img_file)
    >         img = do_somethingwith(img,src_img)
    >     img.save()
    >
    > Is this possible?
    >
    > Thaks in advance for an answer or any other ideas of
    > how I could do smart caching without hogging all the system's
    > memory


    Maybe I'm just being thick today, but why would the "slow" approach be
    slow? The same amount of I/O and processing would be done either way,
    no?
    Have you timed both methods?

    That said, take a look at the weakref module Terry Reedy already
    mentioned, and maybe the gc (garbage collector) module too (although
    that might just lead to wasting a lot of time fiddling with stuff that
    the gc is supposed to handle transparently for you in the first place.)
    Simon Forman, Jun 28, 2009
    #3
  4. News123

    Dave Angel Guest

    News123 wrote:
    > Hi.
    >
    > I started playing with PIL.
    >
    > I'm performing operations on multiple images and would like compromise
    > between speed and memory requirement.
    >
    > The fast approach would load all images upfront and create then multiple
    > result files. The problem is, that I do not have enough memory to load
    > all files.
    >
    > The slow approach is to load each potential source file only when it is
    > needed and to release it immediately after (leaving it up to the gc to
    > free memory when needed)
    >
    >
    >
    > The question, that I have is whether there is any way to tell python,
    > that certain objects could be garbage collected if needed and ask python
    > at a later time whether the object has been collected so far (image has
    > to be reloaded) or not (image would not have to be reloaded)
    >
    >
    > # Fastest approach:
    > imgs = {}
    > for fname in all_image_files:
    > imgs[fname] = Image.open(fname)
    > for creation_rule in all_creation_rules():
    > img = Image.new(...)
    > for img_file in creation_rule.input_files():
    > img = do_somethingwith(img,imgs[img_file])
    > img.save()
    >
    >
    > # Slowest approach:
    > for creation_rule in all_creation_rules():
    > img = Image.new(...)
    > for img_file in creation_rule.input_files():
    > src_img = Image.open(img_file)
    > img = do_somethingwith(img,src_img)
    > img.save()
    >
    >
    >
    > # What I'd like to do is something like:
    > imgs = GarbageCollectable_dict()
    > for creation_rule in all_creation_rules():
    > img = Image.new(...)
    > for img_file in creation_rule.input_files():
    > if src_img in imgs: # if 'm lucke the object is still there
    > src_img = imgs[img_file]
    > else:
    > src_img = Image.open(img_file)
    > img = do_somethingwith(img,src_img)
    > img.save()
    >
    >
    >
    > Is this possible?
    >
    > Thaks in advance for an answer or any other ideas of
    > how I could do smart caching without hogging all the system's
    > memory
    >
    >
    >

    You don't say what implementation of Python, nor on what OS platform.
    Yet you're asking how to influence that implementation.

    In CPython, version 2.6 (and probably most other versions, but somebody
    else would have to chime in) an object is freed as soon as its reference
    count goes to zero. So the garbage collector is only there to catch
    cycles, and it runs relatively infrequently.

    So, if you keep a reference to an object, it'll not be freed.
    Theoretically, you can use the weakref module to keep a reference
    without inhibiting the garbage collection, but I don't have any
    experience with the module. You could start by studying its
    documentation. But probably you want a weakref.WeakValueDictionary.
    Use that in your third approach to store the cache.

    If you're using Cython or Jython, or one of many other implementations,
    the rules will be different.

    The real key to efficiency is usually managing locality of reference.
    If a given image is going to be used for many output files, you might
    try to do all the work with it before going on to the next image. In
    that case, it might mean searching all_creation_rules for rules which
    reference the file you've currently loaded, measurement is key.
    Dave Angel, Jun 28, 2009
    #4
  5. News123

    News123 Guest

    Dave Angel wrote:
    > News123 wrote:
    >> Hi.
    >>
    >> I started playing with PIL.
    >>
    >> I'm performing operations on multiple images and would like compromise
    >> between speed and memory requirement.
    >> . . .
    >>
    >> The question, that I have is whether there is any way to tell python,
    >> that certain objects could be garbage collected if needed and ask python
    >> at a later time whether the object has been collected so far (image has
    >> to be reloaded) or not (image would not have to be reloaded)
    >>
    >>


    >>

    > You don't say what implementation of Python, nor on what OS platform.
    > Yet you're asking how to influence that implementation.


    Sorry my fault. I'm using C-python under Windows and under Linux
    >
    > In CPython, version 2.6 (and probably most other versions, but somebody
    > else would have to chime in) an object is freed as soon as its reference
    > count goes to zero. So the garbage collector is only there to catch
    > cycles, and it runs relatively infrequently.


    If CYthon frees objects as early as possible (as soon as the refcount is
    0), then weakref wil not really help me.
    In this case I'd have to elaborate into a cache like structure.
    >
    > So, if you keep a reference to an object, it'll not be freed.
    > Theoretically, you can use the weakref module to keep a reference
    > without inhibiting the garbage collection, but I don't have any
    > experience with the module. You could start by studying its
    > documentation. But probably you want a weakref.WeakValueDictionary.
    > Use that in your third approach to store the cache.
    >
    > If you're using Cython or Jython, or one of many other implementations,
    > the rules will be different.
    >
    > The real key to efficiency is usually managing locality of reference.
    > If a given image is going to be used for many output files, you might
    > try to do all the work with it before going on to the next image. In
    > that case, it might mean searching all_creation_rules for rules which
    > reference the file you've currently loaded, measurement is key.


    Changing the order of the images to be calculated is key and I'm working
    on that.

    For a first step I can reorder the image creation such, that all outpout
    images, that depend only on one input image will be calculated one after
    the other.

    so for this case I can transform:
    # Slowest approach:
    for creation_rule in all_creation_rules():
    img = Image.new(...)
    for img_file in creation_rule.input_files():
    src_img = Image.open(img_file)
    img = do_somethingwith(img,src_img) # wrong indentation in OP
    img.save()


    into
    src_img = Image.open(img_file)
    for creation_rule in all_creation_rules_with_on_src_img():
    img = Image.new(...)
    img = do_somethingwith(img,src_img)
    img.save()


    What I was more concerned is a group of output images depending on TWO
    or more input images.

    Depending on the platform (and the images) I might not be able to
    preload all two (or more images)

    So, as CPython's garbage collection takes always place immediately,
    then I'd like to pursue something else.
    I can create a cache, which caches input files as long as python leaves
    at least n MB available for the rest of the system.

    For this I have to know how much RAM is still available on a system.

    I'll start looking into this.

    thanks again



    N
    News123, Jun 29, 2009
    #5
  6. News123

    Dave Angel Guest

    News123 wrote:
    > Dave Angel wrote:
    >
    >> News123 wrote:
    >>
    >>> Hi.
    >>>
    >>> I started playing with PIL.
    >>>
    >>> I'm performing operations on multiple images and would like compromise
    >>> between speed and memory requirement.
    >>> . . .
    >>>
    >>> The question, that I have is whether there is any way to tell python,
    >>> that certain objects could be garbage collected if needed and ask python
    >>> at a later time whether the object has been collected so far (image has
    >>> to be reloaded) or not (image would not have to be reloaded)
    >>>
    >>>
    >>>

    >
    >
    >>>
    >>>

    >> You don't say what implementation of Python, nor on what OS platform.
    >> Yet you're asking how to influence that implementation.
    >>

    >
    > Sorry my fault. I'm using C-python under Windows and under Linux
    >
    >> In CPython, version 2.6 (and probably most other versions, but somebody
    >> else would have to chime in) an object is freed as soon as its reference
    >> count goes to zero. So the garbage collector is only there to catch
    >> cycles, and it runs relatively infrequently.
    >>

    >
    > If CYthon frees objects as early as possible (as soon as the refcount is
    > 0), then weakref wil not really help me.
    > In this case I'd have to elaborate into a cache like structure.
    >
    >> So, if you keep a reference to an object, it'll not be freed.
    >> Theoretically, you can use the weakref module to keep a reference
    >> without inhibiting the garbage collection, but I don't have any
    >> experience with the module. You could start by studying its
    >> documentation. But probably you want a weakref.WeakValueDictionary.
    >> Use that in your third approach to store the cache.
    >>
    >> If you're using Cython or Jython, or one of many other implementations,
    >> the rules will be different.
    >>
    >> The real key to efficiency is usually managing locality of reference.
    >> If a given image is going to be used for many output files, you might
    >> try to do all the work with it before going on to the next image. In
    >> that case, it might mean searching all_creation_rules for rules which
    >> reference the file you've currently loaded, measurement is key.
    >>

    >
    > Changing the order of the images to be calculated is key and I'm working
    > on that.
    >
    > For a first step I can reorder the image creation such, that all outpout
    > images, that depend only on one input image will be calculated one after
    > the other.
    >
    > so for this case I can transform:
    > # Slowest approach:
    > for creation_rule in all_creation_rules():
    > img = Image.new(...)
    > for img_file in creation_rule.input_files():
    > src_img = Image.open(img_file)
    > img = do_somethingwith(img,src_img) # wrong indentation in OP
    > img.save()
    >
    >
    > into
    > src_img = Image.open(img_file)
    > for creation_rule in all_creation_rules_with_on_src_img():
    > img = Image.new(...)
    > img = do_somethingwith(img,src_img)
    > img.save()
    >
    >
    > What I was more concerned is a group of output images depending on TWO
    > or more input images.
    >
    > Depending on the platform (and the images) I might not be able to
    > preload all two (or more images)
    >
    > So, as CPython's garbage collection takes always place immediately,
    > then I'd like to pursue something else.
    > I can create a cache, which caches input files as long as python leaves
    > at least n MB available for the rest of the system.
    >
    > For this I have to know how much RAM is still available on a system.
    >
    > I'll start looking into this.
    >
    > thanks again
    >
    >
    >
    > N
    >
    >
    >

    As I said earlier, I think weakref is probably what you need. A weakref
    is still a reference from the point of view of the ref-counting, but not
    from the point of view of the garbage collector. Have you read the help
    on weakref module? In particular, did you read Pep 0205?
    http://www.python.org/dev/peps/pep-0205/

    Object cache is one of the two reasons for the weakref module.
    Dave Angel, Jun 29, 2009
    #6
  7. En Mon, 29 Jun 2009 08:01:20 -0300, Dave Angel <> escribió:
    > News123 wrote:


    >> What I was more concerned is a group of output images depending on TWO
    >> or more input images.
    >>
    >> Depending on the platform (and the images) I might not be able to
    >> preload all two (or more images)
    >>
    >> So, as CPython's garbage collection takes always place immediately,
    >> then I'd like to pursue something else.
    >> I can create a cache, which caches input files as long as python leaves
    >> at least n MB available for the rest of the system.


    > As I said earlier, I think weakref is probably what you need. A weakref
    > is still a reference from the point of view of the ref-counting, but not
    > from the point of view of the garbage collector. Have you read the help
    > on weakref module? In particular, did you read Pep 0205?
    > http://www.python.org/dev/peps/pep-0205/


    You've misunderstood something. A weakref is NOT "a reference from the
    point of view of the ref-counting", it adds zero to the reference count.
    When the last "real" reference to some object is lost, the object is
    destroyed, even if there exist weak references to it. That's the whole
    point of a weak reference. The garbage collector isn't directly related.

    py> from sys import getrefcount as rc
    py> class X(object): pass
    ....
    py> x=X()
    py> rc(x)
    2
    py> y=x
    py> rc(x)
    3
    py> import weakref
    py> r=weakref.ref(x)
    py> r
    <weakref at 00BE56C0; to 'X' at 00BE4F30>
    py> rc(x)
    3
    py> del y
    py> rc(x)
    2
    py> del x
    py> r
    <weakref at 00BE56C0; dead>

    (remember that getrefcount -as any function- holds a temporary reference
    to its argument, so the number it returns is one more than the expected
    value)

    > Object cache is one of the two reasons for the weakref module.


    ....when you don't want the object to stay artificially alive just because
    it's referenced in the cache. But the OP wants a different behavior, it
    seems. A standard dictionary where images are removed when they're no more
    needed (or a memory restriction is fired).

    --
    Gabriel Genellina
    Gabriel Genellina, Jun 29, 2009
    #7
  8. News123

    Dave Angel Guest

    Gabriel Genellina wrote:
    > <div class="moz-text-flowed" style="font-family: -moz-fixed">En Mon,
    > 29 Jun 2009 08:01:20 -0300, Dave Angel <> escribió:
    >> News123 wrote:

    >
    >>> What I was more concerned is a group of output images depending on TWO
    >>> or more input images.
    >>>
    >>> Depending on the platform (and the images) I might not be able to
    >>> preload all two (or more images)
    >>>
    >>> So, as CPython's garbage collection takes always place immediately,
    >>> then I'd like to pursue something else.
    >>> I can create a cache, which caches input files as long as python leaves
    >>> at least n MB available for the rest of the system.

    >
    >> As I said earlier, I think weakref is probably what you need. A
    >> weakref is still a reference from the point of view of the
    >> ref-counting, but not from the point of view of the garbage
    >> collector. Have you read the help on weakref module? In particular,
    >> did you read Pep 0205? http://www.python.org/dev/peps/pep-0205/

    >
    > You've misunderstood something. A weakref is NOT "a reference from the
    > point of view of the ref-counting", it adds zero to the reference
    > count. When the last "real" reference to some object is lost, the
    > object is destroyed, even if there exist weak references to it. That's
    > the whole point of a weak reference. The garbage collector isn't
    > directly related.
    >
    > py> from sys import getrefcount as rc
    > py> class X(object): pass
    > ...
    > py> x=X()
    > py> rc(x)
    > 2
    > py> y=x
    > py> rc(x)
    > 3
    > py> import weakref
    > py> r=weakref.ref(x)
    > py> r
    > <weakref at 00BE56C0; to 'X' at 00BE4F30>
    > py> rc(x)
    > 3
    > py> del y
    > py> rc(x)
    > 2
    > py> del x
    > py> r
    > <weakref at 00BE56C0; dead>
    >
    > (remember that getrefcount -as any function- holds a temporary
    > reference to its argument, so the number it returns is one more than
    > the expected value)
    >
    >> Object cache is one of the two reasons for the weakref module.

    >
    > ...when you don't want the object to stay artificially alive just
    > because it's referenced in the cache. But the OP wants a different
    > behavior, it seems. A standard dictionary where images are removed
    > when they're no more needed (or a memory restriction is fired).
    >

    Thanks for correcting me. As I said earlier, I have no experience with
    weakref. The help and the PEP did sound to me like it would work for
    his needs.

    So how about adding an attribute in the large object that refers to the
    object iself?. Then the ref count will never go to zero, but it can be
    freed by the gc. Also store the ref in a WeakValueDictionary, and you
    can find the object without blocking its gc.

    And no, I haven't tried it, and wouldn't unless a machine had nothing
    important running on it. Clearly, the gc might not be able to keep up
    with this kind of abuse. But if gc is triggered by any attempt to make
    too-large an object, it might work.

    DaveA
    Dave Angel, Jun 29, 2009
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Hypo
    Replies:
    6
    Views:
    407
  2. Troy Simpson

    Fragment Caching inside page caching?

    Troy Simpson, Jan 19, 2004, in forum: ASP .Net
    Replies:
    0
    Views:
    788
    Troy Simpson
    Jan 19, 2004
  3. Replies:
    1
    Views:
    440
    mrstephengross
    Jul 25, 2005
  4. John Nagle

    MySQLdb collectable memory leak

    John Nagle, Mar 16, 2007, in forum: Python
    Replies:
    0
    Views:
    368
    John Nagle
    Mar 16, 2007
  5. JimLad
    Replies:
    3
    Views:
    915
    JimLad
    Jan 21, 2010
Loading...

Share This Page