Re: An iteration idiom (Was: Re: [Guppy-pe-list] loading filescontaining multiple dumps)

Discussion in 'Python' started by Boris Borcic, Sep 2, 2009.

  1. Boris Borcic

    Boris Borcic Guest

    Sverker Nilsson wrote:
    >> Sverker Nilsson wrote:
    >>> It reads one Stat object at a time and wants to report something
    >>> when there is no more to be read from the file.

    >> Hmm, am I right in thinking the above can more nicely be written as:
    >>
    >> >>> from guppy import hpy
    >> >>> h = hpy()
    >> >>> f = open(r'your.hpy')
    >> >>> sets = []
    >> >>> for s in iter(h.load(f)): sets.append(s)

    >> ...
    >>

    > The above iterates over one Stat object returned by h.load(f). I assume
    > you want to iterate over all the objects loaded.


    I dont know guppy,
    but if h.load(f) raises StopIteration upon eof, as seems implied by your
    proposal, then something like the following would work.

    sets.extend(h.load(f) for _ in xrange(1e9))
     
    Boris Borcic, Sep 2, 2009
    #1
    1. Advertising


  2. > I dont know guppy,
    > but if h.load(f) raises StopIteration upon eof, as seems implied by your
    > proposal, then something like the following would work.
    >
    > sets.extend(h.load(f) for _ in xrange(1e9))


    Sounds like hpy has a weird API. Either it should be an
    iterator supporting __iter__() and next() and raising
    StopIteration when it's done, or it should simply return
    None to indicate an empty load.

    In the first case, you would write:
    sets.extend(h.load(f))

    And in the second case:
    sets.extend(iter(partial(h.load, f), None))

    The first way just uses the iterator protocol in a way that
    is consistent with the rest of the language.

    The second way, using the two argument form of iter(),
    is the standard way of creating an iterator from a
    function that has a sentinel return value.

    IOW, it is not normal to use StopIteration in a function
    that isn't an iterator.


    Raymond
     
    Raymond Hettinger, Sep 2, 2009
    #2
    1. Advertising

  3. Raymond Hettinger wrote:
    > In the first case, you would write:
    > sets.extend(h.load(f))


    yes, what I had was:

    for s in iter(h.load(f)): sets.append(s)

    ....which I mistakenly thought was working, but in in fact boils down to
    Raymond's code.

    The problem is that each item that h.load(f) returns *is* actually an
    iterable, so either of the above just ends up the contents of each set
    being extended onto `sets` rather than the sets themselved.

    It's all really rather confusing, apologies if there's interspersed rant
    in here:

    >>> from guppy import hpy
    >>> h = hpy()


    Minor rant, why do I have to instantiate a
    <class 'guppy.heapy.Use._GLUECLAMP_'>
    to do anything with heapy?
    Why doesn't heapy just expose load, dump, etc?

    (oh, and reading the code for guppy.heapy.Use and its ilk made me go
    temporarily blind!) ;-)

    >>> f = open('copy.hpy')
    >>> s = h.load(f)


    Less minor rant: this applies to most things to do with heapy... Having
    __repr__ return the same as __str__ and having that be a long lump of
    text is rather annoying. If you really must, make __str__ return the big
    lump of text but have __repr__ return a simple, short, item containing
    the class, the id, and maybe the number of contained objects...

    Anyway...

    >>> id(s)

    13905272
    >>> len(s)

    192
    >>> s.__class__

    <class guppy.heapy.Part.Stat at 0x00CD6A20>
    >>> i = s[0]
    >>> id(i)

    13904112
    >>> len(i)

    1
    >>> i.__class__

    <class guppy.heapy.Part.Stat at 0x00CD6A20>

    Hmmm, I'm sure there's a good reason why an item in a set has the exact
    same class and iterface as a whole set?

    It feels like some kind of filtering, where are the docs that explain
    all this?

    cheers,

    Chris

    --
    Simplistix - Content Management, Batch Processing & Python Consulting
    - http://www.simplistix.co.uk
     
    Chris Withers, Sep 3, 2009
    #3
  4. Re: [Guppy-pe-list] An iteration idiom (Was: Re: loading filescontaining multiple dumps)

    On Thu, 2009-09-03 at 10:05 +0100, Chris Withers wrote:
    > Raymond Hettinger wrote:
    > > In the first case, you would write:
    > > sets.extend(h.load(f))

    >
    > yes, what I had was:
    >
    > for s in iter(h.load(f)): sets.append(s)
    >
    > ...which I mistakenly thought was working, but in in fact boils down to
    > Raymond's code.
    >
    > The problem is that each item that h.load(f) returns *is* actually an
    > iterable, so either of the above just ends up the contents of each set
    > being extended onto `sets` rather than the sets themselved.


    Yes that is what makes it confusing, otherwise you would get an
    exception.

    I hope the new loadall method as I wrote about before will resolve this.

    def loadall(self,f):
    ''' Generates all objects from an open file f or a file named f'''
    if isinstance(f,basestring):
    f=open(f)
    while True:
    yield self.load(f)

    Should we call it loadall? It is a generator so it doesn't really load
    all immedietally, just lazily. Maybe call it iload? Or redefine load,
    but that might break existing code so would not be good.

    > It's all really rather confusing, apologies if there's interspersed rant
    > in here:
    >
    > >>> from guppy import hpy
    > >>> h = hpy()

    >
    > Minor rant, why do I have to instantiate a
    > <class 'guppy.heapy.Use._GLUECLAMP_'>
    > to do anything with heapy?
    > Why doesn't heapy just expose load, dump, etc?


    Basically, the need for the h=hpy() idiom is to avoid any global
    variables. Heapy uses some rather big internal data structures, to cache
    such things as dict ownership. I didn't want to have all those things in
    global variables. Now they are all contained in the hpy() session
    context. So you can get rid of them by just deleting h if h=hpy(), and
    the other objects you created. Also, it allows for several parallel
    invocations of Heapy.

    However, I am aware of the extra initial overhead to do h=hpy(). I
    discussed this in my thesis. "Section 4.7.8 Why not importing Use
    directly?" page 36,

    http://guppy-pe.sourceforge.net/heapy-thesis.pdf

    Maybe a module should be added that does this, especially if someone
    provides a patch and/or others agree :)


    > (oh, and reading the code for guppy.heapy.Use and its ilk made me go
    > temporarily blind!) ;-)


    Try sunglasses:) (Well, I am aware of this, it was a
    research/experimental system and could have some refactoring :)

    > >>> f = open('copy.hpy')
    > >>> s = h.load(f)

    >
    > Less minor rant: this applies to most things to do with heapy... Having
    > __repr__ return the same as __str__ and having that be a long lump of
    > text is rather annoying. If you really must, make __str__ return the big
    > lump of text but have __repr__ return a simple, short, item containing
    > the class, the id, and maybe the number of contained objects...


    I thought it was cool to not have to use print but get the result
    directly at the prompt.

    But if this is a problem and especially if others also complain, we
    could add an option for shorter __repr__.

    h=hpy(short_repr=True)

    Or something else/shorter if you wish.

    BTW, I think a cool thing with having everything based on a context
    session, h=hpy(args), is that you could add any options there. That
    would be harder/less clean if you just imported all methods from a
    module. Patch some module-level variable.... ilk ...

    > Anyway...
    >
    > >>> id(s)

    > 13905272
    > >>> len(s)

    > 192
    > >>> s.__class__

    > <class guppy.heapy.Part.Stat at 0x00CD6A20>
    > >>> i = s[0]
    > >>> id(i)

    > 13904112
    > >>> len(i)

    > 1
    > >>> i.__class__

    > <class guppy.heapy.Part.Stat at 0x00CD6A20>
    >
    > Hmmm, I'm sure there's a good reason why an item in a set has the exact
    > same class and iterface as a whole set?


    Um, perhaps no very good reason but... a subset of a set is still a set,
    isn't it? This is the same structure that is used in IdentitySet
    objects. Each row is still an IdentitySet, and has the same attributes.
    This is also like Python strings work, there is no special character
    type, a character is just a string of length 1. I thought this was
    pretty cool when I first saw it in Python, compared to other languages
    as C or Pascal. If we don't need a new type, we could better avoid it.

    So what's the problem? :)

    > It feels like some kind of filtering, where are the docs that explain
    > all this?


    Unfortunately, the docs for the Stat object have been lagging behind.
    Sorry. But as people gain more interest for Heapy and send comments or
    even patches, I get more motivated to look into it. :)

    Thanks and Cheers,

    Sverker

    --
    Expertise in Linux, embedded systems, image processing, C, Python...
    http://sncs.se
     
    Sverker Nilsson, Sep 4, 2009
    #4
  5. Re: [Guppy-pe-list] An iteration idiom (Was: Re: loading filescontaining multiple dumps)

    On Fri, 2009-09-04 at 15:25 +0200, Sverker Nilsson wrote:

    >
    > However, I am aware of the extra initial overhead to do h=hpy(). I
    > discussed this in my thesis. "Section 4.7.8 Why not importing Use
    > directly?" page 36,
    >
    > http://guppy-pe.sourceforge.net/heapy-thesis.pdf


    Actually it is described in "4.7.4 Why session context - why not global
    variables?" p. 33.

    Sorry for the double post. I never seem to get things right the first
    time (or so on) :-/

    Sverker
     
    Sverker Nilsson, Sep 4, 2009
    #5
  6. Re: [Guppy-pe-list] An iteration idiom (Was: Re: loading filescontaining multiple dumps)

    Sverker Nilsson wrote:
    > I hope the new loadall method as I wrote about before will resolve this.
    >
    > def loadall(self,f):
    > ''' Generates all objects from an open file f or a file named f'''
    > if isinstance(f,basestring):
    > f=open(f)
    > while True:
    > yield self.load(f)


    It would be great if load either returned just one result ever, or
    properly implemented the iterator protocol, rather than half
    implementing it...

    > Should we call it loadall? It is a generator so it doesn't really load
    > all immedietally, just lazily. Maybe call it iload? Or redefine load,
    > but that might break existing code so would not be good.


    loadall works for me, iload doesn't.

    >> Minor rant, why do I have to instantiate a
    >> <class 'guppy.heapy.Use._GLUECLAMP_'>
    >> to do anything with heapy?
    >> Why doesn't heapy just expose load, dump, etc?

    >
    > Basically, the need for the h=hpy() idiom is to avoid any global
    > variables.


    Eh? What's h then? (And h will reference whatever globals you were
    worried about, surely?)

    > Heapy uses some rather big internal data structures, to cache
    > such things as dict ownership. I didn't want to have all those things in
    > global variables.


    What about attributes of a class instance of some sort then?

    > the other objects you created. Also, it allows for several parallel
    > invocations of Heapy.


    When is that helpful?

    > However, I am aware of the extra initial overhead to do h=hpy(). I
    > discussed this in my thesis. "Section 4.7.8 Why not importing Use
    > directly?" page 36,
    >
    > http://guppy-pe.sourceforge.net/heapy-thesis.pdf


    I'm afraid, while I'd love to, I don't have the time to read a thesis...

    > Try sunglasses:) (Well, I am aware of this, it was a
    > research/experimental system and could have some refactoring :)


    I would suggest creating a minimal system that allows you to do heap()
    and then let other people build what they need from there. Simple is
    *always* better...

    >> Less minor rant: this applies to most things to do with heapy... Having
    >> __repr__ return the same as __str__ and having that be a long lump of
    >> text is rather annoying. If you really must, make __str__ return the big
    >> lump of text but have __repr__ return a simple, short, item containing
    >> the class, the id, and maybe the number of contained objects...

    >
    > I thought it was cool to not have to use print but get the result
    > directly at the prompt.


    That's fine, that's what __str__ is for. __repr__ should be short.

    >> Hmmm, I'm sure there's a good reason why an item in a set has the exact
    >> same class and iterface as a whole set?

    >
    > Um, perhaps no very good reason but... a subset of a set is still a set,
    > isn't it?


    Yeah, but an item in a set is not a set. __getitem__ should return an
    item, not a subset...

    I really think that, by the sounds of it, what is currently implemented
    as __getitem__ should be a `filter` or `subset` method on IdentitySets
    instead...

    > objects. Each row is still an IdentitySet, and has the same attributes.


    Why? It's semantically different. .load() returns a set of measurements,
    each measurement contains a set of something else, but I don't know what...

    > This is also like Python strings work, there is no special character
    > type, a character is just a string of length 1.


    Strings are *way* more simple in terms of what they are though...

    cheers,

    Chris

    --
    Simplistix - Content Management, Batch Processing & Python Consulting
    - http://www.simplistix.co.uk
     
    Chris Withers, Sep 7, 2009
    #6
  7. Re: [Guppy-pe-list] An iteration idiom (Was: Re: loading filescontaining multiple dumps)

    On Mon, 2009-09-07 at 16:53 +0100, Chris Withers wrote:
    > Sverker Nilsson wrote:
    > > I hope the new loadall method as I wrote about before will resolve this.
    > >
    > > def loadall(self,f):
    > > ''' Generates all objects from an open file f or a file named f'''
    > > if isinstance(f,basestring):
    > > f=open(f)
    > > while True:
    > > yield self.load(f)

    >
    > It would be great if load either returned just one result ever, or
    > properly implemented the iterator protocol, rather than half
    > implementing it...
    >

    Agreed, this is arguably a bug or at least a misfeature, as also Raymond
    Hettinger remarked, it is not normal for a normal function to raise
    StopIteration.

    But I don't think I would want to risk breaking someone's code just for
    this when we could just add a new method.

    > > Should we call it loadall? It is a generator so it doesn't really load
    > > all immedietally, just lazily. Maybe call it iload? Or redefine load,
    > > but that might break existing code so would not be good.

    >
    > loadall works for me, iload doesn't.
    >


    Or we could have an option to hpy() to redefine load() as loadall(), but
    I think it is cleaner (and easier) to just define a new method...

    Settled then? :)

    > >> Minor rant, why do I have to instantiate a
    > >> <class 'guppy.heapy.Use._GLUECLAMP_'>
    > >> to do anything with heapy?
    > >> Why doesn't heapy just expose load, dump, etc?

    > >
    > > Basically, the need for the h=hpy() idiom is to avoid any global
    > > variables.

    >
    > Eh? What's h then? (And h will reference whatever globals you were
    > worried about, surely?)


    h is what you make it to be in the context you create it; you can make
    it either a global variable, a local variable, or an object attribute.

    Interactively, I guess one tends to have it as a global variable, yes.
    But it is a global variable you created and responds for yourself, and
    there are no other global variables behind the scene but the ones you
    create yourself (also possibly the results of heap() etc as you store
    them in your environment).

    If making test programs, I would not use global variables but instead
    would tend to have h as a class attribute in a test class, eg as in
    UnitTest. It could also be a local variable in a test function.

    As the enclosing class or frame is deallocated, so is its attribute h
    itself. There should be nothing that stays allocated in other modules
    after one test (class) is done (other than some loaded modules
    themselves, but I am talking about more severe data that can be hundreds
    of megabytes or more).

    > > Heapy uses some rather big internal data structures, to cache
    > > such things as dict ownership. I didn't want to have all those things in
    > > global variables.

    >
    > What about attributes of a class instance of some sort then?


    They are already attributes of an instance: hpy() is a convenience
    factory method that creates a top level instance for this purpose.

    > > the other objects you created. Also, it allows for several parallel
    > > invocations of Heapy.

    >
    > When is that helpful?


    For example, the setref() method sets a reference point somewhere in h.
    Further calls to heap() would report only objects allocated after that
    call. But you could use a new hpy() instance to see all objects again.

    Multiple threads come to mind, where each thread would have its own
    hpy() object. (Thread safety may still be a problem but at least it
    should be improved by not sharing the hpy() structures.)

    Even in the absence of multiple threads, you might have an outer
    invocation of hpy() that is used for global analysis, with its specific
    options, setref()'s etc, and inner invocations that make some local
    analysis perhaps in a single method.

    > > However, I am aware of the extra initial overhead to do h=hpy(). I
    > > discussed this in my thesis. "Section 4.7.8 Why not importing Use
    > > directly?" page 36,
    > >
    > > http://guppy-pe.sourceforge.net/heapy-thesis.pdf

    >
    > I'm afraid, while I'd love to, I don't have the time to read a thesis...


    But it is (an important) part of the documentation. For example it
    contains the rationale and an introduction to the main categories such
    as Sets, Kinds and EquivalenceRelations, and some usecases for example
    how to seal a memory leak in a windowing program.

    I'm afraid, while I'd love to, I don't have the time to duplicate the
    thesis here...;-)

    > > Try sunglasses:) (Well, I am aware of this, it was a
    > > research/experimental system and could have some refactoring :)

    >
    > I would suggest creating a minimal system that allows you to do heap()
    > and then let other people build what they need from there. Simple is
    > *always* better...


    Do you mean we should actually _remove_ features to create a new
    standalone system?

    I don't think that'd be meaningful.
    You don't need to use anything else than heap() if you don't want to.

    You are free to wrap functions as you find suitable; a minimal wrapper
    module could be just like this:

    # Module heapyheap
    from guppy import hpy
    h=hpy()
    heap=heap()

    Should we add some such module? In the thesis I discussed this already
    and argued it was not worth the trouble. And I think it may be
    confusing; as in Python, I think it is good that 'there is only one way
    to do it'.

    > >> Less minor rant: this applies to most things to do with heapy... Having
    > >> __repr__ return the same as __str__ and having that be a long lump of
    > >> text is rather annoying. If you really must, make __str__ return the big
    > >> lump of text but have __repr__ return a simple, short, item containing
    > >> the class, the id, and maybe the number of contained objects...

    > >
    > > I thought it was cool to not have to use print but get the result
    > > directly at the prompt.

    >
    > That's fine, that's what __str__ is for. __repr__ should be short.


    No, it's the other way around: __repr__ is used when evaluating directly
    at the prompt.

    > >> Hmmm, I'm sure there's a good reason why an item in a set has the exact
    > >> same class and iterface as a whole set?

    > >
    > > Um, perhaps no very good reason but... a subset of a set is still a set,
    > > isn't it?

    >
    > Yeah, but an item in a set is not a set. __getitem__ should return an
    > item, not a subset...


    Usually I think it is called an 'element' of a set rather than an
    'item'. Python builtin sets can't even do indexing at all. I think it
    was perceived that since the result of indexing to get at individual
    elements would be ill-defined (depending on hashing and implementation)
    it should not be supported at all.

    Likewise, Heapy IdentitySet objects don't support indexing to get at the
    elements directly. The index (__getitem__) method was available so I
    used it to take the subset of the i'ths row in the partition defined by
    its equivalence order.

    To get at a specific element, you either have to somehow arrive at a
    subset of length 1 (via eg the .byid equivalence relation ) and then
    use .theone, or make a list of the .nodes attribute, both of which
    methods would give somewhat ill-defined results.

    The subset indexing, being the more well-defined operation, and also
    IMHO more generally useful, thus got the honor to have the [] syntax.

    > I really think that, by the sounds of it, what is currently implemented
    > as __getitem__ should be a `filter` or `subset` method on IdentitySets
    > instead...


    It would just be another syntax. I don't see the conceptual problem
    since e.g. indexing works just fine like this with strings.

    >
    > > objects. Each row is still an IdentitySet, and has the same attributes.

    >
    > Why? It's semantically different.


    No, it's semantically identical. :)

    Each row is an IdentitySet just like the top level set, but one which
    happens to contain elements being of one particular kind as defined by
    the equivalence relation in use. So it has only 1 row. The equivalence
    relation can be changed by creating a new set by using some of
    the .byxxx attribute: then the set could be made to contain many kinds
    of objects again, getting more rows albeit the objects themselves don't
    change.

    >>> from guppy import hpy
    >>> h=hpy()
    >>> h.heap()

    Partition of a set of 51045 objects. Total size = 3740412 bytes.
    Index Count % Size % Cumulative % Kind (class / dict of
    class)
    0 25732 50 1694156 45 1694156 45 str
    1 11709 23 450980 12 2145136 57 tuple
    ....
    >>> _[0]

    Partition of a set of 25732 objects. Total size = 1694156 bytes.
    Index Count % Size % Cumulative % Kind (class / dict of
    class)
    0 25732 100 1694156 100 1694156 100 str
    >>> _.bysize

    Partition of a set of 25732 objects. Total size = 1694156 bytes.
    Index Count % Size % Cumulative % Individual Size
    0 4704 18 150528 9 150528 9 32
    1 3633 14 130788 8 281316 17 36
    ...

    > .load() returns a set of measurements,
    > each measurement contains a set of something else, but I don't know what...
    >

    For Stat objects, in analogy with IdentitySet, each row represents (a
    statistical summary of) a subset, a block in the partition defined by
    the classifying equivalence relation. The only special thing with this
    sub - Stat object is that it happens to represent objects of only one
    kind, as defined by the equivalence relation used when dump() ing it. So
    it has only one subset in its own partition, one row in its
    representation. Indexing (with [0]) returns itself.

    Why would this warrant a new type?

    > > This is also like Python strings work, there is no special character
    > > type, a character is just a string of length 1.

    >
    > Strings are *way* more simple in terms of what they are though...


    I don't see why this matters.

    Cheers,

    Sverker

    --
    Expertise in Linux, embedded systems, image processing, C, Python...
    http://sncs.se
     
    Sverker Nilsson, Sep 8, 2009
    #7
  8. Re: [Guppy-pe-list] An iteration idiom (Was: Re: loading filescontaining multiple dumps)

    Sverker Nilsson wrote:
    > But I don't think I would want to risk breaking someone's code just for
    > this when we could just add a new method.


    I don't think anyone will be relying on StopIteration being raised.
    If you're worried, do the next release as a 0.10.0 release and explain
    the backwards incompatible change in the release announcement.

    > Or we could have an option to hpy() to redefine load() as loadall(), but
    > I think it is cleaner (and easier) to just define a new method...


    -1 to options to hpy, +1 to loadall but also -1 to lead load() as broken
    as it is...

    > As the enclosing class or frame is deallocated, so is its attribute h
    > itself.


    Right, but as long as the h hangs around, it hangs on to all the memory
    it's used to build its stats, right? This caused me problems in my most
    recent use of guppy...

    > themselves, but I am talking about more severe data that can be hundreds
    > of megabytes or more).


    Me too ;-) I've been profiling situations where the memory usage was
    over 1GB for processing a 30MB file when I started ;-)

    > For example, the setref() method sets a reference point somewhere in h.
    > Further calls to heap() would report only objects allocated after that
    > call. But you could use a new hpy() instance to see all objects again.
    >
    > Multiple threads come to mind, where each thread would have its own
    > hpy() object. (Thread safety may still be a problem but at least it
    > should be improved by not sharing the hpy() structures.)
    >
    > Even in the absence of multiple threads, you might have an outer
    > invocation of hpy() that is used for global analysis, with its specific
    > options, setref()'s etc, and inner invocations that make some local
    > analysis perhaps in a single method.


    Fair points :)

    >>> http://guppy-pe.sourceforge.net/heapy-thesis.pdf

    >> I'm afraid, while I'd love to, I don't have the time to read a thesis...

    >
    > But it is (an important) part of the documentation.


    That may be, but I'd wager a fair amount of beer that buy far the most
    common uses for heapy are:

    - finding out what's using the memory consumed by a python process

    - log how what the memory consumption is made up of while running a
    large python process

    - finding out how much memory is being used

    ....in that order. Usually on a very tight deadline and with unhappy
    users breathing down their necks. At times like that, reading a thesis
    doesn't really figure into it ;-)

    > I'm afraid, while I'd love to, I don't have the time to duplicate the
    > thesis here...;-)


    I don't think that would help. Succinct help and easy to use functions
    to get those 3 cases above solved is all that's needed ;-)

    > Do you mean we should actually _remove_ features to create a new
    > standalone system?


    Absolutely, why provide more than is used or needed?

    > You are free to wrap functions as you find suitable; a minimal wrapper
    > module could be just like this:
    >
    > # Module heapyheap
    > from guppy import hpy
    > h=hpy()
    > heap=heap()


    I don't follow this.. did you mean heap = h.heap()? If so, isn't that
    using all the gubbinz in Use, etc, anyway?

    >>>> Less minor rant: this applies to most things to do with heapy... Having
    >>>> __repr__ return the same as __str__ and having that be a long lump of
    >>>> text is rather annoying. If you really must, make __str__ return the big
    >>>> lump of text but have __repr__ return a simple, short, item containing
    >>>> the class, the id, and maybe the number of contained objects...
    >>> I thought it was cool to not have to use print but get the result
    >>> directly at the prompt.

    >> That's fine, that's what __str__ is for. __repr__ should be short.

    >
    > No, it's the other way around: __repr__ is used when evaluating directly
    > at the prompt.


    The docs give the idea:

    http://docs.python.org/reference/datamodel.html?highlight=__repr__#object.__repr__

    I believe you "big strings" would be classed as "informal" and so would
    be computed by __str__.


    >> Yeah, but an item in a set is not a set. __getitem__ should return an
    >> item, not a subset...

    >
    > Usually I think it is called an 'element' of a set rather than an
    > 'item'. Python builtin sets can't even do indexing at all.


    ....'cos it doesn't make sense ;-)

    > Likewise, Heapy IdentitySet objects don't support indexing to get at the
    > elements directly.


    ....then they shouldn't have a __getitem__ method!

    > The index (__getitem__) method was available so I
    > used it to take the subset of the i'ths row in the partition defined by
    > its equivalence order.


    That should have another name... I don't know what a partition or
    equivalence order are in the contexts you're using them, but I do know
    that hijacking __getitem__ for this is wrong.

    > The subset indexing, being the more well-defined operation, and also
    > IMHO more generally useful, thus got the honor to have the [] syntax.


    Except it misleads anyone who's programmed in Python for a significant
    period of time and causes problems when combined with the bug in .load :-(

    > It would just be another syntax. I don't see the conceptual problem
    > since e.g. indexing works just fine like this with strings.


    Strings are a bad example...

    >>> objects. Each row is still an IdentitySet, and has the same attributes.

    >> Why? It's semantically different.

    >
    > No, it's semantically identical. :)
    >
    > Each row is an IdentitySet just like the top level set, but one which
    > happens to contain elements being of one particular kind as defined by
    > the equivalence relation in use. So it has only 1 row. The equivalence
    > relation can be changed by creating a new set by using some of
    > the .byxxx attribute: then the set could be made to contain many kinds
    > of objects again, getting more rows albeit the objects themselves don't
    > change.


    Fine, I'll stop arguing, but just be aware that this is confusing and
    you're likely the only person who understands what's really going on or
    how it's supposed to work...

    Chris

    --
    Simplistix - Content Management, Batch Processing & Python Consulting
    - http://www.simplistix.co.uk
     
    Chris Withers, Sep 9, 2009
    #8
  9. Re: [Guppy-pe-list] An iteration idiom (Was: Re: loading filescontaining multiple dumps)

    On Wed, 2009-09-09 at 13:47 +0100, Chris Withers wrote:
    > Sverker Nilsson wrote:
    > > As the enclosing class or frame is deallocated, so is its attribute h
    > > itself.

    >
    > Right, but as long as the h hangs around, it hangs on to all the memory
    > it's used to build its stats, right? This caused me problems in my most
    > recent use of guppy...


    If you just use heap(), and only want total memory not relative to a
    reference point, you can just use hpy() directly. So rather than:

    CASE 1:

    h=hpy()
    h.heap().dump(...)
    #other code, the data internal to h is still around
    h.heap().dump(...)

    you'd do:

    CASE 2:

    hpy().heap().dump(...)
    #other code. No data from Heapy is hanging around
    hpy().heap().dump(...)

    The difference is that in case 1, the second call to heap() could reuse
    the internal data in h, whereas in case 2, it would have to be recreated
    which would take longer time. (The data would be such things as the
    dictionary owner map.)

    However, if you measure memory relative to a reference point, you would
    have to keep h around, as in case 1.

    [snip]

    > > Do you mean we should actually _remove_ features to create a new
    > > standalone system?

    >
    > Absolutely, why provide more than is used or needed?


    How should we understand this? Should we have to support 2 or more
    systems depending on what functionality you happen to need? Or do
    you mean most functionality is actually _never_ used by
    _anybody_ (and will not be in the future)? That would be quite gross
    wouldn't it.

    I'd be hard pressed to support several versions just for the sake
    of some of them would have only the most common methods used in
    certain situations.

    That's would be like to create an additional Python dialect that
    contained say only the 10 % functionality that is used 90 % of the time.
    Quite naturally this is not done anytime soon. Even though one could
    perhaps argue it would be easier to use for children etc, the extra
    work to support this has not been deemed meaningful.

    >
    > > You are free to wrap functions as you find suitable; a minimal wrapper
    > > module could be just like this:
    > >
    > > # Module heapyheap
    > > from guppy import hpy
    > > h=hpy()
    > > heap=heap()

    >
    > I don't follow this.. did you mean heap = h.heap()?


    Actually I meant heap=h.heap

    > If so, isn't that using all the gubbinz in Use, etc, anyway?


    Depends on what you mean with 'using', but I would say no.

    > >>>> Less minor rant: this applies to most things to do with heapy... Having
    > >>>> __repr__ return the same as __str__ and having that be a long lump of
    > >>>> text is rather annoying. If you really must, make __str__ return the big
    > >>>> lump of text but have __repr__ return a simple, short, item containing
    > >>>> the class, the id, and maybe the number of contained objects...
    > >>> I thought it was cool to not have to use print but get the result
    > >>> directly at the prompt.
    > >> That's fine, that's what __str__ is for. __repr__ should be short.

    > >
    > > No, it's the other way around: __repr__ is used when evaluating directly
    > > at the prompt.

    >
    > The docs give the idea:
    >
    > http://docs.python.org/reference/datamodel.html?highlight=__repr__#object.__repr__
    >
    > I believe you "big strings" would be classed as "informal" and so would
    > be computed by __str__.


    Informal or not, they contain the information I thought was most useful
    and are created by __str__, but also with __repr__ because that is used
    when evaluated at the prompt.

    According to the doc you linked to above, __repr__ should preferably be
    a Python expression that could be used to recreate it. I think this has
    been discussed and criticized before and in general there is no way to
    create such an expression. For example, for the result of h.heap(),
    there is no expression that can recreate it later (since the heap
    changes) and the object returned is just an IdentitySet, which doesn't
    know how it was created.

    It also gives as an alternative, "If this is not possible, a string of
    the form <...some useful description...> should be returned"

    The __repr__ I use don't have the enclosing <>, granted, maybe I missed
    this or it wasn't in the docs in 2005 or I didn't think it was important
    (still don't) but was that really what the complain was about?

    The docs also say that "it is important that the representation is
    information-rich and unambiguous."

    I thought it was more useful to actually get information of what was
    contained in the object directly at the prompt, than try to show how to
    recreate it which wasn't possible anyway.

    [snip]

    > The index (__getitem__) method was available so I
    > > used it to take the subset of the i'ths row in the partition defined by
    > > its equivalence order.

    >
    > That should have another name... I don't know what a partition or
    > equivalence order are in the contexts you're using them, but I do know
    > that hijacking __getitem__ for this is wrong.


    Opinions may differ, I'd say one can in principle never 'know' if such a
    thing is 'right' or 'wrong', but that gets us into philosophical territory. Anyway...

    To get a tutorial provided by someone who did not seem to share your
    conviction about indexing, but seemed to regard the way Heapy does it natural
    (although has other valid complaints, though it is somewhat outdated i.e.
    wrt 64 bit) see:

    http://www.pkgcore.org/trac/pkgcore/doc/dev-notes/heapy.rst

    which is also available from the Documentation section of the guppy-pe
    home page.

    Cheers,

    Sverker


    --
    Expertise in Linux, embedded systems, image processing, C, Python...
    http://sncs.se
     
    Sverker Nilsson, Sep 10, 2009
    #9
  10. Re: [Guppy-pe-list] An iteration idiom (Was: Re: loading filescontaining multiple dumps)

    Sverker Nilsson wrote:
    > If you just use heap(), and only want total memory not relative to a
    > reference point, you can just use hpy() directly. So rather than:
    >
    > CASE 1:
    >
    > h=hpy()
    > h.heap().dump(...)
    > #other code, the data internal to h is still around
    > h.heap().dump(...)
    >
    > you'd do:
    >
    > CASE 2:
    >
    > hpy().heap().dump(...)
    > #other code. No data from Heapy is hanging around
    > hpy().heap().dump(...)
    >
    > The difference is that in case 1, the second call to heap() could reuse
    > the internal data in h,


    But that internal data would have to hang around, right? (which might,
    in itself, cause memory problems?)

    > whereas in case 2, it would have to be recreated
    > which would take longer time. (The data would be such things as the
    > dictionary owner map.)


    How long is longer? Do you have any metrics that would help make good
    decisions about when to keep a hpy() instance around and when it's best
    to save memory?

    >>> Do you mean we should actually _remove_ features to create a new
    >>> standalone system?

    >> Absolutely, why provide more than is used or needed?

    >
    > How should we understand this? Should we have to support 2 or more
    > systems depending on what functionality you happen to need? Or do
    > you mean most functionality is actually _never_ used by
    > _anybody_ (and will not be in the future)? That would be quite gross
    > wouldn't it.


    I'm saying have one project and dump all the excess stuff that no-one
    but you uses ;-)

    Or, maybe easier, have a core, separate, package that just has the
    essentials in a simply, clean fashion and then another package that
    builds on this to add all the other stuff...

    > It also gives as an alternative, "If this is not possible, a string of
    > the form <...some useful description...> should be returned"
    >
    > The __repr__ I use don't have the enclosing <>, granted, maybe I missed
    > this or it wasn't in the docs in 2005 or I didn't think it was important
    > (still don't) but was that really what the complain was about?


    No, it was about the fact that when I do repr(something_from_heapy) I
    get a shedload of text.

    > I thought it was more useful to actually get information of what was
    > contained in the object directly at the prompt, than try to show how to
    > recreate it which wasn't possible anyway.


    Agreed, but I think the stuff you currently have in __repr__ would be
    better placed in its own method:

    >>> heap()

    <IdentitySet object at 0x0000 containing 10 items>
    >>> _.show()

    .... all the current __repr__ output

    >> That should have another name... I don't know what a partition or
    >> equivalence order are in the contexts you're using them, but I do know
    >> that hijacking __getitem__ for this is wrong.

    >
    > Opinions may differ, I'd say one can in principle never 'know' if such a
    > thing is 'right' or 'wrong', but that gets us into philosophical territory. Anyway...


    I would bet that if you asked 100 experienced python programmers, most
    of them would tell you that what you're doing with __getitem__ is wrong,
    some might even say evil ;-)

    > To get a tutorial provided by someone who did not seem to share your
    > conviction about indexing, but seemed to regard the way Heapy does it natural
    > (although has other valid complaints, though it is somewhat outdated i.e.
    > wrt 64 bit) see:
    >
    > http://www.pkgcore.org/trac/pkgcore/doc/dev-notes/heapy.rst


    This link has become broken recently, but I don't remember reading the
    author's comments as liking the indexing stuff...

    Chris

    --
    Simplistix - Content Management, Batch Processing & Python Consulting
    - http://www.simplistix.co.uk
     
    Chris Withers, Sep 11, 2009
    #10
  11. Boris Borcic

    Ethan Furman Guest

    Re: [Guppy-pe-list] An iteration idiom (Was: Re: loadingfiles containing multiple dumps)

    Chris Withers wrote:
    > Sverker Nilsson wrote:
    >>
    >> The __repr__ I use don't have the enclosing <>, granted, maybe I missed
    >> this or it wasn't in the docs in 2005 or I didn't think it was important
    >> (still don't) but was that really what the complain was about?

    >
    >
    > No, it was about the fact that when I do repr(something_from_heapy) I
    > get a shedload of text.
    >
    >> I thought it was more useful to actually get information of what was
    >> contained in the object directly at the prompt, than try to show how to
    >> recreate it which wasn't possible anyway.

    >
    >
    > Agreed, but I think the stuff you currently have in __repr__ would be
    > better placed in its own method:
    >
    > >>> heap()

    > <IdentitySet object at 0x0000 containing 10 items>


    For what it's worth, the container class I wrote recently to hold dbf
    rows is along the lines of Chris' suggestion; output is similar to this:

    DbfList(97 records)

    or, if a description was provided at list creation time:

    DbfList(State of Oregon - 97 records)

    basically, a short description of what's in the container, instead of 97
    screens of gibberish (even usefull information is gibberish after 97
    screenfulls of it!-)

    ~Ethan~
     
    Ethan Furman, Sep 11, 2009
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Sverker Nilsson

    Compiling Guppy-PE extension modules

    Sverker Nilsson, Nov 29, 2005, in forum: Python
    Replies:
    7
    Views:
    401
    Sverker Nilsson
    Dec 3, 2005
  2. Sverker Nilsson

    Guppy-PE 0.1.5 released

    Sverker Nilsson, Oct 12, 2006, in forum: Python
    Replies:
    0
    Views:
    265
    Sverker Nilsson
    Oct 12, 2006
  3. Victor Kryukov
    Replies:
    8
    Views:
    492
    Gabriel Genellina
    May 17, 2007
  4. Sverker Nilsson

    Guppy-PE / Heapy 0.1.8

    Sverker Nilsson, Apr 8, 2008, in forum: Python
    Replies:
    0
    Views:
    412
    Sverker Nilsson
    Apr 8, 2008
  5. Sverker Nilsson

    Guppy-PE/Heapy 0.1.9 released

    Sverker Nilsson, Jun 23, 2009, in forum: Python
    Replies:
    0
    Views:
    254
    Sverker Nilsson
    Jun 23, 2009
Loading...

Share This Page