dynamic allocation file buffer

Discussion in 'Python' started by castironpi, Sep 9, 2008.

  1. castironpi

    castironpi Guest

    I will try my idea again. I want to talk to people about a module I
    want to write and I will take the time to explain it. I think it's a
    "cool idea" that a lot of people, forgiving the slang, could benefit
    from. What are its flaws?

    A user has a file he is using either 1/ to persist binary data after
    the run of a single program (persistence) or 2/ share binary data
    between concurrently running programs (IPC / shared memory). The data
    are records of variable types and lengths that can change over time.
    He wants to change a record that's already present in the file. Here
    are two examples.

    Use Case 1: Hierarchical ElementTree-style data

    A user has an XML file like the one shown here.

    <a>
    <b>
    <c>Foo</c>
    </b>
    ...

    He wants to change "Foo" to "Foobar".

    <a>
    <b>
    <c>Foobar</c>
    </b>
    ...

    The change he wants to make is at the beginning of a 4GB file, and
    recopying the remainder is an unacceptable resource drain.

    Use Case 2: Web session logger

    A tutor application has written a plugin to a webbrowser that records
    the order of a user's mouse and keyboard activity during a browsing
    session, and makes them concurrently available to other applications
    in a suite, which are written in varying lanugages. The user takes
    some action, such as surfing to a site or clicking on a link. The
    browser plugin records that sequence into shared memory, where it is
    marked as acknowledged by the listener programs, and recycled back
    into an unused block. URLs, user inputs, and link text can be of any
    length, so truncating them to fit a fixed length is not an option.

    Existing Solutions

    - Shelve - A Python Standard Library shelf object can store a random
    access dictionary mapping strings to pickled objects. It does not
    provide for hierarchical data stores, and objects must be unpickled
    before they can be examined.
    - Relational Database - Separate tables of nodes, attributes, and
    text, and the relations between them are slow and unwieldy to
    reproduce the contents of a dynamic structure. The VARCHAR data type
    still carries a maximum size, no more flexible than fixed-length
    records.
    - POSH - Python Object Sharing - A module currently in its alpha stage
    promises to make it possible to store Python objects directly in
    shared memory. In its current form, its only entry point is 'fork'
    and does not offer persistence, only sharing. See:
    http://poshmodule.sourceforge.net/

    Dynamic Allocation

    The traditional solution, dynamic memory allocation, is to maintain a
    metadata list of "free blocks" that are available to write to. See:
    http://en.wikipedia.org/wiki/Dynamic_memory_allocation
    http://en.wikipedia.org/wiki/Malloc
    http://en.wikipedia.org/wiki/Mmap
    http://en.wikipedia.org/wiki/Memory_leak
    The catch, and the crux of the proposal, is that the metadata must be
    stored in shared memory along with the data themselves. Assuming they
    are, a program can acquire the offset of an unused block of a
    sufficient size for its data, then write it to the file at that
    offset. The metadata can maintain the offset of one root member, to
    serve as a 'table of contents' or header for the remainder of the
    file. It can be grown and reassigned as needed.

    An acquaintence writes: It could be quite useful for highly concurrent
    systems: the overhead involved with interprocess communication can be
    overwhelming, and something more flexible than normal object
    persistence to disk might be worth having.

    Python Applicability

    The usual problems with data persistence and sharing apply. The
    format of the external data is only established conventionally, and
    conversions between Python objects and raw memory bytes take the usual
    overhead. 'struct.Struct', 'ctypes.Structure', and 'pickle.Pickler'
    currently offer this functionality, and the buffer offset obtained
    from 'alloc' can be used with all three.

    Ex 1.
    s= struct.Struct( 'III' )
    x= alloc( s.size )
    s.pack_into( mem, x, 2, 4, 6 )
    Struct in its current form does not permit random access into
    structure contents; a user must read or write the entire converted
    strucutre in order to update one field. Alternative:
    s= struct.Struct( 'I' )
    x1, x2, x3= alloc( s.size ), alloc( s.size ), alloc( s.size )
    s.pack_into( mem, x1, 2 )
    s.pack_into( mem, x2, 4 )
    s.pack_into( mem, x3, 6 )

    Ex 2.
    class Items( ctypes.Structure ):
    _fields_= [
    ( 'x1', ctypes.c_float ),
    ( 'y1', ctypes.c_float ) ]
    x= alloc( ctypes.sizeof( Items ) )
    c= ctypes.cast( mem+ x, ctypes.POINTER( Items ) ).contents
    c.x1, c.y1= 2, 4
    The 'mem' variable is obtained from a call to PyObject_AsWriteBuffer.

    Ex 3.
    s= pickle.dumps( ( 2, 4, 6 ) )
    x= alloc( len( s ) )
    mem[ x: x+ len( s ) ]= s
    'dumps' is still slow and nor does permit random access into contents.

    Use Cases Revisited

    Use Case 1: Hierarchical ElementTree-style data
    Solution: Dynamically allocate the tree and its elements.

    Node: tag: a
    Node: tag: b
    Node: tag: c
    Node: text: Foo

    The user wants to change "Foo" to "Foobar".

    Node: tag: a
    Node: tag: b
    Node: tag: c
    Node: text: Foobar

    Deallocate 'Node: text: Foo', allocate 'Node: text: Foobar', and store
    the new offset into 'Node: tag: c'. Total writes 6 bytes 'foobar', a
    one-word offset, and approximatly 5- 10-word metadata update.

    Use Case 2: Web session logger
    Dynamically allocate a linked list of data points.

    Data: 'friendster.com'
    Data: 'My Account'

    Allocate one block for each string, adding it to a linked list. As
    listeners acknowledge each data point, remove it from the linked
    list. Keep the head node in the 'root offset' metadata field.

    Restrictions

    It is not possible for persistent memory to refer to live memory. Any
    objects it refers to must also be located in file. Their mapped
    addresses must not be stored, only their offsets into it. However,
    live references to persistent memory are eminently possible.

    Current Status

    A pure Python alloc-free implementation based on the GNU PAVL tree
    library is on Google Code. It is only in proof-of-concept form and
    not commented, but does contain a first-pass test suite. See:
    http://code.google.com/p/pymmapstruct/source/browse/#svn/trunk
    The ctypes solution for access is advised.
    castironpi, Sep 9, 2008
    #1
    1. Advertising

  2. castironpi

    Larry Bates Guest

    castironpi wrote:
    > I will try my idea again. I want to talk to people about a module I
    > want to write and I will take the time to explain it. I think it's a
    > "cool idea" that a lot of people, forgiving the slang, could benefit
    > from. What are its flaws?
    >
    > A user has a file he is using either 1/ to persist binary data after
    > the run of a single program (persistence) or 2/ share binary data
    > between concurrently running programs (IPC / shared memory). The data
    > are records of variable types and lengths that can change over time.
    > He wants to change a record that's already present in the file. Here
    > are two examples.
    >
    > Use Case 1: Hierarchical ElementTree-style data
    >
    > A user has an XML file like the one shown here.
    >
    > <a>
    > <b>
    > <c>Foo</c>
    > </b>
    > ...
    >
    > He wants to change "Foo" to "Foobar".
    >
    > <a>
    > <b>
    > <c>Foobar</c>
    > </b>
    > ...
    >
    > The change he wants to make is at the beginning of a 4GB file, and
    > recopying the remainder is an unacceptable resource drain.
    >
    > Use Case 2: Web session logger
    >
    > A tutor application has written a plugin to a webbrowser that records
    > the order of a user's mouse and keyboard activity during a browsing
    > session, and makes them concurrently available to other applications
    > in a suite, which are written in varying lanugages. The user takes
    > some action, such as surfing to a site or clicking on a link. The
    > browser plugin records that sequence into shared memory, where it is
    > marked as acknowledged by the listener programs, and recycled back
    > into an unused block. URLs, user inputs, and link text can be of any
    > length, so truncating them to fit a fixed length is not an option.
    >
    > Existing Solutions
    >
    > - Shelve - A Python Standard Library shelf object can store a random
    > access dictionary mapping strings to pickled objects. It does not
    > provide for hierarchical data stores, and objects must be unpickled
    > before they can be examined.
    > - Relational Database - Separate tables of nodes, attributes, and
    > text, and the relations between them are slow and unwieldy to
    > reproduce the contents of a dynamic structure. The VARCHAR data type
    > still carries a maximum size, no more flexible than fixed-length
    > records.
    > - POSH - Python Object Sharing - A module currently in its alpha stage
    > promises to make it possible to store Python objects directly in
    > shared memory. In its current form, its only entry point is 'fork'
    > and does not offer persistence, only sharing. See:
    > http://poshmodule.sourceforge.net/
    >
    > Dynamic Allocation
    >
    > The traditional solution, dynamic memory allocation, is to maintain a
    > metadata list of "free blocks" that are available to write to. See:
    > http://en.wikipedia.org/wiki/Dynamic_memory_allocation
    > http://en.wikipedia.org/wiki/Malloc
    > http://en.wikipedia.org/wiki/Mmap
    > http://en.wikipedia.org/wiki/Memory_leak
    > The catch, and the crux of the proposal, is that the metadata must be
    > stored in shared memory along with the data themselves. Assuming they
    > are, a program can acquire the offset of an unused block of a
    > sufficient size for its data, then write it to the file at that
    > offset. The metadata can maintain the offset of one root member, to
    > serve as a 'table of contents' or header for the remainder of the
    > file. It can be grown and reassigned as needed.
    >
    > An acquaintence writes: It could be quite useful for highly concurrent
    > systems: the overhead involved with interprocess communication can be
    > overwhelming, and something more flexible than normal object
    > persistence to disk might be worth having.
    >
    > Python Applicability
    >
    > The usual problems with data persistence and sharing apply. The
    > format of the external data is only established conventionally, and
    > conversions between Python objects and raw memory bytes take the usual
    > overhead. 'struct.Struct', 'ctypes.Structure', and 'pickle.Pickler'
    > currently offer this functionality, and the buffer offset obtained
    > from 'alloc' can be used with all three.
    >
    > Ex 1.
    > s= struct.Struct( 'III' )
    > x= alloc( s.size )
    > s.pack_into( mem, x, 2, 4, 6 )
    > Struct in its current form does not permit random access into
    > structure contents; a user must read or write the entire converted
    > strucutre in order to update one field. Alternative:
    > s= struct.Struct( 'I' )
    > x1, x2, x3= alloc( s.size ), alloc( s.size ), alloc( s.size )
    > s.pack_into( mem, x1, 2 )
    > s.pack_into( mem, x2, 4 )
    > s.pack_into( mem, x3, 6 )
    >
    > Ex 2.
    > class Items( ctypes.Structure ):
    > _fields_= [
    > ( 'x1', ctypes.c_float ),
    > ( 'y1', ctypes.c_float ) ]
    > x= alloc( ctypes.sizeof( Items ) )
    > c= ctypes.cast( mem+ x, ctypes.POINTER( Items ) ).contents
    > c.x1, c.y1= 2, 4
    > The 'mem' variable is obtained from a call to PyObject_AsWriteBuffer.
    >
    > Ex 3.
    > s= pickle.dumps( ( 2, 4, 6 ) )
    > x= alloc( len( s ) )
    > mem[ x: x+ len( s ) ]= s
    > 'dumps' is still slow and nor does permit random access into contents.
    >
    > Use Cases Revisited
    >
    > Use Case 1: Hierarchical ElementTree-style data
    > Solution: Dynamically allocate the tree and its elements.
    >
    > Node: tag: a
    > Node: tag: b
    > Node: tag: c
    > Node: text: Foo
    >
    > The user wants to change "Foo" to "Foobar".
    >
    > Node: tag: a
    > Node: tag: b
    > Node: tag: c
    > Node: text: Foobar
    >
    > Deallocate 'Node: text: Foo', allocate 'Node: text: Foobar', and store
    > the new offset into 'Node: tag: c'. Total writes 6 bytes 'foobar', a
    > one-word offset, and approximatly 5- 10-word metadata update.
    >
    > Use Case 2: Web session logger
    > Dynamically allocate a linked list of data points.
    >
    > Data: 'friendster.com'
    > Data: 'My Account'
    >
    > Allocate one block for each string, adding it to a linked list. As
    > listeners acknowledge each data point, remove it from the linked
    > list. Keep the head node in the 'root offset' metadata field.
    >
    > Restrictions
    >
    > It is not possible for persistent memory to refer to live memory. Any
    > objects it refers to must also be located in file. Their mapped
    > addresses must not be stored, only their offsets into it. However,
    > live references to persistent memory are eminently possible.
    >
    > Current Status
    >
    > A pure Python alloc-free implementation based on the GNU PAVL tree
    > library is on Google Code. It is only in proof-of-concept form and
    > not commented, but does contain a first-pass test suite. See:
    > http://code.google.com/p/pymmapstruct/source/browse/#svn/trunk
    > The ctypes solution for access is advised.


    You should review Zope's ZODB and/or memcached before putting in too much effort.

    -Larry
    Larry Bates, Sep 9, 2008
    #2
    1. Advertising

  3. On Tue, 09 Sep 2008 14:59:19 -0700, castironpi wrote:

    > I will try my idea again. I want to talk to people about a module I
    > want to write and I will take the time to explain it. I think it's a
    > "cool idea" that a lot of people, forgiving the slang, could benefit
    > from. What are its flaws?


    [snip long description with not-very-credible use-cases]

    You've created a solution to a problem which (probably) only affects a
    very small number of people, at least judging by your use-cases. Who has
    a 4GB XML file, and how much crack did they smoke?

    Castironpi, what do *you* use this proof-of-concept module for? Don't
    bother tell us what you think *we* should use it for. Tell us what you're
    using it for, or at least what somebody else is using it for. If this is
    just a module that you think will be cool, I don't like your chances of
    people caring. There is no shortage of "cool" software that isn't useful
    for anything, and unlike eye-candy, nobody is going to use your module
    just because they like the algorithm.

    If you don't have an existing application for the software, then explain
    what it does (not how) and give some idea of the performance ("it's alpha
    and written in Python and really slow, but I will re-write it in C and
    expect it to make a billion random accesses in a 10GB file per
    millisecond", or whatever). You might be lucky and have somebody say
    "Hey, that's just the tool I need to solve my problem!".


    --
    Steven
    Steven D'Aprano, Sep 9, 2008
    #3
  4. castironpi

    castironpi Guest

    On Sep 9, 5:44 pm, Larry Bates <> wrote:
    > castironpi wrote:
    > > I will try my idea again.  I want to talk to people about a module I
    > > want to write and I will take the time to explain it.  I think it's a
    > > "cool idea" that a lot of people, forgiving the slang, could benefit
    > > from.  What are its flaws?

    >
    > > A user has a file he is using either 1/ to persist binary data after
    > > the run of a single program (persistence) or 2/ share binary data
    > > between concurrently running programs (IPC / shared memory).  The data
    > > are records of variable types and lengths that can change over time.
    > > He wants to change a record that's already present in the file.  Here
    > > are two examples.

    >
    > > Use Case 1: Hierarchical ElementTree-style data

    >
    > > A user has an XML file like the one shown here.

    >
    > > <a>
    > >   <b>
    > >     <c>Foo</c>
    > >   </b>
    > >   ...

    >
    > > He wants to change "Foo" to "Foobar".

    >
    > > <a>
    > >   <b>
    > >     <c>Foobar</c>
    > >   </b>
    > >   ...

    >
    > > The change he wants to make is at the beginning of a 4GB file, and
    > > recopying the remainder is an unacceptable resource drain.

    >
    > > Use Case 2: Web session logger

    >
    > > A tutor application has written a plugin to a webbrowser that records
    > > the order of a user's mouse and keyboard activity during a browsing
    > > session, and makes them concurrently available to other applications
    > > in a suite, which are written in varying lanugages.  The user takes
    > > some action, such as surfing to a site or clicking on a link.  The
    > > browser plugin records that sequence into shared memory, where it is
    > > marked as acknowledged by the listener programs, and recycled back
    > > into an unused block.  URLs, user inputs, and link text can be of any
    > > length, so truncating them to fit a fixed length is not an option.

    >
    > > Existing Solutions

    >
    > > - Shelve - A Python Standard Library shelf object can store a random
    > > access dictionary mapping strings to pickled objects.  It does not
    > > provide for hierarchical data stores, and objects must be unpickled
    > > before they can be examined.
    > > - Relational Database - Separate tables of nodes, attributes, and
    > > text, and the relations between them are slow and unwieldy to
    > > reproduce the contents of a dynamic structure.  The VARCHAR data type
    > > still carries a maximum size, no more flexible than fixed-length
    > > records.
    > > - POSH - Python Object Sharing - A module currently in its alpha stage
    > > promises to make it possible to store Python objects directly in
    > > shared memory.  In its current form, its only entry point is 'fork'
    > > and does not offer persistence, only sharing.  See:
    > >    http://poshmodule.sourceforge.net/

    >
    > > Dynamic Allocation

    >
    > > The traditional solution, dynamic memory allocation, is to maintain a
    > > metadata list of "free blocks" that are available to write to.  See:
    > >    http://en.wikipedia.org/wiki/Dynamic_memory_allocation
    > >    http://en.wikipedia.org/wiki/Malloc
    > >    http://en.wikipedia.org/wiki/Mmap
    > >    http://en.wikipedia.org/wiki/Memory_leak
    > > The catch, and the crux of the proposal, is that the metadata must be
    > > stored in shared memory along with the data themselves.  Assuming they
    > > are, a program can acquire the offset of an unused block of a
    > > sufficient size for its data, then write it to the file at that
    > > offset.  The metadata can maintain the offset of one root member, to
    > > serve as a 'table of contents' or header for the remainder of the
    > > file.  It can be grown and reassigned as needed.

    >
    > > An acquaintence writes: It could be quite useful for highly concurrent
    > > systems: the overhead involved with interprocess communication can be
    > > overwhelming, and something more flexible than normal object
    > > persistence to disk might be worth having.

    >
    > > Python Applicability

    >
    > > The usual problems with data persistence and sharing apply.  The
    > > format of the external data is only established conventionally, and
    > > conversions between Python objects and raw memory bytes take the usual
    > > overhead.  'struct.Struct', 'ctypes.Structure', and 'pickle.Pickler'
    > > currently offer this functionality, and the buffer offset obtained
    > > from 'alloc' can be used with all three.

    >
    > > Ex 1.
    > >     s= struct.Struct( 'III' )
    > >     x= alloc( s.size )
    > >     s.pack_into( mem, x, 2, 4, 6 )
    > > Struct in its current form does not permit random access into
    > > structure contents; a user must read or write the entire converted
    > > strucutre in order to update one field.  Alternative:
    > >     s= struct.Struct( 'I' )
    > >     x1, x2, x3= alloc( s.size ), alloc( s.size ), alloc( s.size )
    > >     s.pack_into( mem, x1, 2 )
    > >     s.pack_into( mem, x2, 4 )
    > >     s.pack_into( mem, x3, 6 )

    >
    > > Ex 2.
    > >     class Items( ctypes.Structure ):
    > >         _fields_= [
    > >             ( 'x1', ctypes.c_float ),
    > >             ( 'y1', ctypes.c_float ) ]
    > >     x= alloc( ctypes.sizeof( Items ) )
    > >     c= ctypes.cast( mem+ x, ctypes.POINTER( Items ) ).contents
    > >     c.x1, c.y1= 2, 4
    > > The 'mem' variable is obtained from a call to PyObject_AsWriteBuffer.

    >
    > > Ex 3.
    > >     s= pickle.dumps( ( 2, 4, 6 ) )
    > >     x= alloc( len( s ) )
    > >     mem[ x: x+ len( s ) ]= s
    > > 'dumps' is still slow and nor does permit random access into contents.

    >
    > > Use Cases Revisited

    >
    > > Use Case 1: Hierarchical ElementTree-style data
    > > Solution: Dynamically allocate the tree and its elements.

    >
    > > Node: tag: a
    > > Node: tag: b
    > > Node: tag: c
    > > Node: text: Foo

    >
    > > The user wants to change "Foo" to "Foobar".

    >
    > > Node: tag: a
    > > Node: tag: b
    > > Node: tag: c
    > > Node: text: Foobar

    >
    > > Deallocate 'Node: text: Foo', allocate 'Node: text: Foobar', and store
    > > the new offset into 'Node: tag: c'.  Total writes 6 bytes 'foobar', a
    > > one-word offset, and approximatly 5- 10-word metadata update.

    >
    > > Use Case 2: Web session logger
    > > Dynamically allocate a linked list of data points.

    >
    > > Data: 'friendster.com'
    > > Data: 'My Account'

    >
    > > Allocate one block for each string, adding it to a linked list.  As
    > > listeners acknowledge each data point, remove it from the linked
    > > list.  Keep the head node in the 'root offset' metadata field.

    >
    > > Restrictions

    >
    > > It is not possible for persistent memory to refer to live memory.  Any
    > > objects it refers to must also be located in file.  Their mapped
    > > addresses must not be stored, only their offsets into it.  However,
    > > live references to persistent memory are eminently possible.

    >
    > > Current Status

    >
    > > A pure Python alloc-free implementation based on the GNU PAVL tree
    > > library is on Google Code.  It is only in proof-of-concept form and
    > > not commented, but does contain a first-pass test suite.  See:
    > >    http://code.google.com/p/pymmapstruct/source/browse/#svn/trunk
    > > The ctypes solution for access is advised.

    >
    > You should review Zope's ZODB and/or memcached before putting in too much effort.
    >
    > -Larry


    Larry,

    I'd love to say they were exactly what I was looking for. They're
    not. I confess, I stopped reading ZODB when I got to the "uses
    pickles" part, and 'memcached' when I got to the awkward and unwieldy
    "SELECT FROM" part. I'm aware of both of those and my solution does
    something neither other does.
    castironpi, Sep 10, 2008
    #4
  5. castironpi

    castironpi Guest

    On Sep 9, 5:58 pm, Steven D'Aprano <st...@REMOVE-THIS-
    cybersource.com.au> wrote:
    > On Tue, 09 Sep 2008 14:59:19 -0700, castironpi wrote:
    > > I will try my idea again.  I want to talk to people about a module I
    > > want to write and I will take the time to explain it.  I think it's a
    > > "cool idea" that a lot of people, forgiving the slang, could benefit
    > > from.  What are its flaws?

    >
    > [snip long description with not-very-credible use-cases]


    Steven,

    > You've created a solution to a problem which (probably) only affects a
    > very small number of people, at least judging by your use-cases. Who has
    > a 4GB XML file, and how much crack did they smoke?


    I judge from the existence of 'shelve' and 'pickle' modules, and
    relational database packages, that the problem I am addressing is not
    rare. It could be the millionaire investor across the street, the
    venture capitalist down the hall, or the guy with a huge CD catalog.

    > Castironpi, what do *you* use this proof-of-concept module for?


    Honestly, nothing yet. I just wrote it. My user community and
    customer base are very small. Originally, I wanted to store variable-
    length strings in a file, where shelves and databases were overkill.
    I created it for its beauty, sorry to disappoint.

    > Don't
    > bother tell us what you think *we* should use it for. Tell us what you're
    > using it for, or at least what somebody else is using it for. If this is
    > just a module that you think will be cool, I don't like your chances of
    > people caring. There is no shortage of "cool" software that isn't useful
    > for anything, and unlike eye-candy, nobody is going to use your module
    > just because they like the algorithm.


    Unfortunately, nobody is going to care about most of the uses I have
    for it 'til I have a job. I'm goofing around with a laptop,
    remembering when my databases professor kept dropping the ball on
    VARCHARs. If you want a sound byte, think, "imagine programming
    without 'new' and 'malloc'."

    > If you don't have an existing application for the software, then explain
    > what it does (not how) and give some idea of the performance ("it's alpha
    > and written in Python and really slow, but I will re-write it in C and
    > expect it to make a billion random accesses in a 10GB file per
    > millisecond", or whatever). You might be lucky and have somebody say
    > "Hey, that's just the tool I need to solve my problem!".


    I wrote a Rope implementation just to test drive it. It exceeded the
    native immutable string type at 2 megs. It used 'struct' instead of
    'ctypes', so that number could conceivably come down. I am intending
    to leave it in pure Python, so there.

    > --
    > Steven


    Pleasure chatting as always sir.
    castironpi, Sep 10, 2008
    #5
  6. On Sep 9, 5:59 pm, castironpi <> wrote:

    > I will try my idea again.  I want to talk to people about a
    > module I want to write and I will take the time to explain it.
    > I think it's a "cool idea" that a lot of people, forgiving the
    > slang, could benefit from.  
    >
    > (snipped)
    >
    > A pure Python alloc-free implementation based on the GNU PAVL
    > tree library is on Google Code.  It is only in proof-of-concept
    > form and not commented, but does contain a first-pass test
    > suite.  See:
    >    http://code.google.com/p/pymmapstruct/source/browse/#svn/trunk


    So at best (i.e. if it actually makes any sense; I didn't read it),
    this is an ANNouncement of a pre-alpha piece of code. ANN posts rarely
    attract replies, even when they are about production/stable software.
    Thankfully, most people don't expect (let alone "require") readers to
    share their interest or enthusiasm by replying to the ANN. Given your
    past semi-coherent and incoherent posts, expecting people to jump on
    such a thread is a rather tall order.

    George
    George Sakkis, Sep 10, 2008
    #6
  7. Steven D'Aprano wrote:

    > You've created a solution to a problem which (probably) only affects a
    > very small number of people, at least judging by your use-cases. Who has
    > a 4GB XML file


    Getting 4GB XML files from, say, logging processes or databases that can
    render their output as XML is not that uncommon. They're usually
    record-oriented, and are intended to be processed as streams. And given
    the right tools, doing that is no harder than doing the same to a 4GB
    text file.

    </F>
    Fredrik Lundh, Sep 10, 2008
    #7
  8. On Wed, 10 Sep 2008 09:26:20 +0200, Fredrik Lundh wrote:

    > Steven D'Aprano wrote:
    >
    >> You've created a solution to a problem which (probably) only affects a
    >> very small number of people, at least judging by your use-cases. Who
    >> has a 4GB XML file

    >
    > Getting 4GB XML files from, say, logging processes or databases that can
    > render their output as XML is not that uncommon. They're usually
    > record-oriented, and are intended to be processed as streams. And given
    > the right tools, doing that is no harder than doing the same to a 4GB
    > text file.



    Fair enough, that's a good point.

    But would you expect random access to a 4GB XML file? If I've understood
    what Castironpi is trying for, his primary use case was for people
    wanting exactly that.


    --
    Steven
    Steven D'Aprano, Sep 10, 2008
    #8
  9. On Sep 10, 5:24 am, Steven D'Aprano
    <> wrote:
    > On Wed, 10 Sep 2008 09:26:20 +0200, Fredrik Lundh wrote:
    > > Steven D'Aprano wrote:

    >
    > >> You've created a solution to a problem which (probably) only affects a
    > >> very small number of people, at least judging by your use-cases. Who
    > >> has a 4GB XML file

    >
    > > Getting 4GB XML files from, say, logging processes or databases that can
    > > render their output as XML is not that uncommon.  They're usually
    > > record-oriented, and are intended to be processed as streams.  And given
    > > the right tools, doing that is no harder than doing the same to a 4GB
    > > text file.

    >
    > Fair enough, that's a good point.
    >
    > But would you expect random access to a 4GB XML file? If I've understood
    > what Castironpi is trying for, his primary use case was for people
    > wanting exactly that.
    >
    > --
    > Steven


    Steven,

    Are you claiming that sequential storage is sufficient for small
    amounts of data, and relational db.s are necessary for large amounts?
    It's possible that there is only the fringe exception, in which case
    'alloc/free' aren't useful in the majority of cases, and will never
    win customers away from the more mature competition.

    Regardless, it is an elegant solution to the problem of storing
    variable-length strings, with hardly any practical value. Perfect for
    grad school.
    Aaron \Castironpi\ Brady, Sep 10, 2008
    #9
  10. On Sep 9, 10:03 pm, George Sakkis <> wrote:
    > On Sep 9, 5:59 pm, castironpi <> wrote:
    >
    > > I will try my idea again.  I want to talk to people about a
    > > module I want to write and I will take the time to explain it.
    > > I think it's a "cool idea" that a lot of people, forgiving the
    > > slang, could benefit from.  

    >
    > > (snipped)

    >
    > > A pure Python alloc-free implementation based on the GNU PAVL
    > > tree library is on Google Code.  It is only in proof-of-concept
    > > form and not commented, but does contain a first-pass test
    > > suite.  See:
    > >    http://code.google.com/p/pymmapstruct/source/browse/#svn/trunk

    >
    > So at best (i.e. if it actually makes any sense; I didn't read it),
    > this is an ANNouncement of a pre-alpha piece of code. ANN posts rarely
    > attract replies, even when they are about production/stable software.
    > Thankfully, most people don't expect (let alone "require") readers to
    > share their interest or enthusiasm by replying to the ANN. Given your
    > past semi-coherent and incoherent posts, expecting people to jump on
    > such a thread is a rather tall order.
    >
    > George


    No, I'm just excited about it and want to share. I definitely think
    that discouragement of new ideas, successes, and personal expressions
    is a weakness that society has, and something that c-l-py is missing
    as well. I want to encourage them in other people so I will do it
    myself.

    As for the practicality of this module, I am definitely receiving
    skepticism from the group. Further, its feasibility is in question.

    For instance, no one has pointed out, and I only came across last
    night, that IPC synchronization is non-trivial and possibly platform-
    dependent. Of course it's a prerequisite for the advance of any IPC
    mod, so I'm glad I did not release an ANN.
    Aaron \Castironpi\ Brady, Sep 10, 2008
    #10
  11. castironpi

    Paul Boddie Guest

    On Sep 10, 5:03 am, George Sakkis <> wrote:
    >
    > So at best (i.e. if it actually makes any sense; I didn't read it),
    > this is an ANNouncement of a pre-alpha piece of code. ANN posts rarely
    > attract replies, even when they are about production/stable software.


    To be fair, at least some code has been provided here, unlike another
    recently announced piece of software whose author left a trail of
    promotion, referencing a Web site which doesn't even run the
    advertised software but promises to do so after the "first release":
    in other words, vapourware, as far as anyone following up on such
    promotion would be concerned. We aren't seeing anything like that
    here.

    > Thankfully, most people don't expect (let alone "require") readers to
    > share their interest or enthusiasm by replying to the ANN. Given your
    > past semi-coherent and incoherent posts, expecting people to jump on
    > such a thread is a rather tall order.


    I think you should give the benefit of the doubt here, particularly
    since we've been given a short review of related technologies (so it
    isn't as if the ideas are being picked out of thin air), and I think
    that such discussion isn't completely unproductive. Things like shared
    memory and memory-mapped files are often proposed as optimisations
    when building multiprocessing solutions, with solutions like POSH
    previously suggested on a frequent basis despite the lack of their
    further development, so maybe there is a need for something like this.

    Although systems like those developed for Tim Bray's "Wide Finder 2"
    exercise [1] probably work best by dumping raw data to disk - in other
    words, not channelling data between processes at all - there may well
    be a need for better data sharing between processes for some kinds of
    concurrent systems. Whether the ideas described here could help, or
    whether things like distributed filesystems already cover the same
    ground, I think it's still worth having the conversation.

    Of course, people shouldn't have to feel obliged to respond with
    encouragement to every announcement - there are plenty of projects I
    have no direct interest in - but I don't think people should respond
    with discouragement if the only basis for it is how they may have
    perceived the author's previous contributions to the mailing list or
    newsgroup.

    Paul

    P.S. Maybe those of a more judgemental disposition should peruse the
    activities in comp.lang.lisp to preserve a sense of perspective. Upon
    my last chance perusal of that newsgroup, I saw frequent mentions of
    at least one name familiar to comp.lang.python readers.

    [1] http://www.tbray.org/ongoing/When/200x/2008/05/01/Wide-Finder-2
    Paul Boddie, Sep 11, 2008
    #11
  12. On Wed, 10 Sep 2008 11:59:35 -0700, Aaron \"Castironpi\" Brady wrote:

    > On Sep 10, 5:24 am, Steven D'Aprano
    > <> wrote:
    >> On Wed, 10 Sep 2008 09:26:20 +0200, Fredrik Lundh wrote:
    >> > Steven D'Aprano wrote:

    >>
    >> >> You've created a solution to a problem which (probably) only affects
    >> >> a very small number of people, at least judging by your use-cases.
    >> >> Who has a 4GB XML file

    >>
    >> > Getting 4GB XML files from, say, logging processes or databases that
    >> > can render their output as XML is not that uncommon.  They're usually
    >> > record-oriented, and are intended to be processed as streams.  And
    >> > given the right tools, doing that is no harder than doing the same to
    >> > a 4GB text file.

    >>
    >> Fair enough, that's a good point.
    >>
    >> But would you expect random access to a 4GB XML file? If I've
    >> understood what Castironpi is trying for, his primary use case was for
    >> people wanting exactly that.
    >>
    >> --
    >> Steven

    >
    > Steven,
    >
    > Are you claiming that sequential storage is sufficient for small amounts
    > of data, and relational db.s are necessary for large amounts?


    I'm no longer *claiming* anything, I'm *asking* whether random access to
    a 4GB XML file is something that is credible or useful. It is my
    understanding that XML is particularly ill-suited to random access once
    the amount of data is too large to fit in RAM.

    I'm interested in what Fredrik has to say about this, as he's the author
    of ElementTree.



    --
    Steven
    Steven D'Aprano, Sep 11, 2008
    #12
  13. Steven D'Aprano wrote:

    > I'm no longer *claiming* anything, I'm *asking* whether random access to
    > a 4GB XML file is something that is credible or useful. It is my
    > understanding that XML is particularly ill-suited to random access once
    > the amount of data is too large to fit in RAM.


    An XML file doesn't contain any indexing information, so random access
    to a large XML file is very inefficient. You can build (or precompute)
    index information and store in a separate file, of course, but that's
    hardly something that's useful in the general case.

    And as I said before, the only use case for *huge* XML files I've ever
    seen used in practice is to store large streams of record-style data;
    data that's intended to be consumed by sequential processes (and you can
    do a lot with sequential processing these days; for those interested in
    this, digging up a few review papers on "data stream processing" might
    be a good way to waste some time).

    Document-style XML usually fits into memory on modern machines;
    structures larger than that are usually split into different parts (e.g.
    using XInclude) and stored in a container file.

    Random *modifications* to an arbitrary XML file cannot be done, as long
    as you store the file in a standard file system. And if you invent your
    own format, it's no longer an XML file.

    </F>
    Fredrik Lundh, Sep 11, 2008
    #13
  14. castironpi

    Paul Boddie Guest

    On 11 Sep, 10:34, Fredrik Lundh <> wrote:
    >
    > And as I said before, the only use case for *huge* XML files I've ever
    > seen used in practice is to store large streams of record-style data;


    I can imagine that the manipulation of the persistent form of large
    graph structures might be another use case, although for efficient
    navigation of such a structure, which is what you'd need to start
    applying various graph algorithms, one would need some kind of index.
    Certainly, we're straying into database territory.

    Paul
    Paul Boddie, Sep 11, 2008
    #14
  15. On Sep 11, 2:40 am, Steven D'Aprano
    <> wrote:
    > On Wed, 10 Sep 2008 11:59:35 -0700, Aaron \"Castironpi\" Brady wrote:
    > > On Sep 10, 5:24 am, Steven D'Aprano
    > > <> wrote:
    > >> On Wed, 10 Sep 2008 09:26:20 +0200, Fredrik Lundh wrote:
    > >> > Steven D'Aprano wrote:

    >
    > >> >> You've created a solution to a problem which (probably) only affects
    > >> >> a very small number of people, at least judging by your use-cases.
    > >> >> Who has a 4GB XML file

    >
    > >> > Getting 4GB XML files from, say, logging processes or databases that
    > >> > can render their output as XML is not that uncommon.  They're usually
    > >> > record-oriented, and are intended to be processed as streams.  And
    > >> > given the right tools, doing that is no harder than doing the same to
    > >> > a 4GB text file.

    >
    > >> Fair enough, that's a good point.

    >
    > >> But would you expect random access to a 4GB XML file? If I've
    > >> understood what Castironpi is trying for, his primary use case was for
    > >> people wanting exactly that.

    >
    > >> --
    > >> Steven

    >
    > > Steven,

    >
    > > Are you claiming that sequential storage is sufficient for small amounts
    > > of data, and relational db.s are necessary for large amounts?

    >
    > I'm no longer *claiming* anything, I'm *asking* whether random access to
    > a 4GB XML file is something that is credible or useful. It is my
    > understanding that XML is particularly ill-suited to random access once
    > the amount of data is too large to fit in RAM.
    >
    > I'm interested in what Fredrik has to say about this, as he's the author
    > of ElementTree.
    >
    > --
    > Steven


    XML is the wrong word for the example I was thinking of (as was
    already pointed out in another thread). XML is by definition
    sequential. The use case pertained to a generic element hierarchy;
    think of 4GB of hierarchical data.
    Aaron \Castironpi\ Brady, Sep 11, 2008
    #15
  16. On Sep 11, 5:35 am, Paul Boddie <> wrote:
    > On 11 Sep, 10:34, Fredrik Lundh <> wrote:
    >
    >
    >
    > > And as I said before, the only use case for *huge* XML files I've ever
    > > seen used in practice is to store large streams of record-style data;

    >
    > I can imagine that the manipulation of the persistent form of large
    > graph structures might be another use case, although for efficient
    > navigation of such a structure, which is what you'd need to start
    > applying various graph algorithms, one would need some kind of index.
    > Certainly, we're straying into database territory.
    >
    > Paul


    An acquaintance suggests that defragmentation would be a useful
    service to provide along with memory management too, which also
    requires an index.

    I encourage overlap between a bare-bones alloc/free module and
    established database territory and I'm very aware of it.

    Databases already support both concurrency and persistence, but don't
    tell me you'd use a database for IPC. And don't tell me you've never
    wished you had a reference to a record in a table so that you could
    make an update just by changing one word of memory at the right
    place. Sometimes databases are overkill where all you want is dynamic
    allocation.
    Aaron \Castironpi\ Brady, Sep 11, 2008
    #16
  17. castironpi

    Paul Boddie Guest

    On 11 Sep, 19:31, "Aaron \"Castironpi\" Brady" <>
    wrote:
    >
    > An acquaintance suggests that defragmentation would be a useful
    > service to provide along with memory management too, which also
    > requires an index.


    I presume that you mean efficient access to large amounts of data in
    the sense that if all the data you want happens to be in the same page
    or segment, then retrieving it is much more efficient than having to
    seek around for all the different pieces. So the defragmentation would
    be what they call clustering in a relational database context:

    http://www.postgresql.org/docs/8.3/static/sql-cluster.html

    I've seen similar phenomena outside the relational database world,
    notably with big Lucene indexes which wouldn't fit in memory in their
    entirety.

    > I encourage overlap between a bare-bones alloc/free module and
    > established database territory and I'm very aware of it.
    >
    > Databases already support both concurrency and persistence, but don't
    > tell me you'd use a database for IPC.


    Of course, databases are widely used in scalable systems to hold
    central state, which is why there's a lot of effort put into to not
    only scaling up database installations, but also into things like
    caching which are supposed to save the database systems behind popular
    Web applications from excessive load.

    > And don't tell me you've never
    > wished you had a reference to a record in a table so that you could
    > make an update just by changing one word of memory at the right
    > place. Sometimes databases are overkill where all you want is dynamic
    > allocation.


    I think that the challenge is to reduce an abstract operation (for
    example, wanting to update a particular column in a particular record)
    to its measurable effects (this word of memory/disk will change as a
    consequence). It's easy for a human with a reasonable knowledge of,
    say, a relational database system to anticipate such things, but to
    actually collapse a number of layers through some kind of generic
    optimisation process is a lot more difficult.

    Paul
    Paul Boddie, Sep 11, 2008
    #17
  18. On Thu, 11 Sep 2008 10:20:41 -0700, Aaron \"Castironpi\" Brady wrote:

    > XML is the wrong word for the example I was thinking of (as was already
    > pointed out in another thread). XML is by definition sequential.


    I'm pretty sure you're wrong. XML can be used for serialization, but that
    doesn't mean it is only sequential data. XML is suitable for hierarchical
    data too. To quote Wikipedia:

    "As long as only well-formedness is required, XML is a generic framework
    for storing any amount of text or any data whose structure can be
    represented as a tree. The only indispensable syntactical requirement is
    that the document has exactly one root element (alternatively called the
    document element)."

    http://en.wikipedia.org/wiki/Xml




    --
    Steven
    Steven D'Aprano, Sep 12, 2008
    #18
  19. On Sep 11, 10:37 pm, Steven D'Aprano
    <> wrote:
    > On Thu, 11 Sep 2008 10:20:41 -0700, Aaron \"Castironpi\" Brady wrote:
    > > XML is the wrong word for the example I was thinking of (as was already
    > > pointed out in another thread).  XML is by definition sequential.

    >
    > I'm pretty sure you're wrong. XML can be used for serialization, but that
    > doesn't mean it is only sequential data. XML is suitable for hierarchical
    > data too. To quote Wikipedia:
    >
    > "As long as only well-formedness is required, XML is a generic framework
    > for storing any amount of text or any data whose structure can be
    > represented as a tree. The only indispensable syntactical requirement is
    > that the document has exactly one root element (alternatively called the
    > document element)."
    >
    > http://en.wikipedia.org/wiki/Xml
    >
    > --
    > Steven


    That's my choice of words at work again, I'm afraid. What I mean is,
    there is no possibility that you can correctly interpret a segment of
    XML text without knowing certain facts about everything that precedes
    it. Compare to the case of a fixed-length record file, of record size
    say 20, where you know the meaning of the characters in offset ranges
    20-40, 80-100, 500020-500040, etc.

    To clarify the point of the use case in question, because data would
    be allocated and located dynamically, its possible that you could read
    the first several words, then not need anything until say, the 1KB
    mark. (Unless you're somehow storing an offset in to an XML string as
    a value in the string, which would require composing it, leaving room
    for that value, and then writing it with random access anyway.) There
    can be gaps in a dynamically managed buffer--- say the unused/free
    bytes from offsets 200 to 220, but every byte that follows another in
    an XML file follows it in the file's meaning too. Is this any
    clearer?

    Aaron
    Aaron \Castironpi\ Brady, Sep 12, 2008
    #19
  20. On Thu, 11 Sep 2008 22:40:01 -0700, Dennis Lee Bieber wrote:

    > On 12 Sep 2008 03:37:51 GMT, Steven D'Aprano
    > <> declaimed the following in
    > comp.lang.python:
    >
    >
    >> I'm pretty sure you're wrong. XML can be used for serialization, but
    >> that doesn't mean it is only sequential data. XML is suitable for
    >> hierarchical data too. To quote Wikipedia:
    >>

    > There is a difference between the format of the data content, and
    > the processing of that data... Regardless of the content, one
    > essentially has to process the XML /file/ sequentially, and translate
    > into an in-memory model that allows for accessing said data. To reach
    > the nth subelement of the mth element requires reading all 1..m-1
    > elements, followed by all 1..n-1 subelements in m. Modifying any element
    > requires rewriting the entire file.


    Which is why I previously said that XML was not well suited for random
    access.

    I think we're starting to be sucked into a vortex of obtuse and opaque
    communication. We agree that XML can store hierarchical data, and that it
    has to be read and written sequentially, and that whatever the merits of
    castironpi's software, his original use-case of random access to a 4GB
    XML file isn't workable. Yes?



    --
    Steven
    Steven D'Aprano, Sep 12, 2008
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Ken
    Replies:
    24
    Views:
    3,832
    Ben Bacarisse
    Nov 30, 2006
  2. chris
    Replies:
    6
    Views:
    971
    chris
    Oct 28, 2005
  3. He Shiming
    Replies:
    0
    Views:
    282
    He Shiming
    Aug 1, 2006
  4. Kok How Teh

    dynamic buffer allocation at char buf[1]

    Kok How Teh, Apr 10, 2010, in forum: C Programming
    Replies:
    22
    Views:
    1,452
    Phil Carmody
    Apr 13, 2010
  5. Bjarke Hammersholt Roune
    Replies:
    14
    Views:
    1,167
    Bjarke Hammersholt Roune
    Mar 6, 2011
Loading...

Share This Page