dynamic allocation file buffer

C

castironpi

I will try my idea again. I want to talk to people about a module I
want to write and I will take the time to explain it. I think it's a
"cool idea" that a lot of people, forgiving the slang, could benefit
from. What are its flaws?

A user has a file he is using either 1/ to persist binary data after
the run of a single program (persistence) or 2/ share binary data
between concurrently running programs (IPC / shared memory). The data
are records of variable types and lengths that can change over time.
He wants to change a record that's already present in the file. Here
are two examples.

Use Case 1: Hierarchical ElementTree-style data

A user has an XML file like the one shown here.

<a>
<b>
<c>Foo</c>
</b>
...

He wants to change "Foo" to "Foobar".

<a>
<b>
<c>Foobar</c>
</b>
...

The change he wants to make is at the beginning of a 4GB file, and
recopying the remainder is an unacceptable resource drain.

Use Case 2: Web session logger

A tutor application has written a plugin to a webbrowser that records
the order of a user's mouse and keyboard activity during a browsing
session, and makes them concurrently available to other applications
in a suite, which are written in varying lanugages. The user takes
some action, such as surfing to a site or clicking on a link. The
browser plugin records that sequence into shared memory, where it is
marked as acknowledged by the listener programs, and recycled back
into an unused block. URLs, user inputs, and link text can be of any
length, so truncating them to fit a fixed length is not an option.

Existing Solutions

- Shelve - A Python Standard Library shelf object can store a random
access dictionary mapping strings to pickled objects. It does not
provide for hierarchical data stores, and objects must be unpickled
before they can be examined.
- Relational Database - Separate tables of nodes, attributes, and
text, and the relations between them are slow and unwieldy to
reproduce the contents of a dynamic structure. The VARCHAR data type
still carries a maximum size, no more flexible than fixed-length
records.
- POSH - Python Object Sharing - A module currently in its alpha stage
promises to make it possible to store Python objects directly in
shared memory. In its current form, its only entry point is 'fork'
and does not offer persistence, only sharing. See:
http://poshmodule.sourceforge.net/

Dynamic Allocation

The traditional solution, dynamic memory allocation, is to maintain a
metadata list of "free blocks" that are available to write to. See:
http://en.wikipedia.org/wiki/Dynamic_memory_allocation
http://en.wikipedia.org/wiki/Malloc
http://en.wikipedia.org/wiki/Mmap
http://en.wikipedia.org/wiki/Memory_leak
The catch, and the crux of the proposal, is that the metadata must be
stored in shared memory along with the data themselves. Assuming they
are, a program can acquire the offset of an unused block of a
sufficient size for its data, then write it to the file at that
offset. The metadata can maintain the offset of one root member, to
serve as a 'table of contents' or header for the remainder of the
file. It can be grown and reassigned as needed.

An acquaintence writes: It could be quite useful for highly concurrent
systems: the overhead involved with interprocess communication can be
overwhelming, and something more flexible than normal object
persistence to disk might be worth having.

Python Applicability

The usual problems with data persistence and sharing apply. The
format of the external data is only established conventionally, and
conversions between Python objects and raw memory bytes take the usual
overhead. 'struct.Struct', 'ctypes.Structure', and 'pickle.Pickler'
currently offer this functionality, and the buffer offset obtained
from 'alloc' can be used with all three.

Ex 1.
s= struct.Struct( 'III' )
x= alloc( s.size )
s.pack_into( mem, x, 2, 4, 6 )
Struct in its current form does not permit random access into
structure contents; a user must read or write the entire converted
strucutre in order to update one field. Alternative:
s= struct.Struct( 'I' )
x1, x2, x3= alloc( s.size ), alloc( s.size ), alloc( s.size )
s.pack_into( mem, x1, 2 )
s.pack_into( mem, x2, 4 )
s.pack_into( mem, x3, 6 )

Ex 2.
class Items( ctypes.Structure ):
_fields_= [
( 'x1', ctypes.c_float ),
( 'y1', ctypes.c_float ) ]
x= alloc( ctypes.sizeof( Items ) )
c= ctypes.cast( mem+ x, ctypes.POINTER( Items ) ).contents
c.x1, c.y1= 2, 4
The 'mem' variable is obtained from a call to PyObject_AsWriteBuffer.

Ex 3.
s= pickle.dumps( ( 2, 4, 6 ) )
x= alloc( len( s ) )
mem[ x: x+ len( s ) ]= s
'dumps' is still slow and nor does permit random access into contents.

Use Cases Revisited

Use Case 1: Hierarchical ElementTree-style data
Solution: Dynamically allocate the tree and its elements.

Node: tag: a
Node: tag: b
Node: tag: c
Node: text: Foo

The user wants to change "Foo" to "Foobar".

Node: tag: a
Node: tag: b
Node: tag: c
Node: text: Foobar

Deallocate 'Node: text: Foo', allocate 'Node: text: Foobar', and store
the new offset into 'Node: tag: c'. Total writes 6 bytes 'foobar', a
one-word offset, and approximatly 5- 10-word metadata update.

Use Case 2: Web session logger
Dynamically allocate a linked list of data points.

Data: 'friendster.com'
Data: 'My Account'

Allocate one block for each string, adding it to a linked list. As
listeners acknowledge each data point, remove it from the linked
list. Keep the head node in the 'root offset' metadata field.

Restrictions

It is not possible for persistent memory to refer to live memory. Any
objects it refers to must also be located in file. Their mapped
addresses must not be stored, only their offsets into it. However,
live references to persistent memory are eminently possible.

Current Status

A pure Python alloc-free implementation based on the GNU PAVL tree
library is on Google Code. It is only in proof-of-concept form and
not commented, but does contain a first-pass test suite. See:
http://code.google.com/p/pymmapstruct/source/browse/#svn/trunk
The ctypes solution for access is advised.
 
L

Larry Bates

castironpi said:
I will try my idea again. I want to talk to people about a module I
want to write and I will take the time to explain it. I think it's a
"cool idea" that a lot of people, forgiving the slang, could benefit
from. What are its flaws?

A user has a file he is using either 1/ to persist binary data after
the run of a single program (persistence) or 2/ share binary data
between concurrently running programs (IPC / shared memory). The data
are records of variable types and lengths that can change over time.
He wants to change a record that's already present in the file. Here
are two examples.

Use Case 1: Hierarchical ElementTree-style data

A user has an XML file like the one shown here.

<a>
<b>
<c>Foo</c>
</b>
...

He wants to change "Foo" to "Foobar".

<a>
<b>
<c>Foobar</c>
</b>
...

The change he wants to make is at the beginning of a 4GB file, and
recopying the remainder is an unacceptable resource drain.

Use Case 2: Web session logger

A tutor application has written a plugin to a webbrowser that records
the order of a user's mouse and keyboard activity during a browsing
session, and makes them concurrently available to other applications
in a suite, which are written in varying lanugages. The user takes
some action, such as surfing to a site or clicking on a link. The
browser plugin records that sequence into shared memory, where it is
marked as acknowledged by the listener programs, and recycled back
into an unused block. URLs, user inputs, and link text can be of any
length, so truncating them to fit a fixed length is not an option.

Existing Solutions

- Shelve - A Python Standard Library shelf object can store a random
access dictionary mapping strings to pickled objects. It does not
provide for hierarchical data stores, and objects must be unpickled
before they can be examined.
- Relational Database - Separate tables of nodes, attributes, and
text, and the relations between them are slow and unwieldy to
reproduce the contents of a dynamic structure. The VARCHAR data type
still carries a maximum size, no more flexible than fixed-length
records.
- POSH - Python Object Sharing - A module currently in its alpha stage
promises to make it possible to store Python objects directly in
shared memory. In its current form, its only entry point is 'fork'
and does not offer persistence, only sharing. See:
http://poshmodule.sourceforge.net/

Dynamic Allocation

The traditional solution, dynamic memory allocation, is to maintain a
metadata list of "free blocks" that are available to write to. See:
http://en.wikipedia.org/wiki/Dynamic_memory_allocation
http://en.wikipedia.org/wiki/Malloc
http://en.wikipedia.org/wiki/Mmap
http://en.wikipedia.org/wiki/Memory_leak
The catch, and the crux of the proposal, is that the metadata must be
stored in shared memory along with the data themselves. Assuming they
are, a program can acquire the offset of an unused block of a
sufficient size for its data, then write it to the file at that
offset. The metadata can maintain the offset of one root member, to
serve as a 'table of contents' or header for the remainder of the
file. It can be grown and reassigned as needed.

An acquaintence writes: It could be quite useful for highly concurrent
systems: the overhead involved with interprocess communication can be
overwhelming, and something more flexible than normal object
persistence to disk might be worth having.

Python Applicability

The usual problems with data persistence and sharing apply. The
format of the external data is only established conventionally, and
conversions between Python objects and raw memory bytes take the usual
overhead. 'struct.Struct', 'ctypes.Structure', and 'pickle.Pickler'
currently offer this functionality, and the buffer offset obtained
from 'alloc' can be used with all three.

Ex 1.
s= struct.Struct( 'III' )
x= alloc( s.size )
s.pack_into( mem, x, 2, 4, 6 )
Struct in its current form does not permit random access into
structure contents; a user must read or write the entire converted
strucutre in order to update one field. Alternative:
s= struct.Struct( 'I' )
x1, x2, x3= alloc( s.size ), alloc( s.size ), alloc( s.size )
s.pack_into( mem, x1, 2 )
s.pack_into( mem, x2, 4 )
s.pack_into( mem, x3, 6 )

Ex 2.
class Items( ctypes.Structure ):
_fields_= [
( 'x1', ctypes.c_float ),
( 'y1', ctypes.c_float ) ]
x= alloc( ctypes.sizeof( Items ) )
c= ctypes.cast( mem+ x, ctypes.POINTER( Items ) ).contents
c.x1, c.y1= 2, 4
The 'mem' variable is obtained from a call to PyObject_AsWriteBuffer.

Ex 3.
s= pickle.dumps( ( 2, 4, 6 ) )
x= alloc( len( s ) )
mem[ x: x+ len( s ) ]= s
'dumps' is still slow and nor does permit random access into contents.

Use Cases Revisited

Use Case 1: Hierarchical ElementTree-style data
Solution: Dynamically allocate the tree and its elements.

Node: tag: a
Node: tag: b
Node: tag: c
Node: text: Foo

The user wants to change "Foo" to "Foobar".

Node: tag: a
Node: tag: b
Node: tag: c
Node: text: Foobar

Deallocate 'Node: text: Foo', allocate 'Node: text: Foobar', and store
the new offset into 'Node: tag: c'. Total writes 6 bytes 'foobar', a
one-word offset, and approximatly 5- 10-word metadata update.

Use Case 2: Web session logger
Dynamically allocate a linked list of data points.

Data: 'friendster.com'
Data: 'My Account'

Allocate one block for each string, adding it to a linked list. As
listeners acknowledge each data point, remove it from the linked
list. Keep the head node in the 'root offset' metadata field.

Restrictions

It is not possible for persistent memory to refer to live memory. Any
objects it refers to must also be located in file. Their mapped
addresses must not be stored, only their offsets into it. However,
live references to persistent memory are eminently possible.

Current Status

A pure Python alloc-free implementation based on the GNU PAVL tree
library is on Google Code. It is only in proof-of-concept form and
not commented, but does contain a first-pass test suite. See:
http://code.google.com/p/pymmapstruct/source/browse/#svn/trunk
The ctypes solution for access is advised.

You should review Zope's ZODB and/or memcached before putting in too much effort.

-Larry
 
S

Steven D'Aprano

I will try my idea again. I want to talk to people about a module I
want to write and I will take the time to explain it. I think it's a
"cool idea" that a lot of people, forgiving the slang, could benefit
from. What are its flaws?

[snip long description with not-very-credible use-cases]

You've created a solution to a problem which (probably) only affects a
very small number of people, at least judging by your use-cases. Who has
a 4GB XML file, and how much crack did they smoke?

Castironpi, what do *you* use this proof-of-concept module for? Don't
bother tell us what you think *we* should use it for. Tell us what you're
using it for, or at least what somebody else is using it for. If this is
just a module that you think will be cool, I don't like your chances of
people caring. There is no shortage of "cool" software that isn't useful
for anything, and unlike eye-candy, nobody is going to use your module
just because they like the algorithm.

If you don't have an existing application for the software, then explain
what it does (not how) and give some idea of the performance ("it's alpha
and written in Python and really slow, but I will re-write it in C and
expect it to make a billion random accesses in a 10GB file per
millisecond", or whatever). You might be lucky and have somebody say
"Hey, that's just the tool I need to solve my problem!".
 
C

castironpi

castironpi said:
I will try my idea again.  I want to talk to people about a module I
want to write and I will take the time to explain it.  I think it's a
"cool idea" that a lot of people, forgiving the slang, could benefit
from.  What are its flaws?
A user has a file he is using either 1/ to persist binary data after
the run of a single program (persistence) or 2/ share binary data
between concurrently running programs (IPC / shared memory).  The data
are records of variable types and lengths that can change over time.
He wants to change a record that's already present in the file.  Here
are two examples.
Use Case 1: Hierarchical ElementTree-style data
A user has an XML file like the one shown here.
<a>
  <b>
    <c>Foo</c>
  </b>
  ...
He wants to change "Foo" to "Foobar".
<a>
  <b>
    <c>Foobar</c>
  </b>
  ...
The change he wants to make is at the beginning of a 4GB file, and
recopying the remainder is an unacceptable resource drain.
Use Case 2: Web session logger
A tutor application has written a plugin to a webbrowser that records
the order of a user's mouse and keyboard activity during a browsing
session, and makes them concurrently available to other applications
in a suite, which are written in varying lanugages.  The user takes
some action, such as surfing to a site or clicking on a link.  The
browser plugin records that sequence into shared memory, where it is
marked as acknowledged by the listener programs, and recycled back
into an unused block.  URLs, user inputs, and link text can be of any
length, so truncating them to fit a fixed length is not an option.
Existing Solutions
- Shelve - A Python Standard Library shelf object can store a random
access dictionary mapping strings to pickled objects.  It does not
provide for hierarchical data stores, and objects must be unpickled
before they can be examined.
- Relational Database - Separate tables of nodes, attributes, and
text, and the relations between them are slow and unwieldy to
reproduce the contents of a dynamic structure.  The VARCHAR data type
still carries a maximum size, no more flexible than fixed-length
records.
- POSH - Python Object Sharing - A module currently in its alpha stage
promises to make it possible to store Python objects directly in
shared memory.  In its current form, its only entry point is 'fork'
and does not offer persistence, only sharing.  See:
   http://poshmodule.sourceforge.net/
Dynamic Allocation
The traditional solution, dynamic memory allocation, is to maintain a
metadata list of "free blocks" that are available to write to.  See:
   http://en.wikipedia.org/wiki/Dynamic_memory_allocation
   http://en.wikipedia.org/wiki/Malloc
   http://en.wikipedia.org/wiki/Mmap
   http://en.wikipedia.org/wiki/Memory_leak
The catch, and the crux of the proposal, is that the metadata must be
stored in shared memory along with the data themselves.  Assuming they
are, a program can acquire the offset of an unused block of a
sufficient size for its data, then write it to the file at that
offset.  The metadata can maintain the offset of one root member, to
serve as a 'table of contents' or header for the remainder of the
file.  It can be grown and reassigned as needed.
An acquaintence writes: It could be quite useful for highly concurrent
systems: the overhead involved with interprocess communication can be
overwhelming, and something more flexible than normal object
persistence to disk might be worth having.
Python Applicability
The usual problems with data persistence and sharing apply.  The
format of the external data is only established conventionally, and
conversions between Python objects and raw memory bytes take the usual
overhead.  'struct.Struct', 'ctypes.Structure', and 'pickle.Pickler'
currently offer this functionality, and the buffer offset obtained
from 'alloc' can be used with all three.
Ex 1.
    s= struct.Struct( 'III' )
    x= alloc( s.size )
    s.pack_into( mem, x, 2, 4, 6 )
Struct in its current form does not permit random access into
structure contents; a user must read or write the entire converted
strucutre in order to update one field.  Alternative:
    s= struct.Struct( 'I' )
    x1, x2, x3= alloc( s.size ), alloc( s.size ), alloc( s.size )
    s.pack_into( mem, x1, 2 )
    s.pack_into( mem, x2, 4 )
    s.pack_into( mem, x3, 6 )
Ex 2.
    class Items( ctypes.Structure ):
        _fields_= [
            ( 'x1', ctypes.c_float ),
            ( 'y1', ctypes.c_float ) ]
    x= alloc( ctypes.sizeof( Items ) )
    c= ctypes.cast( mem+ x, ctypes.POINTER( Items ) ).contents
    c.x1, c.y1= 2, 4
The 'mem' variable is obtained from a call to PyObject_AsWriteBuffer.
Ex 3.
    s= pickle.dumps( ( 2, 4, 6 ) )
    x= alloc( len( s ) )
    mem[ x: x+ len( s ) ]= s
'dumps' is still slow and nor does permit random access into contents.
Use Cases Revisited
Use Case 1: Hierarchical ElementTree-style data
Solution: Dynamically allocate the tree and its elements.
Node: tag: a
Node: tag: b
Node: tag: c
Node: text: Foo
The user wants to change "Foo" to "Foobar".
Node: tag: a
Node: tag: b
Node: tag: c
Node: text: Foobar
Deallocate 'Node: text: Foo', allocate 'Node: text: Foobar', and store
the new offset into 'Node: tag: c'.  Total writes 6 bytes 'foobar', a
one-word offset, and approximatly 5- 10-word metadata update.
Use Case 2: Web session logger
Dynamically allocate a linked list of data points.
Data: 'friendster.com'
Data: 'My Account'
Allocate one block for each string, adding it to a linked list.  As
listeners acknowledge each data point, remove it from the linked
list.  Keep the head node in the 'root offset' metadata field.

It is not possible for persistent memory to refer to live memory.  Any
objects it refers to must also be located in file.  Their mapped
addresses must not be stored, only their offsets into it.  However,
live references to persistent memory are eminently possible.
Current Status
A pure Python alloc-free implementation based on the GNU PAVL tree
library is on Google Code.  It is only in proof-of-concept form and
not commented, but does contain a first-pass test suite.  See:
   http://code.google.com/p/pymmapstruct/source/browse/#svn/trunk
The ctypes solution for access is advised.

You should review Zope's ZODB and/or memcached before putting in too much effort.

-Larry

Larry,

I'd love to say they were exactly what I was looking for. They're
not. I confess, I stopped reading ZODB when I got to the "uses
pickles" part, and 'memcached' when I got to the awkward and unwieldy
"SELECT FROM" part. I'm aware of both of those and my solution does
something neither other does.
 
C

castironpi

I will try my idea again.  I want to talk to people about a module I
want to write and I will take the time to explain it.  I think it's a
"cool idea" that a lot of people, forgiving the slang, could benefit
from.  What are its flaws?

[snip long description with not-very-credible use-cases]
Steven,

You've created a solution to a problem which (probably) only affects a
very small number of people, at least judging by your use-cases. Who has
a 4GB XML file, and how much crack did they smoke?

I judge from the existence of 'shelve' and 'pickle' modules, and
relational database packages, that the problem I am addressing is not
rare. It could be the millionaire investor across the street, the
venture capitalist down the hall, or the guy with a huge CD catalog.
Castironpi, what do *you* use this proof-of-concept module for?

Honestly, nothing yet. I just wrote it. My user community and
customer base are very small. Originally, I wanted to store variable-
length strings in a file, where shelves and databases were overkill.
I created it for its beauty, sorry to disappoint.
Don't
bother tell us what you think *we* should use it for. Tell us what you're
using it for, or at least what somebody else is using it for. If this is
just a module that you think will be cool, I don't like your chances of
people caring. There is no shortage of "cool" software that isn't useful
for anything, and unlike eye-candy, nobody is going to use your module
just because they like the algorithm.

Unfortunately, nobody is going to care about most of the uses I have
for it 'til I have a job. I'm goofing around with a laptop,
remembering when my databases professor kept dropping the ball on
VARCHARs. If you want a sound byte, think, "imagine programming
without 'new' and 'malloc'."
If you don't have an existing application for the software, then explain
what it does (not how) and give some idea of the performance ("it's alpha
and written in Python and really slow, but I will re-write it in C and
expect it to make a billion random accesses in a 10GB file per
millisecond", or whatever). You might be lucky and have somebody say
"Hey, that's just the tool I need to solve my problem!".

I wrote a Rope implementation just to test drive it. It exceeded the
native immutable string type at 2 megs. It used 'struct' instead of
'ctypes', so that number could conceivably come down. I am intending
to leave it in pure Python, so there.

Pleasure chatting as always sir.
 
G

George Sakkis

I will try my idea again.  I want to talk to people about a
module I want to write and I will take the time to explain it.
 I think it's a "cool idea" that a lot of people, forgiving the
slang, could benefit from.  

(snipped)

A pure Python alloc-free implementation based on the GNU PAVL
tree library is on Google Code.  It is only in proof-of-concept
form and not commented, but does contain a first-pass test
suite.  See:
   http://code.google.com/p/pymmapstruct/source/browse/#svn/trunk

So at best (i.e. if it actually makes any sense; I didn't read it),
this is an ANNouncement of a pre-alpha piece of code. ANN posts rarely
attract replies, even when they are about production/stable software.
Thankfully, most people don't expect (let alone "require") readers to
share their interest or enthusiasm by replying to the ANN. Given your
past semi-coherent and incoherent posts, expecting people to jump on
such a thread is a rather tall order.

George
 
F

Fredrik Lundh

Steven said:
You've created a solution to a problem which (probably) only affects a
very small number of people, at least judging by your use-cases. Who has
a 4GB XML file

Getting 4GB XML files from, say, logging processes or databases that can
render their output as XML is not that uncommon. They're usually
record-oriented, and are intended to be processed as streams. And given
the right tools, doing that is no harder than doing the same to a 4GB
text file.

</F>
 
S

Steven D'Aprano

Getting 4GB XML files from, say, logging processes or databases that can
render their output as XML is not that uncommon. They're usually
record-oriented, and are intended to be processed as streams. And given
the right tools, doing that is no harder than doing the same to a 4GB
text file.


Fair enough, that's a good point.

But would you expect random access to a 4GB XML file? If I've understood
what Castironpi is trying for, his primary use case was for people
wanting exactly that.
 
A

Aaron \Castironpi\ Brady

Fair enough, that's a good point.

But would you expect random access to a 4GB XML file? If I've understood
what Castironpi is trying for, his primary use case was for people
wanting exactly that.

Steven,

Are you claiming that sequential storage is sufficient for small
amounts of data, and relational db.s are necessary for large amounts?
It's possible that there is only the fringe exception, in which case
'alloc/free' aren't useful in the majority of cases, and will never
win customers away from the more mature competition.

Regardless, it is an elegant solution to the problem of storing
variable-length strings, with hardly any practical value. Perfect for
grad school.
 
A

Aaron \Castironpi\ Brady

So at best (i.e. if it actually makes any sense; I didn't read it),
this is an ANNouncement of a pre-alpha piece of code. ANN posts rarely
attract replies, even when they are about production/stable software.
Thankfully, most people don't expect (let alone "require") readers to
share their interest or enthusiasm by replying to the ANN. Given your
past semi-coherent and incoherent posts, expecting people to jump on
such a thread is a rather tall order.

George

No, I'm just excited about it and want to share. I definitely think
that discouragement of new ideas, successes, and personal expressions
is a weakness that society has, and something that c-l-py is missing
as well. I want to encourage them in other people so I will do it
myself.

As for the practicality of this module, I am definitely receiving
skepticism from the group. Further, its feasibility is in question.

For instance, no one has pointed out, and I only came across last
night, that IPC synchronization is non-trivial and possibly platform-
dependent. Of course it's a prerequisite for the advance of any IPC
mod, so I'm glad I did not release an ANN.
 
P

Paul Boddie

So at best (i.e. if it actually makes any sense; I didn't read it),
this is an ANNouncement of a pre-alpha piece of code. ANN posts rarely
attract replies, even when they are about production/stable software.

To be fair, at least some code has been provided here, unlike another
recently announced piece of software whose author left a trail of
promotion, referencing a Web site which doesn't even run the
advertised software but promises to do so after the "first release":
in other words, vapourware, as far as anyone following up on such
promotion would be concerned. We aren't seeing anything like that
here.
Thankfully, most people don't expect (let alone "require") readers to
share their interest or enthusiasm by replying to the ANN. Given your
past semi-coherent and incoherent posts, expecting people to jump on
such a thread is a rather tall order.

I think you should give the benefit of the doubt here, particularly
since we've been given a short review of related technologies (so it
isn't as if the ideas are being picked out of thin air), and I think
that such discussion isn't completely unproductive. Things like shared
memory and memory-mapped files are often proposed as optimisations
when building multiprocessing solutions, with solutions like POSH
previously suggested on a frequent basis despite the lack of their
further development, so maybe there is a need for something like this.

Although systems like those developed for Tim Bray's "Wide Finder 2"
exercise [1] probably work best by dumping raw data to disk - in other
words, not channelling data between processes at all - there may well
be a need for better data sharing between processes for some kinds of
concurrent systems. Whether the ideas described here could help, or
whether things like distributed filesystems already cover the same
ground, I think it's still worth having the conversation.

Of course, people shouldn't have to feel obliged to respond with
encouragement to every announcement - there are plenty of projects I
have no direct interest in - but I don't think people should respond
with discouragement if the only basis for it is how they may have
perceived the author's previous contributions to the mailing list or
newsgroup.

Paul

P.S. Maybe those of a more judgemental disposition should peruse the
activities in comp.lang.lisp to preserve a sense of perspective. Upon
my last chance perusal of that newsgroup, I saw frequent mentions of
at least one name familiar to comp.lang.python readers.

[1] http://www.tbray.org/ongoing/When/200x/2008/05/01/Wide-Finder-2
 
S

Steven D'Aprano

Steven,

Are you claiming that sequential storage is sufficient for small amounts
of data, and relational db.s are necessary for large amounts?

I'm no longer *claiming* anything, I'm *asking* whether random access to
a 4GB XML file is something that is credible or useful. It is my
understanding that XML is particularly ill-suited to random access once
the amount of data is too large to fit in RAM.

I'm interested in what Fredrik has to say about this, as he's the author
of ElementTree.
 
F

Fredrik Lundh

Steven said:
I'm no longer *claiming* anything, I'm *asking* whether random access to
a 4GB XML file is something that is credible or useful. It is my
understanding that XML is particularly ill-suited to random access once
the amount of data is too large to fit in RAM.

An XML file doesn't contain any indexing information, so random access
to a large XML file is very inefficient. You can build (or precompute)
index information and store in a separate file, of course, but that's
hardly something that's useful in the general case.

And as I said before, the only use case for *huge* XML files I've ever
seen used in practice is to store large streams of record-style data;
data that's intended to be consumed by sequential processes (and you can
do a lot with sequential processing these days; for those interested in
this, digging up a few review papers on "data stream processing" might
be a good way to waste some time).

Document-style XML usually fits into memory on modern machines;
structures larger than that are usually split into different parts (e.g.
using XInclude) and stored in a container file.

Random *modifications* to an arbitrary XML file cannot be done, as long
as you store the file in a standard file system. And if you invent your
own format, it's no longer an XML file.

</F>
 
P

Paul Boddie

And as I said before, the only use case for *huge* XML files I've ever
seen used in practice is to store large streams of record-style data;

I can imagine that the manipulation of the persistent form of large
graph structures might be another use case, although for efficient
navigation of such a structure, which is what you'd need to start
applying various graph algorithms, one would need some kind of index.
Certainly, we're straying into database territory.

Paul
 
A

Aaron \Castironpi\ Brady

I'm no longer *claiming* anything, I'm *asking* whether random access to
a 4GB XML file is something that is credible or useful. It is my
understanding that XML is particularly ill-suited to random access once
the amount of data is too large to fit in RAM.

I'm interested in what Fredrik has to say about this, as he's the author
of ElementTree.

XML is the wrong word for the example I was thinking of (as was
already pointed out in another thread). XML is by definition
sequential. The use case pertained to a generic element hierarchy;
think of 4GB of hierarchical data.
 
A

Aaron \Castironpi\ Brady

I can imagine that the manipulation of the persistent form of large
graph structures might be another use case, although for efficient
navigation of such a structure, which is what you'd need to start
applying various graph algorithms, one would need some kind of index.
Certainly, we're straying into database territory.

Paul

An acquaintance suggests that defragmentation would be a useful
service to provide along with memory management too, which also
requires an index.

I encourage overlap between a bare-bones alloc/free module and
established database territory and I'm very aware of it.

Databases already support both concurrency and persistence, but don't
tell me you'd use a database for IPC. And don't tell me you've never
wished you had a reference to a record in a table so that you could
make an update just by changing one word of memory at the right
place. Sometimes databases are overkill where all you want is dynamic
allocation.
 
P

Paul Boddie

An acquaintance suggests that defragmentation would be a useful
service to provide along with memory management too, which also
requires an index.

I presume that you mean efficient access to large amounts of data in
the sense that if all the data you want happens to be in the same page
or segment, then retrieving it is much more efficient than having to
seek around for all the different pieces. So the defragmentation would
be what they call clustering in a relational database context:

http://www.postgresql.org/docs/8.3/static/sql-cluster.html

I've seen similar phenomena outside the relational database world,
notably with big Lucene indexes which wouldn't fit in memory in their
entirety.
I encourage overlap between a bare-bones alloc/free module and
established database territory and I'm very aware of it.

Databases already support both concurrency and persistence, but don't
tell me you'd use a database for IPC.

Of course, databases are widely used in scalable systems to hold
central state, which is why there's a lot of effort put into to not
only scaling up database installations, but also into things like
caching which are supposed to save the database systems behind popular
Web applications from excessive load.
And don't tell me you've never
wished you had a reference to a record in a table so that you could
make an update just by changing one word of memory at the right
place. Sometimes databases are overkill where all you want is dynamic
allocation.

I think that the challenge is to reduce an abstract operation (for
example, wanting to update a particular column in a particular record)
to its measurable effects (this word of memory/disk will change as a
consequence). It's easy for a human with a reasonable knowledge of,
say, a relational database system to anticipate such things, but to
actually collapse a number of layers through some kind of generic
optimisation process is a lot more difficult.

Paul
 
S

Steven D'Aprano

XML is the wrong word for the example I was thinking of (as was already
pointed out in another thread). XML is by definition sequential.

I'm pretty sure you're wrong. XML can be used for serialization, but that
doesn't mean it is only sequential data. XML is suitable for hierarchical
data too. To quote Wikipedia:

"As long as only well-formedness is required, XML is a generic framework
for storing any amount of text or any data whose structure can be
represented as a tree. The only indispensable syntactical requirement is
that the document has exactly one root element (alternatively called the
document element)."

http://en.wikipedia.org/wiki/Xml
 
A

Aaron \Castironpi\ Brady

I'm pretty sure you're wrong. XML can be used for serialization, but that
doesn't mean it is only sequential data. XML is suitable for hierarchical
data too. To quote Wikipedia:

"As long as only well-formedness is required, XML is a generic framework
for storing any amount of text or any data whose structure can be
represented as a tree. The only indispensable syntactical requirement is
that the document has exactly one root element (alternatively called the
document element)."

http://en.wikipedia.org/wiki/Xml

That's my choice of words at work again, I'm afraid. What I mean is,
there is no possibility that you can correctly interpret a segment of
XML text without knowing certain facts about everything that precedes
it. Compare to the case of a fixed-length record file, of record size
say 20, where you know the meaning of the characters in offset ranges
20-40, 80-100, 500020-500040, etc.

To clarify the point of the use case in question, because data would
be allocated and located dynamically, its possible that you could read
the first several words, then not need anything until say, the 1KB
mark. (Unless you're somehow storing an offset in to an XML string as
a value in the string, which would require composing it, leaving room
for that value, and then writing it with random access anyway.) There
can be gaps in a dynamically managed buffer--- say the unused/free
bytes from offsets 200 to 220, but every byte that follows another in
an XML file follows it in the file's meaning too. Is this any
clearer?

Aaron
 
S

Steven D'Aprano

There is a difference between the format of the data content, and
the processing of that data... Regardless of the content, one
essentially has to process the XML /file/ sequentially, and translate
into an in-memory model that allows for accessing said data. To reach
the nth subelement of the mth element requires reading all 1..m-1
elements, followed by all 1..n-1 subelements in m. Modifying any element
requires rewriting the entire file.

Which is why I previously said that XML was not well suited for random
access.

I think we're starting to be sucked into a vortex of obtuse and opaque
communication. We agree that XML can store hierarchical data, and that it
has to be read and written sequentially, and that whatever the merits of
castironpi's software, his original use-case of random access to a 4GB
XML file isn't workable. Yes?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,020
Latest member
GenesisGai

Latest Threads

Top