dynamic allocation file buffer

A

Aaron \Castironpi\ Brady

Which is why I previously said that XML was not well suited for random
access.

I think we're starting to be sucked into a vortex of obtuse and opaque
communication. We agree that XML can store hierarchical data, and that it
has to be read and written sequentially, and that whatever the merits of
castironpi's software, his original use-case of random access to a 4GB
XML file isn't workable. Yes?

By 'isn't workable' do you mean, "no one ever uses 4GB of XML", or "no
one ever uses 4GB or hierarchical data period"?
 
P

Paul Boddie

Which is why I previously said that XML was not well suited for random
access.

Maybe not. A consideration of other storage formats such as HDF5 might
be appropriate:

http://hdf.ncsa.uiuc.edu/HDF5/whatishdf5.html

There are, of course, HDF5 tools available for Python.
I think we're starting to be sucked into a vortex of obtuse and opaque
communication.

I don't know about that. I'm managing to keep up with the discussion.
We agree that XML can store hierarchical data, and that it
has to be read and written sequentially, and that whatever the merits of
castironpi's software, his original use-case of random access to a 4GB
XML file isn't workable. Yes?

Again, XML specifically might not be workable for random access in a
serialised form, despite people's best efforts at processing it in
various unconventional ways, but that doesn't mean that random access
to a 4GB file containing hierarchical data isn't possible, so I
suppose it depends on whether he is wedded to the idea of using
vanilla XML or not. It's always worth exploring the available
alternatives before embarking on a challenging project, unless one
wants to pursue the exercise as a learning experience, and I therefore
suggest investigating whether HDF5 doesn't already solve at least some
of the problems or use-cases stated in this discussion.

Paul
 
A

Aaron \Castironpi\ Brady

Maybe not.

No, it's not. Element trees are, which if I just would have said
originally...
A consideration of other storage formats such as HDF5 might
be appropriate:

http://hdf.ncsa.uiuc.edu/HDF5/whatishdf5.html

There are, of course, HDF5 tools available for Python.

PyTables came up within the past few weeks on the list.

"When the file is created, the metadata in the object tree is updated
in memory while the actual data is saved to disk. When you close the
file the object tree is no longer available. However, when you reopen
this file the object tree will be reconstructed in memory from the
metadata on disk...."

This is different from what I had in mind, but the extremity depends
on how slow the 'reconstructed in memory' step is. (From
http://www.pytables.org/docs/manual/ch01.html#id2506782 ). The
counterexample would be needing random access into multiple data
files, which don't all fit in memory at once, but the maturity of the
package might outweigh that. Reconstruction will form a bottleneck
anyway.
I don't know about that. I'm managing to keep up with the discussion.

I could renege that bid and talk about a 4MB file, where recopying is
prohibitively expensive and so random access is needed, thereby
requiring an alternative to XML.
Again, XML specifically might not be workable for random access in a
serialised form, despite people's best efforts at processing it in
various unconventional ways, but that doesn't mean that random access
to a 4GB file containing hierarchical data isn't possible, so I
suppose it depends on whether he is wedded to the idea of using
vanilla XML or not.

No. It is always nice to be able to scroll through your data, but
it's much less common to be able to scroll though a data -structure-.
(Which is part of the reason data structures are hard to design.)
It's always worth exploring the available
alternatives before embarking on a challenging project, unless one
wants to pursue the exercise as a learning experience, and I therefore
suggest investigating whether HDF5 doesn't already solve at least some
of the problems or use-cases stated in this discussion.

The potential for concurrency is definitely one benefit of raw alloc/
free management, and a requirement I was setting out to program
directly for. There is a multi-threaded version of HDF5 but
interprocess communication is unsupported.

"This version serializes the API suitable for use in a multi-threaded
application but does not provide any level of concurrency."

From: http://www.hdfgroup.uiuc.edu/papers/features/mthdf/

(It is always appreciated to find a statement of what a product does
not do.)

There is an updated statement of the problem on the project website:

http://code.google.com/p/pymmapstruct/source/browse/trunk/pymmapstruct.txt

I don't have numbers for my claim that the abstraction layers in SQL,
including string construction and parsing, are ever a bottleneck or
limiting factor, despite that it's sort of intuitive. Until I get
those, maybe I should leave that allegation out.

Compared to the complexity of all these other packages (ZOPE,
memcached, HDF5/PyTables), alloc and free are almost looking like they
should become methods on a subclass of the builtin buffer type. Ha!
(Ducks.) They're beyond dangerous compared to the snuggly feeling of
Python though, so maybe they could belong in ctypes.

Aaron
 
F

Francesc

PyTablescame up within the past few weeks on the list.

"When the file is created, the metadata in the object tree is updated
in memory while the actual data is saved to disk. When you close the
file the object tree is no longer available. However, when you reopen
this file the object tree will be reconstructed in memory from the
metadata on disk...."

This is different from what I had in mind, but the extremity depends
on how slow the 'reconstructed in memory' step is. (Fromhttp://www.pytables.org/docs/manual/ch01.html#id2506782). The
counterexample would be needing random access into multiple data
files, which don't all fit in memory at once, but the maturity of the
package might outweigh that. Reconstruction will form a bottleneck
anyway.

Hmm, this was a part of a documentation that needed to be updated.
Now, the object tree is reconstructed in a lazy way (i.e. on-demand),
in order to avoid the bottleneck that you mentioned. I have corrected
the docs in:

http://www.pytables.org/trac/changeset/3714/trunk

Thanks for (indirectly ;-) bringing this to my attention,

Francesc
 
A

Aaron \Castironpi\ Brady

Hmm, this was a part of a documentation that needed to be updated.
Now, the object tree is reconstructed in a lazy way (i.e. on-demand),
in order to avoid the bottleneck that you mentioned.  I have corrected
the docs in:

http://www.pytables.org/trac/changeset/3714/trunk

Thanks for (indirectly ;-) bringing this to my attention,

Francesc

Depending on how lazy the reconstruction is, would it be possible to
modify separate tables from separate processes concurrently?
 
F

Francesc

Depending on how lazy the reconstruction is, would it be possible to
modify separate tables from separate processes concurrently?

No, modification of different tables in the same file simultaneously
is not supported yet. This is a limitation of the HDF5 library
itself. The HDF Group said that they have plans to address this, but
this is probably a long-term task.

Francesc
 
F

Francesc

Depending on how lazy the reconstruction is, would it be possible to
modify separate tables from separate processes concurrently?

No, modification of different tables in the same file simultaneously
is not supported yet. This is a limitation of the HDF5 library
itself. The HDF Group said that they have plans to address this, but
this is probably a long-term task.

Francesc
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,572
Members
45,046
Latest member
Gavizuho

Latest Threads

Top