xml + mmap cross

C

castironpi

Any interest in pursuing/developing/working together on a mmaped-xml
class? Faster, not readable in text editor.
 
S

Stefan Behnel

castironpi said:
Any interest in pursuing/developing/working together on a mmaped-xml
class? Faster, not readable in text editor.

Any hints on what you are talking about?

Stefan
 
C

castironpi

Any hints on what you are talking about?

Stefan

Nice to hear from you. I assumed you were familiar with the problem;
you're not.

In an XML file, entries are stored in serial, sort of like this.

AAA BBB CCC DDD

Or more recognizably,

<A><B><C>something</C><D>something</D></B></A>

Point is, to change <C>something</C> to <C>something else</C>, you
have to recopy everything after that.

AAA BBB CCC DDD
AAA BBBb CCC DDD

requires 7 writes, 'b CCC DDD', not 1.

I want to use a simple tree structure to store:

0 A-> None, 1
1 B-> None, 2
2 C-> 3, None
3 D-> None, None

Each node maps to 'Next, Child', or more accurately, 'Next Sibling,
First Child'.

You get constant time updates to contents, and log-time searches.

There was a similar problem today in:

From: Gerhard Häring <[email protected]>
Date: Thu, 04 Sep 2008 13:08:59 +0200
Subject: Re: cPickle

The OP wanted to update the third element in a pickled tuple, but not
the first two.

I propose to write a tree structure to a memory-mapped file. A
heavyweight string class, Rope, I wrote, exceeded native string speeds
at a file size of two megs. You could use that, or store the tree
directly.

The obstacle is probably mmap 'alloc' and 'free' routines, which I
posted on Google Code.
 
A

alex23

Any interest in pursuing/developing/working together on a mmaped-xml
class?  Faster, not readable in text editor.

XML is text-based, so it should -always- be readable in a text editor.
It's part of the definition, I believe.

However, an implementation of one of the alternative binary XML
formats would probably be very welcome.

Fast Infoset: http://www.itu.int/rec/T-REC-X.891-200505-I/en
EXI: http://www.w3.org/TR/2007/WD-exi-20070716/

I don't know enough about either format to say if it would be
possible, but an implementation that conformed to the ElementTree API
could be a big win.
 
C

castironpi

XML is text-based, so it should -always- be readable in a text editor.
It's part of the definition, I believe.

However, an implementation of one of the alternative binary XML
formats would probably be very welcome.

Fast Infoset:http://www.itu.int/rec/T-REC-X.891-200505-I/en
EXI:http://www.w3.org/TR/2007/WD-exi-20070716/

I don't know enough about either format to say if it would be
possible, but an implementation that conformed to the ElementTree API
could be a big win.

I was thinking something much less restrictive than the two links.
Since it's not text, I'm not sure it event counts as structured
markup. More generic, something like hierarchical 'tag-content-child'
pairs.

Here's what the xml.etree.ElementTree API says:

Each element has a number of properties associated with it:

- a tag which is a string identifying what kind of data this element
represents (the element type, in other words).
- a number of attributes, stored in a Python dictionary.
- a text string.
- an optional tail string.
- a number of child elements, stored in a Python sequence

Since all of these would be buffer-based representations, the
attribute list would merely implement the mapping-object protocol, not
be in a true dictionary. The strings would be stored as offsets to
length-prefixed buffer segments.

Each node would look roughly like:
tag_offset, first_attr, text_offset, tail_offset, first_child,
prev_sibling, next_sibling, parent

Attributes would look like:
key_offset, value_offset, prev_attr, next_attr, node

These are all integers representing offsets elsewhere into the map.

A short observation:
a= e.XML( '<a><b>abc</b></a>' )
a.getchildren()[0].text 'abc'
a.getchildren()[0].text= 'ab<'
e.tostring(a)
' said:
_.getchildren()[0].text
'ab<'

The current implementation supports round trips between special
characters '<' and markup '&lt;', which I propose to support as well.

Of course, you'd have to garbage collect removed nodes by hand, on any
deletions.

Also, poss. change subject to: ElementTree + mmap cross.
 
S

Stefan Behnel

Hi,

this discussion seems pretty much off-topic for a Python list.
In an XML file, entries are stored in serial, sort of like this.

AAA BBB CCC DDD

Or more recognizably,

<A><B><C>something</C><D>something</D></B></A>

Point is, to change <C>something</C> to <C>something else</C>, you
have to recopy everything after that.

AAA BBB CCC DDD
AAA BBBb CCC DDD

requires 7 writes, 'b CCC DDD', not 1.

I want to use a simple tree structure to store:

0 A-> None, 1
1 B-> None, 2
2 C-> 3, None
3 D-> None, None

Each node maps to 'Next, Child', or more accurately, 'Next Sibling,
First Child'.

Do I understand you right: you want to access serialised XML data as a memory
mapped file and operate directly on the byte data? You would still have to
decode the complete byte sequence to parse it and build your index structure
in that case.

How do you plan to store the pointers to a node's next sibling/child? And how
do you keep them updated over insertions/deletions? That, plus the copy
overhead in a sequential file, will be very costly on each change.

You get constant time updates to contents, and log-time searches.

Every XML tree structure gives you log-time searches. But how do you achieve
constant time updates in a sequential file?

Stefan
 
C

castironpi

Hi,

this discussion seems pretty much off-topic for a Python list.











Do I understand you right: you want to access serialised XML data

alex23 correctly pointed out last night that XML is always
serialized. What I mean and possibly what you mean is, a serialized
tree structure.
as a memory
mapped file and operate directly on the byte data?

Yes. Byte data containing both strings, and the structure of the
tree.
You would still have to
decode the complete byte sequence to parse it and build your index structure
in that case.

No, it's saved in an index structure; it's already in one. To find a/
b/c, read a, read its children to b, read b, read its children to c,
read c.
How do you plan to store the pointers to a node's next sibling/child? And how
do you keep them updated over insertions/deletions?

You update them when you insert them. To add 'c' to a/b, allocate c,
initialize c, read a, read its children to b, read b, read its
children to the end, add c.

When you delete c from a/b/c, however, any references to c that you
have in any programs become invalid. Don't delete it if you have
them. The bytes are marked in the file to be no longer in use, which
marking takes up some of the space in the file.
That, plus the copy
overhead in a sequential file, will be very costly on each change.

No. That's the point of a dynamic structure. <abc><def> are not
stored in memory as 10 consecutive characters. The file is not
strictly speaking XML. See below for what they're stored as.
Every XML tree structure gives you log-time searches. But how do you achieve
constant time updates in a sequential file?

You don't use a sequential file.

I stated earlier that each node would look roughly like:
tag_offset, first_attr, text_offset, tail_offset, first_child,
prev_sibling, next_sibling, parent

Attributes would look like:
key_offset, value_offset, prev_attr, next_attr, node

All these fields are integers that contain an offset into the file.

Simplified:

0 Reserved
1 A.Tag 7 (points to 7 below)
2 A.FirstChild 4 (etc.)
3 A.Contents 0 (None)
4 B.Tag 11
5 B.FirstChild 0
6 B.Contents 15
7 3abc
11 3def
15 5ghijk

This translates to:

<abc><def>ghijk</def></abc>

But isn't stored that way.

The records are all the same size. The 'Tag' field of a record that
starts at offset J is at offset J. The 'FirstChild' field of a record
that starts at offset K is at offset K+ 1.

'tag_offset', 'text_offset', 'key_offset', and 'value_offset' contain
offsets of variable-length strings into the file ( 7, 11, 15 ). The
rest contain offsets of further structures.

There's extra space usage not only in the structure of the tree, but
in the record-keeping of what bytes are available for use, and which
are in use. You have to grow the file size yourself (like growing an
array) when you need to; the file system won't for you in mmap. (This
means the alloc-free module I'm looking at will need modifications.)

I'll reemphasize the value of constant-time insertions to a file
though.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,431
Messages
2,571,677
Members
48,796
Latest member
Greg L.

Latest Threads

Top