"archive" data formats

I

Ivan Shmakov

[Cross-posting to for the reasons below.
Feel free to drop if inappropriate.]
I want to use a tar file like an IBM partitioned dataset, i. e., a
file with multiple members, from a C program.
There're a plenty of data formats allowing for such a use. Did you
consider SQLite [1] or HDF5 [2]? Or even GDBM [3]?
[...]
If its octet sequences instead, SQLite BLOB's [4] could be the
way to go.
[1] http://sqlite.org/
[2] http://www.hdfgroup.org/HDF5/
[3] http://www.gnu.org.ua/software/gdbm/
[4] http://sqlite.org/c3ref/blob.html
Thanks, Ivan, that'll all work, too. The data's more like TLOB's
(text large objects), with each "record" a small program, or more
often plain english text, in its own file.

SQLite seems to fit nicely such a description. Consider, e. g.:

CREATE TABLE "file" (
id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
text TEXT NOT NULL);

-- ensure that names are unique
CREATE UNIQUE INDEX "file-unique"
ON "file" ("name");

-- @file-get name
SELECT "text" FROM "file"
WHERE "name" = ?1;

-- @file-put name text
INSERT INTO "file" ("name", "text")
VALUES (?1, ?2);

-- @file-replace name text
UPDATE "file"
SET "text" = ?2
WHERE "name" = ?1;

AIUI, SQLite has strong support for static linking, even at the
"source level", which could be important for one wishing to keep
the number of dependencies low.
Then some set of those is collected and assembled. One first small
test application (just to exercise the code) is listings, e. g.,
www.forkosh.com then click "Alps" under Sample Code (I'd give you a
direct deep link, but you'll see it's a real long constructed link,
passing lots of query_string attributes on to the cgi program, under
construction, that I've been talking about).
The more important application will be algorithmically collecting
snippets of boilerplate text and constructing complete documents
"according to spec".

The above somehow reminds me of XML (the model, if not the
representation), and the associated "tools": XInclude, XPath,
XSLT and XQuery. And there's Fast Infoset for a space- and
time-efficient XML representation, BTW.

The use of XML to encode the structure of the data (and code)
being stored could bring a level of consistency, but depending
on the task, it may be too much pain for too low gain.
 
J

JohnF

Ivan Shmakov said:
JohnF said:
I want to use a tar file like an IBM partitioned dataset, i. e., a
file with multiple members, from a C program.
There're a plenty of data formats allowing for such a use. Did you
consider SQLite [1] or HDF5 [2]? Or even GDBM [3]?
[...]
If its octet sequences instead, SQLite BLOB's [4] could be the
way to go.
[1] http://sqlite.org/
[2] http://www.hdfgroup.org/HDF5/
[3] http://www.gnu.org.ua/software/gdbm/
[4] http://sqlite.org/c3ref/blob.html
Thanks, Ivan, that'll all work, too. The data's more like TLOB's
(text large objects), with each "record" a small program, or more
often plain english text, in its own file.

SQLite seems to fit nicely such a description. Consider, e. g.:

CREATE TABLE "file" (
id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
text TEXT NOT NULL);

-- ensure that names are unique
CREATE UNIQUE INDEX "file-unique"
ON "file" ("name");

-- @file-get name
SELECT "text" FROM "file"
WHERE "name" = ?1;

-- @file-put name text
INSERT INTO "file" ("name", "text")
VALUES (?1, ?2);

-- @file-replace name text
UPDATE "file"
SET "text" = ?2
WHERE "name" = ?1;

AIUI, SQLite has strong support for static linking, even at the
"source level", which could be important for one wishing to keep
the number of dependencies low.

The problem with all non-popen()-type solutions is redesigning/writing
existing code that fget's (for the "r" side) and fput's (for "w").
I've already replaced fopen() with myfopen() and fclose() with myfclose()
(not their real names) to transparently access files across the net --
if the requested file starts with http:// myfopen() popen's wget to read
the file, otherwise just fopen's it as usual (and there's no "w" side
for that yet, either).
So all the hooks already exist to transparently popen tar if the
requested filename indicates a tar file. Very easy change; no logic
affected whatsoever. Not so with anything else. Zip, etc instead of
tar is zero extra effort. Mysql, etc instead of tar becomes a job.
Moreover, the volume of requests is insignificant, so that
efficiency/overhead/whatever is completely irrelevant. And finally,
I'm seeing no real functional advantage to mysql, etc visible to
the end user (unless maybe, at some future time, e.g., text snippets
are described by keys), just complicating the internals.
The above somehow reminds me of XML (the model, if not the
representation), and the associated "tools": XInclude, XPath,
XSLT and XQuery. And there's Fast Infoset for a space- and
time-efficient XML representation, BTW.

The use of XML to encode the structure of the data (and code)
being stored could bring a level of consistency, but depending
on the task, it may be too much pain for too low gain.

That's what it sounds like (too much/too little), though I'm not
familiar enough with xml and friends to be entirely positive.
Modulo the original tar question, a satisfactory solution already
completely exists. And it's not clear what future functionality
might or might not be required/desired. Maybe none. And I'm quite
happy treating current code as a prototype to discover if there's
maybe some (additional functionality required). But immediately
making the design/internals too far ahead of the curve vis-a-vis
currently necessary (and reasonably anticipated) functionality
doesn't seem wise to me. Look how far this thread has diverged from
the original tar question. That's (one way) how projects fail.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,575
Members
45,053
Latest member
billing-software

Latest Threads

Top