Interesting design question involving ZIPs and servers

G

gfrommer

Hello Everyone,

I'm writing a report server that uses servlets and JSP's running
on Apache Tomcat. The reports are XML files that users can submit, and
are stored on the servers file system. Since there will be many, many
reports entered into this server, to save on space I loaded all
completed XML reports into a byte[] buffer (running the XML text
through a ZipOutputStream). So all completed reports are stored in
memory in a compressed zip buffer until the report is needed. This
byte[] buffer is contained in a ReportMetaData class with several other
meta data information describing the report.

When the report is requested from the ReportMetaData object, the XML
text is unzipped, then parsed in my ReportXMLParser class to give me
the final ReportObject. Any thoughts on my memory management technique?
Is there a more efficient way of doing this?

OK... So, the problem is, I want to be able to do FREE-FORM TEXT
searches on all of my completed reports. How can I search a zipped
buffer for a specific text string without having to uncompress the
buffer (ruining my memory savings). I could store text keywords with my
meta data object, but thats not really free-form... and that's what the
users are really hopeing for.
Thanks,

Greg Frommer
(e-mail address removed)
 
G

Gregory Toomey

Hello Everyone,

I'm writing a report server that uses servlets and JSP's running
on Apache Tomcat. The reports are XML files that users can submit, and
are stored on the servers file system. Since there will be many, many
reports entered into this server, to save on space I loaded all
completed XML reports into a byte[] buffer (running the XML text
through a ZipOutputStream). So all completed reports are stored in
memory in a compressed zip buffer until the report is needed. This
byte[] buffer is contained in a ReportMetaData class with several other
meta data information describing the report.

When the report is requested from the ReportMetaData object, the XML
text is unzipped, then parsed in my ReportXMLParser class to give me
the final ReportObject. Any thoughts on my memory management technique?
Is there a more efficient way of doing this?

OK... So, the problem is, I want to be able to do FREE-FORM TEXT
searches on all of my completed reports. How can I search a zipped
buffer for a specific text string without having to uncompress the
buffer (ruining my memory savings). I could store text keywords with my
meta data object, but thats not really free-form... and that's what the
users are really hopeing for.
Thanks,

Greg Frommer
(e-mail address removed)

In simple terms ... you're nuts.

gtoomey
 
J

joeking

Hello Everyone,
[snipped description of zipped XML's...]
When the report is requested from the ReportMetaData object, the XML
text is unzipped, then parsed in my ReportXMLParser class to give me
the final ReportObject. Any thoughts on my memory management technique?
Is there a more efficient way of doing this?

Almost certainly.

Consider that XML files are usually full of repetition.
Forget about the text, think about the DOM structure. It
contains numerous elements which are of the same type. It
will often contain numerous attributes of the same type,
and often with the reoccuring values. This type of data is
ripe for tokenization. Consider a compression algorithm
which would actually take an XML DOM and serialize it to a
binary stream in which elements, common attributes and
common values are all lookups in a table of text values.

The interesting thing about such a scheme is because it is
compressing the DOM representation (as opposed to an plain
text XML version) you can store the hierarchy as part of
the output, making it a lot easier (and more efficient) to
restore back to its original DOM object form.

OK... So, the problem is, I want to be able to do FREE-FORM TEXT
searches on all of my completed reports. How can I search a zipped
buffer for a specific text string without having to uncompress the
buffer (ruining my memory savings). I could store text keywords with my
meta data object, but thats not really free-form... and that's what the
users are really hopeing for.
Thanks,

If you mean free form as in the text data nodes in a DOM,
then you can write somthing which scans over your compressed
DOM looking for these sections, uncompress them (assuming
you bothered to compress the text data nodes) and search
them. Remember, the 'compressed' version is just a heavily
optimised version of the DOM object structure.

If you mean free form as in treating the XML as a plain text
document, so search terms may include bits of markup, then
you can't - not without the original XML document.


-FISH- ><>
 
C

Chris Uppal

When the report is requested from the ReportMetaData object, the XML
text is unzipped, then parsed in my ReportXMLParser class to give me
the final ReportObject. Any thoughts on my memory management technique?

It doesn't scale. Unless you /know/ that you can only ever have a (fairly
tiny) number of (fairly small) reports, then sooner or later you will run out
of memory. What you want is a /cache/ of the reports, so that you can access
(some of) them efficiently, but in a limited amount of space. As it happens,
operating-systems (and disk sub-systems) are good a caching frequently-used
data from files, so I would recommend that you give up on the idea of doing
your own caching in addition, and just read the data from file as you need it.

It /may/ be that you need greater performance than the OS can provide without
assistance, but doing your own caching is a non-trivial design exercise. It's
difficult even to measure the effects.

OK... So, the problem is, I want to be able to do FREE-FORM TEXT
searches on all of my completed reports. How can I search a zipped
buffer for a specific text string without having to uncompress the
buffer (ruining my memory savings).

Assuming that you have decided that you really /need/ the cache, and that
you've implemented it, and it really /does/ give you the performance gain you
needed, this should be easy enough. But first you should consider whether the
overall performance gain from being able to keep more reports in the cache (by
compressing them) is greater or less than the performance loss of having to
compress/decompress them all the time. If it /is/ quicker to compress them
than not (which is perfectly possible), then you don't need to decompress all
of every report just to search. The compress/decompressing streams do
incremental (de)compression, so you would only have /part/ of just /one/ report
decompressed at any one time as you were searching.

-- chris
 
G

gfrommer

It doesn't scale. Unless you /know/ that you can only ever have a (fairly
tiny) number of (fairly small) reports, then sooner or later you will run out
of memory. What you want is a /cache/ of the reports, so that you can access
(some of) them efficiently, but in a limited amount of space. As it happens,
operating-systems (and disk sub-systems) are good a caching frequently-used
data from files, so I would recommend that you give up on the idea of doing
your own caching in addition, and just read the data from file as you
need it.

Thanks for the comments. Well, so there are two options that I have
with my program. 1.) Load all of the completed reports into memory at
startup; either just the XML Text, XML Text compressed with ZIP, or my
final ReportObject (the object version of the parsed XML Text).

or 2.) Keep all the completed reports offline on disk and read/parse
the XML report when its needed.

The only thing is, when a user does a keyword search of all completed
reports, the program will have to scan/load EVERY report file on disk
looking for those keywords. So there is no pre-loading in this scheme,
but every report file still has to be opened/scanned each time a search
is requested. This seems very inefficient, but I suppose it's as much
work as decompressing a ZIP'ed byte buffer each time.

Thanks
 
C

Chris Uppal

or 2.) Keep all the completed reports offline on disk and read/parse
the XML report when its needed.

The only thing is, when a user does a keyword search of all completed
reports, the program will have to scan/load EVERY report file on disk
looking for those keywords. So there is no pre-loading in this scheme,
but every report file still has to be opened/scanned each time a search
is requested.

Well, either all the data fits in memory or it doesn't. If it does (and you
/know/ that will still be true in <N> months/years time), then there's really
nothing to worry about. If you use a cache in your program then the
data will be held in memory. If you read the data off-disk on demand then the
data will /still/ be held in memory (by the OS). It's only if there is too
much data to fit in RAM that disk access (obviously much slower) becomes an
issue.

Using compression only allows you to fit a small increment (x2 or x4, say) more
data into RAM before you /have/ to put stuff on-disk. So it's almost certain
that compression won't be enough to avoid the crunch (though it's always worth
investigating whether compressing data on-disk will save you more in I/O
bandwidth than it looses in extra load on the CPU). So what you need is
probably some sort of on-disk index of the data that is rich enough to allow
you to perform free-text searches without scanning every byte of every report.

You might find that a free-text search engine such as Jakarta Lucene:

http://jakarta.apache.org/lucene/

would be useful here (I should have thought to mention that before).

Lucene is only one of many alternatives, some commercial, some free, which (in
one way or another) build an index of the searchable data. They differ in what
kind of index they build, what kind of data they "understand", and how clever
they are about how they locate the data the user is looking for.

-- chris
 
I

Ian T

OK... So, the problem is, I want to be able to do FREE-FORM TEXT
searches on all of my completed reports. How can I search a zipped
buffer for a specific text string without having to uncompress the
buffer (ruining my memory savings). I could store text keywords with my
meta data object, but thats not really free-form... and that's what the
users are really hopeing for.
Thanks,

Build an index and compress that too. Parse each XML document for words,
and build some kind of multi-entry hash table, hmmm ... does Java have a
multimap?
Each index entry would consist of the word, followed by the documentID.
If you will have less than 4 billion documents, then the documentID can
be a 32bit unsigned int (4 bytes). That means that for words of greater
than 4 letters long, you achieve a first order level of compression on
the index, equivilent to (average_word_length - 4) per entry. Not to
mention that you can discard common words like "the, it , a , an, but,
with" from the index.

Once you have built your index, compress it with ZipOutputStream. When
the user wants to search, load the index and build your hash map, do the
search on the map, then retrieve the XML documents via their documentID.

Ian
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,731
Messages
2,569,432
Members
44,832
Latest member
GlennSmall

Latest Threads

Top