.NET and Oracle BLOB

R

Robert Vabo

I have a database that is going to contain a lot of documents i
..DOC,.TXT,.PPT,.PDF etc. formats. I want to index the documets to use a free
text search on the database table. I also want to insert and retrieve the
documents using .NEt (C# or VB.NET) !

Is there anyone of you out there that can give me some tips, links or other
helpful hints ?
 
M

Mark Kamoski

Robert--

Hmmm. We'll you've asked a lot, actually.

Here are some thoughts, because I find the question interesting.

While storing documents in a database has often "seemed" like a good idea,
the truth is that it is not. In short, a database is for storing data. A
file system is for storing files. Sure, one can store binary data in a
database and maybe (just maybe) this is OK in a case or two, the best place
for documents seems to be the file system. That's the OS's job and it does
it VERY well. One can get a DB to do it, but it is clunky at best.

A good way to manage such documents, if you must have database "handle" on
them, is to store the filename and perhaps the location in a database, as a
"pointer" to the file itself. However, if you do this then there is an
argument that says there are plenty of built-in DotNet classes for getting
to and from the file system (which is a good argument), so the database is
redundant anyway. Still, having the filenames collected neatly may be a
good idea now and again.

With files in a file system, one can hook to the file system's
functionality for searching, or use some kind of indexing system, and so
on. Usually, to build a searchable index, one gets a product or use's the
OS's functionality. It is an involved task to write this sort of code; but,
of course, it CAN be done.

Now, if one simply MUST store files in a database, then it is going to be
tricky building a dynamic index on documents of type PPT and the like. I
expect it can be done, but I should want to avoid it. But, I am a shirker
looking for the easiest way. Furthermore, building and keeping this "search
index" fresh is going to take time, especially if there are "a lot of
documents", as you have mentioned. Then again, some data analysis is
required here-- for example, if the system is not in-use 24-hours a day,
and if one does not need an up-to-the-minute index, then building a day-old
index would be an option. And so on.

Now, another way that I have addressed this issue is to truly separate
content from format. I have designed a newsgroup system that stores each
post's text in the database, as plain text. The formatting is handled by
CSS and/or XSLT. This way, the database just handles plain text and it is
easy to search. Furthermore, this is a relatively low-traffic newsgroup.
Finally, there is a limit to the size of each post (which I control), so
the database is not storing large pieces of text. All of this, however,
makes for a much different problem set when compared to the one you
describe; but, it may help you to think about the issues involved.

As I mentioned, this is a BIG topic, so I'll stop here while I'm behind.
There will be many arguments for and against what I have said, some good on
both sides. Please just take this as food for thought. I doubt that I have
clarified anything at all here; but, I hope that I have at least muddied
the waters.

HTH.

--Mark.


I have a database that is going to contain a lot of documents i
..DOC,.TXT,.PPT,.PDF etc. formats. I want to index the documets to use a
free
text search on the database table. I also want to insert and retrieve the
documents using .NEt (C# or VB.NET) !

Is there anyone of you out there that can give me some tips, links or other
helpful hints ?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top