Indexing a Text File

F

Foodbank

Hi,

I'm trying to index a text file by creating the index and data clusters
(basically ISAM). Can anyone help with this. I'm finding a very small
amount of resources online for this application.

Thanks,
James
 
F

Foodbank

I've gotten some code started below, if anyone can give me some
pointers (and I'm not talking about pointers in C :) ).

Thanks,
James

Code:
#define ELEMPERCLUST 42                      //42 elements per cluster
#define CLUSTERSIZE       1024                 //cluster size obviously
:)
#define MAXWORDLEN   22                       //maximum word length
#define NUMLEVELS 3                              //number of index
levels
#define WORDLEV (NUMLEVELS-1)           //words are on level 2 (3-1 =
2)

struct element {
     int index;                                     // cluster index
     char     word[MAXWORDLEN];
};

struct cluster {
     int indexclust;         //Set to 1 for index clust or 0 for words
     struct element elem[ELEMPERCLUST];
} clust[NUMLEVELS];  // 2 index levels 0, 1 plus the word level 2

int fd, clust_pos, nincluster[NUMLEVELS], location[NUMLEVELS];

void isam_output(int lev, char *word, int index) {
     //Time to write the index and data clusters for words
}
 
K

kleuske

Foodbank schreef:
I've gotten some code started below, if anyone can give me some
pointers (and I'm not talking about pointers in C :) ).

First off, all datastructures are useless without code. It's hard to
see what exactly you are trying to achieve and how you are intending to
do it.

And I hate trying to give advice fro, what I _think_ you mean.
Code:
#define ELEMPERCLUST 42                      //42 elements per cluster
#define CLUSTERSIZE       1024                 //cluster size obviously
:)[/QUOTE]

So far, so goed. Dump the smileys, though. If this is professional
code, don't try to be cute.
[QUOTE]
#define MAXWORDLEN   22                       //maximum word length
#define NUMLEVELS 3                              //number of index
levels[/QUOTE]

A more usefull comment would explain _why_ there are only three levels.
Why not simply used qsort and bsearch? Why require a homegrown indexing
method?
[QUOTE]
#define WORDLEV (NUMLEVELS-1)           //words are on level 2 (3-1 =
2) 

struct element {
int index;                                     // cluster index
char     word[MAXWORDLEN];
};[/QUOTE]

This will give you a lot of overhead, since MAXWORDLEN must be large
enough to hold the longest word and most words will be much shorter.
Isn't there a better alternative? Does the content change frequently?

If not, why not record all the words in one big buffer, separated by
'\0' and simply use a char*? Also why do you store the 'cluster index'
in the element? It's not clear from what you write here, so at least
there should be a comment, explaining that.
[QUOTE]
struct cluster {
int indexclust;         //Set to 1 for index clust or 0 for words
struct element elem[ELEMPERCLUST];
} clust[NUMLEVELS];  // 2 index levels 0, 1 plus the word level 2[/QUOTE]

Obviously you know (from the index of 'clust') whether you are dealing
with an 'index clust' or a word, so the first field seems superfluous.
Unless of course you are planning to domething incredibly clever, I do
not see.
[QUOTE]
int fd, clust_pos, nincluster[NUMLEVELS], location[NUMLEVELS];[/QUOTE]

It's generally a good idea to define only one variable per line. That
will make your code easier to read. After all, you write in C fro the
benefit of humans, not computers.
[QUOTE]
void isam_output(int lev, char *word, int index) {
//Time to write the index and data clusters for words
}[/QUOTE]

How do you signal an error? If you write to a stream, any number of
things can go wrong. Failing silently virtually guarantees BIG
problems. Also, your interface requires you to know the index before
you've written anything. Now you could be planning something incredibly
clever, but from what you post here, i don't think it's a very useable
interface. Usually the index is a _result_ of writing a record.
[QUOTE]
 
F

Foodbank

I appreciate the effort, but all you did was basically criticize my
code instead of pointing me in the correct direction to go. Anyone
else?

Thanks,
James

PS I'll use all the smileys I want :)
 
B

Barry Schwarz

I've gotten some code started below, if anyone can give me some
pointers (and I'm not talking about pointers in C :) ).

There is nothing in your code that tells us what you are trying to
accomplish.

I recommend you leave off trivial comments. They actually decrease
readability.
Thanks,
James

Code:
#define ELEMPERCLUST 42                      //42 elements per cluster
#define CLUSTERSIZE       1024                 //cluster size obviously
:)
#define MAXWORDLEN   22                       //maximum word length
#define NUMLEVELS 3                              //number of index
levels
#define WORDLEV (NUMLEVELS-1)           //words are on level 2 (3-1 =
2)

struct element {
int index;                                     // cluster index
char     word[MAXWORDLEN];
};

struct cluster {
int indexclust;         //Set to 1 for index clust or 0 for words
struct element elem[ELEMPERCLUST];
} clust[NUMLEVELS];  // 2 index levels 0, 1 plus the word level 2

int fd, clust_pos, nincluster[NUMLEVELS], location[NUMLEVELS];

void isam_output(int lev, char *word, int index) {
//Time to write the index and data clusters for words
}


<<Remove the del for email>>
 
W

Walter Roberson

I appreciate the effort, but all you did was basically criticize my
code instead of pointing me in the correct direction to go. Anyone
else?

Your posting asked for "pointers", and the respondant gave you a
number of pointers as to how your code could be improved and as
to why your existing interface does not appear to suit the stated
purpose.

If that wasn't the kind of pointer that you wanted, then you
could have been more specific.

What is it that you are looking for? Are you looking for research
papers comparing the efficiency of ISAM to other databases? Are
you looking for information on how to optimize ISAM lookups?
Are you looking for a solid escription of what ISAM is, but
without code, for the purposes of a "clean-room implementation"
for a commercial product? Are you looking for a public domain
ISAM for use in a commercial product? Are you looking for an ISAM
implementation with a freeware license that could be used in
a commercial product? Are you looking for an ISAM with a freeware
license that would allow you to use it in a non-commercial product?

Or, are you looking for hints on how to code a school assignment?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,009
Latest member
GidgetGamb

Latest Threads

Top