looking for help with a counting algorithm

A

Andy Baxter

I'm working on a web-based database project written in perl, which is a
virtual library for people in the town where I live to share details of
books and videos they have - the idea is if you find a book you like you
click on 'borrow', and you are given details of how to contact the person
who owns it.

I'm not sure exactly where to ask for this, as I'm looking for help with
finding a good algorithm, rather than details of coding, but I've chosen
this group as it's written in perl, and more of a coding problem than a
cgi problem.

I have decided on a categorisation system for the items, where each item
can belong to many categories, and each category can also be grouped as a
subcategory under one or more higher level categories. Everything is
stored in a mysql database, and the links between books and categories,
and between categories and higher level categories are done through two
linking tables, which have entries like:

CategoryMembership
------------------
CatID ItemId
1 1
2 1
1 2
1 3
3 3
1 4
4 4
7 5

CategoryGrouping
----------------
CatID SubCatId SortOrder Stop
1 2 1
1 3 2
1 4 3
3 4 1 1 (yes.)
4 5 1
4 6 2
4 7 3


I.e. item 1 is in cats 1 & 2, item 2 is in cat 1, item 3 is in cat 3, and
item 4 is in cats 1 & 4. Categories 2 and 3 are in Cat 1, and Category 4
is in cat 3 and cat 1 etc.

I want any set of category memberships or groupings of categories that can
be represented like this without leading to loops to be allowed. (I.e. the
only thing that isn't allowed is things like category 1 is grouped under
cat 2 which is grouped under cat 3 which is grouped under cat 1, and more
complex variations on this.)

I have written some code which reads these tables and generates a
tree-like data structure, where each category is represented by a hash
like this:

{Name=>"category name",
Id=>"Id on database",
DownList=>ref of array with list of links to subcategories,
UpList=>ref of array with list of links to higher categories
}

The links between categories are hashes like this:
{Up=>ref to category hash,
Down=>ref to (sub) category hash,
Stop=>stop listing subcategories at this point?,
Sort=>sort order when displaying this subcategory}

This structure is then used by the scripts which render the pages
displaying the categories. This bit is working fine.

The bit I'm finding difficult is how to count how many items there are
under a given category, so that a count can be shown in the category index
page.

The code I have at the moment is here: (This code is in a module called
LibMod.pm)

sub CountCatItems($$) {
# Count how many items there are under this category, either directly or indirectly.
my ($mod,$pCat, # pointer to the category hash.
$pStack # pointer to stack which
)= @_;
my ($query, # query handle.
$catId, # category Id no.
$pLink # ref to link between cats.
);

$catId=${$pCat}{Id};
DPrint "counting ${$pCat}{Name} - $catId = ";
$query=$db->prepare("select count(ItemId) from CategoryMembership where " .
"CategoryId=$catId group By CategoryId");
$query->execute;
my ($dCount) = $query->fetchrow_array();
$query->finish;
${$pCat}{CountDirect}=$dCount;
DPrint "$dCount<br>"; # debug print.
foreach $pLink ( @{${$pCat}{DownList}} ) {
LibMod->CountCatItems(${$pLink}{Down}); # call self recursively
};
};

This works, but only counts the number of items directly included in a
category, not the ones which belong to a subcategory.

The thing that makes it hard is the multiple category membership plus
multiple category grouping, but I want to keep this if I possibly can,
because one of the problems with traditional non-computerised indexes is
that you can't do this even though books can be relevant to several
different topics.

My thoughts at the moment are either:

- give each category a hash where the keys are all the items included in
the category. Read these in from the database, then each time a
subcategory is counted, the code goes back up the tree to the root, adding
the items that belong to it to all the higher level categories. When the
recursive calls to subcategories return, (i.e. end of this code fragment),
then count the keys in the hash. This would work I think, but it would
involve retrieving all the category memberships from the database, rather
than just the counts, and it might cause speed and memory problems if a
lot of items were added to the system.

- or when you put an item in a category, category memberships are
automatically created in the database for all the higher level categories.
This would be quicker to count, but I'm thinking that if someone has
chosen to put an item in a given subcategory, and then later on I decide
to move that subcategory to a different group, the category memberships
should shift to follow this, whereas with this system the item would stay
in the old top-level categories.

Finally - my questions are:
- does anyone know of an algorithm for working out the counts just from
the counts in each category - i.e. without having to get details of
exactly which items belong to each category? (I have a feeling this may be
impossible)
- or generally, and maybe more useful in the long run, a good place to
start looking on the web for answers to problems like this when they crop
up.
- also if you have any thoughts on either of the solutions I've
described, or can think of a better way of doing it, I'd appreciate it.

andy.
 
A

Andy Baxter

At earth time Mon, 29 Dec 2003 11:35:17 +0100, the following transmission
was received from the entity known as Gregory Toomey:
You have to write a recursive algorithm, which you can precompute.

SQL has a "start with" and "connect by" syntax for this sort of thing,
but mysql does to implement it. Oracle implement this recursion for
you. eg See http://www.adp-gmbh.ch/ora/sql/connect_by.html
It would be a bit of a nightmare to do it yourself.

Thanks for this - I didn't know you could do this sort of thing in sql.
Its called "graph theory", and has been well studied for decades.

OK, I'll have a look about.

I've tried doing the first method I thought of, and it's not working out
as slow as I thought - I added 10000 items to the dataset and it slowed
the page loading by just over a second, which isn't as bad as I thought.
This was on a 1.3 GHz machine doing nothing else though.

Precomputing the algorithm might help - thanks.

I think I needn't have posted about this - I just felt like I'd hit a
brick wall thinking about how to do it, then writing the posting helped me
put my thoughts together, but having written all that I wanted to post it
anyway and see if anyone had any suggestions.

For what it's worth, the code I've written now looks like this: (in module
LibMod.pm) Any comments welcome, as I'm still quite new to perl, and
probably not making best use of the language.

andy.

sub ReadCategories($$$) {
# Read in list of categories from database into Id index and tree structure.
my ($mod,$pRoot, # a pointer to an empty hash which will be filled with the
# category tree structure.
$pIdIndex, # a pointer to an empty hash which will be filled with the
# category Id index.
$pFlags # ref to a hash with flags saying what info to include in the
# tree.
# CountItems=>defined - count items belonging to that category.
# Contains=>defined - mark whether this category includes an
# item.
# ItemId=>num - Id of item to check for.
)=@_;
my ($query, # query handle.
$pCats, # ref to results from category query.
$pCatRow, # ref to single row of results.
$pCat, # ref to individual category entry in the data structure.
$pGroups, # ref to results from category grouping query.
$pGroupRow, # ref to row from cat group query.
$pGroupCat, # ref to category that this one is grouped under.
$pLink # ref to a grouping link between two categories.
);
DPrint ("sub: ReadCategories<br>");
$query=$db->prepare("select Id,Name from Category");
$query->execute;
$pCats=$query->fetchall_arrayref;
$query->finish;

# fill the category Id index with entries, each of which is a pointer to a hash
# containing a category name and Id.
foreach $pCatRow ( @$pCats ) {
$pCat={Name=>$$pCatRow[1], # Category name.
Id=>$$pCatRow[0], # Category Id.
UpList=>[], # List of links to cats this is grouped under.
DownList=>[], # List of links to cats grouped under this one.
};
${$pIdIndex}{$$pCatRow[0]}=$pCat;
if ( exists ${$pFlags}{CountItems} ) {
$$pCat{Items}={};
};
};

# Now read in the category groupings to generate a tree structure.

# first generate a root category to build the tree from. There is no
# category in the database with Id 0, but all the top level categories
# have a CategoryGrouping entry with Id 0.
${$pRoot}{Name}="root";
${$pRoot}{Id}=0;
${$pRoot}{UpList}=undef;
${$pRoot}{DownList}=[];
${$pIdIndex}{0}=$pRoot;

# Read all the category groupings into a hash in memory:
$query=$db->prepare("select GroupId,MemberId,SortOrder,Stop ".
"from CategoryGrouping");
$query->execute;
$pGroups=$query->fetchall_arrayref;
$query->finish;

# Now go through all the groupings, and add links to the categories to make
# a tree structure.
foreach $pGroupRow (@$pGroups) {
$pCat=${$pIdIndex}{$$pGroupRow[1]};
$pGroupCat=${$pIdIndex}{$$pGroupRow[0]};
$pLink={Down=>$pCat, # ref to sub-category.
Up=>$pGroupCat, # ref to group category.
Sort=>$$pGroupRow[2], # sort order of this link.
};
if ($$pGroupRow[3] ) {
# if Stop is defined, don't display these subcategories.
${$pLink}{Stop}=${$pGroupRow}[3];
};
# add link to up and down lists for the two categories.
push @{${$pCat}{UpList}}, $pLink;
push @{${$pGroupCat}{DownList}}, $pLink;
};

if ( exists ${$pFlags}{CountItems} ) {
# count how many items belong to each category.
# This means adding three keys to the category hash:
# CountDirect - How many belong to this category directly.
# CountUnique - How many belong to this category and none of its subcategories.
# CountTotal - How many belong to this category and any of its subcategories.
DPrint "Counting<br>";
LibMod->CountCatItems($pRoot);
}

# LibMod->ReadCategoryLevel($pRoot,$pIdIndex,$pGroups);
};

sub CountCatItems($$) {
# Count how many items there are under this category, either directly or indirectly.
my ($mod,$pCat, # pointer to the category hash.
$pStack # pointer to stack which
)= @_;
my ($query, # query handle.
$catId, # category Id no.
$pLink # ref to link between cats.
);
$catId=${$pCat}{Id};
DPrint "sub: CountCatItems. Category= ${$pCat}{Name} - Id: $catId.<br>";
#DPrint "iteration: $i <br>";
$query=$db->prepare("select ItemId from CategoryMembership where " .
"CategoryId=$catId ");
$query->execute;
my $pItems = $query->fetchall_arrayref();
$query->finish;
my $row;
foreach $row (@$pItems) {
my $itemId=$$row[0];
FollowCatLinksUp($pCat,\&MarkCatContainsItem,$itemId);
};
#${$pCat}{CountDirect}=$dCount;
#DPrint "$dCount<br>";
foreach $pLink ( @{${$pCat}{DownList}} ) {
LibMod->CountCatItems(${$pLink}{Down});
};
DPrint "CountCatItems: dumping items from $$pCat{Name} - ";
DumpStruct($$pCat{Items});
DPrint "<br>";
$$pCat{Count}=scalar (keys (%{$$pCat{Items}}));
};

sub MarkCatContainsItem {
# mark this category as contaning a given item.
my ($pCat,$itemId
)=@_;
${$$pCat{Items}}{$itemId}=1;
};

sub FollowCatLinksUp {
# follow the upward pointing links from this category to all higher level cats,
# and do something with them.
my ($pCat, # ref to category.
$pSub, # sub to call on this category.
$parm
)=@_;
DPrint "sub: FollowCatLinksUp. Category=$$pCat{Name}<br>";
# go through each of the up-links:
if ( not defined ($$pCat{UpList}) ) {
DPrint "Reached category root. <br>";
return;
};
&$pSub($pCat,$parm); # call the sub.
my $pLink;
foreach $pLink ( @{$$pCat{UpList}} ) {
DPrint "Checking up-link: <br>";
FollowCatLinksUp($$pLink{Up}, $pSub, $parm);
};
};

ReadCategories is called with two empty hash pointers by any of the
scripts that need to work with the category tree, and then this hash, with
the {Count} hash keys filled in for each category, is then passed on to
the function which renders the category listing page.
 
A

Andy Baxter

At earth time Mon, 29 Dec 2003 02:14:34 +0000, the following transmission
was received from the entity known as Andy Baxter:
At earth time Mon, 29 Dec 2003 11:35:17 +0100, the following transmission
was received from the entity known as Gregory Toomey:



I've tried doing the first method I thought of, and it's not working out
as slow as I thought - I added 10000 items to the dataset and it slowed
the page loading by just over a second, which isn't as bad as I thought.
This was on a 1.3 GHz machine doing nothing else though.

P.S. I just changed the code so that instead of going back up the tree
adding each item to the higher level categories' item lists individually,
it passes a hash with all the items in just once to the sub that goes back
up the tree. This has cut the increase in page loading time for 10000
items from 1064 ms to 383 ms.

andy.
 
G

Gregory Toomey

Andy said:
I'm working on a web-based database project written in perl, which is a
virtual library for people in the town where I live to share details of
books and videos they have - the idea is if you find a book you like you
click on 'borrow', and you are given details of how to contact the person
who owns it.

I'm not sure exactly where to ask for this, as I'm looking for help with
finding a good algorithm, rather than details of coding, but I've chosen
this group as it's written in perl, and more of a coding problem than a
cgi problem.

I have decided on a categorisation system for the items, where each item
can belong to many categories, and each category can also be grouped as a
subcategory under one or more higher level categories. Everything is
stored in a mysql database, and the links between books and categories,
and between categories and higher level categories are done through two
linking tables, which have entries like:

CategoryMembership
------------------
CatID ItemId
1 1
2 1
1 2
1 3
3 3
1 4
4 4
7 5

CategoryGrouping
----------------
CatID SubCatId SortOrder Stop
1 2 1
1 3 2
1 4 3
3 4 1 1 (yes.)
4 5 1
4 6 2
4 7 3


I.e. item 1 is in cats 1 & 2, item 2 is in cat 1, item 3 is in cat 3, and
item 4 is in cats 1 & 4. Categories 2 and 3 are in Cat 1, and Category 4
is in cat 3 and cat 1 etc.

I want any set of category memberships or groupings of categories that can
be represented like this without leading to loops to be allowed. (I.e. the
only thing that isn't allowed is things like category 1 is grouped under
cat 2 which is grouped under cat 3 which is grouped under cat 1, and more
complex variations on this.)

That's called a directed acyclic graph, or DAG.
I have written some code which reads these tables and generates a
tree-like data structure, where each category is represented by a hash
like this:

{Name=>"category name",
Id=>"Id on database",
DownList=>ref of array with list of links to subcategories,
UpList=>ref of array with list of links to higher categories
}

The links between categories are hashes like this:
{Up=>ref to category hash,
Down=>ref to (sub) category hash,
Stop=>stop listing subcategories at this point?,
Sort=>sort order when displaying this subcategory}

This structure is then used by the scripts which render the pages
displaying the categories. This bit is working fine.

The bit I'm finding difficult is how to count how many items there are
under a given category, so that a count can be shown in the category index
page.

The code I have at the moment is here: (This code is in a module called
LibMod.pm)

sub CountCatItems($$) {
# Count how many items there are under this category, either directly or indirectly.
my ($mod,$pCat, # pointer to the category hash.
$pStack # pointer to stack which
)= @_;
my ($query, # query handle.
$catId, # category Id no.
$pLink # ref to link between cats.
);

$catId=${$pCat}{Id};
DPrint "counting ${$pCat}{Name} - $catId = ";
$query=$db->prepare("select count(ItemId) from CategoryMembership where " .
"CategoryId=$catId group By CategoryId");
$query->execute;
my ($dCount) = $query->fetchrow_array();
$query->finish;
${$pCat}{CountDirect}=$dCount;
DPrint "$dCount<br>"; # debug print.
foreach $pLink ( @{${$pCat}{DownList}} ) {
LibMod->CountCatItems(${$pLink}{Down}); # call self recursively
};
};

This works, but only counts the number of items directly included in a
category, not the ones which belong to a subcategory.

The thing that makes it hard is the multiple category membership plus
multiple category grouping, but I want to keep this if I possibly can,
because one of the problems with traditional non-computerised indexes is
that you can't do this even though books can be relevant to several
different topics.

My thoughts at the moment are either:

- give each category a hash where the keys are all the items included in
the category. Read these in from the database, then each time a
subcategory is counted, the code goes back up the tree to the root, adding
the items that belong to it to all the higher level categories. When the
recursive calls to subcategories return, (i.e. end of this code fragment),
then count the keys in the hash. This would work I think, but it would
involve retrieving all the category memberships from the database, rather
than just the counts, and it might cause speed and memory problems if a
lot of items were added to the system.

- or when you put an item in a category, category memberships are
automatically created in the database for all the higher level categories.
This would be quicker to count, but I'm thinking that if someone has
chosen to put an item in a given subcategory, and then later on I decide
to move that subcategory to a different group, the category memberships
should shift to follow this, whereas with this system the item would stay
in the old top-level categories.

Finally - my questions are:
- does anyone know of an algorithm for working out the counts just from
the counts in each category - i.e. without having to get details of
exactly which items belong to each category? (I have a feeling this may be
impossible)

You have to write a recursive algorithm, which you can precompute.

SQL has a "start with" and "connect by" syntax for this sort of thing,
but mysql does to implement it. Oracle implement this recursion for
you. eg See http://www.adp-gmbh.ch/ora/sql/connect_by.html
It would be a bit of a nightmare to do it yourself.

- or generally, and maybe more useful in the long run, a good place to
start looking on the web for answers to problems like this when they crop
up.

Its called "graph theory", and has been well studied for decades.
 
C

ctcgag

Andy Baxter said:
CategoryMembership
------------------
CatID ItemId
1 1
2 1
1 2
1 3
3 3
1 4
4 4
7 5

CategoryGrouping
----------------
CatID SubCatId SortOrder Stop
1 2 1
1 3 2
1 4 3
3 4 1 1 (yes.)
4 5 1
4 6 2
4 7 3

I.e. item 1 is in cats 1 & 2, item 2 is in cat 1, item 3 is in cat 3, and
item 4 is in cats 1 & 4. Categories 2 and 3 are in Cat 1, and Category 4
is in cat 3 and cat 1 etc.

So you have two different kinds of membership, implicit (I'm a member of
one category by virtue of my membership in a subcategory) and explicit. As
long as you retain this, and it seems you do want to retain this so that
implicit membership is automatically revoked when a subcategory is moved,
it is a fairly tough problem.
I want any set of category memberships or groupings of categories that
can be represented like this without leading to loops to be allowed.
(I.e. the only thing that isn't allowed is things like category 1 is
grouped under cat 2 which is grouped under cat 3 which is grouped under
cat 1, and more complex variations on this.)

How are you going to tackle inforcing this provision?

....
The bit I'm finding difficult is how to count how many items there are
under a given category, so that a count can be shown in the category
index page.

How important is this, really? If the count is somewhat inaccurate, will
anyone notice or care?

....
- give each category a hash where the keys are all the items included in
the category. Read these in from the database, then each time a
subcategory is counted, the code goes back up the tree to the root,
adding the items that belong to it to all the higher level categories.
When the recursive calls to subcategories return, (i.e. end of this code
fragment), then count the keys in the hash. This would work I think, but
it would involve retrieving all the category memberships from the
database, rather than just the counts, and it might cause speed and
memory problems if a lot of items were added to the system.

If the number of entries was small and quiet dynamic, this is the way I
would do it (at least to the extent I understand your description. I
wouldn't say it goes up to the root, I'd say it goes down to the leaves.
Recursive code doesn't "go" up, it just "returns" up.). If it does get
slow, and the data is read-often/write-seldom, I'd just cache the counts
and recalculate them periodically.
- or when you put an item in a category, category memberships are
automatically created in the database for all the higher level
categories. This would be quicker to count, but I'm thinking that if
someone has chosen to put an item in a given subcategory, and then later
on I decide to move that subcategory to a different group, the category
memberships should shift to follow this, whereas with this system the
item would stay in the old top-level categories.

Add a status field to the membership table. When a subcategory is moved,
all memberships marked as implicit would get removed and recalculated.
This would be computationally expensive, but moving categories around would
probably be rare.

Finally - my questions are:
- does anyone know of an algorithm for working out the counts just from
the counts in each category - i.e. without having to get details of
exactly which items belong to each category? (I have a feeling this may
be impossible)

I agree that it is probably not possible.

Xho
 
A

Andy Baxter

At earth time Fri, 02 Jan 2004 20:09:36 +0000, the following transmission
was received from the entity known as ctcgag:
How are you going to tackle inforcing this provision?

...

Initially just by taking care writing the data file that the category
structure is read from. If I get round to letting users add to the
category list, I might have to do some more research to find a good
algorithm for finding loops.
How important is this, really? If the count is somewhat inaccurate,
will anyone notice or care?

Maybe not that important, but it would be good to get it right if
possible. I have found a solution that works, and it's not horribly slow.
(10000 extra items adds about a second to the page rendering.)
If the number of entries was small and quiet dynamic, this is the way I
would do it (at least to the extent I understand your description. I
wouldn't say it goes up to the root, I'd say it goes down to the leaves.
Recursive code doesn't "go" up, it just "returns" up.). If it does get
slow, and the data is read-often/write-seldom, I'd just cache the counts
and recalculate them periodically.

The code I wrote goes down from the top, and then at every level another
recursive function is called which runs back up to the root, adding all
the items belonging to that subcategory to the list for all the higher
ones. Not sure if this is the best way but it does work. The tricky bit
is there can be more than one path back to the root from any leaf,
because it's not a straight tree. The item data will change fairly
regularly (I hope), but the category list will probably stay much the same
once I've worked it out.
Add a status field to the membership table. When a subcategory is
moved, all memberships marked as implicit would get removed and
recalculated. This would be computationally expensive, but moving
categories around would probably be rare.

Thanks - might try that if I need to.
I agree that it is probably not possible.

I can't find the words to say exactly why not, but that was what I came to
thinking having written it all out in my original post.

andy.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top