# a question on google's algorithm

Discussion in 'C++' started by %NAME%, Feb 21, 2007.

1. ### %NAME%Guest

The title is not exactly a disnomer, so pleaes let me explain...

Say I have a large number of 0-1 vectors containing only
bits corresponding to 0 or 1. Now given an arbitrary 0-1 vector,
I wish to find all the 0-1 vectors that are subsumed by that query bit
vector.

For example, given bit vector query as 01100, i wish to find all bit
vectors that whose 1st, 4th, 5th bits are also zero, leaving
the 2nd, 3rd bits as either 0 or 1.

What I wish to do is to build an index on the bit vectors, so that I
could avoid checking each 0-1 vector in sequence. In another word, I
hope this index could help me quickly filter away those vectors that
guarunteed not to be subsumed by the given 0-1 vector.

I am not sure this is the right place to ask, but the index could be
used to solve a search engine problem: the 0-1 vector
corresponds to my keywords search (a 0 means that keyword NOT in my
query, while a 1 means the keyword is in), and all the webpages
corresponds to 0-1 vectors. I wish to find all the webpages that
subsumes my query.

If anyone happen to know how Google solves this problem, and it is not
confidential, pls let me know, thanks a bunch!!!!

%NAME%, Feb 21, 2007

2. ### Alf P. SteinbachGuest

* %NAME%:
> [interesting question, cross-posted to numerous groups]

No C++ content means OFF-TOPIC in [comp.lang.c++].

Thank you.

FUT: [comp.programming]

--
A: Because it messes up the order in which people normally read text.
Q: Why is it such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

Alf P. Steinbach, Feb 21, 2007

3. ### Bill ReidGuest

%NAME% <> wrote in message
news:...

> The title is not exactly a disnomer, so pleaes let me explain...
>
> Say I have a large number of 0-1 vectors containing only
> bits corresponding to 0 or 1. Now given an arbitrary 0-1 vector,
> I wish to find all the 0-1 vectors that are subsumed by that query bit
> vector.
>
> For example, given bit vector query as 01100, i wish to find all bit
> vectors that whose 1st, 4th, 5th bits are also zero, leaving
> the 2nd, 3rd bits as either 0 or 1.
>
> What I wish to do is to build an index on the bit vectors, so that I
> could avoid checking each 0-1 vector in sequence. In another word, I
> hope this index could help me quickly filter away those vectors that
> guarunteed not to be subsumed by the given 0-1 vector.
>
> I am not sure this is the right place to ask, but the index could be
> used to solve a search engine problem: the 0-1 vector
> corresponds to my keywords search (a 0 means that keyword NOT in my
> query, while a 1 means the keyword is in), and all the webpages
> corresponds to 0-1 vectors. I wish to find all the webpages that
> subsumes my query.
>
> If anyone happen to know how Google solves this problem, and it is not
> confidential, pls let me know, thanks a bunch!!!!
>

Have you tried Googling(TM) your question?

---
William Ernest Reid

Bill Reid, Feb 21, 2007
4. ### Guest

On Feb 20, 6:56 pm, "%NAME%" <> wrote:
> The title is not exactly a disnomer, so pleaes let me explain...
>
> Say I have a large number of 0-1 vectors containing only
> bits corresponding to 0 or 1. Now given an arbitrary 0-1 vector,
> I wish to find all the 0-1 vectors that are subsumed by that query bit
> vector.
>
> For example, given bit vector query as 01100, i wish to find all bit
> vectors that whose 1st, 4th, 5th bits are also zero, leaving
> the 2nd, 3rd bits as either 0 or 1.
>
> What I wish to do is to build an index on the bit vectors, so that I
> could avoid checking each 0-1 vector in sequence. In another word, I
> hope this index could help me quickly filter away those vectors that
> guarunteed not to be subsumed by the given 0-1 vector.
>
> I am not sure this is the right place to ask, but the index could be
> used to solve a search engine problem: the 0-1 vector
> corresponds to my keywords search (a 0 means that keyword NOT in my
> query, while a 1 means the keyword is in), and all the webpages
> corresponds to 0-1 vectors. I wish to find all the webpages that
> subsumes my query.
>
> If anyone happen to know how Google solves this problem, and it is not
> confidential, pls let me know, thanks a bunch!!!!

Suppose I have a database of all the words from the
Official Scrabble Players Dictionary (OSPD). I can
create a consonant/vowel vector for all 5-letter
words using the SQL statement:

SELECT OSPD_words.OSPD,
IIf(Mid\$([OSPD],1,1) Like "[a,e,i,o,u]",1,0) AS bit1,
IIf(Mid\$([OSPD],2,1) Like "[a,e,i,o,u]",1,0) AS bit2,
IIf(Mid\$([OSPD],3,1) Like "[a,e,i,o,u]",1,0) AS bit3,
IIf(Mid\$([OSPD],4,1) Like "[a,e,i,o,u]",1,0) AS bit4,
IIf(Mid\$([OSPD],5,1) Like "[a,e,i,o,u]",1,0) AS bit5,
[bit1] & [bit2] & [bit3] & [bit4] & [bit5] AS vector
FROM OSPD_words
WHERE (((Len([OSPD]))=5));

which produces:

OSPD bit1 bit2 bit3 bit4 bit5 vector
added 1 0 0 1 0 10010
adder 1 0 0 1 0 10010
addle 1 0 0 0 1 10001
adeem 1 0 1 1 0 10110
adept 1 0 1 0 0 10100
adieu 1 0 1 1 1 10111
adios 1 0 1 1 0 10110
adits 1 0 1 0 0 10100
adman 1 0 0 1 0 10010
admen 1 0 0 1 0 10010
admit 1 0 0 1 0 10010
admix 1 0 0 1 0 10010
adobe 1 0 1 0 1 10101
adobo 1 0 1 0 1 10101
..
..
..

Once I have those vectors, I can then query them to
find anything that has the 1st, 4th & 5th letters
as consonants and the 2nd & 3rd letters can be either
consonants or vowels. This is done using the
like "0??00" pattern matching criteria. The question
marks are single character wild cards:

SELECT vector_test1.OSPD, vector_test1.vector
FROM vector_test1
WHERE (((vector_test1.vector) Like "0??00"));

which filters the above output to show only the
requested vectors:

OSPD vector
baals 01100
backs 01000
baffs 01000
baffy 01000
baggy 01000
bahts 01000
bails 01100
bairn 01100
baith 01100
baits 01100
..
..
..
brach 00100
bract 00100
brags 00100
braky 00100
brand 00100
..
..
..

And because we didn'y define "y" as a vowel, there
are even some 5-letter words with no vowels:

OSPD vector
ghyll 00000
lymph 00000
hymns 00000
gypsy 00000
crypt 00000
dryly 00000
lynch 00000
synth 00000
glyph 00000
synch 00000

, Feb 21, 2007
5. ### Jerry CoffinGuest

In article <>,
says...
> The title is not exactly a disnomer, so pleaes let me explain...
>
> Say I have a large number of 0-1 vectors containing only
> bits corresponding to 0 or 1. Now given an arbitrary 0-1 vector,
> I wish to find all the 0-1 vectors that are subsumed by that query bit
> vector.
>
> For example, given bit vector query as 01100, i wish to find all bit
> vectors that whose 1st, 4th, 5th bits are also zero, leaving
> the 2nd, 3rd bits as either 0 or 1.

Assuming your bit-vectors were stored in an integer type, you could use
a comparison like this:

if ((bit_vector | pattern) == pattern)
// found
else

If you need to store larger bit-vectors than will fit in any integer
type (only 32 are guaranteed), you can look at std::bit_vector and/or
std::vector<bool>. IIRC, bit_vector supports boolean operations
directly, but for vector<bool> you'll probably need to write your own.

--
Later,
Jerry.

The universe is a figment of its own imagination.

Jerry Coffin, Feb 21, 2007
6. ### CBFalconerGuest

Jerry Coffin wrote:
> says...
>
>> The title is not exactly a disnomer, so pleaes let me explain...
>>
>> Say I have a large number of 0-1 vectors containing only bits
>> corresponding to 0 or 1. Now given an arbitrary 0-1 vector,
>> I wish to find all the 0-1 vectors that are subsumed by that
>> query bit vector.
>>
>> For example, given bit vector query as 01100, i wish to find
>> all bit vectors that whose 1st, 4th, 5th bits are also zero,
>> leaving the 2nd, 3rd bits as either 0 or 1.

>
> Assuming your bit-vectors were stored in an integer type, you
> could use a comparison like this:
>
> if ((bit_vector | pattern) == pattern)
> // found
> else
>
> If you need to store larger bit-vectors than will fit in any
> integer type (only 32 are guaranteed), you can look at
> std::bit_vector and/or std::vector<bool>. IIRC, bit_vector
> supports boolean operations directly, but for vector<bool>
> you'll probably need to write your own.

The fault is basically that of the OP, but please do not post
off-topic information on c.l.c. It is easy to overlook the foolish
cross-posting, but you should not answer in any group where the
answer (and question) are off topic. F'ups set.

--
<http://www.cs.auckland.ac.nz/~pgut001/pubs/vista_cost.txt>
<http://www.securityfocus.com/columnists/423>

"A man who is right every time is not likely to do very much."
-- Francis Crick, co-discover of DNA
"There is nothing more amazing than stupidity in action."
-- Thomas Matthews

CBFalconer, Feb 21, 2007
7. ### Randy HowardGuest

On Tue, 20 Feb 2007 18:56:19 -0600, NAME% wrote
(in article <>):

> The title is not exactly a disnomer, so pleaes let me explain...

What in the world is a disnomer? Is this another one of those "we made
up a word because the mood struck us" deals?

--