Expected Token Density in Random Stream

Discussion in 'C++' started by Andrew Tomazos, Dec 7, 2011.

  1. Summary: We want to find out how often a given token appears in a
    random stream formed by concatenating randomly chosen strings from a
    given set of strings.

    Details:

    Given S, an array of n (non-empty) strings; and T, a string of length
    m;

    We create a random stream of characters by the following process:

    1. assign i = a (uniformly) random integer between 1 and n
    inclusive
    2. write the characters of the ith element of S to the stream
    3. goto 1

    (The elements of S are not necessarily the same length)

    For example:

    if S = {"a", "bug", "ug"}

    then the stream may start as follows:

    augbugugugaabug...

    Each time T appears in the stream we call that a hit. More precisely:

    1. Initialize a queue Q with m null elements

    2. If Q equals T record a hit
    3. Pop front of Q
    4. Push to back of Q next char from stream
    5. Goto 2

    For example:

    If T = "ugu"...

    then:

    augbugugugaabug...
    1 2

    We get two hits so far.

    (Note hits can overlap each other)

    What is the expected frequency of hits in terms of S and T?

    More precisely:

    let y be the number of chars read from the stream so far
    let x be the number of hits

    What is the expected limit of x/y as y approaches infinity?

    We can approximate this empirically as follows in C++:

    $ cat > HitFinder.cpp
    <paste below code>
    $ g++ -o HitFinder HitFinder.cpp
    $ ./HitFinder ugu a bug ug
    T = ugu
    A = {a, bug, ug}

    (x, y, x/y) = (0, 1, 0)
    (x, y, x/y) = (0, 2, 0)
    (x, y, x/y) = (0, 4, 0)
    (x, y, x/y) = (0, 8, 0)
    (x, y, x/y) = (1, 16, 1/16)
    (x, y, x/y) = (3, 32, 1/10.6667)
    (x, y, x/y) = (8, 64, 1/8)
    (x, y, x/y) = (16, 128, 1/8)
    ...
    (x, y, x/y) = (29821708, 268435456, 1/9.00134)
    (x, y, x/y) = (59648899, 536870912, 1/9.00052)
    (x, y, x/y) = (119304725, 1073741824, 1/8.99999)
    (x, y, x/y) = (238593314, 2147483648, 1/9.0006)
    (x, y, x/y) = (477198099, 4294967296, 1/9.00039)

    As you can see the answer for this converges on 1/9

    How can we calculate this figure by direct computation?

    ie How would you implement a function with signature...

    double ExpectedHitLimit(string T, vector<string> A)

    ....that quickly calculates this limit for any given T and A?

    Thanks and have fun,
    Andrew.

    --
    Andrew Tomazos <> <http://www.tomazos.com>


    // ======================= HitFinder.cpp Cut Here
    ========================
    // (C) 2011, Andrew Tomazos <>. All Rights
    Reserved.

    #include <queue>
    #include <string>
    #include <iostream>
    #include <limits>
    #include <cstdlib>

    using namespace std;

    typedef long long int64;
    bool IsPow2(int64 i)
    {
    return i != 0 && !(i & (i - 1));
    }

    int64 x = 0;

    void Hit()
    {
    x++;
    }

    int Rand(int n)
    {
    return rand() % n;
    }

    class RandCharStream
    {
    public:
    vector<string> S;
    size_t n, istr, ichar;

    RandCharStream(vector<string> S_) :
    S(S_),
    n(S.size()),
    istr(Rand(n)),
    ichar(0)
    {
    }

    char nextChar()
    {
    if (ichar == S[istr].size())
    {
    istr = Rand(n);
    ichar = 0;
    }

    return S[istr][ichar++];
    }
    };

    class HitFinder
    {
    public:
    RandCharStream& stream;
    queue<char> T;
    int m;
    queue<char> Q;

    HitFinder(RandCharStream& stream_, string T_) :
    stream(stream_),
    m(T_.size())
    {
    for (int i = 0; i < m; i++)
    {
    T.push(T_);
    Q.push('\0');
    }
    }

    void nextChar()
    {
    Q.pop();
    Q.push(stream.nextChar());

    if (Q == T)
    Hit();
    }
    };

    int main(int argc, char** argv)
    {
    if (argc < 3)
    {
    cerr << "Usage: " << argv[0] << " <T> <S1> <S2> ... <Sn>" << endl;
    return -1;
    }

    string T = argv[1];
    cout << "T = " << T << endl;

    cout << "A = {";
    vector<string> A;
    for (int i = 2; i < argc; i++)
    {
    string s = argv;
    cout << s << (i < argc - 1 ? ", " : "");
    A.push_back(s);
    if (s.empty())
    {
    cerr << "Element S" << i - 1 << " is empty" << endl;
    return -1;
    }
    }
    cout << "}" << endl;
    cout << endl;

    RandCharStream stream(A);
    HitFinder finder(stream, T);

    for (int64 y = 1; y < __LONG_LONG_MAX__; y++)
    {
    finder.nextChar();

    if (IsPow2(y))
    {
    cout << "(x, y, x/y) = (" << x << ", " << y << ", ";
    if (x == 0)
    cout << "0";
    else
    cout << "1/" << double(y) / double(x);
    cout << ")" << endl;
    }
    }

    return 0;
    }
    //======================= HitFinder.cpp Cut Here
    ========================
    Andrew Tomazos, Dec 7, 2011
    #1
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Cronus
    Replies:
    1
    Views:
    658
    Paul Mensonides
    Jul 15, 2004
  2. G Fernandes
    Replies:
    1
    Views:
    523
  3. Wessi
    Replies:
    3
    Views:
    846
    Lawrence Kirby
    Aug 11, 2005
  4. =?Utf-8?B?Y2FzaGRlc2ttYWM=?=

    This is an unexpected token. The expected token is 'NAME'

    =?Utf-8?B?Y2FzaGRlc2ttYWM=?=, Jul 13, 2007, in forum: ASP .Net
    Replies:
    2
    Views:
    775
    =?Utf-8?B?Y2FzaGRlc2ttYWM=?=
    Jul 13, 2007
  5. globalrev
    Replies:
    4
    Views:
    753
    Gabriel Genellina
    Apr 20, 2008
Loading...

Share This Page