2D - bivariate random number generatin

J

John

Hi All, I have created the following class to uniformly generate
numbers according to a zipf distribution. How can I extend this program
so that I can generate two data points with the requirement that the
first data point is correlated with the first? ie. the secondpoint will
vary with the first? any ideas? thanks in advance.

import java.util.Random;

/*
Basically a simple program to generate random
variables which approximate a Zipf distribution
*/


public class basic_zipf
{
/* Calculate the generalized harmonic number of order N.
Basically, the sum of the reciprocals. */

public final static void harmonic(double exponent, double[]
sum_reciprocals)
{
double sum = 0.0;

for (int i = 1; i <= sum_reciprocals.length; i++)
{
sum += 1.0 / Math.pow(i,exponent);
sum_reciprocals[i-1]=sum;
}
}

public static void main(String[] args)
{
int numer_of_files=1000;

/* random number generator */
Random random=new Random();
double exponent=1.0;

/* get the harmonic number of order 30 */
//double sum = harmonic(numer_of_files,exponent);

/* lookup table for each reciprocal value 1 ... 30 */
double reciprocals[]=new double[numer_of_files];

/* calculate harmonic numbers */
harmonic(exponent,reciprocals);

double sum=reciprocals[numer_of_files-1];

int[] variates=new int[numer_of_files];

/* generate 10,000 random variables */
for(int i=0;i<500000;i++)
{
// scale the random variable to the range [0,sum]
double tmp=random.nextDouble()*sum;
double range=0;

// bin the scaled random number and output the bin number
for(int j=0;j<numer_of_files;j++)
{
// find out which range the value is in
double upper=reciprocals[j];

if((tmp>=range) && (tmp<=upper))
{
//System.out.println(j+1);
variates[j]++;
break;
}
else
{
range=upper;
}
}
}

for(int i=0;i<variates.length;i++)
{
System.out.println((i+1)+","+variates);

}
}
}
 
P

Patricia Shanahan

John said:
Hi All, I have created the following class to uniformly generate
numbers according to a zipf distribution. How can I extend this program
so that I can generate two data points with the requirement that the
first data point is correlated with the first? ie. the secondpoint will
vary with the first? any ideas? thanks in advance.

Can you describe the conditional distribution of the second variable,
given the first?

Patricia
 
J

John

Patricia said:
Can you describe the conditional distribution of the second variable,
given the first?

Yes I can (somewhat informaly), what I want to do is model web browsing
behaviour. So given a variable X say the popularity of a particular
file there will be a covariant random number specifying the size of the
file. The number will also vary according to a zipf distribution. I was
thinking of using a 2D array with the required densities but am unsure
as to how to proceed. The second variable the file size will be smaller
for a popular file and larger for an unpopular file. Any ideas ?
 
O

Oliver Wong

John said:
Yes I can (somewhat informaly), what I want to do is model web browsing
behaviour. So given a variable X say the popularity of a particular
file there will be a covariant random number specifying the size of the
file. The number will also vary according to a zipf distribution. I was
thinking of using a 2D array with the required densities but am unsure
as to how to proceed. The second variable the file size will be smaller
for a popular file and larger for an unpopular file. Any ideas ?

You're saying that if a file is popular, it's likely that it's small,
and vice versa? Is it exactly the case that the nth most popular file is the
nth smallest file?

- Oliver
 
P

Patricia Shanahan

John said:
Yes I can (somewhat informaly), what I want to do is model web browsing
behaviour. So given a variable X say the popularity of a particular
file there will be a covariant random number specifying the size of the
file. The number will also vary according to a zipf distribution. I was
thinking of using a 2D array with the required densities but am unsure
as to how to proceed. The second variable the file size will be smaller
for a popular file and larger for an unpopular file. Any ideas ?

I think you need to move beyond "somewhat informally" to nail down a
conditional distribution.

The 2D array idea amounts to storing the joint distribution for
popularity and size. It can be calculated from:

P(popularity==p && size==s)
= P(size==s | popularity==p)*P(popularity==p).

You have a good choice of distribution for popularity. Whether it is a
good idea to pre-evaluate the joint distribution and store the
probabilities in an array depends on the nature of the conditional
distribution. If it is easy to evaluate on the fly, there is no need for
what could turn out to be an inconveniently large array. There are a lot
of web pages.

Patricia
 
J

John

Patricia said:
I think you need to move beyond "somewhat informally" to nail down a
conditional distribution.

The 2D array idea amounts to storing the joint distribution for
popularity and size. It can be calculated from:

P(popularity==p && size==s)
= P(size==s | popularity==p)*P(popularity==p).

You have a good choice of distribution for popularity. Whether it is a
good idea to pre-evaluate the joint distribution and store the
probabilities in an array depends on the nature of the conditional
distribution. If it is easy to evaluate on the fly, there is no need for
what could turn out to be an inconveniently large array. There are a lot
of web pages.
I've been reading up on non-uniform random variate generation
techniques in the book "Non-Uniform Random Variate Generation" by Luc
Devroye. Anyway it seems that there is quite a steep learning curve.
Are you aware of any code examples that do something similar to what I
want ? I have found a number of examples that deal with bivariate
generation for normal distributed variables using a covariance matrix.
I just wonder if these are also applicable to what I want. Ideally I
want something that can be tuned in terms of correlation. thanks for
all your help.
 
C

Chris Uppal

John said:
I've been reading up on non-uniform random variate generation
techniques in the book "Non-Uniform Random Variate Generation" [...]

Wow!

(Sorry, couln't resist. I'll leave it to someone else to provide a sensible
answer...)

-- chris
 
O

Oliver Wong

John said:
I've been reading up on non-uniform random variate generation
techniques in the book "Non-Uniform Random Variate Generation" by Luc
Devroye. Anyway it seems that there is quite a steep learning curve.
Are you aware of any code examples that do something similar to what I
want ?

The problem is, it's not clear what you want. What does your application
do?

- Oliver
 
J

John

Oliver said:
The problem is, it's not clear what you want. What does your application
do?

OK. My application is a P2P caching web proxy and what I want to do is
load test the cache and get statistics such as file hit count, byte hit
count and the amount of data that has to be sent/recieved to/from an
origin server. I think the zipf distribution is a good choice of
distribution for web page popularity. But I would also like to
correlate web page popularity with file size. This correlation should
be adjustable between negative correlation, independence and positive
correlation. As such I can generate web page file requests with a given
popularity and file size which may or may not be dependent on the
popularity. I am still not 100 % convinced that there is a dependency
between file size and popularity but if there is I would like to cover
that base as well.
 
P

Patricia Shanahan

John said:
OK. My application is a P2P caching web proxy and what I want to do is
load test the cache and get statistics such as file hit count, byte hit
count and the amount of data that has to be sent/recieved to/from an
origin server. I think the zipf distribution is a good choice of
distribution for web page popularity. But I would also like to
correlate web page popularity with file size. This correlation should
be adjustable between negative correlation, independence and positive
correlation. As such I can generate web page file requests with a given
popularity and file size which may or may not be dependent on the
popularity. I am still not 100 % convinced that there is a dependency
between file size and popularity but if there is I would like to cover
that base as well.

I do have a truncated Zipf generator in Java, but it may not be suitable
for your purposes. I have a relatively small number of distinct items,
so I just calculate out the probability for each item, and then use a
discrete distribution generator.

For the page size, you still need to pick a distribution, and then
relate the distribution parameters to the page popularity.

Have you read
http://www.nslij-genetics.org/wli/zipf/breslau99.pdf? It looks relevant,
and although it is a few years old it may be a useful starting point for
finding papers that reference it. [Do you know how to use Citeseer?]

Patricia
 
C

Chris Uppal

John said:
OK. My application is a P2P caching web proxy and what I want to do is
load test the cache and get statistics such as file hit count, byte hit
count and the amount of data that has to be sent/recieved to/from an
origin server. I think the zipf distribution is a good choice of
distribution for web page popularity.

OK, a fair hypothesis. I think you've already said that you know how to do
that bit.

But I would also like to
correlate web page popularity with file size. This correlation should
be adjustable between negative correlation, independence and positive
correlation. As such I can generate web page file requests with a given
popularity and file size which may or may not be dependent on the
popularity. I am still not 100 % convinced that there is a dependency
between file size and popularity but if there is I would like to cover
that base as well.

But this seems iffy to me -- the statistical equivalent of over-engineering.
You don't have a good reason to suspect a correlation. You don't know what the
correlation is, nor how it is parameterised. But you still want to model it
"just in case"...

So start simple. Have two file sizes, and either (a) choose randomly between
them, (b) choose the larger one for popular requests (all and only), (c) choose
the smaller one for popular requests (all and only). If you want to elaborate
a little more, choose a file size with a randomly varying value around one or
other mean. If you want to get more elaborate still;;, choose a file size with
a mean varying randomly around a mean derived from the popularity by some
simple formula.

No that's not statistically sound (or so I assume), but you are not in a
position to /be/ statistically sound -- since you have no data to model, nor a
theory to yield an analytic model. So all you are doing -- all you /can/ do --
is a simple test to see if performance is obviously sensitive to such
correlations. So, to borrow a phrase from XP, use the Simplest Correlation
That Could Possibly Work.

-- chris
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top