getting a submatrix of all true

Discussion in 'Python' started by John Hunter, Jul 2, 2003.

  1. John Hunter

    John Hunter Guest

    I have a largish data set (1000 observations x 100 floating point
    variables), and some of the of the data are missing. I want to try a
    variety of clustering, neural network, etc, algorithms on the data,
    and to keep life simple I want to reduce the dimensions of the matrix
    so that I have no missing values, since not all the algorithms are
    able to handle them and there is sufficient redundancy in the
    variables that I can afford to lose some.

    I am currently using a hack that works, but it makes me wonder if
    there is an optimal solution. I define optimal as the removal of rows
    and columns such that there are no missing values and
    max(numRows*numCols).

    My current approach is to drop rows (observations) that have more than
    some prespecified number of missing variables, and then drop the
    columns (variables) of the reduced data set that have any missing
    values. I chose the threshold for dropping a row by eyeballing the
    distribution of number of missing variables per observation, pick a
    number on the low end of the distribution, and dropping the rows that
    exceed the threshold.

    Another way of formulating the question: for a sparse boolean matrix
    (sparse on True), what is the optimal way to remove rows and columns
    so that the total number of elements in the matrix is maximal and
    there are no True values left.


    Example:

    0 0 0
    0 0 0 candidate sub matrix has 12 elements
    0 0 0
    0 0 0

    1 0 0 0 1
    0 0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0 0 0 candidate submatrix has 15 elements
    0 0 0 0 0 0 0 0 0 0
    0 0 1 0 0

    0 0
    0 0 candidate submatrix has 8 elements
    0 0
    0 0

    I want to programatically extract the 15 element matrix

    Following the approach described above, I get the desired answer in
    the example below, though this is a hack solution and I have the
    feeling there is a better one.

    from Numeric import nonzero, array, take, sum

    X = array([[1, 0, 0, 0, 1],
    [0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0],
    [0, 0, 1, 0, 0]])

    goodObsInd = nonzero(sum(X,1)<2) # observations with < 2 missing variables
    X = take(X, goodObsInd) # drop the bad

    goodVarInd = nonzero(sum(X)==0) # variables with no missing data
    X = take(X, goodVarInd, 1 ) # drop the bad variables

    print X


    John Hunter
    John Hunter, Jul 2, 2003
    #1
    1. Advertising

  2. On Wed, 02 Jul 2003 14:16:57 -0500, John Hunter <> wrote:

    >
    >I have a largish data set (1000 observations x 100 floating point
    >variables), and some of the of the data are missing. I want to try a
    >variety of clustering, neural network, etc, algorithms on the data,
    >and to keep life simple I want to reduce the dimensions of the matrix
    >so that I have no missing values, since not all the algorithms are
    >able to handle them and there is sufficient redundancy in the
    >variables that I can afford to lose some.
    >
    >I am currently using a hack that works, but it makes me wonder if
    >there is an optimal solution. I define optimal as the removal of rows
    >and columns such that there are no missing values and
    >max(numRows*numCols).
    >
    >My current approach is to drop rows (observations) that have more than
    >some prespecified number of missing variables, and then drop the
    >columns (variables) of the reduced data set that have any missing
    >values. I chose the threshold for dropping a row by eyeballing the
    >distribution of number of missing variables per observation, pick a
    >number on the low end of the distribution, and dropping the rows that
    >exceed the threshold.
    >
    >Another way of formulating the question: for a sparse boolean matrix
    >(sparse on True), what is the optimal way to remove rows and columns
    >so that the total number of elements in the matrix is maximal and
    >there are no True values left.
    >
    >
    >Example:
    >
    > 0 0 0
    > 0 0 0 candidate sub matrix has 12 elements
    > 0 0 0
    > 0 0 0
    >
    >1 0 0 0 1
    >0 0 0 0 0 0 0 0 0 0
    >0 0 0 0 0 0 0 0 0 0 candidate submatrix has 15 elements
    >0 0 0 0 0 0 0 0 0 0
    >0 0 1 0 0
    >
    > 0 0
    > 0 0 candidate submatrix has 8 elements
    > 0 0
    > 0 0
    >
    >I want to programatically extract the 15 element matrix


    If I understand your original optimality definition, that would be suboptimal.
    I.e., how about the 16-element matrix? (x's mark the corresponding zeroes)

    1 0 0 0 1
    x x 0 x x
    x x 0 x x
    x x 0 x x
    x x 1 x x

    Or do they have to be adjacent?

    >
    >Following the approach described above, I get the desired answer in
    >the example below, though this is a hack solution and I have the
    >feeling there is a better one.
    >
    > from Numeric import nonzero, array, take, sum
    >
    > X = array([[1, 0, 0, 0, 1],
    > [0, 0, 0, 0, 0],
    > [0, 0, 0, 0, 0],
    > [0, 0, 0, 0, 0],
    > [0, 0, 1, 0, 0]])
    >
    > goodObsInd = nonzero(sum(X,1)<2) # observations with < 2 missing variables
    > X = take(X, goodObsInd) # drop the bad
    >
    > goodVarInd = nonzero(sum(X)==0) # variables with no missing data
    > X = take(X, goodVarInd, 1 ) # drop the bad variables
    >
    > print X
    >


    Brute force seems to work on this little example. Maybe it can
    be memoized and optimized and/or whatever to handle your larger matrix fast enough?

    ====< submatrix.py >================================
    X = [
    [1, 0, 0, 0, 1],
    [0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0],
    [0, 0, 1, 0, 0]
    ]

    def optimrcs(mx, rows=[], cols=[]):
    if not rows or not cols: return 0,[],[]
    maxscore = 0
    for r in rows:
    if not mx[r].count(1): continue
    for c in cols:
    if not mx[r][c]: continue
    # eval column submatrix vs row submatrix
    colsxthis = cols[:]; colsxthis.remove(c)
    colscore, rset, cset = optimrcs(mx, rows, colsxthis)
    if colscore>maxscore:
    maxscore, maxrows, maxcols = colscore, rset, cset
    rowsxthis = rows[:]; rowsxthis.remove(r)
    rowscore, rset, cset = optimrcs(mx, rowsxthis, cols)
    if rowscore >= maxscore:
    maxscore, maxrows, maxcols = rowscore, rset, cset
    if not maxscore:
    return len(rows)*len(cols), rows, cols
    return maxscore, maxrows,maxcols

    if __name__ == '__main__':
    score, rowsel, colsel = optimrcs(X, range(5), range(5))
    print 'Score = %s, rows = %s, cols = %s' % (score,rowsel,colsel)
    print
    for r in range(5):
    row = X[r]
    for c in range(5):
    if r in rowsel and c in colsel:
    print '<%s>'% row[c],
    else:
    print ' %s '% row[c],
    print
    ====================================================
    Running it results thus:

    [21:24] C:\pywk\clp>submatrix.py
    Score = 16, rows = [1, 2, 3, 4], cols = [0, 1, 3, 4]

    1 0 0 0 1
    <0> <0> 0 <0> <0>
    <0> <0> 0 <0> <0>
    <0> <0> 0 <0> <0>
    <0> <0> 1 <0> <0>

    Regards,
    Bengt Richter
    Bengt Richter, Jul 3, 2003
    #2
    1. Advertising

  3. John Hunter

    Terry Reedy Guest

    "John Hunter" <> wrote in message
    news:...
    >
    > I have a largish data set (1000 observations x 100 floating point
    > variables), and some of the of the data are missing.


    All too typical -- missing data are the bane of statistics.

    > I want to try a
    > variety of clustering, neural network, etc, algorithms on the data,
    > and to keep life simple I want to reduce the dimensions of the

    matrix
    > so that I have no missing values, since not all the algorithms are
    > able to handle them and there is sufficient redundancy in the
    > variables that I can afford to lose some.


    Statisticians have tried a variety of approaches. Googling for '
    statistics "missing data" 'will give you some leads if you want.

    Terry J. Reedy
    Terry Reedy, Jul 3, 2003
    #3
  4. John Hunter

    John Hunter Guest

    >>>>> "Bengt" == Bengt Richter <> writes:


    Bengt> If I understand your original optimality definition, that
    Bengt> would be suboptimal. I.e., how about the 16-element
    Bengt> matrix? (x's mark the corresponding zeroes)

    Bengt> Or do they have to be adjacent?

    No, in general they won't be. I missed that one. Just goes to show
    that my "solution" is not one, which I knew. But it does efficiently
    deliver good solutions, where good means large but not optimal.

    Bengt> Brute force seems to work on this little example. Maybe it
    Bengt> can be memoized and optimized and/or whatever to handle
    Bengt> your larger matrix fast enough?

    Thanks for the example. Unfortunately, it is too slow even for
    moderate size matrices (30,10). I've been running it for over two
    hours for a 30x10 matrix and it hasn't finished. And my data are
    1000x100!

    Here is a little code to generate larger matrices for testing....

    from Numeric import zeros, array, Int, put, reshape
    from RandomArray import uniform

    numRows, numCols = 30,10
    numMissing = int(0.05*numRows*numCols) # 5% missing

    X = zeros( (300,))
    ind = uniform(0, numRows*numCols, (numMissing,)).astype(Int)

    put(X,ind,1)
    X = reshape(X, (numRows,numCols))

    # turn it into a list for your code....
    X = [ [val for val in row] for row in X]
    for row in X: print row

    Last night, I began to formulate the problem as a logic statement,
    hoping this would give me an idea of how to proceed. But no progress
    yet. But I have come to the conclusion that with P ones, brute force
    requires 2^P combinations. With 1000x100 with 5% missing that gives
    me 2^5000. Not good.

    Thanks for your suggestion, though. If you have any more thoughts,
    let me know.

    John Hunter
    John Hunter, Jul 3, 2003
    #4
  5. John Hunter

    John Hunter Guest

    >>>>> "Terry" == Terry Reedy <> writes:


    Terry> Statisticians have tried a variety of approaches. Googling
    Terry> for ' statistics "missing data" 'will give you some leads
    Terry> if you want.

    I have done some searching. I'm familiar with the common methods
    (delete every row that contains any missing, replace missing via mean
    or regression or something clever) but haven't seen any discussion of
    dropping variables and observations together to yield data sets with
    no missing values. Have you seen something like this?

    John Hunter
    John Hunter, Jul 3, 2003
    #5
  6. John Hunter

    Terry Reedy Guest

    "John Hunter" <> wrote in message
    news:...
    > >>>>> "Terry" == Terry Reedy <> writes:

    >
    >
    > Terry> Statisticians have tried a variety of approaches.

    Googling
    > Terry> for ' statistics "missing data" 'will give you some leads
    > Terry> if you want.
    >
    > I have done some searching. I'm familiar with the common methods
    > (delete every row that contains any missing, replace missing via

    mean
    > or regression or something clever) but haven't seen any discussion

    of
    > dropping variables and observations together to yield data sets with
    > no missing values. Have you seen something like this?


    There are also calculation methods for at least some analyses that
    allow for missing data . One of the google hits is for the book
    Statistical Analysis with Missing Data. I have not seen it, but it is
    not unique.

    As I hinted, there are no really nice solutions to missing data. I
    have done both row and column deletion. Sometimes I have done
    multiple analyses with different deletion strategies: once with enough
    vars deleted so all or most cases are complete, and again with enough
    cases deleted so that all or most var are complete.

    I would start by counting (with a program) the number of missing
    values for each row and then construction the frequency distribution
    thereof. Then the same for the columns, with the addition of a
    correlation table or tables.

    One thing one can do with vars is to combine some to make a composite
    measure. For instance, if three variables more-or-less measure the
    same thing, one can combine (perhaps by the mean of those present) to
    make one variable that is present if any of the three are, so it is
    only missing for cases (rows) that are missing all three. This type
    of work requires that you look at the variables and consider their
    meaning, rather than just inputing them into a blind proceedure that
    consisders all vars to be the same.

    Terry J. Reedy
    Terry Reedy, Jul 3, 2003
    #6
  7. John Hunter

    Jon Guest

    (Bengt Richter) wrote in message news:<be0b84$89a$0@216.39.172.122>...
    > On Wed, 02 Jul 2003 14:16:57 -0500, John Hunter <> wrote:
    > >I am currently using a hack that works, but it makes me wonder if
    > >there is an optimal solution. I define optimal as the removal of rows
    > >and columns such that there are no missing values and
    > >max(numRows*numCols).


    There must be! For each missing entry you can choose to drop either
    the row or the column - giving a maximum of 2^N possible ways for N
    missing elements. Given that the order in which the rows/columns are
    deleted is unimportant the number of possibilities to test is less
    than that, but the difficulty is that your decision about row or
    column for one element affects the decision for other elements if they
    share rows or columns. Graph theoretic approaches probably help - it's
    similar to reordering matrix elements to avoid fill in when
    decomposing sparse matrices. Depending on N, you might be able to test
    all possibilities in a reasonable time, but I have a nasty feeling you
    need to check all 2^M to be sure of an optimal solution, where M is
    the subset of N which either a share a row or column with another
    problematic element.

    > >Another way of formulating the question: for a sparse boolean matrix
    > >(sparse on True), what is the optimal way to remove rows and columns
    > >so that the total number of elements in the matrix is maximal and
    > >there are no True values left.


    Sadly I can only give you a method which is certainly not optimal, but
    at least finds something which is probably not too bad, and fairly
    quickly. Uses the Numeric module to speed up, so that a few thousand
    by a few hundred is quickly computable. I'm curious to know if it does
    much better or worse than other algorithms.

    Interesting problem!

    Cheers,

    Jon

    =====================================================================
    from Numeric import *

    def pickrcs(a,debug=0):
    b=array(a,copy=1) # local copy of array is modified
    dr=zeros(b.shape[0]) # Arrays to note deleted rows...
    dc=zeros(b.shape[1]) # ... and columns
    sizeleft=[b.shape[0],b.shape[1]] # Remaining dimensions
    while 1: # Keep deleting till none left
    sr=sum(b,1) # Column scores = ij's to remove
    sc=sum(b,0) # Row scores
    mr=argmax(sr) # index highest row score
    mc=argmax(sc) # index highest column score
    if sr[mr]==0 and sc[mc]==0 :
    break # stop when nothing to delete
    # pick the row/column to delete fewest useful elements
    if sizeleft[1]-sr[mr] > sizeleft[0]-sc[mc]:
    b[:,mc]=0 # Zero out column
    dc[mc]=1 # tags column as deleted
    sizeleft[1]-=1
    else:
    b[mr,:]=0 # Zero out row
    dr[mr]=1
    sizeleft[0]-=1
    # end of deletions loop - should be no missing elements now
    #
    # paranoia!, check if anything was left after deletions
    if sum(dr)<b.shape[0] and sum(dc)<b.shape[1]:
    dr=compress(dr==0,range(b.shape[0])) # gives return format of
    dc=compress(dc==0,range(b.shape[1])) # optimrcs posted by Bengt
    score=dr.shape[0]*dc.shape[0] # Richter
    return score,dr,dc
    else:
    print "Sorry, didn't manage to let anything survive"
    return 0,0,0

    # test code

    X = [
    [1, 0, 0, 0, 1],
    [0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0],
    [0, 0, 1, 0, 0]
    ]

    if __name__ == '__main__':
    a=array(X)
    score, rowsel, colsel = pickrcs(a)
    print "score=",score
    for r in range(5):
    row = X[r]
    for c in range(5):
    if r in rowsel and c in colsel:
    print '<%s>'% row[c],
    else:
    print ' %s '% row[c],
    print
    # now try for a larger problem - use a random pattern
    import RandomArray,time
    start=time.time()
    x0=1000 # problem size
    x1=600
    a=RandomArray.random((x0,x1))
    a=where(a>0.999,1,0) # ~ 0.1% of elements are 1
    print "Using a large random matrix"
    print "Number of non-zeros=",sum(ravel(a)),"in",x0,"x",x1,"matrix"
    score, rowsel, colsel = pickrcs(a)
    print 'Score = %s' % (score)
    print 'Number of rows deleted=',a.shape[0]-rowsel.shape[0]
    print 'Number of columns deleted=',a.shape[1]-colsel.shape[0]
    print time.time()-start," seconds"



    ==sample output (of course second part is random!)===============

    score= 16
    1 0 0 0 1
    <0> <0> 0 <0> <0>
    <0> <0> 0 <0> <0>
    <0> <0> 0 <0> <0>
    <0> <0> 1 <0> <0>
    Using a large random matrix
    Number of non-zeros= 607 in 1000 x 600 matrix
    Score = 331776
    Number of rows deleted= 424
    Number of columns deleted= 24
    3.38999998569 seconds
    Jon, Jul 4, 2003
    #7
  8. John Hunter <> wrote:

    >Another way of formulating the question: for a sparse boolean matrix
    >(sparse on True), what is the optimal way to remove rows and columns
    >so that the total number of elements in the matrix is maximal and
    >there are no True values left.


    After having read Bengts code (and scraping the parts of my exploded
    head from the floor) and after later getting the hint about the size
    of the problem being 2**N, it finally dawned on me that the problem
    would be related to getting all possible combinations of a range.

    The following code has the advantage that it generates solutions by
    index, for example "M" returns the score and the row and column
    information for index i. (note that is not the same as finding the
    optimal solution, but it still could be handy to be able to generate a
    specific one by index)

    However for small matrices it is possible to do exhaustive searching
    with "max(M)".

    Maybe if this would be combined with some heuristic algorithm it would
    be better (genetic algorithm?)

    Code below (needs Python 2.3, very lightly tested), I hope this helps,

    Anton

    class Combinations:

    def __init__(self,n,k):
    self.n,self.k,self.count = n,k,self.noverk(n,k)

    def __getitem__(self,index):
    #combination number index
    if index > self.count - 1: raise IndexError
    res,x,rest = [],self.n,index
    for i in range(self.k,0,-1):
    while self.noverk(x,i) > rest : x -= 1
    rest -= self.noverk(x,i)
    res.append(x)
    return res

    def noverk(self,n,k):
    return reduce(lambda a,b: a*(n-b)/(b+1),range(k),1)

    class AllCombinations:

    def __init__(self,n):
    self.n, self.count = n, 2**n

    def __getitem__(self,index):
    #combination number index for all k in combinations(n,k)
    if index > self.count - 1: raise IndexError
    n,rest = self.n,index
    for k in range(n+1):
    c = Combinations(n,k)
    if rest - c.count < 0:
    return c[rest]
    rest -= c.count

    class ScoreMatrix:

    def __init__(self, X):
    self.AC = AllCombinations(len(X))
    self.X = X

    def __getitem__(self,i):
    #score selected rows and columns by index
    rows = self.AC[::-1]
    if rows:
    Y = [self.X for i in rows]
    Z = [(i,z) for i,z in enumerate(zip(*Y)) if 1 not in z]
    cols = [i for i,z in Z]
    score = sum(map(len,[z for i,z in Z]))
    return score,rows,cols
    else: return 0,[],[]

    def test():
    X = [ [1, 0, 0, 0, 1],
    [0, 0, 0, 0, 0],
    [0, 0, 0, 1, 0],
    [0, 0, 0, 0, 0],
    [0, 0, 1, 0, 0],
    [0, 0, 1, 0, 0], ]

    M = ScoreMatrix(X)
    sc,rows,cols = max(M)
    print sc
    for i,r in enumerate(X):
    for j,c in enumerate(r):
    if i in rows and j in cols: print c,
    else: print 'x',
    print

    if __name__=='__main__':
    test()

    output:

    16
    x x x x x
    0 0 x 0 0
    x x x x x
    0 0 x 0 0
    0 0 x 0 0
    0 0 x 0 0
    Anton Vredegoor, Jul 4, 2003
    #8
  9. John Hunter <> wrote:

    >>>>>> "Bengt" == Bengt Richter <> writes:

    > Bengt> Brute force seems to work on this little example. Maybe it
    > Bengt> can be memoized and optimized and/or whatever to handle
    > Bengt> your larger matrix fast enough?
    >
    >Thanks for the example. Unfortunately, it is too slow even for
    >moderate size matrices (30,10). I've been running it for over two
    >hours for a 30x10 matrix and it hasn't finished. And my data are
    >1000x100!


    I posted a somewhat long version before (inspired by Bengts idea) but
    I remembered an easier way to generate all combinations, and also I
    noticed that there is freedom in choosing to analyze rows or columns,
    which can be computationally advantageous. Below is a version that
    depends mostly on 2**N where N is the smallest of those two values.
    Unfortunately 2**100 is still way to big a number, but my code can now
    routinely solve 100x15 matrices ...

    >Last night, I began to formulate the problem as a logic statement,
    >hoping this would give me an idea of how to proceed. But no progress
    >yet. But I have come to the conclusion that with P ones, brute force
    >requires 2^P combinations. With 1000x100 with 5% missing that gives
    >me 2^5000. Not good.


    IMO it depends (except for the amount time spent in copying data of
    course) on the smallest of the number of rows or columns, so that's
    2**100 in this case. Maybe for some matrices, your number is
    preferable?

    >Thanks for your suggestion, though. If you have any more thoughts,
    >let me know.


    I hope you don't mind me following this too :) Nice problem, and I'm
    not so sure about my math as I sound above, if anyone can prove me
    right or wrong I'd be happy anyway ...

    Anton

    ---

    class ScoreMatrix:

    def __init__(self, X):
    n1,n2 = len(X),len(X[0])
    if n2<n1:
    self.X = zip(*X)
    self.rotated = True
    n = n2
    else:
    self.X = X
    n = n1
    self.rotated = False
    self.R = range(n)
    self.count = 2**n

    def __getitem__(self,i):
    #score selected rows and columns by index
    if not (-1<i<self.count): raise IndexError
    rows =[x for j,x in enumerate(self.R) if 1<<j &i]
    if rows:
    Y = [self.X for i in rows]
    Z = [(i,z) for i,z in enumerate(zip(*Y)) if 1 not in z]
    cols = [i for i,z in Z]
    score = sum(map(len,[z for i,z in Z]))
    if self.rotated: return score,cols,rows
    else: return score,rows,cols
    else: return 0,[],[]

    def test():
    from random import random,randint
    f = .05
    r,c = 100,15
    X=[]
    for i in range(r):
    X.append([])
    for j in range(c):
    X.append(int(random()<f))
    M = ScoreMatrix(X)
    highscore = 0,[],[]
    for i,sc in enumerate(M):
    if sc > highscore:
    mi,highscore = i,sc
    sc,rows,cols = highscore
    print "maximum score: %s" %(sc)
    print "index of this score: %s" %(mi)
    print "tabledata:"
    for i,r in enumerate(X):
    for j,c in enumerate(r):
    if i in rows and j in cols: print c,
    else: print 'x',
    print

    if __name__=='__main__':
    test()
    Anton Vredegoor, Jul 5, 2003
    #9
  10. Folllowing Bengt and (Anton Vredegoor)'s leads,
    the following code can be fast (at times). It is quite sensitive to the
    probability of non-zeroness, (.01 works well, the .o5 is nowhere near so
    nice).

    I get 2.25 min to do 25 x 25 at .05
    2.75 min to do 30 x 30 at .05
    It gets slow _very_ fast, but gets good numbers if the probability is low
    .25 min to do 45 x 45 at .01
    1.5 min to do 50 x 50 at .01


    -Scott David Daniels
    (not fully back up yet, though).


    Here's the code:
    ################################
    # myopt.py
    __version__ = '0.3'

    try:
    assert list(enumerate('ab')) == [(0, 'a'), (1, 'b')]
    except NameError:
    def enumerate(iterable):
    # Not exact, but close enough
    lst = list(iterable)
    return zip(range(len(lst)), lst)


    def bitcount(row):
    '''Return the number of on bits in the integer argsA'''
    assert row >= 0
    result = 0
    while row:
    result += 1
    lsbit = row & -row
    row ^= lsbit
    return result


    def rowencode(vector):
    '''convert from a buncha numbers to a bit-representation'''
    bit = 1L
    result = 0
    for element in vector:
    if element:
    result |= bit
    bit <<= 1
    return result


    def rowdecode(value, mask=-1):
    '''convert from a bit-rep to a buncha numbers (at least as long as mask)'''
    if mask < 0: mask = value
    assert value >= 0
    result = []
    while mask or value:
    result.append(value & 1)
    mask >>= 1
    value >>= 1
    return result


    class Answer(object):
    '''An answer represents a result calculation.'''
    __slots__ = 'rows', 'colmask', '_score'
    totalrequests = 0
    totalcalcs = 0

    def __init__(self, colmask, rows):
    '''The columns in colmask are the ones we keep.'''
    self.colmask = colmask
    self.rows = rows
    self._score = None

    def score(self):
    '''Calculate the score lazily'''
    self.__class__.totalrequests += 1
    if self._score is None:
    self.__class__.totalcalcs += 1
    self._score = bitcount(self.colmask) * len(self.rows)
    return self._score

    def __repr__(self):
    return '%s(%d:%s, %x):%s' % (self.__class__.__name__,
    len(self.rows), self.rows,
    self.colmask, self._score)


    totalcalls = 0

    def choose(rows, keepcols, N=0, keeprows=None):
    '''Choose rows and columns to keep. Return an Answer for the choice'''
    global totalcalls
    totalcalls += 1
    if keeprows is None:
    keeprows = []
    try:
    while 0 == rows[N] & keepcols:
    keeprows.append(N)
    N += 1
    except IndexError:
    return Answer(keepcols, keeprows)

    # a difference: some kept columns in this row are non-0
    # must drop either those columns or this row

    # Calculate result if we keep this row (drop problem columns)
    withrow = choose(rows, keepcols & ~rows[N], N+1, keeprows + [N])

    # Calculate result if we drop this row
    skiprow = choose(rows, keepcols, N+1, keeprows)

    # Choose the better answer (keep the row if everything is equal).
    if (withrow.colmask == skiprow.colmask
    or withrow.score() >= skiprow.score()):
    return withrow
    else:
    return skiprow


    # The data used from the example
    X = [ [1, 0, 0, 0, 1],
    [0, 0, 0, 0, 0],
    [0, 0, 0, 1, 0],
    [0, 0, 0, 0, 0],
    [0, 0, 1, 0, 0],
    [0, 0, 1, 0, 0] ]


    def printrep(row, symbols, mask=0):
    '''A row representing a single row (column-masked by mask)'''
    assert mask >= 0
    result = []
    for element in row:
    result.append(symbols[(1 & mask) * 2 + (element != 0)])
    mask >>= 1
    assert mask == 0 # mask doesn't extend beyond data.
    return ''.join(result)


    def printanswer(data, rows, keepcols):
    '''Print the represented row'''
    toohuge = len(data)
    rowqueue = rows + [toohuge]
    rowqueue.reverse()
    nextrow = rowqueue.pop()
    for rownumber, row in enumerate(data):
    if nextrow > rownumber:
    # This row was zapped
    print '#', printrep(row, '01OI', keepcols)
    else:
    assert rownumber == nextrow # This row was kept
    nextrow = rowqueue.pop()
    print '_', printrep(row, '01~@', keepcols)
    assert nextrow == toohuge and not rowqueue


    def getanswer(data):
    '''Calculate the best-cut for a particular matrix'''
    columns = max([len(row) for row in data])
    rowdata = [rowencode(row) for row in data]
    return choose(rowdata, (1L << columns) - 1)


    def main(data=X):
    global totalcalls

    totalcalls = 0
    answer = getanswer(data)
    print 'Requested: %s, Calculated: %s,' % (
    Answer.totalrequests, Answer.totalcalcs),

    print 'answer: %r,' % answer,
    print 'Score: %s' % answer.score()
    print 'Total choose calls required: %s' % totalcalls

    printanswer(data, answer.rows, answer.colmask)



    def rangen(rows, columns, probability=0.05):
    '''create a rows by columns data table with 1s at the given probability'''
    import random
    result = []
    for row in range(rows):
    result.append([int(random.random() < probability)
    for column in range(columns)])
    return result


    if __name__ == '__main__':
    import sys
    assert getanswer([[0]]).score() == 1
    assert getanswer([[0,1], [1,0]]).score() == 1
    assert getanswer([[0,1,0], [1,0,0]]).score() == 2
    if len(sys.argv) < 2:
    main(X)
    else:
    args = sys.argv[1:]
    if '.' in args[-1]:
    assert len(args) > 1
    probability = float(args.pop())
    else:
    probability = .2

    rows = int(args[0])
    if len(args) == 1:
    cols = rows
    else:
    assert len(args) == 2
    cols = int(args[1])
    main(rangen(rows, cols, probability))
    Scott David Daniels, Jul 6, 2003
    #10
  11. In article <3f07bd72$>,
    (Scott David Daniels) writes:
    > Folllowing Bengt and (Anton Vredegoor)'s leads,
    > the following code can be fast (at times). It is quite sensitive to the
    > probability of non-zeroness, (.01 works well, the .o5 is nowhere near so
    > nice).
    >
    > I get 2.25 min to do 25 x 25 at .05
    > 2.75 min to do 30 x 30 at .05
    > It gets slow _very_ fast, but gets good numbers if the probability is low
    > .25 min to do 45 x 45 at .01
    > 1.5 min to do 50 x 50 at .01


    OK, a bit faster: (bitcount faster, allow choose to sometimes quit early).
    I get 0.25 min to do 25 x 25 at .05
    5.5 min to do 30 x 30 at .05 (?! maybe above was wrong)
    and:
    1 sec to do 45 x 45 at .01
    9 sec to do 50 x 50 at .01

    -Scott David Daniels


    ###################################

    # myopt.py

    __version__ = '0.4'
    try:
    assert list(enumerate('ab')) == [(0, 'a'), (1, 'b')]
    except NameError:
    def enumerate(iterable):
    lst = list(iterable)
    return zip(range(len(lst)), lst)


    def bitcount(row):
    '''Return the number of on bits in the integer'''
    assert row >= 0
    result = 0
    while row:
    result += 1
    lsbit = row & -row
    row ^= lsbit
    return result

    bytebits = [bitcount(n) for n in range(256)] # precalculate byte counts

    def bitcount(row): # replace previous bitcount
    '''Return the number of on bits in the integer (byte at a time)'''
    assert row >= 0
    result = bytebits[row & 255]
    while row >= 256:
    row >>= 8
    result += bytebits[row & 255]
    return result


    def rowencode(vector):
    '''convert from a buncha numbers to a bit-representation'''
    bit = 1L
    result = 0
    for element in vector:
    if element:
    result |= bit
    bit <<= 1
    return result


    class Answer(object):
    '''An answer represents a result calculation.'''
    __slots__ = 'rows', 'colmask', '_score'
    totalrequests = 0
    totalcalcs = 0

    def __init__(self, colmask, rows):
    '''The columns in colmask are the ones we keep.'''
    self.colmask = colmask
    self.rows = rows
    self._score = None

    def score(self):
    '''Calculate the score lazily'''
    self.__class__.totalrequests += 1
    if self._score is None:
    self.__class__.totalcalcs += 1
    self._score = bitcount(self.colmask) * len(self.rows)
    return self._score

    def __repr__(self):
    return '%s(%d:%s, %x):%s' % (self.__class__.__name__,
    len(self.rows), self.rows,
    self.colmask, self._score)


    totalcalls = 0

    def choose(rows, keepcols, N=0, keeprows=None, best=None):
    '''Choose rows and columns to keep. Return an Answer for the choice'''
    global totalcalls
    totalcalls += 1
    if keeprows is None:
    keeprows = []
    try:
    while 0 == rows[N] & keepcols:
    keeprows.append(N)
    N += 1
    except IndexError:
    return Answer(keepcols, keeprows)

    # a difference: some kept columns in this row are non-0
    # must drop either those columns or this row

    # Calculate result if we keep this row (drop problem columns)
    newmask = keepcols & ~rows[N]
    if best and newmask == best.colmask:
    withrow = best # Already have a calculated with this mask. use it.
    else:
    withrow = choose(rows, keepcols & ~rows[N], N+1, keeprows + [N], best)

    # Calculate result if we drop this row
    skiprow = choose(rows, keepcols, N+1, keeprows, best or withrow)

    # Choose the better answer (keep the row if everything is equal).
    if (withrow.colmask == skiprow.colmask
    or withrow.score() >= skiprow.score()):
    return withrow
    else:
    return skiprow


    # The data used from the example
    X = [ [1, 0, 0, 0, 1],
    [0, 0, 0, 0, 0],
    [0, 0, 0, 1, 0],
    [0, 0, 0, 0, 0],
    [0, 0, 1, 0, 0],
    [0, 0, 1, 0, 0] ]


    def printrep(row, symbols, mask=0):
    '''A row representing a single row (column-masked by mask)'''
    assert mask >= 0
    result = []
    for element in row:
    result.append(symbols[(1 & mask) * 2 + (element != 0)])
    mask >>= 1
    assert mask == 0 # mask doesn't extend beyond data.
    return ''.join(result)


    def printanswer(data, rows, keepcols):
    '''Print the represented row'''
    toohuge = len(data)
    rowqueue = rows + [toohuge]
    rowqueue.reverse()
    nextrow = rowqueue.pop()
    for rownumber, row in enumerate(data):
    if nextrow > rownumber:
    # This row was zapped
    print '#', printrep(row, '01OI', keepcols)
    else:
    assert rownumber == nextrow # This row was kept
    nextrow = rowqueue.pop()
    print '_', printrep(row, '01~@', keepcols)
    assert nextrow == toohuge and not rowqueue


    def getanswer(data):
    '''Calculate the best-cut for a particular matrix'''
    columns = max([len(row) for row in data])
    rowdata = [rowencode(row) for row in data]
    return choose(rowdata, (1L << columns) - 1)


    def main(data=X):
    global totalcalls

    totalcalls = 0
    answer = getanswer(data)
    print 'Requested: %s, Calculated: %s,' % (
    Answer.totalrequests, Answer.totalcalcs),

    print 'answer: %r,' % answer,
    print 'Score: %s' % answer.score()
    print 'Total choose calls required: %s' % totalcalls

    printanswer(data, answer.rows, answer.colmask)



    def rangen(rows, columns, probability=0.05):
    '''create a rows by columns data table with 1s at the given probability'''
    import random
    result = []
    for row in range(rows):
    result.append([int(random.random() < probability)
    for column in range(columns)])
    return result


    if __name__ == '__main__':
    import sys
    assert getanswer([[0]]).score() == 1
    assert getanswer([[0,1], [1,0]]).score() == 1
    assert getanswer([[0,1,0], [1,0,0]]).score() == 2
    if len(sys.argv) < 2:
    main(X)
    else:
    args = sys.argv[1:]
    if '.' in args[-1]:
    assert len(args) > 1
    probability = float(args.pop())
    else:
    probability = .2

    rows = int(args[0])
    if len(args) == 1:
    cols = rows
    else:
    assert len(args) == 2
    cols = int(args[1])
    main(rangen(rows, cols, probability))
    Scott David Daniels, Jul 6, 2003
    #11
  12. John Hunter

    John Hunter Guest

    >>>>> "Jon" == Jon <> writes:

    Jon> Sadly I can only give you a method which is certainly not
    Jon> optimal, but at least finds something which is probably not
    Jon> too bad, and fairly quickly. Uses the Numeric module to speed
    Jon> up, so that a few thousand by a few hundred is quickly
    Jon> computable. I'm curious to know if it does much better or
    Jon> worse than other algorithms.

    Hi Jon, thanks for the suggestion. For my data set, your solution
    performed a little better size-wise than my poor man's approach, which
    I gave in the original post. And it is certainly fast enough. I
    liked your idea of zero-ing out the elements when you made a decision
    to delete a row or column. When I was thinking about solutions using
    Numeric, I kept running up against the problem of deleting a row or
    column (which entails a whole array copy so is expensive).

    If anyone wants to try their algorithm on "real world data" which,
    unlike the ones discussed here, have lots of correlation between
    missing rows and columns that might be exploited, I have uploaded the
    missing data matrix to a web server. It can be loaded with:

    from urllib import urlopen
    url = 'http://nitace.bsd.uchicago.edu:8080/summer/jdh/missing.dat'
    L = [ [ int(val) for val in line.split()]
    for line in urlopen(url).readlines()]
    # L is 1072x83

    I think I may adapt your algorithm to terminate when the percent
    missing falls below some threshold, removing the requirement that
    *all* missing values be removed. This will drop the worst offenders
    observation or variable wise. Then I can use a regression approach to
    fill in the remaining ones.

    Jon> Interesting problem!

    I agree. There are lots of extensions to the original formulation
    that are relevant to the missing value problem. As we have seen in
    the posts here, there are good exact solutions for smallish matrices,
    but for larger ones a statistical approach is required. The right
    statistical solution surely depends on the correlation structure of
    the matrix.

    Also, it would be nice to generalize to higher dimensions (eg 3D
    matrices). For my problem, I have N observations of M variables over
    D days. So this is the 3D version of the 2D problem above. Your
    solution is readily adaptable to this case. More generally, you might
    want to impose extra requirements, like, no days can be dropped, or at
    most 10% of the days can be dropped, or if you want to compute changes
    in variables over days, that no days can be dropped.

    The more general formulation is something like

    Given an N-D (N0, M0, P0, ...) boolean matrix X with a covariance
    matrix C, what is the largest submatrix of X of dimensions (N1, M1,
    P1, ...) such that the fraction of remaining dimensions satisfies

    (N1/N0, M1/M0, P1/P0, ...) >= (alpha, beta, gamma, ..)

    and the total percent true <= p.

    The original problem is a special case of this where alpha=0, beta=0,
    p=0 and C unspecified.

    Anyone looking for a dissertation project :) ?

    John Hunter
    John Hunter, Jul 7, 2003
    #12
  13. John Hunter

    John Hunter Guest

    >>>>> "John" == John Hunter <> writes:

    John> I think I may adapt your algorithm to terminate when the
    John> percent missing falls below some threshold, removing the
    John> requirement that *all* missing values be removed. This will
    John> drop the worst offenders observation or variable wise. Then
    John> I can use a regression approach to fill in the remaining
    John> ones.


    I think I have found a major improvement on your algorithm Jon, by
    changing the rule which determines whether to drop a row or a column
    after you have found the worst offenders for each. The original
    approach was to keep the one with the most valid (zero) elements,
    which makes sense since that is the quantity you are trying to
    maximize. But consider the case where there is correlation so that
    true values on a row or column tend to be correlated with other true
    values on that row or column, as in

    X = 1 0
    1 0
    1 0
    1 0
    1 0
    0 0
    0 0

    Your algorithm would keep the column, since keeping a row gives you
    only one zero whereas keeping a column gives you 2. By looking at the
    *fraction* of ones in the worst offending row and column, you can do
    much better. By deleting the row or column with the greatest fraction
    of ones, you reduce the likelihood that you'll encounter a one in
    future iterations. Using this approach, I was able to find a 893x67
    submatrix with a score of 59831 versus the 462x81 submatrix with a
    score of 37422.

    I have also added thresholds to the code, so that you can terminate
    with a fraction of missing values remaining. By setting the
    thresholds to zero, you get the special case of no missing values.

    The example below also includes a no-copy optimization that Jon
    emailed me off list. The algorithm run time for a 1073x82 matrix is
    around 30ms on my 700MHz machine.

    John Hunter

    from __future__ import division
    import sys, time
    from urllib import urlopen
    from Numeric import *

    def trim_missing(a, thresh=(0.05,0.05) ):
    """
    a is a numObservations by numVariables boolean matrix

    thresh = (othresh, vthresh) are the observation and variable
    thresholds that trigger deletion

    remove any rows where percent missing vars > vthresh
    remove any cols where percent missing obs > othresh

    This is done iteratively rechecking the percent missing on each
    iteration or row or col removal, with the largest offender removed
    on each iteration

    return value is score, rowInd, colInd where the two ind arrays
    are the indices of the rows and cols that were kept

    """
    othresh, vthresh = thresh # theshold for deletion
    dr=zeros(a.shape[0]) # Arrays to note deleted rows...
    dc=zeros(a.shape[1]) # ... and columns
    sizeleft=list(a.shape) # Remaining dimensions
    sr=sum(a,1) # Column scores = ij's to remove
    sc=sum(a,0) # Row scores
    while 1: # Keep deleting till none left
    mr=argmax(sr) # index highest row score
    mc=argmax(sc) # index highest column score
    fracRow = sr[mr]/sizeleft[1]
    fracCol = sc[mc]/sizeleft[0]

    if fracRow<=vthresh and fracCol<=othresh:
    break # stop when missing criteria are satisfied
    if fracRow-vthresh<fracCol-othresh:
    sr=sr-a[:,mc] # rows reduced by contents of column
    sc[mc]=0 # sum of column now zero
    dc[mc]=1 # tags column as deleted
    sizeleft[1]-=1
    else:
    # deleting the row
    sc=sc-a[mr,:] # columns reduced by contents of row
    sr[mr]=0 # sum of row is now zero
    dr[mr]=1 # tags row as deleted
    sizeleft[0]-=1

    score=(a.shape[0]-sum(dr))*(a.shape[1]-sum(dc))
    dr=compress(dr==0,range(a.shape[0])) # gives return format of
    dc=compress(dc==0,range(a.shape[1])) # optimrcs posted by Bengt

    return score, dr, dc


    url = 'http://nitace.bsd.uchicago.edu:8080/summer/jdh/missing.dat'
    L = [ [ int(val) for val in line.split()]
    for line in urlopen(url).readlines()]
    a= array(L)
    start = time.time()
    score, goodRows, goodCols = trim_missing(a, thresh=(0.0, 0.0))
    end = time.time()
    print end-start
    print 'Original size', a.shape
    print 'Number of non-zeros in', sum(ravel(a))
    print 'Reduced size', len(goodRows), len(goodCols)
    print 'Score', score
    John Hunter, Jul 7, 2003
    #13
  14. On Mon, 07 Jul 2003 11:00:11 -0500, John Hunter <> wrote:

    >>>>>> "Jon" == Jon <> writes:

    >
    > Jon> Sadly I can only give you a method which is certainly not
    > Jon> optimal, but at least finds something which is probably not
    > Jon> too bad, and fairly quickly. Uses the Numeric module to speed
    > Jon> up, so that a few thousand by a few hundred is quickly
    > Jon> computable. I'm curious to know if it does much better or
    > Jon> worse than other algorithms.
    >
    >Hi Jon, thanks for the suggestion. For my data set, your solution
    >performed a little better size-wise than my poor man's approach, which
    >I gave in the original post. And it is certainly fast enough. I
    >liked your idea of zero-ing out the elements when you made a decision
    >to delete a row or column. When I was thinking about solutions using
    >Numeric, I kept running up against the problem of deleting a row or
    >column (which entails a whole array copy so is expensive).
    >
    >If anyone wants to try their algorithm on "real world data" which,
    >unlike the ones discussed here, have lots of correlation between
    >missing rows and columns that might be exploited, I have uploaded the
    >missing data matrix to a web server. It can be loaded with:
    >
    > from urllib import urlopen
    > url = 'http://nitace.bsd.uchicago.edu:8080/summer/jdh/missing.dat'
    > L = [ [ int(val) for val in line.split()]
    > for line in urlopen(url).readlines()]
    > # L is 1072x83
    >

    The statistics on this data is interesting, referring to counts of 1's:
    (not very tested)

    [15:31] C:\pywk\clp\submx>python showmx.py -f missing.dat -counts
    nr=1072, nc=83
    totcc=6239, totrc=6239 => *100./(83*1072) => 7.01
    count:freq -- cols with count ones occurred freq times
    0:16 1:4 8:3 14:2 16:4 20:4 21:3 23:1
    25:6 27:1 28:7 32:4 41:1 59:1 60:3 62:3
    65:1 95:2 96:1 108:3 125:1 134:1 158:1 177:1
    199:1 206:2 237:1 241:1 252:1 254:1 1061:2
    count:freq -- rows with count ones occurred freq times
    0:3 1:2 2:460 3:67 4:140 5:66 6:42 7:49
    8:61 9:17 10:46 11:23 12:8 13:8 14:12 15:10
    16:3 17:5 18:1 19:7 20:4 21:2 22:5 23:1
    27:2 28:2 30:2 31:1 33:1 34:1 36:2 38:1
    40:1 41:1 44:4 45:7 48:1 55:2 56:2

    And sorted per descending frequency of number of cols and rows with particular counts:

    [15:31] C:\pywk\clp\submx>python showmx.py -f missing.dat -counts -freq
    nr=1072, nc=83
    totcc=6239, totrc=6239 => *100./(83*1072) => 7.01
    freq:count -- cols with count ones occurred freq times
    16:0 7:28 6:25 4:32 4:20 4:16 4:1 3:108
    3:62 3:60 3:21 3:8 2:1061 2:206 2:95 2:14
    1:254 1:252 1:241 1:237 1:199 1:177 1:158 1:134
    1:125 1:96 1:65 1:59 1:41 1:27 1:23
    freq:count -- rows with count ones occurred freq times
    460:2 140:4 67:3 66:5 61:8 49:7 46:10 42:6
    23:11 17:9 12:14 10:15 8:13 8:12 7:45 7:19
    5:22 5:17 4:44 4:20 3:16 3:0 2:56 2:55
    2:36 2:30 2:28 2:27 2:21 2:1 1:48 1:41
    1:40 1:38 1:34 1:33 1:31 1:23 1:18

    There are two columns with 1061 ones out of a possible 1072, so they account for 2
    in most of the row counts. Hence the most popular row count (2) mostly goes to 0.
    Interestingly, 4(2) is 'way more popular than 3(1). This doesn't look like what
    I'd expect from uniform overall probability of some percentage. Discounting the
    stuck columns, no ones shouldn't be more likely than one one, since there's only
    one way to get all zeroes, but there's then 81 equally probable ways to get one one.

    Look at a uniform 5% on 1072 x 83 (twice to show variation).
    There's a distribution with tails, centered on 5% of 1072 and 83 it seems:

    [17:48] C:\pywk\clp\submx>showmx.py 1072 83 5 -counts
    nr=1072, nc=83
    totcc=4338, totrc=4338 => *100./(83*1072) => 4.88
    count:freq -- cols with count ones occurred freq times
    35:1 39:1 40:2 41:1 42:5 43:1 45:5 46:3
    47:2 48:6 49:3 50:5 51:5 52:5 53:5 54:5
    55:5 56:5 58:3 59:4 61:1 62:2 63:1 64:1
    65:3 70:1 78:2
    count:freq -- rows with count ones occurred freq times
    0:21 1:74 2:134 3:219 4:218 5:176 6:113 7:64
    8:29 9:15 10:6 11:2 12:1

    [17:48] C:\pywk\clp\submx>showmx.py 1072 83 5 -counts
    nr=1072, nc=83
    totcc=4394, totrc=4394 => *100./(83*1072) => 4.94
    count:freq -- cols with count ones occurred freq times
    37:1 40:1 41:3 43:1 44:2 45:2 46:5 47:5
    48:1 49:3 50:6 51:8 52:7 53:7 54:3 55:3
    56:4 57:3 58:3 59:1 60:1 61:1 62:1 63:1
    64:3 65:2 68:2 69:1 70:1 72:1
    count:freq -- rows with count ones occurred freq times
    0:19 1:58 2:150 3:218 4:218 5:173 6:111 7:63
    8:38 9:13 10:5 11:5 12:1


    >I think I may adapt your algorithm to terminate when the percent
    >missing falls below some threshold, removing the requirement that
    >*all* missing values be removed. This will drop the worst offenders
    >observation or variable wise. Then I can use a regression approach to
    >fill in the remaining ones.
    >
    > Jon> Interesting problem!

    Yes. So many, so little time ... ;-/

    >
    >I agree. There are lots of extensions to the original formulation
    >that are relevant to the missing value problem. As we have seen in
    >the posts here, there are good exact solutions for smallish matrices,
    >but for larger ones a statistical approach is required. The right
    >statistical solution surely depends on the correlation structure of
    >the matrix.

    I'm pretty sure there are better exact algorithms than the brute force
    one I first posted, but I haven't done a new one since, just some little
    things to show things about the data, as above.
    >
    >Also, it would be nice to generalize to higher dimensions (eg 3D
    >matrices). For my problem, I have N observations of M variables over
    >D days. So this is the 3D version of the 2D problem above. Your
    >solution is readily adaptable to this case. More generally, you might
    >want to impose extra requirements, like, no days can be dropped, or at
    >most 10% of the days can be dropped, or if you want to compute changes
    >in variables over days, that no days can be dropped.
    >

    I'm wondering if you might not have other considerations about which
    data points to drop too. E.g., if an observation makes the difference
    between being able to invert a matrix or not, you might opt to drop
    several other things instead, even though in terms of count the set
    would be less optimum. (Curiousness: what do your observations represent?)

    >The more general formulation is something like
    >
    > Given an N-D (N0, M0, P0, ...) boolean matrix X with a covariance
    > matrix C, what is the largest submatrix of X of dimensions (N1, M1,
    > P1, ...) such that the fraction of remaining dimensions satisfies
    >
    > (N1/N0, M1/M0, P1/P0, ...) >= (alpha, beta, gamma, ..)
    >
    > and the total percent true <= p.
    >
    >The original problem is a special case of this where alpha=0, beta=0,
    >p=0 and C unspecified.
    >
    >Anyone looking for a dissertation project :) ?
    >
    >John Hunter
    >


    Regards,
    Bengt Richter
    Bengt Richter, Jul 8, 2003
    #14
  15. John Hunter

    John Hunter Guest

    >>>>> "Bengt" == Bengt Richter <> writes:

    Bengt> Curiousness: what do your observations represent?

    By day, I do research in neurology and neurobiology. By night, stock
    market analysis: academia doesn't pay! I have written several modules
    to parse the Yahoo Finance pages, eg,
    http://biz.yahoo.com/z/a/x/xoma.html and
    http://biz.yahoo.com/p/x/xoma.html. The 2 fields which you identified
    as missing on all but 2 observations are Revenue Estimates: "Year Ago
    Sales" and "Sales Growth" for This Year, both of which are N/A for
    almost all tickers.

    Jon> Interesting problem!
    Bengt> Yes. So many, so little time ... ;-/

    Exactly.

    JDH
    John Hunter, Jul 8, 2003
    #15
  16. John Hunter

    John Hunter Guest

    >>>>> "John" == John Hunter <> writes:
    John>Using this approach, I was able to find a 893x67 submatrix
    John>with a score of 59831 versus the 462x81 submatrix with a
    John>score of 37422.

    Onwards and upwards.... Since this is beginning to look like a
    gradient descent algorithm, it is natural to worry about local minima.
    So I followed the lead of simulated annealing algorithms and added a
    little noise to the step where the decision to delete the row or
    column is made, eg,

    from MLab import randn
    def trim_missing(a, thresh=(0.05,0.05), sigma=0.005 ):
    ...snip...
    if fracRow-vthresh+sigma*randn()<fracCol-othresh:
    #delete the column
    else:
    #delete the row

    where sigma is like the temperature parameter of a simulated annealing
    algorithm.

    Repeatedly stepping over a variety of sigmas in search of a minimum
    has yielded minor improvement for the dataset of interest. Best
    results:

    Deterministic: 59831 893 67
    Random: 59928 908 66 for sigma=0.0017

    For random arrays the results are more striking. For random matrices
    500x500, comparing the scores of the deterministic (sigma=0) versus
    random (best sigma in [0,0.01]) searches:

    score score
    % ones mean determ mean random mean(ratio) std(ratio)
    1% 41753.2 42864.6 1.03 0.0093
    5% 5525.8 6174.2 1.12 0.06
    10% 1968.3 2193.9 1.12 0.05

    For a matrix of the same number of total elements, but unbalanced in
    the dimensions, eg, 1000x250, the results are more pronounced, in
    favor of the random search algorithm

    score score
    % ones mean determ mean random mean(ratio) std(ratio)
    1% 48724.0 51451.8 1.06 0.018
    5% 6340.5 7542.3 1.19 0.038
    10% 2092.5 2458.6 1.180 0.091

    Below is a script that performs a random search of a random matrix for
    a variety of sigmas.

    JDH

    from __future__ import division
    import sys, time
    from urllib import urlopen
    from Numeric import *
    from MLab import randn, rand, mean, std

    def trim_missing(a, thresh=(0.05,0.05), sigma=0.005 ):
    """
    a is a numObservations by numVariables boolean matrix

    thresh = (othresh, vthresh) are the observation and variable
    thresholds that trigger deletion

    remove any rows where percent missing vars > vthresh
    remove any cols where percent missing obs > othresh

    This is done iteratively rechecking the percent missing on each
    iteration or row or col removal, with the largest offender removed
    on each iteration

    return value is score, rowInd, colInd where the two ind arrays
    are the indices of the rows and cols that were kept

    """
    othresh, vthresh = thresh # theshold for deletion
    dr=zeros(a.shape[0]) # Arrays to note deleted rows...
    dc=zeros(a.shape[1]) # ... and columns
    sizeleft=list(a.shape) # Remaining dimensions
    sr=sum(a,1) # Column scores = ij's to remove
    sc=sum(a,0) # Row scores
    while 1: # Keep deleting till none left
    mr=argmax(sr) # index highest row score
    mc=argmax(sc) # index highest column score
    fracRow = sr[mr]/sizeleft[1]
    fracCol = sc[mc]/sizeleft[0]

    if fracRow<=vthresh and fracCol<=othresh:
    break # stop when missing criteria are satisfied
    if fracRow-vthresh+sigma*randn()<fracCol-othresh:
    sr=sr-a[:,mc] # rows reduced by contents of column
    sc[mc]=0 # sum of column now zero
    dc[mc]=1 # tags column as deleted
    sizeleft[1]-=1
    else:
    # deleting the row
    sc=sc-a[mr,:] # columns reduced by contents of row
    sr[mr]=0 # sum of row is now zero
    dr[mr]=1 # tags row as deleted
    sizeleft[0]-=1

    score=(a.shape[0]-sum(dr))*(a.shape[1]-sum(dc))
    dr=compress(dr==0,range(a.shape[0])) # gives return format of
    dc=compress(dc==0,range(a.shape[1])) # optimrcs posted by Bengt

    return score, dr, dc

    results = []
    for trial in range(10):
    a=rand(1000,250)
    a=where(a>0.90,1,0)

    score0, goodRows, goodCols = trim_missing(a, thresh=(0,0), sigma=0)
    maxScore = 0, score0, len(goodRows), len(goodCols)
    print 'Deterministic:', score0, len(goodRows), len(goodCols)
    for sigma in arange(0.0001, 0.01, 0.001):
    for i in range(10):
    score, goodRows, goodCols = trim_missing(
    a, thresh=(0,0), sigma=sigma)
    if score>maxScore[1]:
    #print '\t', sigma, score, len(goodRows), len(goodCols)
    maxScore = sigma, score, len(goodRows), len(goodCols)
    print '\t', maxScore

    results.append((score0, maxScore[1], maxScore[1]/score0))
    results = array(results)
    print mean(results)
    print std(results)
    John Hunter, Jul 8, 2003
    #16
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Siemel Naran

    Does true ^ true return false?

    Siemel Naran, Jun 17, 2004, in forum: C++
    Replies:
    19
    Views:
    660
    Chris Theis
    Jun 18, 2004
  2. Chip
    Replies:
    6
    Views:
    2,628
    E. Robert Tisdale
    Jan 8, 2005
  3. Andy Leszczynski
    Replies:
    4
    Views:
    327
    Erik Max Francis
    Oct 13, 2005
  4. Pierre Quentel

    "0 in [True,False]" returns True

    Pierre Quentel, Dec 12, 2005, in forum: Python
    Replies:
    59
    Views:
    1,028
    Grant Edwards
    Dec 16, 2005
  5. bdb112
    Replies:
    45
    Views:
    1,334
    jazbees
    Apr 29, 2009
Loading...

Share This Page