Gathering ngrams with the highest probability

M

Minkoo Seo

Hi group.

I'm writing some scientific applications with Ruby, and found a
frequent problem that I want to solve with Ruby.

I got tons of instances of NGram whose definition is as follows:

NGram = Struct.new :seq, :prob

I have a list of instances of NGram like:

....
#<struct NGram seq=["AO", "S"], prob=-139918.174804688>
#<struct NGram seq=["AY", "T"], prob=-46389.6875>
#<struct NGram seq=["HH", "IH"], prob=18983.1796875>
#<struct NGram seq=["OW", "Z", "AH"], prob=-326323.640625>
#<struct NGram seq=["OW", "Z", "AH"], prob=-35945.25>
#<struct NGram seq=["T", "AH", "L"], prob=20778.7421875>
#<struct NGram seq=["HH", "IH", "S"], prob=37747.3046875>
#<struct NGram seq=["IH", "S", "T"], prob=-17305.6640625>
#<struct NGram seq=["IH", "S", "T"], prob=-17477.390625>
#<struct NGram seq=["IH", "S", "T"], prob=34243.34375>
#<struct NGram seq=["IH", "S", "T"], prob=-2125.265625>
#<struct NGram seq=["IH", "S", "T"], prob=-9046.7890625>
#<struct NGram seq=["IH", "S", "T"], prob=-18200.265625>
#<struct NGram seq=["K", "L", "AH"], prob=-110206.140625>
#<struct NGram seq=["K", "L", "AH"], prob=-92664.984375>
....

What I want to derive from this data is the list of NGram instances
each of which is unique with regard to seq. At the same time, the prob
of each ngram in the list must be that of the highest prob.

For example, from the ngram list I've shown above, I want to derive a
list like the folloing:

....
#<struct NGram seq=["AO", "S"], prob=-139918.174804688>
#<struct NGram seq=["AY", "T"], prob=-46389.6875>
#<struct NGram seq=["HH", "IH"], prob=18983.1796875>
#<struct NGram seq=["OW", "Z", "AH"], prob=-35945.25>
#<struct NGram seq=["T", "AH", "L"], prob=20778.7421875>
#<struct NGram seq=["HH", "IH", "S"], prob=37747.3046875>
#<struct NGram seq=["K", "L", "AH"], prob=-92664.984375>
....

What I've written so far is

# Sort by prob in descending order
ngrams.sort_by { |ngram|

# Compare seq

# Then, compare prob
}

result = []

# Collect unique ngrams with the highest prob.
ngrams.inject(nil) { |prev, cur|
if prev.nil?
result << cur
prev = cur
elsif prev.seq != cur.seq
result << cur
prev = cur
end
}

return result

And it does not seem to be good even to me. Not to mention unwritten
sort_by block, I used result = [] statement which might be get rid of.

Any idea for better code?

Sincerely,
Minkoo Seo
 
R

Robert Feldt

Hi group.

I'm writing some scientific applications with Ruby, and found a
frequent problem that I want to solve with Ruby.

I got tons of instances of NGram whose definition is as follows:

NGram =3D Struct.new :seq, :prob

I have a list of instances of NGram like:

....
#<struct NGram seq=3D["AO", "S"], prob=3D-139918.174804688>
#<struct NGram seq=3D["AY", "T"], prob=3D-46389.6875>
#<struct NGram seq=3D["HH", "IH"], prob=3D18983.1796875>
#<struct NGram seq=3D["OW", "Z", "AH"], prob=3D-326323.640625>
#<struct NGram seq=3D["OW", "Z", "AH"], prob=3D-35945.25>
#<struct NGram seq=3D["T", "AH", "L"], prob=3D20778.7421875>
#<struct NGram seq=3D["HH", "IH", "S"], prob=3D37747.3046875>
#<struct NGram seq=3D["IH", "S", "T"], prob=3D-17305.6640625>
#<struct NGram seq=3D["IH", "S", "T"], prob=3D-17477.390625>
#<struct NGram seq=3D["IH", "S", "T"], prob=3D34243.34375>
#<struct NGram seq=3D["IH", "S", "T"], prob=3D-2125.265625>
#<struct NGram seq=3D["IH", "S", "T"], prob=3D-9046.7890625>
#<struct NGram seq=3D["IH", "S", "T"], prob=3D-18200.265625>
#<struct NGram seq=3D["K", "L", "AH"], prob=3D-110206.140625>
#<struct NGram seq=3D["K", "L", "AH"], prob=3D-92664.984375>
....

What I want to derive from this data is the list of NGram instances
each of which is unique with regard to seq. At the same time, the prob
of each ngram in the list must be that of the highest prob.

For example, from the ngram list I've shown above, I want to derive a
list like the folloing:

....
#<struct NGram seq=3D["AO", "S"], prob=3D-139918.174804688>
#<struct NGram seq=3D["AY", "T"], prob=3D-46389.6875>
#<struct NGram seq=3D["HH", "IH"], prob=3D18983.1796875>
#<struct NGram seq=3D["OW", "Z", "AH"], prob=3D-35945.25>
#<struct NGram seq=3D["T", "AH", "L"], prob=3D20778.7421875>
#<struct NGram seq=3D["HH", "IH", "S"], prob=3D37747.3046875>
#<struct NGram seq=3D["K", "L", "AH"], prob=3D-92664.984375>
....

What I've written so far is

# Sort by prob in descending order
ngrams.sort_by { |ngram|

# Compare seq

# Then, compare prob
}

result =3D []

# Collect unique ngrams with the highest prob.
ngrams.inject(nil) { |prev, cur|
if prev.nil?
result << cur
prev =3D cur
elsif prev.seq !=3D cur.seq
result << cur
prev =3D cur
end
}

return result
ngrams.inject({}) do |highest, ngram|
seq =3D ngram.seq
best_now =3D highest[seq]
highest[seq] =3D ngram unless (best_now && best_now.prob > ngram.prob)
highest
end.values

/RF
 
S

Sylvain Joyeux

ngrams.inject({}) do |table, ngram|
if old = table[ngram.seq]
table[ngram.seq] = ngram if ngram.prob > old.prob
else
table[ngram.seq] = ngram
end
table
end
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,535
Members
45,007
Latest member
obedient dusk

Latest Threads

Top