String 'close-to' comparison.

K

Kyle Hunter

Hello, I have an array. It contains approximately twenty elements which
are strings. I also have one string - this string was obtained using an
OCR system. One of the strings in the array should 'match' the string
gotten using the OCR system - unfortunately OCRs aren't perfect!

I want to take this string, and compare it to every string in the array,
and attempt to return the closest match.

I.E.,
array = ['Hello there, how are you?', 'What did you do over your
break?', 'I like my coffee brown.", "I just bought a new car."]
string = "What did you d0 over your brcak?"


And then have my comparison function return array[1]. As you can see,
string has some 'OCR errors' - it's usually 80-95% accurate, if not
dead-on.
 
B

Brendan Stennett

If you know all the possibilities that your OCR system *could* pick up
then you could always do something like this...

knownStrings = ['Hello','Goodbye']
out = []
OCR_strings = #new array of strings

OCR_strings.each do |ocr|
matches,len = 0,0
knownStrings.each do |known|
len = known.length
(len-1).times do |i|
if (i+1) >= ocr.length
break
else
if ocr == known
matches += 1
end
end
if matches / known.length > 0.85
out << known
else
out << "!#{known}"
end
end
end
end


...completely untested but i think you know what im getting at
 
B

Brendan Stennett

if matches / known.length > 0.85
out << known
else
out << "!#{known}"
end

should be more like

if matches / known.length > 0.85
out << known
end
 
H

Heesob Park

Hi,

Kyle said:
Hello, I have an array. It contains approximately twenty elements which
are strings. I also have one string - this string was obtained using an
OCR system. One of the strings in the array should 'match' the string
gotten using the OCR system - unfortunately OCRs aren't perfect!

I want to take this string, and compare it to every string in the array,
and attempt to return the closest match.

I.E.,
array = ['Hello there, how are you?', 'What did you do over your
break?', 'I like my coffee brown.", "I just bought a new car."]
string = "What did you d0 over your brcak?"


And then have my comparison function return array[1]. As you can see,
string has some 'OCR errors' - it's usually 80-95% accurate, if not
dead-on.
Here is a simple score matching code:

array = ['Hello there, how are you?', 'What did you do over your
break?',
'I like my coffee brown.', 'I just bought a new car.']
string = "What did you d0 over your brcak?"

def comp(str1,str2)
a=str1.split('').uniq
b=str2.split('').uniq
(a+b).uniq.length*1.0/(a.length+b.length)
end

puts array.sort_by{|x|comp(string,x)}.first

Regards,
Park Heesob
 
C

Chris Shea

Hello, I have an array. It contains approximately twenty elements which
are strings. I also have one string - this string was obtained using an
OCR system. One of the strings in the array should 'match' the string
gotten using the OCR system - unfortunately OCRs aren't perfect!

I want to take this string, and compare it to every string in the array,
and attempt to return the closest match.

I.E.,
array = ['Hello there, how are you?', 'What did you do over your
break?', 'I like my coffee brown.", "I just bought a new car."]
string = "What did you d0 over your brcak?"

And then have my comparison function return array[1]. As you can see,
string has some 'OCR errors' - it's usually 80-95% accurate, if not
dead-on.

It sounds like what you want is something like the Levenshtein
distance (http://en.wikipedia.org/wiki/Levenshtein_distance).

HTH,
Chris
 
A

ara.t.howard

Hello, I have an array. It contains approximately twenty elements
which
are strings. I also have one string - this string was obtained using
an
OCR system. One of the strings in the array should 'match' the string
gotten using the OCR system - unfortunately OCRs aren't perfect!

I want to take this string, and compare it to every string in the
array,
and attempt to return the closest match.

I.E.,
array = ['Hello there, how are you?', 'What did you do over your
break?', 'I like my coffee brown.", "I just bought a new car."]
string = "What did you d0 over your brcak?"


And then have my comparison function return array[1]. As you can see,
string has some 'OCR errors' - it's usually 80-95% accurate, if not
dead-on.

http://amatch.rubyforge.org/


a @ http://codeforpeople.com/
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,598
Members
45,149
Latest member
Vinay Kumar Nevatia0
Top