ImSim: Image Similarity

n00m · Mar 5, 2011

Let me present my newborn project (in Python) ImSim:

http://sourceforge.net/projects/imsim/

Its README.txt:
---------------------------------------------------------------------
ImSim is a python script for finding the most similar pic(s) to
a given one among a set/list/db of your pics.
The script is very short and very easy to follow and understand.
Its sample output looks like this:

bears2.jpg
--------------------
bears2.jpg 0.00
bears3.jpg 55.33
bears1.jpg 68.87
sky1.jpg 83.84
sky2.jpg 84.41
ff1.jpg 91.35
lake1.jpg 95.14
water1.jpg 96.94
ff2.jpg 102.36
roses1.jpg 115.02
roses2.jpg 130.02

Done!

The *less* numeric value -- the *more similar* this pic is to the
tested pic. If this value > 70 almost for sure these pictures are
absolutely different (from totally different domains, so to speak).

What is "similarity" and how can/could/should it be estimated this
point I'm leaving for your consideration/contemplation/arguing etc.

Several sample pics (*.jpg) are included into .zip.
And of course the stuff requires PIL (Python Imaging Library), see:
Home-page: http://www.pythonware.com/products/pil
Download-URL: http://effbot.org/zone/pil-changes-116.htm

Grigory Javadyan · Mar 5, 2011

At least you could've tried to make the script more usable by adding
the possibility to supply command line arguments, instead of editing
the source every time you want to compare a couple of images.

n00m · Mar 5, 2011

I uploaded a new version of the subject with a
VERY MINOR correction in it. Namely, in line #55:

print '%12s %7.2f' % (db[k][1], db[k][0] / 3600.0,)

instead of

print '%12s %7.2f' % (db[k][1], db[k][0] * 0.001,)

I.e. I normalized it to base = 100.
Now the values of similarity can't be greater than 100
and can be treated as some "regular" percents (%%).

Also, due to this change, the *empirical* threshold of
"system alarmity" moved down from "number 70" to "20%".

bears2.jpg
--------------------
bears2.jpg 0.00
bears3.jpg 15.37
bears1.jpg 19.13
sky1.jpg 23.29
sky2.jpg 23.45
ff1.jpg 25.37
lake1.jpg 26.43
water1.jpg 26.93
ff2.jpg 28.43
roses1.jpg 31.95
roses2.jpg 36.12

Done!

Mel · Mar 5, 2011

n00m said:
I uploaded a new version of the subject with a
VERY MINOR correction in it. Namely, in line #55:

print '%12s %7.2f' % (db[k][1], db[k][0] / 3600.0,)

instead of

print '%12s %7.2f' % (db[k][1], db[k][0] * 0.001,)

I.e. I normalized it to base = 100.
Now the values of similarity can't be greater than 100
and can be treated as some "regular" percents (%%).

Also, due to this change, the *empirical* threshold of
"system alarmity" moved down from "number 70" to "20%".

bears2.jpg
--------------------
bears2.jpg 0.00
bears3.jpg 15.37
bears1.jpg 19.13
sky1.jpg 23.29
sky2.jpg 23.45
ff1.jpg 25.37
lake1.jpg 26.43
water1.jpg 26.93
ff2.jpg 28.43
roses1.jpg 31.95
roses2.jpg 36.12

I'd like to see a *lot* more structure in there, with modularization, so the
internal functions could be used from another program. Once I'd figured out
what it was doing, I had this:

from PIL import Image
from PIL import ImageStat

def row_column_histograms (file_name):
'''Reduce the image to a 5x5 square of b/w brightness levels 0..3
Return two brightness histograms across Y and X
packed into a 10-item list of 4-item histograms.'''
im = Image.open (file_name)
im = im.convert ('L') # convert to 8-bit b/w
w, h = 300, 300
im = im.resize ((w, h))
imst = ImageStat.Stat (im)
sr = imst.mean[0] # average pixel level in layer 0
sr_low, sr_mid, sr_high = (sr*2)/3, sr, (sr*4)/3
def foo (t):
if t < sr_low: return 0
if t < sr_mid: return 1
if t < sr_high: return 2
return 3
im = im.point (foo) # reduce to brightness levels 0..3
yhist = [[0]*4 for i in xrange(5)]
xhist = [[0]*4 for i in xrange(5)]
for y in xrange (h):
for x in xrange (w):
k = im.getpixel ((x, y))
yhist[y / 60][k] += 1
xhist[x / 60][k] += 1
return yhist + xhist

def difference_ranks (test_histogram, sample_histograms):
'''Return a list of difference ranks between the test histograms and
each of the samples.'''
result = [0]*len (sample_histograms)
for k, s in enumerate (sample_histograms): # for each image
for i in xrange(10): # for each histogram slot
for j in xrange(4): # for each brightness level
result[k] += abs (s[j] - test_histogram[j])
return result

if __name__ == '__main__':
import getopt, sys
opts, args = getopt.getopt (sys.argv[1:], '', [])
if not args:
args = [
'bears1.jpg',
'bears2.jpg',
'bears3.jpg',
'roses1.jpg',
'roses2.jpg',
'ff1.jpg',
'ff2.jpg',
'sky1.jpg',
'sky2.jpg',
'water1.jpg',
'lake1.jpg',
]
test_pic = 'bears2.jpg'
else:
test_pic, args = args[0], args[1:]

z = [row_column_histograms (a) for a in args]
test_z = row_column_histograms (test_pic)

file_ranks = zip (difference_ranks (test_z, z), args)
file_ranks.sort()

print '%12s' % (test_pic,)
print '--------------------'
for r in file_ranks:
print '%12s %7.2f' % (r[1], r[0] / 3600.0,)

(omitting a few comments that wrapped around.) The test-case still agrees
with your archived version:

mwilson@tecumseth:~/sandbox/im_sim$ python image_rank.py bears2.jpg *.jpg
bears2.jpg
--------------------
bears2.jpg 0.00
bears3.jpg 15.37
bears1.jpg 19.20
sky1.jpg 23.20
sky2.jpg 23.37
ff1.jpg 25.30
lake1.jpg 26.38
water1.jpg 26.98
ff2.jpg 28.43
roses1.jpg 32.01

I'd vaguely wanted to do something like this for a while, but I never dug
far enough into PIL to even get started. An additional kind of ranking that
takes colour into account would also be good -- that's the first one I never
did.

Cheers, Mel.

n00m · Mar 5, 2011

n00m said:
n00m said:

I uploaded a new version of the subject with a
VERY MINOR correction in it. Namely, in line #55:

Click to expand...

print '%12s %7.2f' % (db[k][1], db[k][0] / 3600.0,)

Click to expand...

instead of

Click to expand...

print '%12s %7.2f' % (db[k][1], db[k][0] * 0.001,)

Click to expand...

I.e. I normalized it to base = 100.
Now the values of similarity can't be greater than 100
and can be treated as some "regular" percents (%%).

Click to expand...

Also, due to this change, the *empirical* threshold of
"system alarmity" moved down from "number 70" to "20%".

Click to expand...

bears2.jpg
--------------------
bears2.jpg 0.00
bears3.jpg 15.37
bears1.jpg 19.13
sky1.jpg 23.29
sky2.jpg 23.45
ff1.jpg 25.37
lake1.jpg 26.43
water1.jpg 26.93
ff2.jpg 28.43
roses1.jpg 31.95
roses2.jpg 36.12

Click to expand...

I'd like to see a *lot* more structure in there, with modularization, so the
internal functions could be used from another program. Once I'd figured out
what it was doing, I had this:

from PIL import Image
from PIL import ImageStat

def row_column_histograms (file_name):
'''Reduce the image to a 5x5 square of b/w brightness levels 0..3
Return two brightness histograms across Y and X
packed into a 10-item list of 4-item histograms.'''
im = Image.open (file_name)
im = im.convert ('L') # convert to 8-bit b/w
w, h = 300, 300
im = im.resize ((w, h))
imst = ImageStat.Stat (im)
sr = imst.mean[0] # average pixel level in layer 0
sr_low, sr_mid, sr_high = (sr*2)/3, sr, (sr*4)/3
def foo (t):
if t < sr_low: return 0
if t < sr_mid: return 1
if t < sr_high: return 2
return 3
im = im.point (foo) # reduce to brightness levels 0..3
yhist = [[0]*4 for i in xrange(5)]
xhist = [[0]*4 for i in xrange(5)]
for y in xrange (h):
for x in xrange (w):
k = im.getpixel ((x, y))
yhist[y / 60][k] += 1
xhist[x / 60][k] += 1
return yhist + xhist

def difference_ranks (test_histogram, sample_histograms):
'''Return a list of difference ranks between the test histograms and
each of the samples.'''
result = [0]*len (sample_histograms)
for k, s in enumerate (sample_histograms): # for each image
for i in xrange(10): # for each histogram slot
for j in xrange(4): # for each brightness level
result[k] += abs (s[j] - test_histogram[j])
return result

if __name__ == '__main__':
import getopt, sys
opts, args = getopt.getopt (sys.argv[1:], '', [])
if not args:
args = [
'bears1.jpg',
'bears2.jpg',
'bears3.jpg',
'roses1.jpg',
'roses2.jpg',
'ff1.jpg',
'ff2.jpg',
'sky1.jpg',
'sky2.jpg',
'water1.jpg',
'lake1.jpg',
]
test_pic = 'bears2.jpg'
else:
test_pic, args = args[0], args[1:]

z = [row_column_histograms (a) for a in args]
test_z = row_column_histograms (test_pic)

file_ranks = zip (difference_ranks (test_z, z), args)
file_ranks.sort()

print '%12s' % (test_pic,)
print '--------------------'
for r in file_ranks:
print '%12s %7.2f' % (r[1], r[0] / 3600.0,)

(omitting a few comments that wrapped around.) The test-case still agrees
with your archived version:

mwilson@tecumseth:~/sandbox/im_sim$ python image_rank.py bears2.jpg *.jpg
bears2.jpg
--------------------
bears2.jpg 0.00
bears3.jpg 15.37
bears1.jpg 19.20
sky1.jpg 23.20
sky2.jpg 23.37
ff1.jpg 25.30
lake1.jpg 26.38
water1.jpg 26.98
ff2.jpg 28.43
roses1.jpg 32.01

I'd vaguely wanted to do something like this for a while, but I never dug
far enough into PIL to even get started. An additional kind of rankingthat
takes colour into account would also be good -- that's the first one I never
did.

Cheers, Mel.

Very nice, Mel.

As for using color info...
my current strong opinion is: the colors must be forgot for good.
Paradoxically but "profound" elaboration and detailization can/will
spoil/undermine the whole thing. Just my current imo.

===========================
Vitali

Jorgen Grahn · Mar 5, 2011

At least you could've tried to make the script more usable by adding
the possibility to supply command line arguments, instead of editing
the source every time you want to compare a couple of images.

So basically you're saying you won't tell the users what the program
*does*. I don't get that.

Is it better than this?
- scale each image to 100x100
- go black&white in such a way that half the pixels are black
- XOR the images and count the mismatches

That takes care of JPEG quality, scaling and possibly gamma
correction, but not cropping or rotation. I'm sure there are better,
well-known algorithms.

/Jorgen

n00m · Mar 5, 2011

Is it better than this?
- scale each image to 100x100
- go black&white in such a way that half the pixels are black
- XOR the images and count the mismatches

It's *much* better but I'm not *much* about to prove it.

I'm sure there are better,
well-known algorithms.

The best well-known algorithm is to hire a man with good eyesight
for to do the job of comparing, ranking and selecting the pictures.

n00m · Mar 5, 2011

PS

For some reason they don't update the link to the last version.

It's _20110306, here: http://sourceforge.net/projects/imsim/files/

I use Python 2.5 & PIL for Python 2.5

Mel · Mar 6, 2011

n00m said:
As for using color info...
my current strong opinion is: the colors must be forgot for good.
Paradoxically but "profound" elaboration and detailization can/will
spoil/undermine the whole thing. Just my current imo.

Yeah. I guess including color info cubes the complexity of the answer.
Might be too complicated to know what to do with an answer like that.

Mel.

n00m · Mar 6, 2011

Yeah. I guess including color info cubes the complexity of the answer.
Might be too complicated to know what to do with an answer like that.

Mel.

Uhmm, Mel. Totally agree with you.
+
I included "roses1.jpg" & "roses2.jpg" on purpose:
the 1st one is a painting by Abbott Handerson Thayer,
the 2nd is its copy by some obscure Russian painter.
But it's of course a creative & revamped copy.

In strict sense they are 2 different images (look at their colors etc)
, on the other hand they are closely related to each other.
Plus, we can't tell *in principle* what is original and what is copy
what colors are "right/good" and what colors are "wrong/bad"

n00m · Mar 6, 2011

http://www.nga.gov/search/index.shtm
http://deyoung.famsf.org/search-collections
etc
Seems they all offer search only by keywords and this kind.
What about to submit e.g. roses2.jpg (copy) and to find its
original? Assume we don't know its author neither its title

n00m · Mar 6, 2011

Obviously if we'd use it in practice (in a web-museum ?)
all pic's matrices should be precalculated only once and
stored in a table with fourty fields v00 ... v93 like:

-----------------------------------------------
pic_title v00 v01 v02 ... v93
-----------------------------------------------
bears2.jpg 1234 4534 8922 ... 333
....
....
-----------------------------------------------

Then SQL query will look like this:

select top 3 pic_title from table
order by
abs(v00 - w[0][0]) +
abs(v01 - w[0][1]) +
.... +
abs(v93 - w[9][3])

here w[][] is the matrix of a newly-entering picture.

P.S.
If someone will encounter 2 apparently unrelated pics
but for which ImSim gives value of their mutual diff.
*** less than 20% *** please emailed them to me.

John Bokma · Mar 6, 2011

n00m said:
http://www.nga.gov/search/index.shtm
http://deyoung.famsf.org/search-collections
etc
Seems they all offer search only by keywords and this kind.
What about to submit e.g. roses2.jpg (copy) and to find its
original? Assume we don't know its author neither its title

Title: TinEye, author: http://ideeinc.com/
Search: http://www.tineye.com/

Example:
http://www.tineye.com/search/2b3305135fa4c59311ed58b41da5d07f213e4d47/

Notice how it finds modified images.

n00m · Mar 6, 2011

Title: TinEye, author:http://ideeinc.com/
Search:http://www.tineye.com/

Example:
http://www.tineye.com/search/2b3305135fa4c59311ed58b41da5d07f213e4d47/

Notice how it finds modified images.

--
John Bokma j3b

Blog:http://johnbokma.com/ Facebook:http://www.facebook.com/j.j.j.bokma
Freelance Perl & Python Development:http://castleamber.com/

It's for kids.
Such trifles can easily be cracked by e.g. Jorgen Grahn's algo (see
his message)

n00m · Mar 6, 2011

It's for kids.
Such trifles can easily be cracked by e.g. Jorgen Grahn's algo (see
his message)

Even his algo will be an overhead.
Comparing meta-data/EXIF of image files will be enough in 99% cases.

John Bokma · Mar 6, 2011

n00m said:
Even his algo will be an overhead.
Comparing meta-data/EXIF of image files will be enough in 99% cases.

Yes, yes, we get it. You're so much smarter (but not smart enough to not
quote a signature...). Anyway, I guess that's the reason big names use
tineye and not your algorithm...

n00m · Mar 6, 2011

As for "proper" quoting: I read/post to this group via my web-browser.
And for me everything looks OK. I don't even quite understand what
exactly
do you mean by your remark. I'm not a facebookie/forumish/twitterish
thing.
Btw I don't know what is the twitter. I don't need it, neither to know
nor
to use it. Oh... Pres. Medvedev knows what is the twitter and uses it.

John Bokma · Mar 6, 2011

n00m said:
As for "proper" quoting: I read/post to this group via my web-browser.
And for me everything looks OK. I don't even quite understand what
exactly
do you mean by your remark. I'm not a facebookie/forumish/twitterish
thing.

Exactly. It's Usenet, something I've been using for, oh, just over 20
years now, and even then it was not new. You know, before the web thing
you're talking about...

n00m · Mar 7, 2011

If someone will encounter 2 apparently unrelated pics
but for which ImSim gives value of their mutual diff.
*** less than 20% *** please emailed them to me.

Never mind, people.
I've found such a pair of images in my .zipped project.
It's "sky1.jpg" and "lake1.jpg", with sim. value < 15%.

sky1.jpg
--------------------
sky1.jpg 0.00
sky2.jpg 0.77
lake1.jpg 14.28 <-----
bears2.jpg 23.29
bears3.jpg 26.60
roses2.jpg 29.41
roses1.jpg 31.36
ff1.jpg 33.47
bears1.jpg 36.60
ff2.jpg 39.52
water1.jpg 40.11

But funny thing takes place.
At first thought it's a false-positive: some modern South East
Asian town and a lake somewhere in Russia, more than 100 years
ago. Nothing similar in them?

On both pics we see:
-- a lot of water on foreground;
-- a lot of blue sky at sunny mid-day;
-- a bit of light white clouds in the sky;

In short,
the notion of similarity can be speculated about just endlessly.

Grigory Javadyan · Mar 7, 2011

Just admit that your algorithm doesn't work that well already

Or give a solid formal definition of "similarity" and prove that your
algo works with that definition.

ImSim: Image Similarity

n00m

Grigory Javadyan

n00m

Mel

n00m

Jorgen Grahn

n00m

n00m

Mel

n00m

n00m

n00m

John Bokma

n00m

n00m

John Bokma

n00m

John Bokma

n00m

Grigory Javadyan

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads