Pattern Matching Given # of Characters and no String Input; use RegularExpressions?

Synonymous · Apr 17, 2005

Hello,

Can regular expressions compare file names to one another. It seems RE
can only compare with input i give it, while I want it to compare
amongst itself and give me matches if the first x characters are
similiar.

For example:

cccat
cccap
cccan
dddfa
dddfg
dddfz

Would result in the 'ddd' and the 'ccc' being grouped together if I
specified it to look for a match of the first 3 characters.

What I am trying to do is build a script that will automatically
create directories based on duplicates like this starting with say 10
characters, and going down to 1. This way "Vacation1.jpg,
Vacation2.jpg" would be sent to its own directory (if i specifiy the
first 8 characters being similiar) and "Cat1.jpg, Cat2.jpg" would
(with 3) as well.

Thanks for your help and interest!

S M

tiissa · Apr 17, 2005

Synonymous said:
Can regular expressions compare file names to one another. It seems RE
can only compare with input i give it, while I want it to compare
amongst itself and give me matches if the first x characters are
similiar.

Do you have to use regular expressions?

If you know the number of characters to match can't you just compare slices?

In [1]: f1,f2='cccat','cccap'

In [2]: f1[:3]
Out[2]: 'ccc'

In [3]: f1[:3]==f2[:3]
Out[3]: True

It seems to me you just have to compare each file to the next one (after
having sorted your list).

tiissa · Apr 17, 2005

tiissa said:
If you know the number of characters to match can't you just compare
slices?

If you don't, you can still do it by hand:

In [7]: def cmp(s1,s2):
....: diff_map=[chr(s1!=s2) for i in range(min(len(s1),
len(s2)))]
....: diff_index=''.join(diff_map).find(chr(True))
....: if -1==diff_index:
....: return min(len(s1), len(s2))
....: else:
....: return diff_index
....:

In [8]: cmp('cccat','cccap')
Out[8]: 4

In [9]: cmp('ccc','cccap')
Out[9]: 3

In [10]: cmp('cccat','dddfa')
Out[10]: 0

Kent Johnson · Apr 17, 2005

tiissa said:
Do you have to use regular expressions?

If you know the number of characters to match can't you just compare
slices?

It seems to me you just have to compare each file to the next one (after
having sorted your list).

itertools.groupby() can do the comparing and grouping:
... lst.sort()
... def key(item):
... return item[:n]
... return [ list(items) for k, items in itertools.groupby(lst, key=key) ]
...

>>> names = ['cccat', 'cccap', 'cccan', 'cccbt', 'ccddd', 'dddfa', 'dddfg', 'dddfz']
>>> groupbyPrefix(names, 3) [['cccat', 'cccap', 'cccan', 'cccbt'], ['ccddd'], ['dddfa', 'dddfg', 'dddfz']]
>>> groupbyPrefix(names, 2)

Click to expand...

Click to expand...

[['cccat', 'cccap', 'cccan', 'cccbt', 'ccddd'], ['dddfa', 'dddfg', 'dddfz']]

Kent

Synonymous · Apr 18, 2005

tiissa said:
tiissa said:

If you know the number of characters to match can't you just compare
slices?

Click to expand...

If you don't, you can still do it by hand:

In [7]: def cmp(s1,s2):
....: diff_map=[chr(s1!=s2) for i in range(min(len(s1),
len(s2)))]
....: diff_index=''.join(diff_map).find(chr(True))
....: if -1==diff_index:
....: return min(len(s1), len(s2))
....: else:
....: return diff_index
....:

In [8]: cmp('cccat','cccap')
Out[8]: 4

In [9]: cmp('ccc','cccap')
Out[9]: 3

In [10]: cmp('cccat','dddfa')
Out[10]: 0

I will look at that, although if i have 300 images i dont want to type
all the comparisons (In [9]: cmp('ccc','cccap')) by hand, it would
just be easier to sort them then .

I got it somewhat close to working in visual basic:

If Left$(Cells(iRow, 1).Value, Count) = Left$(Cells(iRow - 1,
1).Value, Count) Then

What it says is when comparing a list, it looks at the 'Count' left
number of characters in the cell and compares it to the row cell
above's 'Count' left number of characters and then does the task (i.e.
makes a directory, moves the files) if they are equal.

I will look for a Left$(str) function that looks at the first X
characters for python ).

Thank you for your help!

Synonymous

John Machin · Apr 18, 2005

I will look for a Left$(str) function that looks at the first X
characters for python ).

Wild goose chase alert! AFAIK there isn't one. Python uses slice
notation instead of left/mid/right/substr/whatever functions. I do
suggest that instead of looking for such a beastie, you read this
section of the Python Tutorial: 3.1.2 Strings.

Then, if you think that that was a good use of your time, you might
like to read the *whole* tutorial

)

HTH,

John

Dennis Lee Bieber · Apr 18, 2005

I will look for a Left$(str) function that looks at the first X
characters for python ).

BASIC's
Left$(str, x)

is essentially Python's
str[:x]

and a comparison of two would be
somestring[:X] == anotherstring[:X]

--

tiissa · Apr 18, 2005

Synonymous said:
tiissa said:

tiissa said:

If you know the number of characters to match can't you just compare
slices?

Click to expand...

If you don't, you can still do it by hand:

In [7]: def cmp(s1,s2):
....: diff_map=[chr(s1!=s2) for i in range(min(len(s1),
len(s2)))]
....: diff_index=''.join(diff_map).find(chr(True))
....: if -1==diff_index:
....: return min(len(s1), len(s2))
....: else:
....: return diff_index
....:

Click to expand...

I will look at that, although if i have 300 images i dont want to type
all the comparisons (In [9]: cmp('ccc','cccap')) by hand, it would
just be easier to sort them then .

I didn't meant you had to type it by hand. I thought about writing a
small script (as opposed to using some in the standard tools). It might
look like:

In [22]: def make_group(L):
....: root,res='',[]
....: for i in range(1,len(L)):
....: if ''==root:
....: root=L[:cmp(L[i-1],L)]
....: if ''==root:
....: res.append((L[i-1],[L[i-1]]))
....: else:
....: res.append((root,[L[i-1],L]))
....: elif len(root)==cmp(root,L):
....: res[-1][1].append(L)
....: else:
....: root=''
....: if ''==root:
....: res.append((L[-1],[L[-1]]))
....: return res
....:

In [23]: L=['cccat','cccap','cccan','dddfa','dddfg','dddfz']

In [24]: L.sort()

In [25]: make_group(L)
Out[25]: [('ccca', ['cccan', 'cccap', 'cccat']), ('dddf', ['dddfa',
'dddfg', 'dddfz'])]

However I guarantee no optimality in the number of classes (but, hey,
that's when you don't specify the size of the prefix).
(Actually, I guarantee nothing at all ;p)
But in particular, you can have some file singled out:

In [26]: make_group(['cccan','cccap','cccat','cccb'])
Out[26]: [('ccca', ['cccan', 'cccap', 'cccat']), ('cccb', ['cccb'])]

It is a matter of choice: either you want to specify by hand the size of
the prefix and you'd rather look at itertools as pointed out by Kent, or
you don't and a variation with the above code might do the job.

Synonymous · Apr 21, 2005

John Machin said:
Wild goose chase alert! AFAIK there isn't one. Python uses slice
notation instead of left/mid/right/substr/whatever functions. I do
suggest that instead of looking for such a beastie, you read this
section of the Python Tutorial: 3.1.2 Strings.

Then, if you think that that was a good use of your time, you might
like to read the *whole* tutorial )

Haha it always comes down to RTFM i guess, which is always the best
advice

).

Thank you for your help, Now that I think about it I guess string is
exactly what I am looking for because even though I am using file
names I am treating them like strings when comparing them.

Byebye

)

S M

Synonymous · Apr 21, 2005

tiissa said:
Synonymous said:

tiissa said:

tiissa wrote:

If you know the number of characters to match can't you just compare
slices?

If you don't, you can still do it by hand:

In [7]: def cmp(s1,s2):
....: diff_map=[chr(s1!=s2) for i in range(min(len(s1),
len(s2)))]
....: diff_index=''.join(diff_map).find(chr(True))
....: if -1==diff_index:
....: return min(len(s1), len(s2))
....: else:
....: return diff_index
....:

Click to expand...

I will look at that, although if i have 300 images i dont want to type
all the comparisons (In [9]: cmp('ccc','cccap')) by hand, it would
just be easier to sort them then .

Click to expand...

I didn't meant you had to type it by hand. I thought about writing a
small script (as opposed to using some in the standard tools). It might
look like:

In [22]: def make_group(L):
....: root,res='',[]
....: for i in range(1,len(L)):
....: if ''==root:
....: root=L[:cmp(L[i-1],L)]
....: if ''==root:
....: res.append((L[i-1],[L[i-1]]))
....: else:
....: res.append((root,[L[i-1],L]))
....: elif len(root)==cmp(root,L):
....: res[-1][1].append(L)
....: else:
....: root=''
....: if ''==root:
....: res.append((L[-1],[L[-1]]))
....: return res
....:

In [23]: L=['cccat','cccap','cccan','dddfa','dddfg','dddfz']

In [24]: L.sort()

In [25]: make_group(L)
Out[25]: [('ccca', ['cccan', 'cccap', 'cccat']), ('dddf', ['dddfa',
'dddfg', 'dddfz'])]

However I guarantee no optimality in the number of classes (but, hey,
that's when you don't specify the size of the prefix).
(Actually, I guarantee nothing at all ;p)
But in particular, you can have some file singled out:

In [26]: make_group(['cccan','cccap','cccat','cccb'])
Out[26]: [('ccca', ['cccan', 'cccap', 'cccat']), ('cccb', ['cccb'])]

It is a matter of choice: either you want to specify by hand the size of
the prefix and you'd rather look at itertools as pointed out by Kent, or
you don't and a variation with the above code might do the job.

Thank you, that is very kool I found out how to copy files finally
with shutil too, so i'm getting close to doing something. Going to be
working on an old computer, playing with files = dangerous lol.

Thanks for your help and taking the time to post!

Bye )

S M

Synonymous · Apr 22, 2005

Hello!

I was trying to create a program to search for the largest common
subsetstring among filenames in a directory, them move the filenames
to the substring's name. I have succeeded, with help, in doing so and
here is the code.

Thanks for your help!

--- Code ---

#This program was created with feed back from: smeghead and sirup plus
aum of I2P; and also tiissa and John Machin of comp.lang.python
#Thank you very much.
#I still get the odd error in this, but it was 1 out of 2500 files
successfully sorted. Make sure you have a directory under c:/test/
called 'aa' and have your
#I release this code into the public domain

), send feed back to
(e-mail address removed)
files in c:/test/
import pickle
import os
import shutil
os.chdir ( '/test')
aaaa=2
aa='aa'
x=0
y=20
while y <> 2:
print y
List = []
for fileName in os.listdir ( '/test/' ):
Directory = fileName
List.append(Directory)
List.append("A111111111111")
List.sort()
List.append("Z111111111111")
ListLength = len(List) - 1
x = 0
while x < ListLength:
ListLength = len(List) - 1
b = List[x]
c = List[x + 1]
backward1 = List[x - 1]
d = b[:y]
e = c[:y]
backward2 = backward1[:y]
f = str(d)
g = str(e)
backward3 = str(backward2)
if f==g:
if os.path.isdir (aa+"/"+f) == True:
shutil.move(b,aa+"/"+f)
else:
os.mkdir(aa+"/"+f)
#os.mkdir(f)
shutil.move(b,aa+"/"+f)
else:
if f==backward3:
if os.path.isdir (aa+"/"+f) == True:
shutil.move(b,aa+"/"+f)
else:
os.mkdir(aa+"/"+f)
#os.mkdir(f)
shutil.move(b,aa+"/"+f)
else:
aaaa=3
x = x + 1
y = y - 1

--- End Code ---

Decoding no of ways and printing each decode message	2	Jun 1, 2021
Regular Expression - Matching Multiples of 3 Characters exactly.	6	Apr 28, 2008
Unicode: matching a word and unaccenting characters	2	Nov 15, 2007
Generate random string matching specific pattern and length	7	May 10, 2011
Reading an exact number of characters from input	1	Apr 16, 2009
how to find difference in number of characters	12	Oct 9, 2010
String Pattern Matching algo	8	Apr 29, 2006
matching first 3 characters	6	Aug 19, 2005

Pattern Matching Given # of Characters and no String Input; use RegularExpressions?

Synonymous

tiissa

tiissa

Kent Johnson

Synonymous

John Machin

Dennis Lee Bieber

tiissa

Synonymous

Synonymous

Synonymous

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads