RegExp poser: matching two substrings

Stuart Moore · Apr 26, 2004

I've been working on a uni project where you have to work out how
similar two strings are, and I was wondering if it could be done somehow
in a regular expression (or two...). This isn't part of my project,
there are far more sensible ways of doing it, this is just for fun.

Suppose you have two strings, $s and $t. You want to find the substrings
of $s and $t that have the highest score attached, where you have +1 for
each pair of letters that match, and -1 for each that don't.

So, $s='MEAT', $t='ATE' has a score of 2 because you match the 2 'AT'
substrings.
$s='ABCDE' $t='ABFDE' has a score of 3 - it's still worthwhile taking
the -1 on the middle letter, because there's +2 on either side.

Any thoughts? Feel free to assume $s and $t are alphabetic/alphanumeric.

Finding matching substrings of length $n is easy, using
"$s:$t" =~ /([^:]{$n}).*:.*$1/;
(untested) but I got no further...

Stuart

Tore Aursand · Apr 27, 2004

I've been working on a uni project where you have to work out how
similar two strings are [...]

Have you had a look at the String::* modules, specifically String::Approx?

Stuart Moore · Apr 27, 2004

Tore said:
I've been working on a uni project where you have to work out how
similar two strings are [...]

Click to expand...

Have you had a look at the String::* modules, specifically String::Approx?

Ooh, I'd forgotten that one. I have to implement the matching routines
myself though, so can't cheat (anyway it's done!). Just wondered how
much could be done with regexps, it's surprising what you can pick up
from silly questions like this and actually use later...

Malcolm Dew-Jones · Apr 27, 2004

Stuart Moore ([email protected]) wrote:
: I've been working on a uni project where you have to work out how
: similar two strings are, and I was wondering if it could be done somehow
: in a regular expression (or two...). This isn't part of my project,
: there are far more sensible ways of doing it, this is just for fun.

: Suppose you have two strings, $s and $t. You want to find the substrings
: of $s and $t that have the highest score attached, where you have +1 for
: each pair of letters that match, and -1 for each that don't.

: So, $s='MEAT', $t='ATE' has a score of 2 because you match the 2 'AT'
: substrings.
: $s='ABCDE' $t='ABFDE' has a score of 3 - it's still worthwhile taking
: the -1 on the middle letter, because there's +2 on either side.

: Any thoughts? Feel free to assume $s and $t are alphabetic/alphanumeric.

: Finding matching substrings of length $n is easy, using
: "$s:$t" =~ /([^:]{$n}).*:.*$1/;
: (untested) but I got no further...

: Stuart

I would not use regexp.

I would split the strings into character arrays, and then use the diff
module with each character treated as a "line".

Anno Siegel · Apr 27, 2004

Malcolm Dew-Jones said:
Stuart Moore ([email protected]) wrote:
: I've been working on a uni project where you have to work out how
: similar two strings are, and I was wondering if it could be done somehow
: in a regular expression (or two...). This isn't part of my project,
: there are far more sensible ways of doing it, this is just for fun.

: Suppose you have two strings, $s and $t. You want to find the substrings
: of $s and $t that have the highest score attached, where you have +1 for
: each pair of letters that match, and -1 for each that don't.

: So, $s='MEAT', $t='ATE' has a score of 2 because you match the 2 'AT'
: substrings.
: $s='ABCDE' $t='ABFDE' has a score of 3 - it's still worthwhile taking
: the -1 on the middle letter, because there's +2 on either side.

: Any thoughts? Feel free to assume $s and $t are alphabetic/alphanumeric.

: Finding matching substrings of length $n is easy, using
: "$s:$t" =~ /([^:]{$n}).*:.*$1/;
: (untested) but I got no further...

: Stuart

I would not use regexp.

I would split the strings into character arrays, and then use the diff
module with each character treated as a "line".

A different, and presumably faster, way to find the score of matching
and non-matching characters in two strings $a, $b is

my $score;
$score = tr/\0// - tr/\0//c for $a ^ $b;

No guarantees for multi-byte characters, though.

Anno

Counting Palindrome Substrings ?	1	Feb 5, 2011
SENTINEL CONTROL LOOP WHEN DEALING WITH TWO ARRAYS	1	Oct 26, 2023
Python point location of intersect between two lines	0	Feb 28, 2018
Processing in Python help	0	Aug 31, 2022
regexp(ing) Backus-Naurish expressions ...	23	Mar 10, 2013
Regex: deleting non-matching words	3	Aug 22, 2010
regexp	7	Jun 3, 2004
FAQ 4.23 How do I find matching/nesting anything?	0	Apr 2, 2011

RegExp poser: matching two substrings

Stuart Moore

Tore Aursand

Stuart Moore

Malcolm Dew-Jones

Anno Siegel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads