RegExp poser: matching two substrings

S

Stuart Moore

I've been working on a uni project where you have to work out how
similar two strings are, and I was wondering if it could be done somehow
in a regular expression (or two...). This isn't part of my project,
there are far more sensible ways of doing it, this is just for fun.

Suppose you have two strings, $s and $t. You want to find the substrings
of $s and $t that have the highest score attached, where you have +1 for
each pair of letters that match, and -1 for each that don't.

So, $s='MEAT', $t='ATE' has a score of 2 because you match the 2 'AT'
substrings.
$s='ABCDE' $t='ABFDE' has a score of 3 - it's still worthwhile taking
the -1 on the middle letter, because there's +2 on either side.

Any thoughts? Feel free to assume $s and $t are alphabetic/alphanumeric.

Finding matching substrings of length $n is easy, using
"$s:$t" =~ /([^:]{$n}).*:.*$1/;
(untested) but I got no further...

Stuart
 
T

Tore Aursand

I've been working on a uni project where you have to work out how
similar two strings are [...]

Have you had a look at the String::* modules, specifically String::Approx?
 
S

Stuart Moore

Tore said:
I've been working on a uni project where you have to work out how
similar two strings are [...]


Have you had a look at the String::* modules, specifically String::Approx?

Ooh, I'd forgotten that one. I have to implement the matching routines
myself though, so can't cheat (anyway it's done!). Just wondered how
much could be done with regexps, it's surprising what you can pick up
from silly questions like this and actually use later...
 
M

Malcolm Dew-Jones

Stuart Moore ([email protected]) wrote:
: I've been working on a uni project where you have to work out how
: similar two strings are, and I was wondering if it could be done somehow
: in a regular expression (or two...). This isn't part of my project,
: there are far more sensible ways of doing it, this is just for fun.

: Suppose you have two strings, $s and $t. You want to find the substrings
: of $s and $t that have the highest score attached, where you have +1 for
: each pair of letters that match, and -1 for each that don't.

: So, $s='MEAT', $t='ATE' has a score of 2 because you match the 2 'AT'
: substrings.
: $s='ABCDE' $t='ABFDE' has a score of 3 - it's still worthwhile taking
: the -1 on the middle letter, because there's +2 on either side.

: Any thoughts? Feel free to assume $s and $t are alphabetic/alphanumeric.

: Finding matching substrings of length $n is easy, using
: "$s:$t" =~ /([^:]{$n}).*:.*$1/;
: (untested) but I got no further...

: Stuart

I would not use regexp.

I would split the strings into character arrays, and then use the diff
module with each character treated as a "line".
 
A

Anno Siegel

Malcolm Dew-Jones said:
Stuart Moore ([email protected]) wrote:
: I've been working on a uni project where you have to work out how
: similar two strings are, and I was wondering if it could be done somehow
: in a regular expression (or two...). This isn't part of my project,
: there are far more sensible ways of doing it, this is just for fun.

: Suppose you have two strings, $s and $t. You want to find the substrings
: of $s and $t that have the highest score attached, where you have +1 for
: each pair of letters that match, and -1 for each that don't.

: So, $s='MEAT', $t='ATE' has a score of 2 because you match the 2 'AT'
: substrings.
: $s='ABCDE' $t='ABFDE' has a score of 3 - it's still worthwhile taking
: the -1 on the middle letter, because there's +2 on either side.

: Any thoughts? Feel free to assume $s and $t are alphabetic/alphanumeric.

: Finding matching substrings of length $n is easy, using
: "$s:$t" =~ /([^:]{$n}).*:.*$1/;
: (untested) but I got no further...

: Stuart

I would not use regexp.

I would split the strings into character arrays, and then use the diff
module with each character treated as a "line".

A different, and presumably faster, way to find the score of matching
and non-matching characters in two strings $a, $b is

my $score;
$score = tr/\0// - tr/\0//c for $a ^ $b;

No guarantees for multi-byte characters, though.

Anno
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,189
Latest member
CryptoTaxSoftware

Latest Threads

Top