extracting substrings based on 'fuzzy' match

P

Pilcrow

This problem was raised in comp.lang.perl.misc, and the poster was
concerned, among other things, by the speed of execution.
Since C is faster than perl, I wonder how a C coder would solve it?

Given a *very* long string, A, and a shorter string, B, extract all
substrings of A that differ from B in at most N positions.

For example:
N = 2
A = 'abcdefaacdefxbcxfaaabcdefaaaacdxf'
B = 'abcdef'
the substrings are 'abcdef' at offsets 0 & 19, 'aacdef' at offset 6, and
'aacdxf' at offset 27.
 
B

Bartc

Pilcrow said:
This problem was raised in comp.lang.perl.misc, and the poster was
concerned, among other things, by the speed of execution.
Since C is faster than perl, I wonder how a C coder would solve it?

Given a *very* long string, A, and a shorter string, B, extract all
substrings of A that differ from B in at most N positions.

For example:
N = 2
A = 'abcdefaacdefxbcxfaaabcdefaaaacdxf'
B = 'abcdef'
the substrings are 'abcdef' at offsets 0 & 19, 'aacdef' at offset 6, and
'aacdxf' at offset 27.

Here's how /I/ did, as an occasional C coder; this assumes substrings can
overlap:

#include <stdio.h>
#include <string.h>

int main(void) {
char *a="abcdefaacdefxbcxfaaabcdefaaaacdxf";
char *b="abcdef";
int n=2;
int blen;
int i,j,k,c;

blen=strlen(b);

k=strlen(a)-blen+1;

for (i=0; i<k; ++i) {

/* Compare substring a..a[i+blen-1] with b*/
c=0;
for (j=0; j<blen; ++j)
if (a[i+j]!=b[j]){
++c;
if (c>n) break;
}
if (c<=n) {
printf("Offset: %d Substring: ",i);
for (j=0; j<blen; ++j) printf("%c",a[i+j]);
puts("");
}
}
}
 
E

Eric Sosman

Pilcrow said:
This problem was raised in comp.lang.perl.misc, and the poster was
concerned, among other things, by the speed of execution.
Since C is faster than perl,

What a nonsensical assertion!
[...] I wonder how a C coder would solve it?

Given a *very* long string, A, and a shorter string, B, extract all
substrings of A that differ from B in at most N positions.

For example:
N = 2
A = 'abcdefaacdefxbcxfaaabcdefaaaacdxf'
B = 'abcdef'
the substrings are 'abcdef' at offsets 0 & 19, 'aacdef' at offset 6, and
'aacdxf' at offset 27.

I don't see anything C-specific about the question.
A few ideas come to mind, but they'd be better suited to
a general programming or algorithms group than to this one.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,767
Messages
2,569,570
Members
45,045
Latest member
DRCM

Latest Threads

Top