Find repeating substring

Mike · Jun 20, 2006

Hi All,

I need a regular expression to find repeating substrings (in particular
the substring that starts in position 1 of the string and is repeated
elsewhere in the string). For example, in the case below, the
substring of interest would be "HEART (CONDUCTION DEFECT)".

Thanks much for any insights,

Mike

HEART (CONDUCTION DEFECT) 37.33/2 HEART (CONDUCTION DEFECT) WITH
CATHETER 37.34/2

Mike · Jun 20, 2006

bugbear said:
That sounds related to this:

http://www.cs.sunysb.edu/~algorith/files/longest-common-substring.shtml

and may be beyond a regexp.

BugBear

Thank you for the reference. This problem while simple at first glance
has so far proved quite challenging for me.

Mike

thundergnat · Jun 20, 2006

Mike said:
Hi All,

I need a regular expression to find repeating substrings (in particular
the substring that starts in position 1 of the string and is repeated
elsewhere in the string). For example, in the case below, the
substring of interest would be "HEART (CONDUCTION DEFECT)".

Thanks much for any insights,

Mike

HEART (CONDUCTION DEFECT) 37.33/2 HEART (CONDUCTION DEFECT) WITH
CATHETER 37.34/2

my $string = 'HEART (CONDUCTION DEFECT) 37.33/2 HEART (CONDUCTION DEFECT)
WITH CATHETER 37.34/2';

if ($string =~ m/^(.+)(?=\s*).*\1/) {
print $1;
}

This isn't foolproof, (what is?) but it may be good enough. Newlines in the
string may be problematic if they fall inside the term you are searching on.

Mike · Jun 20, 2006

thundergnat said:
my $string = 'HEART (CONDUCTION DEFECT) 37.33/2 HEART (CONDUCTION DEFECT)
WITH CATHETER 37.34/2';

if ($string =~ m/^(.+)(?=\s*).*\1/) {
print $1;
}

This isn't foolproof, (what is?) but it may be good enough. Newlines in the
string may be problematic if they fall inside the term you are searching on.

Hmm. This seems very close. It still picks up the code (37.33/2) that
is not repeated, but let me play with this a bit to see if I can
exclude that.

Mirco Wahab · Jun 20, 2006

Thus spoke Mike (on 2006-06-20 16:19):

Hmm. This seems very close. It still picks up the code (37.33/2) that
is not repeated, but let me play with this a bit to see if I can
exclude that.

It doesn't here, consider:

my $text = <<END_OF_TEXT;
HEART (CONDUCTION DEFECT) 37.33/2 HEART (CONDUCTION DEFECT) WITH
CATHETER 37.34/2
END_OF_TEXT

my $rep = qr{ ^(.+)(?=\s*).*?\1 }msx;
my ($repeated) = $text=~/$rep/;

print $repeated;

prints: HEART (CONDUCTION DEFECT)

(BTW: I made the final .* non-greedy)

Regards

Mirco

Ted Zlatanov · Jun 20, 2006

Thank you for the reference. This problem while simple at first glance
has so far proved quite challenging for me.

You asked a very open-ended question that, while theoretically
interesting, may not be the real problem in your case. Can you try to
restate your request in a different way, for example:

"I want to extract data where the format is

[LABEL] [NUMBER] [LABEL][EXTRA_INFORMATION1] [NUMBER] [LABEL][EXTRA_INFORMATION2] [NUMBER]"

And then explain the syntax of [LABEL] and [EXTRA_INFORMATION*] if
possible?

Ted

xhoster · Jun 20, 2006

Mike said:
Hi All,

I need a regular expression to find repeating substrings (in particular
the substring that starts in position 1 of the string and is repeated
elsewhere in the string). For example, in the case below, the
substring of interest would be "HEART (CONDUCTION DEFECT)".

Thanks much for any insights,

Mike

HEART (CONDUCTION DEFECT) 37.33/2 HEART (CONDUCTION DEFECT) WITH
CATHETER 37.34/2

This seems pretty simple. What am I missing?

/^(.*).*\1/s

Xho

Mike · Jun 20, 2006

Your solution still leaves the code 37.33/2 in the result.

Mike · Jun 20, 2006

What I have is 24,000 lines of ICD9 index entries with appended codes
that used to be stored in a format like below and processed one time
per year by an OS390 (then printed out). These are index entries for
ICD9 codes that now have been moved to a web service. The web services
are all "code-centric" and the index entries are just properties of the
codes now. The gist of the problem is that I need to format each entry
(for example that of ablation) into a tree-view that users can peruse
for codes. I believe I can do this recursively be reading each string
until the first word changes then pulling the repeating first and
subsequent words (which are then the root). Then, remove the root and
process what is left of each substring in turn until the base case is
hit (no repeating substrings).

The owner of the webservices is not willing to change the format of the
data. I may be able to do a one-time process (Perl) or on-demand
depending on performance.

A tree view would look like

Ablation
Endometrial (Hysteroscopic) 68.23
Heart (Conduction Defect) 27.33/2
With Catheter 37.34/2
Inner Ear (Cryosurgery) (Ultrasound) 20.79/4
By Injection 20.72
Lesion Heart
By Peripherally Inserted Catheter 37.34

etc etc.....

The raw index entries would look like below.

ABLATION ENDOMETRIAL (HYSTEROSCOPIC) 68.23
ABLATION HEART (CONDUCTION DEFECT) 37.33/2
ABLATION HEART (CONDUCTION DEFECT) WITH CATHETER 37.34/2
ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND) 20.79/4
ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND) BY INJECTION 20.72
ABLATION LESION HEART BY PERIPHERALLY INSERTED CATHETER 37.34
ABLATION LESION HEART ENDOVASCULAR APPROACH 37.34
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) ENDOVASCULAR
APPROACH 37.34
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) OPEN (TRANS-THORACIC)
APPROACH 37.33
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) TRANS-THORACIC
APPROACH 37.33
ABLATION PITUITARY 7.69
ABLATION PITUITARY BY COBALT-60 92.32
ABLATION PITUITARY BY IMPLANTATION (STRONTIUM-YTTRIUM) (Y) NEC 92.39
ABLATION PITUITARY BY PROTON BEAM (BRAGG PEAK) 92.33
ABLATION PROSTATE (ANAT = 59.02) BY LASER, TRANSURETHRAL 60.21
ABLATION PROSTATE (ANAT = 59.02) BY RADIOFREQUENCY THERMOTHERAPY 60.97
ABLATION PROSTATE (ANAT = 59.02) BY TRANSURETHRAL NEEDLE ABLATION
(TUNA) 60.97
ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY CRYOABLATION 60.62
ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY RADICAL CRYOSURGICAL
ABLATION (RCSA) 60.62
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL BY LASER 60.21
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL CRYOABLATION 60.29
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL RADICAL CRYOSURGICAL
ABLATION (RCSA) 60.29
ABLATION TISSUE HEART - SEE ABLATION, LESION, HEART 0
ABLATION VESICLE NECK (ANAT = 60.02) 57.91

Ted said:
Thank you for the reference. This problem while simple at first glance
has so far proved quite challenging for me.

Click to expand...

You asked a very open-ended question that, while theoretically
interesting, may not be the real problem in your case. Can you try to
restate your request in a different way, for example:

"I want to extract data where the format is

[LABEL] [NUMBER] [LABEL][EXTRA_INFORMATION1] [NUMBER] [LABEL][EXTRA_INFORMATION2] [NUMBER]"

And then explain the syntax of [LABEL] and [EXTRA_INFORMATION*] if
possible?

Ted

xhoster · Jun 20, 2006

Please don't top post. Top posting fixed.

Your solution still leaves the code 37.33/2 in the result.

No, it does not.

$ perl -l
$_ = 'HEART (CONDUCTION DEFECT) 37.33/2 HEART (CONDUCTION DEFECT)
WITH CATHETER 37.34/2';
if (/^(.*).*\1/) {
print "Result: $1";
}
__END__
Result: HEART (CONDUCTION DEFECT)

See, no 37.33/2 in the result!

Xho

Mike · Jun 20, 2006

Please don't top post. Top posting fixed.

No, it does not.

$ perl -l
$_ = 'HEART (CONDUCTION DEFECT) 37.33/2 HEART (CONDUCTION DEFECT)
WITH CATHETER 37.34/2';
if (/^(.*).*\1/) {
print "Result: $1";
}
__END__
Result: HEART (CONDUCTION DEFECT)

See, no 37.33/2 in the result!

Xho

--

Sorry. My mistake. I ran the Perl script with your suggestion and no
37.33/2 (as you noted). As I am trying to also see if this will work
in C# code (on demand) I ran it on Expresso as ^(.*).*\1 against the
same string and it left the code. Different implementation? I only
use regular expressions irregularly and so am merely dangerous with
them.

David Squire · Jun 20, 2006

Mike wrote:

[big snip]

Sorry. My mistake. I ran the Perl script with your suggestion and no
37.33/2 (as you noted). As I am trying to also see if this will work
in C# code (on demand) I ran it on Expresso as ^(.*).*\1 against the
same string and it left the code. Different implementation? I only
use regular expressions irregularly and so am merely dangerous with
them.

What makes you think that REs in one language will function like those
in another? [1]

DS

[1] Deliberatively provocative.

Mike · Jun 21, 2006

David said:
Mike wrote:

[big snip]

Sorry. My mistake. I ran the Perl script with your suggestion and no
37.33/2 (as you noted). As I am trying to also see if this will work
in C# code (on demand) I ran it on Expresso as ^(.*).*\1 against the
same string and it left the code. Different implementation? I only
use regular expressions irregularly and so am merely dangerous with
them.

Click to expand...

What makes you think that REs in one language will function like those
in another? [1]

DS

[1] Deliberatively provocative.

I don't expect them to function 'exactly' alike, but I was kind of
expecting similar.... Regular expressions are pretty much like
everything else in my life -- I need to get better at them.

Peter Scott · Jun 21, 2006

I need a regular expression to find repeating substrings (in particular
the substring that starts in position 1 of the string and is repeated
elsewhere in the string).

http://search.cpan.org/~gray/Tree-Suffix-0.14/lib/Tree/Suffix.pm

Peter Scott · Jun 21, 2006

Sorry, hit send too early.

I need a regular expression to find repeating substrings (in particular
the substring that starts in position 1 of the string and is repeated
elsewhere in the string). For example, in the case below, the substring
of interest would be "HEART (CONDUCTION DEFECT)".

http://search.cpan.org/~gray/Tree-Suffix-0.14/lib/Tree/Suffix.pm

may not be a regular expression, but it is a Perl binding to a library
that you may be able to link to from another language (as you indicated
you are using). The link above has a link to that library.

On large amounts of data the regular expression solution can have
unacceptable performance. This is much faster.

Xicheng Jia · Jun 21, 2006

Mike said:
What makes you think that REs in one language will function like those
in another? [1]

DS

[1] Deliberatively provocative.

Click to expand...

I don't expect them to function 'exactly' alike, but I was kind of
expecting similar.... Regular expressions are pretty much like
everything else in my life -- I need to get better at them.

In fact, if you don't use embedded code in Perl, named captures,
balanced groupings in C#(C# may also have variable-length look behind),
and some other minor differences, the regex patterns for these two
tools are about the same.. (more importantly, they are using the same
Traditional NFA regex engin).

Xicheng

David Squire · Jun 21, 2006

Xicheng said:
Mike said:

What makes you think that REs in one language will function like those
in another? [1]

DS

[1] Deliberatively provocative.

Click to expand...

I don't expect them to function 'exactly' alike, but I was kind of
expecting similar.... Regular expressions are pretty much like
everything else in my life -- I need to get better at them.

Click to expand...

In fact, if you don't use embedded code in Perl, named captures,
balanced groupings in C#(C# may also have variable-length look behind),
and some other minor differences, the regex patterns for these two
tools are about the same.. (more importantly, they are using the same
Traditional NFA regex engin).

.... and yet regexes in sed, vi, etc. are subtly different from those in
Perl. It's never a good idea to assume equivalence. Those little
differences can cause much puzzlement otherwise.

DS

Xicheng Jia · Jun 21, 2006

David said:
Xicheng said:

Mike said:

What makes you think that REs in one language will function like those
in another? [1]

DS

[1] Deliberatively provocative.
I don't expect them to function 'exactly' alike, but I was kind of
expecting similar.... Regular expressions are pretty much like
everything else in my life -- I need to get better at them.

Click to expand...

In fact, if you don't use embedded code in Perl, named captures,
balanced groupings in C#(C# may also have variable-length look behind),
and some other minor differences, the regex patterns for these two
tools are about the same.. (more importantly, they are using the same
Traditional NFA regex engin).

Click to expand...

... and yet regexes in sed, vi, etc. are subtly different from those in
Perl. It's never a good idea to assume equivalence. Those little
differences can cause much puzzlement otherwise.

While I didnt say that the same engine must have exactly the same
implementations. the engine just tell how it works inside, right?.. If
you've read references about traditional NFA regexes like Perl, C#
..NET, Java, Python, Javascript, and I do think switching patterns from
one flavor to another is not a huge deal..

BTW. vim? is using DFA engine.

Xicheng

thundergnat · Jun 23, 2006

Mike said:
What I have is 24,000 lines of ICD9 index entries with appended codes
that used to be stored in a format like below and processed one time
per year by an OS390 (then printed out). These are index entries for
ICD9 codes that now have been moved to a web service. The web services
are all "code-centric" and the index entries are just properties of the
codes now. The gist of the problem is that I need to format each entry
(for example that of ablation) into a tree-view that users can peruse
for codes. I believe I can do this recursively be reading each string
until the first word changes then pulling the repeating first and
subsequent words (which are then the root). Then, remove the root and
process what is left of each substring in turn until the base case is
hit (no repeating substrings).

The owner of the webservices is not willing to change the format of the
data. I may be able to do a one-time process (Perl) or on-demand
depending on performance.

A tree view would look like

Ablation
Endometrial (Hysteroscopic) 68.23
Heart (Conduction Defect) 27.33/2
With Catheter 37.34/2
Inner Ear (Cryosurgery) (Ultrasound) 20.79/4
By Injection 20.72
Lesion Heart
By Peripherally Inserted Catheter 37.34

etc etc.....

The raw index entries would look like below.

ABLATION ENDOMETRIAL (HYSTEROSCOPIC) 68.23
ABLATION HEART (CONDUCTION DEFECT) 37.33/2
ABLATION HEART (CONDUCTION DEFECT) WITH CATHETER 37.34/2
ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND) 20.79/4
ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND) BY INJECTION 20.72
ABLATION LESION HEART BY PERIPHERALLY INSERTED CATHETER 37.34
ABLATION LESION HEART ENDOVASCULAR APPROACH 37.34
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) ENDOVASCULAR
APPROACH 37.34
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) OPEN (TRANS-THORACIC)
APPROACH 37.33
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) TRANS-THORACIC
APPROACH 37.33
ABLATION PITUITARY 7.69
ABLATION PITUITARY BY COBALT-60 92.32
ABLATION PITUITARY BY IMPLANTATION (STRONTIUM-YTTRIUM) (Y) NEC 92.39
ABLATION PITUITARY BY PROTON BEAM (BRAGG PEAK) 92.33
ABLATION PROSTATE (ANAT = 59.02) BY LASER, TRANSURETHRAL 60.21
ABLATION PROSTATE (ANAT = 59.02) BY RADIOFREQUENCY THERMOTHERAPY 60.97
ABLATION PROSTATE (ANAT = 59.02) BY TRANSURETHRAL NEEDLE ABLATION
(TUNA) 60.97
ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY CRYOABLATION 60.62
ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY RADICAL CRYOSURGICAL
ABLATION (RCSA) 60.62
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL BY LASER 60.21
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL CRYOABLATION 60.29
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL RADICAL CRYOSURGICAL
ABLATION (RCSA) 60.29
ABLATION TISSUE HEART - SEE ABLATION, LESION, HEART 0
ABLATION VESICLE NECK (ANAT = 60.02) 57.91

Ho hum. I was bored so I farted around with this for a bit.

Not particularly elegant or fast but...

use warnings;
use strict;

my @file;
my $lastline = '';
my $partline = '';

while ( my $line = <DATA> ) {
chomp $line;
$line = "$partline $line" if length $partline;
if ( $line =~ /\D$/ ) {
$partline = $line;
next;
}
else {
$partline = '';
}
$line =~ s/(\w+('\w+)?)/\u\L$1/g;
push @file, $line;
}

my $level = 0;
my $prefix = '';
my @step;
my $prev = shift @file;
my $tab = ' ';

while (@file) {
( $level, $prefix, $prev ) =
buildtree( $level, $prefix, shift @file, $prev );
}
buildtree( $level, $prefix, $prefix, $prev );

sub buildtree {
my ( $level, $prefix, $next, $prev ) = @_;
my $common = join ' ', greatest_common_prefix( $prev, $next );
if ( $common eq $prefix ) {
$prev =~ s/^\Q$prefix\E\s*//;
print $tab x $level, $prev, "\n";
}
elsif ( length $common > length $prefix ) {
$prev =~ s/^\Q$common\E\s*//;
my $trim = $common;
$common =~ s/^\Q$prefix\E\s*//;
push @step, $common;
print $tab x $level, $common;
if ( $prev !~ /[ \p{Alpha}]/ ) {
print "\t$prev\n";
$level = @step;
}
else {
$level = @step;
print "\n", $tab x $level, $prev, "\n";
}
$prefix = $trim;
}
elsif ( length $common < length $prefix ) {
$prev =~ s/^\Q$prefix\E\s*//;
print $tab x $level, $prev, "\n";
my @newstep;
my $test = $next;
for (@step) {
last unless $test =~ s/\Q$_\E\s*//;
push @newstep, $_;
}
@step = @newstep;
$level = @step;
$prefix = $common;
}
return ( $level, $prefix, $next );
}

sub greatest_common_prefix {
no warnings 'uninitialized';
my ( $first, $second ) = @_;
my @first = split ' ', $first;
my @second = split ' ', $second;
my @gcp;
for (@first) {
if ( $_ eq shift @second ) {
push @gcp, $_;
}
else {
last;
}
}
return @gcp;
}

__DATA__
ABLATION ENDOMETRIAL (HYSTEROSCOPIC) 68.23
ABLATION HEART (CONDUCTION DEFECT) 37.33/2
ABLATION HEART (CONDUCTION DEFECT) WITH CATHETER 37.34/2
ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND) 20.79/4
ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND) BY INJECTION 20.72
ABLATION LESION HEART BY PERIPHERALLY INSERTED CATHETER 37.34
ABLATION LESION HEART ENDOVASCULAR APPROACH 37.34
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) ENDOVASCULAR
APPROACH 37.34
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) OPEN (TRANS-THORACIC)
APPROACH 37.33
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) TRANS-THORACIC
APPROACH 37.33
ABLATION PITUITARY 7.69
ABLATION PITUITARY BY COBALT-60 92.32
ABLATION PITUITARY BY IMPLANTATION (STRONTIUM-YTTRIUM) (Y) NEC 92.39
ABLATION PITUITARY BY PROTON BEAM (BRAGG PEAK) 92.33
ABLATION PROSTATE (ANAT = 59.02) BY LASER, TRANSURETHRAL 60.21
ABLATION PROSTATE (ANAT = 59.02) BY RADIOFREQUENCY THERMOTHERAPY 60.97
ABLATION PROSTATE (ANAT = 59.02) BY TRANSURETHRAL NEEDLE ABLATION
(TUNA) 60.97
ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY CRYOABLATION 60.62
ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY RADICAL CRYOSURGICAL
ABLATION (RCSA) 60.62
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL BY LASER 60.21
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL CRYOABLATION 60.29
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL RADICAL CRYOSURGICAL
ABLATION (RCSA) 60.29
ABLATION TISSUE HEART - SEE ABLATION, LESION, HEART 0
ABLATION VESICLE NECK (ANAT = 60.02) 57.91

Mike · Jun 23, 2006

Ho hum. I was bored so I farted around with this for a bit.

Not particularly elegant or fast but...

Bored? I am relatively easily impressed. Performance will continue to
be a problem if my users want to be able to pull these index trees
real-time.

Mike

FAQ 4.29 How can I count the number of occurrences of a substring within a string?	0	Jan 4, 2011
Find and count strings of text from multiple files	17	Dec 16, 2021
extract substring by regex from a text file	5	Apr 15, 2010
SubString() not working the way I expect, why?	1	Nov 10, 2009
efficiently splitting up strings based on substrings	7	Sep 5, 2009
[SUMMARY] Longest Repeated Substring (#153)	0	Jan 24, 2008
How to find every occurrence of a substring by a function?	5	Jan 25, 2010
Extracting a substring	3	Jan 20, 2010

Find repeating substring

Mike

Mike

thundergnat

Mike

Mirco Wahab

Ted Zlatanov

xhoster

Mike

Mike

xhoster

Mike

David Squire

Mike

Peter Scott

Peter Scott

Xicheng Jia

David Squire

Xicheng Jia

thundergnat

Mike

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads