Find repeating substring

M

Mike

Hi All,

I need a regular expression to find repeating substrings (in particular
the substring that starts in position 1 of the string and is repeated
elsewhere in the string). For example, in the case below, the
substring of interest would be "HEART (CONDUCTION DEFECT)".

Thanks much for any insights,

Mike



HEART (CONDUCTION DEFECT) 37.33/2 HEART (CONDUCTION DEFECT) WITH
CATHETER 37.34/2
 
T

thundergnat

Mike said:
Hi All,

I need a regular expression to find repeating substrings (in particular
the substring that starts in position 1 of the string and is repeated
elsewhere in the string). For example, in the case below, the
substring of interest would be "HEART (CONDUCTION DEFECT)".

Thanks much for any insights,

Mike



HEART (CONDUCTION DEFECT) 37.33/2 HEART (CONDUCTION DEFECT) WITH
CATHETER 37.34/2

my $string = 'HEART (CONDUCTION DEFECT) 37.33/2 HEART (CONDUCTION DEFECT)
WITH CATHETER 37.34/2';

if ($string =~ m/^(.+)(?=\s*).*\1/) {
print $1;
}


This isn't foolproof, (what is?) but it may be good enough. Newlines in the
string may be problematic if they fall inside the term you are searching on.
 
M

Mike

thundergnat said:
my $string = 'HEART (CONDUCTION DEFECT) 37.33/2 HEART (CONDUCTION DEFECT)
WITH CATHETER 37.34/2';

if ($string =~ m/^(.+)(?=\s*).*\1/) {
print $1;
}


This isn't foolproof, (what is?) but it may be good enough. Newlines in the
string may be problematic if they fall inside the term you are searching on.

Hmm. This seems very close. It still picks up the code (37.33/2) that
is not repeated, but let me play with this a bit to see if I can
exclude that.
 
M

Mirco Wahab

Thus spoke Mike (on 2006-06-20 16:19):
Hmm. This seems very close. It still picks up the code (37.33/2) that
is not repeated, but let me play with this a bit to see if I can
exclude that.

It doesn't here, consider:


my $text = <<END_OF_TEXT;
HEART (CONDUCTION DEFECT) 37.33/2 HEART (CONDUCTION DEFECT) WITH
CATHETER 37.34/2
END_OF_TEXT

my $rep = qr{ ^(.+)(?=\s*).*?\1 }msx;
my ($repeated) = $text=~/$rep/;

print $repeated;



prints: HEART (CONDUCTION DEFECT)


(BTW: I made the final .* non-greedy)

Regards

Mirco
 
T

Ted Zlatanov

Thank you for the reference. This problem while simple at first glance
has so far proved quite challenging for me.

You asked a very open-ended question that, while theoretically
interesting, may not be the real problem in your case. Can you try to
restate your request in a different way, for example:

"I want to extract data where the format is

[LABEL] [NUMBER] [LABEL][EXTRA_INFORMATION1] [NUMBER] [LABEL][EXTRA_INFORMATION2] [NUMBER]"

And then explain the syntax of [LABEL] and [EXTRA_INFORMATION*] if
possible?

Ted
 
X

xhoster

Mike said:
Hi All,

I need a regular expression to find repeating substrings (in particular
the substring that starts in position 1 of the string and is repeated
elsewhere in the string). For example, in the case below, the
substring of interest would be "HEART (CONDUCTION DEFECT)".

Thanks much for any insights,

Mike

HEART (CONDUCTION DEFECT) 37.33/2 HEART (CONDUCTION DEFECT) WITH
CATHETER 37.34/2

This seems pretty simple. What am I missing?

/^(.*).*\1/s

Xho
 
M

Mike

What I have is 24,000 lines of ICD9 index entries with appended codes
that used to be stored in a format like below and processed one time
per year by an OS390 (then printed out). These are index entries for
ICD9 codes that now have been moved to a web service. The web services
are all "code-centric" and the index entries are just properties of the
codes now. The gist of the problem is that I need to format each entry
(for example that of ablation) into a tree-view that users can peruse
for codes. I believe I can do this recursively be reading each string
until the first word changes then pulling the repeating first and
subsequent words (which are then the root). Then, remove the root and
process what is left of each substring in turn until the base case is
hit (no repeating substrings).

The owner of the webservices is not willing to change the format of the
data. I may be able to do a one-time process (Perl) or on-demand
depending on performance.

A tree view would look like

Ablation
Endometrial (Hysteroscopic) 68.23
Heart (Conduction Defect) 27.33/2
With Catheter 37.34/2
Inner Ear (Cryosurgery) (Ultrasound) 20.79/4
By Injection 20.72
Lesion Heart
By Peripherally Inserted Catheter 37.34

etc etc.....

The raw index entries would look like below.

ABLATION ENDOMETRIAL (HYSTEROSCOPIC) 68.23
ABLATION HEART (CONDUCTION DEFECT) 37.33/2
ABLATION HEART (CONDUCTION DEFECT) WITH CATHETER 37.34/2
ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND) 20.79/4
ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND) BY INJECTION 20.72
ABLATION LESION HEART BY PERIPHERALLY INSERTED CATHETER 37.34
ABLATION LESION HEART ENDOVASCULAR APPROACH 37.34
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) ENDOVASCULAR
APPROACH 37.34
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) OPEN (TRANS-THORACIC)
APPROACH 37.33
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) TRANS-THORACIC
APPROACH 37.33
ABLATION PITUITARY 7.69
ABLATION PITUITARY BY COBALT-60 92.32
ABLATION PITUITARY BY IMPLANTATION (STRONTIUM-YTTRIUM) (Y) NEC 92.39
ABLATION PITUITARY BY PROTON BEAM (BRAGG PEAK) 92.33
ABLATION PROSTATE (ANAT = 59.02) BY LASER, TRANSURETHRAL 60.21
ABLATION PROSTATE (ANAT = 59.02) BY RADIOFREQUENCY THERMOTHERAPY 60.97
ABLATION PROSTATE (ANAT = 59.02) BY TRANSURETHRAL NEEDLE ABLATION
(TUNA) 60.97
ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY CRYOABLATION 60.62
ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY RADICAL CRYOSURGICAL
ABLATION (RCSA) 60.62
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL BY LASER 60.21
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL CRYOABLATION 60.29
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL RADICAL CRYOSURGICAL
ABLATION (RCSA) 60.29
ABLATION TISSUE HEART - SEE ABLATION, LESION, HEART 0
ABLATION VESICLE NECK (ANAT = 60.02) 57.91





Ted said:
Thank you for the reference. This problem while simple at first glance
has so far proved quite challenging for me.

You asked a very open-ended question that, while theoretically
interesting, may not be the real problem in your case. Can you try to
restate your request in a different way, for example:

"I want to extract data where the format is

[LABEL] [NUMBER] [LABEL][EXTRA_INFORMATION1] [NUMBER] [LABEL][EXTRA_INFORMATION2] [NUMBER]"

And then explain the syntax of [LABEL] and [EXTRA_INFORMATION*] if
possible?

Ted
 
X

xhoster

Please don't top post. Top posting fixed.
Your solution still leaves the code 37.33/2 in the result.

No, it does not.


$ perl -l
$_ = 'HEART (CONDUCTION DEFECT) 37.33/2 HEART (CONDUCTION DEFECT)
WITH CATHETER 37.34/2';
if (/^(.*).*\1/) {
print "Result: $1";
}
__END__
Result: HEART (CONDUCTION DEFECT)

See, no 37.33/2 in the result!

Xho
 
M

Mike

Please don't top post. Top posting fixed.



No, it does not.


$ perl -l
$_ = 'HEART (CONDUCTION DEFECT) 37.33/2 HEART (CONDUCTION DEFECT)
WITH CATHETER 37.34/2';
if (/^(.*).*\1/) {
print "Result: $1";
}
__END__
Result: HEART (CONDUCTION DEFECT)

See, no 37.33/2 in the result!

Xho

--

Sorry. My mistake. I ran the Perl script with your suggestion and no
37.33/2 (as you noted). As I am trying to also see if this will work
in C# code (on demand) I ran it on Expresso as ^(.*).*\1 against the
same string and it left the code. Different implementation? I only
use regular expressions irregularly and so am merely dangerous with
them.
 
D

David Squire

Mike wrote:

[big snip]
Sorry. My mistake. I ran the Perl script with your suggestion and no
37.33/2 (as you noted). As I am trying to also see if this will work
in C# code (on demand) I ran it on Expresso as ^(.*).*\1 against the
same string and it left the code. Different implementation? I only
use regular expressions irregularly and so am merely dangerous with
them.

What makes you think that REs in one language will function like those
in another? [1]

DS

[1] Deliberatively provocative.
 
M

Mike

David said:
Mike wrote:

[big snip]
Sorry. My mistake. I ran the Perl script with your suggestion and no
37.33/2 (as you noted). As I am trying to also see if this will work
in C# code (on demand) I ran it on Expresso as ^(.*).*\1 against the
same string and it left the code. Different implementation? I only
use regular expressions irregularly and so am merely dangerous with
them.

What makes you think that REs in one language will function like those
in another? [1]

DS

[1] Deliberatively provocative.

I don't expect them to function 'exactly' alike, but I was kind of
expecting similar.... Regular expressions are pretty much like
everything else in my life -- I need to get better at them.
 
P

Peter Scott

Sorry, hit send too early.

I need a regular expression to find repeating substrings (in particular
the substring that starts in position 1 of the string and is repeated
elsewhere in the string). For example, in the case below, the substring
of interest would be "HEART (CONDUCTION DEFECT)".

http://search.cpan.org/~gray/Tree-Suffix-0.14/lib/Tree/Suffix.pm

may not be a regular expression, but it is a Perl binding to a library
that you may be able to link to from another language (as you indicated
you are using). The link above has a link to that library.

On large amounts of data the regular expression solution can have
unacceptable performance. This is much faster.
 
X

Xicheng Jia

Mike said:
What makes you think that REs in one language will function like those
in another? [1]

DS

[1] Deliberatively provocative.

I don't expect them to function 'exactly' alike, but I was kind of
expecting similar.... Regular expressions are pretty much like
everything else in my life -- I need to get better at them.

In fact, if you don't use embedded code in Perl, named captures,
balanced groupings in C#(C# may also have variable-length look behind),
and some other minor differences, the regex patterns for these two
tools are about the same.. (more importantly, they are using the same
Traditional NFA regex engin).

Xicheng
 
D

David Squire

Xicheng said:
Mike said:
What makes you think that REs in one language will function like those
in another? [1]

DS

[1] Deliberatively provocative.
I don't expect them to function 'exactly' alike, but I was kind of
expecting similar.... Regular expressions are pretty much like
everything else in my life -- I need to get better at them.

In fact, if you don't use embedded code in Perl, named captures,
balanced groupings in C#(C# may also have variable-length look behind),
and some other minor differences, the regex patterns for these two
tools are about the same.. (more importantly, they are using the same
Traditional NFA regex engin).

.... and yet regexes in sed, vi, etc. are subtly different from those in
Perl. It's never a good idea to assume equivalence. Those little
differences can cause much puzzlement otherwise.

DS
 
X

Xicheng Jia

David said:
Xicheng said:
Mike said:
What makes you think that REs in one language will function like those
in another? [1]

DS

[1] Deliberatively provocative.
I don't expect them to function 'exactly' alike, but I was kind of
expecting similar.... Regular expressions are pretty much like
everything else in my life -- I need to get better at them.

In fact, if you don't use embedded code in Perl, named captures,
balanced groupings in C#(C# may also have variable-length look behind),
and some other minor differences, the regex patterns for these two
tools are about the same.. (more importantly, they are using the same
Traditional NFA regex engin).

... and yet regexes in sed, vi, etc. are subtly different from those in
Perl. It's never a good idea to assume equivalence. Those little
differences can cause much puzzlement otherwise.

While I didnt say that the same engine must have exactly the same
implementations. the engine just tell how it works inside, right?.. If
you've read references about traditional NFA regexes like Perl, C#
..NET, Java, Python, Javascript, and I do think switching patterns from
one flavor to another is not a huge deal.. :)

BTW. vim? is using DFA engine.

Xicheng
 
T

thundergnat

Mike said:
What I have is 24,000 lines of ICD9 index entries with appended codes
that used to be stored in a format like below and processed one time
per year by an OS390 (then printed out). These are index entries for
ICD9 codes that now have been moved to a web service. The web services
are all "code-centric" and the index entries are just properties of the
codes now. The gist of the problem is that I need to format each entry
(for example that of ablation) into a tree-view that users can peruse
for codes. I believe I can do this recursively be reading each string
until the first word changes then pulling the repeating first and
subsequent words (which are then the root). Then, remove the root and
process what is left of each substring in turn until the base case is
hit (no repeating substrings).

The owner of the webservices is not willing to change the format of the
data. I may be able to do a one-time process (Perl) or on-demand
depending on performance.

A tree view would look like

Ablation
Endometrial (Hysteroscopic) 68.23
Heart (Conduction Defect) 27.33/2
With Catheter 37.34/2
Inner Ear (Cryosurgery) (Ultrasound) 20.79/4
By Injection 20.72
Lesion Heart
By Peripherally Inserted Catheter 37.34

etc etc.....

The raw index entries would look like below.

ABLATION ENDOMETRIAL (HYSTEROSCOPIC) 68.23
ABLATION HEART (CONDUCTION DEFECT) 37.33/2
ABLATION HEART (CONDUCTION DEFECT) WITH CATHETER 37.34/2
ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND) 20.79/4
ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND) BY INJECTION 20.72
ABLATION LESION HEART BY PERIPHERALLY INSERTED CATHETER 37.34
ABLATION LESION HEART ENDOVASCULAR APPROACH 37.34
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) ENDOVASCULAR
APPROACH 37.34
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) OPEN (TRANS-THORACIC)
APPROACH 37.33
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) TRANS-THORACIC
APPROACH 37.33
ABLATION PITUITARY 7.69
ABLATION PITUITARY BY COBALT-60 92.32
ABLATION PITUITARY BY IMPLANTATION (STRONTIUM-YTTRIUM) (Y) NEC 92.39
ABLATION PITUITARY BY PROTON BEAM (BRAGG PEAK) 92.33
ABLATION PROSTATE (ANAT = 59.02) BY LASER, TRANSURETHRAL 60.21
ABLATION PROSTATE (ANAT = 59.02) BY RADIOFREQUENCY THERMOTHERAPY 60.97
ABLATION PROSTATE (ANAT = 59.02) BY TRANSURETHRAL NEEDLE ABLATION
(TUNA) 60.97
ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY CRYOABLATION 60.62
ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY RADICAL CRYOSURGICAL
ABLATION (RCSA) 60.62
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL BY LASER 60.21
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL CRYOABLATION 60.29
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL RADICAL CRYOSURGICAL
ABLATION (RCSA) 60.29
ABLATION TISSUE HEART - SEE ABLATION, LESION, HEART 0
ABLATION VESICLE NECK (ANAT = 60.02) 57.91

Ho hum. I was bored so I farted around with this for a bit.

Not particularly elegant or fast but...


use warnings;
use strict;

my @file;
my $lastline = '';
my $partline = '';

while ( my $line = <DATA> ) {
chomp $line;
$line = "$partline $line" if length $partline;
if ( $line =~ /\D$/ ) {
$partline = $line;
next;
}
else {
$partline = '';
}
$line =~ s/(\w+('\w+)?)/\u\L$1/g;
push @file, $line;
}

my $level = 0;
my $prefix = '';
my @step;
my $prev = shift @file;
my $tab = ' ';

while (@file) {
( $level, $prefix, $prev ) =
buildtree( $level, $prefix, shift @file, $prev );
}
buildtree( $level, $prefix, $prefix, $prev );


sub buildtree {
my ( $level, $prefix, $next, $prev ) = @_;
my $common = join ' ', greatest_common_prefix( $prev, $next );
if ( $common eq $prefix ) {
$prev =~ s/^\Q$prefix\E\s*//;
print $tab x $level, $prev, "\n";
}
elsif ( length $common > length $prefix ) {
$prev =~ s/^\Q$common\E\s*//;
my $trim = $common;
$common =~ s/^\Q$prefix\E\s*//;
push @step, $common;
print $tab x $level, $common;
if ( $prev !~ /[ \p{Alpha}]/ ) {
print "\t$prev\n";
$level = @step;
}
else {
$level = @step;
print "\n", $tab x $level, $prev, "\n";
}
$prefix = $trim;
}
elsif ( length $common < length $prefix ) {
$prev =~ s/^\Q$prefix\E\s*//;
print $tab x $level, $prev, "\n";
my @newstep;
my $test = $next;
for (@step) {
last unless $test =~ s/\Q$_\E\s*//;
push @newstep, $_;
}
@step = @newstep;
$level = @step;
$prefix = $common;
}
return ( $level, $prefix, $next );
}

sub greatest_common_prefix {
no warnings 'uninitialized';
my ( $first, $second ) = @_;
my @first = split ' ', $first;
my @second = split ' ', $second;
my @gcp;
for (@first) {
if ( $_ eq shift @second ) {
push @gcp, $_;
}
else {
last;
}
}
return @gcp;
}

__DATA__
ABLATION ENDOMETRIAL (HYSTEROSCOPIC) 68.23
ABLATION HEART (CONDUCTION DEFECT) 37.33/2
ABLATION HEART (CONDUCTION DEFECT) WITH CATHETER 37.34/2
ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND) 20.79/4
ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND) BY INJECTION 20.72
ABLATION LESION HEART BY PERIPHERALLY INSERTED CATHETER 37.34
ABLATION LESION HEART ENDOVASCULAR APPROACH 37.34
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) ENDOVASCULAR
APPROACH 37.34
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) OPEN (TRANS-THORACIC)
APPROACH 37.33
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) TRANS-THORACIC
APPROACH 37.33
ABLATION PITUITARY 7.69
ABLATION PITUITARY BY COBALT-60 92.32
ABLATION PITUITARY BY IMPLANTATION (STRONTIUM-YTTRIUM) (Y) NEC 92.39
ABLATION PITUITARY BY PROTON BEAM (BRAGG PEAK) 92.33
ABLATION PROSTATE (ANAT = 59.02) BY LASER, TRANSURETHRAL 60.21
ABLATION PROSTATE (ANAT = 59.02) BY RADIOFREQUENCY THERMOTHERAPY 60.97
ABLATION PROSTATE (ANAT = 59.02) BY TRANSURETHRAL NEEDLE ABLATION
(TUNA) 60.97
ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY CRYOABLATION 60.62
ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY RADICAL CRYOSURGICAL
ABLATION (RCSA) 60.62
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL BY LASER 60.21
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL CRYOABLATION 60.29
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL RADICAL CRYOSURGICAL
ABLATION (RCSA) 60.29
ABLATION TISSUE HEART - SEE ABLATION, LESION, HEART 0
ABLATION VESICLE NECK (ANAT = 60.02) 57.91
 
M

Mike

Ho hum. I was bored so I farted around with this for a bit.

Not particularly elegant or fast but...

Bored? I am relatively easily impressed. Performance will continue to
be a problem if my users want to be able to pull these index trees
real-time.

Mike
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,022
Latest member
MaybelleMa

Latest Threads

Top