Find repeating substring

M

Mirco Wahab

Thus spoke Mike (on 2006-06-23 16:46):
Bored? I am relatively easily impressed. Performance will continue to
be a problem if my users want to be able to pull these index trees
real-time.

Your (full) data set seems to be not larger than 2 or 3MB?
What would you consider 'realtime'? On a decent machine,
tree-building-during-line-read should be manageable
in (very) few seconds (am I wrong here?).

At what point are you with that? Can you push (per
personal e-mail or via http-link to the group) a
complete data set? What are the occassional line
breaks in your example lines? Are they intended?


Regards

Mirco
 
T

thundergnat

Fixed a few errors.

Reformatted to reduce memory usage by not slurping in the
whole data file at once.



use warnings;
use strict;

my $partline = '';
my $prefix = '';
my $prev = '';
my $tab = ' ';
my $level = 0;
my @step;

while ( my $line = <DATA> ) {
chomp $line;
$line = "$partline $line" if length $partline;
if ( $line =~ /\D$/ ) {
$partline = $line;
next;
}
else {
$partline = '';
}
$line =~ s/(\w+('\w+)?)/\u\L$1/g;
( $level, $prefix, $prev ) =
buildtree( $level, $prefix, $line, $prev );
}
buildtree( $level, $prefix, $prefix, $prev );


sub buildtree {
my ( $level, $prefix, $next, $prev ) = @_;
my $common = greatest_common_prefix( $prev, $next );
if ( $common eq $prefix ) {
$prev =~ s/^\Q$prefix\E\s*//;
print $tab x $level, $prev, "\n";
}
elsif ( length $common > length $prefix ) {
$prev =~ s/^\Q$common\E\s*//;
my $trim = $common;
$trim =~ s/^\Q$prefix\E\s*//;
push @step, $trim;
print $tab x $level, $trim;
$level = @step;
if ( $prev !~ /[ \p{Alpha}]/ ) {
print "\t$prev\n";
}
else {
print "\n", $tab x $level, $prev, "\n";
}
$prefix = $common;
}
elsif ( length $common < length $prefix ) {
$prev =~ s/^\Q$prefix\E\s*//;
print $tab x $level, $prev, "\n";
my @newstep;
my $test = $next;
for (@step) {
last unless $test =~ s/^\Q$_\E\s*//;
push @newstep, $_;
}
$level = @step = @newstep;
$prefix = $common;
}
return ( $level, $prefix, $next );
}

sub greatest_common_prefix {
no warnings 'uninitialized';
my ( $first, $second ) = @_;
my @first = split ' ', $first;
my @second = split ' ', $second;
my @gcp;
for (@first) {
$_ eq shift @second ? push @gcp, $_ : last;
}
return join ' ', @gcp;
}

__DATA__
ABLATION ENDOMETRIAL (HYSTEROSCOPIC) 68.23
ABLATION HEART (CONDUCTION DEFECT) 37.33/2
ABLATION HEART (CONDUCTION DEFECT) WITH CATHETER 37.34/2
ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND) 20.79/4
ABLATION INNER EAR (CRYOSURGERY) (ULTRASOUND) BY INJECTION 20.72
ABLATION LESION HEART BY PERIPHERALLY INSERTED CATHETER 37.34
ABLATION LESION HEART ENDOVASCULAR APPROACH 37.34
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) ENDOVASCULAR
APPROACH 37.34
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) OPEN (TRANS-THORACIC)
APPROACH 37.33
ABLATION LESION HEART MAZE PROCEDURE (COX-MAZE) TRANS-THORACIC
APPROACH 37.33
ABLATION PITUITARY 7.69
ABLATION PITUITARY BY COBALT-60 92.32
ABLATION PITUITARY BY IMPLANTATION (STRONTIUM-YTTRIUM) (Y) NEC 92.39
ABLATION PITUITARY BY PROTON BEAM (BRAGG PEAK) 92.33
ABLATION PROSTATE (ANAT = 59.02) BY LASER, TRANSURETHRAL 60.21
ABLATION PROSTATE (ANAT = 59.02) BY RADIOFREQUENCY THERMOTHERAPY 60.97
ABLATION PROSTATE (ANAT = 59.02) BY TRANSURETHRAL NEEDLE ABLATION
(TUNA) 60.97
ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY CRYOABLATION 60.62
ABLATION PROSTATE (ANAT = 59.02) PERINEAL BY RADICAL CRYOSURGICAL
ABLATION (RCSA) 60.62
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL BY LASER 60.21
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL CRYOABLATION 60.29
ABLATION PROSTATE (ANAT = 59.02) TRANSURETHRAL RADICAL CRYOSURGICAL
ABLATION (RCSA) 60.29
ABLATION TISSUE HEART - SEE ABLATION, LESION, HEART 0
ABLATION VESICLE NECK (ANAT = 60.02) 57.91
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,808
Messages
2,569,686
Members
45,455
Latest member
GwendolynG

Latest Threads

Top