Truncating text from a string with beginning text from another string

M

Mark

From a line of arbitrary text, possibly followed by some amount of
text from the beginning of the string ' Reference #\d+', where \d+
represents one or more digit characters, I want to output the line
without the ending ' Reference...' string. For example, the input line
'some arbitrary text Refer' would become 'some arbitrary text'.

Here are two programs that seem to do what I want, but they seem
overly complicated for this task. I'm looking for a simpler solution,
possibly by using a better regular expression than I have chosen in my
first sample code.

First sample:
use strict ;
use warnings ;

my $re = qr'^(.*)\ ( (R$)|
(Re$)|
(Ref$)|
(Refe$)|
(Refer$)|
(Refere$)|
(Referenc$)|
(Reference\ {0,1}$)|
(Reference\ \#\d{0,}$)
)'x ;

while(<DATA>) {
chomp ;
print "in : >$_<\n" ;
if (my($result) = /$re/g) {
print "out: >$result<\n" ;
}
else {
print "out: >$_<\n" ;
}
}

__DATA__
Refer
One Referenc
two three Reference
xx yy Reference Reference
def Refere Reference #xx
abc the def Refere Reference #
abc the def Refere Reference #12


Second sample:
use strict ;
use warnings ;

my $PATTERN = 'Reference #000000' ;

my $pos ;
while (<DATA>) {
chomp ;
$pos = -1 ;
while ((my $ind = index($_,' R',$pos)) != -1) {
$pos = $ind + 1 ;
}
print "in : >$_<\n" ;
my $result = $_ ;

if ($pos > 0) {
my $re = substr($_,$pos) ;
$re =~ s/\d+$/\\d+/ ;
$re = qr/^$re/ ;
if ($PATTERN =~ /$re/) {
$result = substr($_,0,$pos-1) ;
}
}
print "out: >$result<\n" ;
}

__DATA__
Refer
One Referenc
two three Reference
xx yy Reference Reference
def Refere Reference #xx
abc the def Refere Reference #
abc the def Refere Reference #12
 
M

Mirco Wahab

Mark said:
Here are two programs that seem to do what I want, but they seem
overly complicated for this task. I'm looking for a simpler solution,
possibly by using a better regular expression than I have chosen in my
first sample code.
First sample:
[...]
Second sample:
[...]

I don't really know what all this
should give, but whay wouldn't
a simple:

while(<DATA>) {
chomp && print "$1 ==> from [$_]\n" if /(.+?)Refer/
}


do all you want? In your explanations you
mentioned you'd truncate all subsequent
occurencies of 'refer' 'reference' and all
following stuff.

Regards

M.
 
B

Brian McCauley

[ An interesting problem ]
I'm looking for a simpler solution,
possibly by using a better regular expression than I have chosen in my
first sample code.

Wow! What a brilliant post. Clear, well thought out, interesting.

Just wish I had an answer. I'll think about that one tonight. I'll
probably be up all night thinking about it!
 
B

Brian McCauley

use strict ;
use warnings ;

my $re = qr'^(.*)\ ( (R$)|
(Re$)|
(Ref$)|
(Refe$)|
(Refer$)|
(Refere$)|
(Referenc$)|
(Reference\ {0,1}$)|
(Reference\ \#\d{0,}$)
)'x ;

while(<DATA>) {
chomp ;
print "in : >$_<\n" ;
if (my($result) = /$re/g) {
print "out: >$result<\n" ;
}
else {
print "out: >$_<\n" ;
}

}

Just being picky but...

As far as I can see the /g in the match does nothing useful.

Nor to most of the (...) in the regex.

{0,1} and {0,} in regex are so commonly used that they have one-
character short hands: ? and * respectively.

BTW are you perhaps trying to implement something like File::Stream?
 
U

usenet

my $re = qr'^(.*)\ ( (R$)|
(Re$)|
(Ref$)|
(Refe$)|
(Refer$)|
(Refere$)|
(Referenc$)|
(Reference\ {0,1}$)|
(Reference\ \#\d{0,}$)
)'x ;

Try this instead; results are identical to your regex except what
happens to $2, which you don't use anyway (and you could avoid setting
$2, but extra complexity for no real gain):

$re = qr{^(.*) Re?f?e?r?e?n?c?e? ?(\#\d*)$}x;
 
U

usenet

$re = qr{^(.*) Re?f?e?r?e?n?c?e? ?(\#\d*)$}x;

Then again, it would be possible to "fool" this regex where your
original would not be fooled (for example, by dropping a middle
character). Needs more thought....
 
B

Brian McCauley

Try this instead; results are identical to your regex except what
happens to $2, which you don't use anyway (and you could avoid setting
$2, but extra complexity for no real gain):

$re = qr{^(.*) Re?f?e?r?e?n?c?e? ?(\#\d*)$}x;

No, that matches "Rernc 10" etc too.
 
A

anno4000

Brian McCauley said:
[ An interesting problem ]
I'm looking for a simpler solution,
possibly by using a better regular expression than I have chosen in my
first sample code.

Wow! What a brilliant post. Clear, well thought out, interesting.

....plus runnable code, including a convincing set of test data.
I quite agree.
Just wish I had an answer. I'll think about that one tonight. I'll
probably be up all night thinking about it!

Ah, it won't take all night. Here is my take:

{
my $fix = ' Reference #';
my $pat = "$fix\\d+";
my @parts = map substr( $fix, 0, $_), 1 .. length $fix;

sub rem_ref {
my $str = shift;
$str =~ s/$pat$// and return $str;
$str =~ s/$_$// and return $str for @parts;
return $str;
}
}

while ( <DATA> ) {
chomp;
print "in : >$_<\n";
print "out: >", rem_ref( $_), "<\n";
}

Anno
 
A

anno4000

Try this instead; results are identical to your regex except what
happens to $2, which you don't use anyway (and you could avoid setting
$2, but extra complexity for no real gain):

$re = qr{^(.*) Re?f?e?r?e?n?c?e? ?(\#\d*)$}x;

No, that would also match things like "gaga Refe #12".

Anno
 
M

Mirco Wahab

Mark said:
From a line of arbitrary text, possibly followed by some amount of
text from the beginning of the string ' Reference #\d+', where \d+
represents one or more digit characters, I want to output the line
without the ending ' Reference...' string. For example, the input line
'some arbitrary text Refer' would become 'some arbitrary text'.

Here are two programs that seem to do what I want, but they seem
overly complicated for this task. I'm looking for a simpler solution,
possibly by using a better regular expression than I have chosen in my
first sample code.

After making the wrong turn first,
I think this can't be solved very
much different from your solution.

The Regex can be an incremental one
(as was shown already by others) or a
sequence of alternations (as you tried).

One could rewrite it somehow 'different',
as a "split", like:

use strict;
use warnings;
no warnings 'qw';

my @end = qw{R e f e r e n c e \\s # \\d+};
my $reg = '('.(join '|',map join('',@$_),map[@end[0..$_]],0..$#end).')$';

while( <DATA> ) {
chomp;
print "[$_->[0]]\n\t[$_->[1]]\n" for
map [$_->[0]||'undef', $_->[1]||'undef'],
[split /$reg/]
}

__DATA__
....

Aside from the regex construction (which can be commented
properly ;-), this should be quite readable.


Regards

M.
 
M

Mirco Wahab

Mark said:
Here are two programs that seem to do what I want, but they seem
overly complicated for this task. I'm looking for a simpler solution,
possibly by using a better regular expression than I have chosen in my
first sample code.

After making the wrong turn first,
I think this can't be solved very
much different from your solution.

Of course, one can write it somehow 'different',like:

...
my @end = split //, 'Reference #000000';
my $key = '('.(join '|', map join('',,@$_), map[@end[0..$_]], 0..$#end).')';
...

while(<DATA>) {
print "$1\t\t$2\n"
if /^(.+?)($key)$/
}

__DATA__
....

Regards

M.
 
M

Mirco Wahab

Mirco said:
One could rewrite it somehow 'different',
as a "split", like:

use strict;
use warnings;
...
[split /$reg/]
...

....
reg and output slightly modified to match yours:


...
no warnings 'qw';

my @end = qw{R e f e r e n c e \\s # \\d+};
my $reg = '\s+('.(join '|',map join('',@$_),map[@end[0..$_]],0..$#end).')$';

while( <DATA> ) {
chomp;
print "in : >$_<\n";
print "out: >", (split /$reg/)[0], "<\n"
}
...

Regards

M.
 
G

Gary E. Ansok

No, that would also match things like "gaga Refe #12".

You could write something like this

$re = qr{^(.*)\ (R(?:e(?:f(?:e(?:r(?:e(?:n(?:c(?:e(?:\ (?:\#\d*)
?)?)?)?)?)?)?)?)?))$}x;

but that's not clear at all to the human reader, and I don't think
adding more whitespace would help much in this case.

Depending on your needs, it might be more clear to use a simpler regex like
$re = qr{^(.*) ((R[a-z #]+) \d*)$};

and then test ($3 eq substr('Reference #', 0, length $3))

Gary Ansok
 
M

Mirco Wahab

Mirco said:
...
[split /$reg/]
...


regex/output simplified and slightly modified
to match yours:

...
no warnings 'qw';
my @end = qw{R e f e r e n c e \\s # \\d+};

my $reg = '\s+('.(join'|',map join('',@end[0..$_]),0..$#end).')$';

while( <DATA> ) {
chomp;
print "in : >$_<\n";
print "out: >", (split /$reg/)[0], "<\n"
}
...

Regards

M.
 
M

Mark

text from the beginning of the string ' Reference #\d+', where \d+
represents one or more digit characters, I want to output the line
without the ending ' Reference...' string. For example, the input line
'some arbitrary text Refer' would become 'some arbitrary text'.

Thanks to all who responded and offered ideas. Anno's post was
especially interesting.

- M
 
A

anno4000

Mark said:
Thanks to all who responded and offered ideas. Anno's post was
especially interesting.

Thanks. Since you mention it, the sub definition can be slightly
simplified:

{
my $fix = ' Reference #';
my @parts = map substr( $fix, 0, $_), 1 .. length $fix;

sub rem_ref {
my $str = shift;
$str =~ s/$_$// and return $str for @parts, "$fix\\d+";
return $str;
}
}

Anno
 
M

Mirco Wahab

Michele said:
my @end = split //, 'Reference #000000';
my $key = '('.(join '|', map join('',,@$_), map[@end[0..$_]], 0..$#end).')';

Isn't that an awkward way to reimplement substr()?

First this - and the whole approach shown above also
will not work (to solve to said problem). I tried to
cancel the message (and post a working solution) after
thinking again - but your news server didn't honor my
cancel attempts. This way, all came to the light ...

Regards

Mirco
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,479
Members
44,899
Latest member
RodneyMcAu

Latest Threads

Top