Remove short words from a string

L

Leif Wessman

Hi all!

How can I remove all words that have a length that is 3 or less?

"a lot of words in this text";

should become

"words this text"

Is it possible?

Leif
 
P

Paul Lalli

Leif said:
How can I remove all words that have a length that is 3 or less?

"a lot of words in this text";

should become

"words this text"

Is it possible?

Yes, it's possible.

There are two approaches which jump out at me. You could use a
join/grep/split combination. Or you could use a regexp solution being
sure to include word boundaries.

What have you tried so far? How did it not work as you expected?

Paul Lalli
 
T

Ted Zlatanov

How can I remove all words that have a length that is 3 or less?

"a lot of words in this text";

should become

"words this text"

Solution below. Note that your requirement ("remove all words...")
does not match the expected result, since you are also removing
whitespace around the words. That's why I added the second regex.
Still, the leading space is preserved. You can either add a third
regex to eliminate leading spaces, or you can split on ' '.

Keep in mind that if you split on ' ' you still won't have "words"
because punctuation will be included, for example. This is why I
would recommend against a split()/grep()/join() approach for this,
unless you are absolutely sure you don't need to worry about
punctuation or preserving spaces.

Ted

#!/usr/bin/perl

use warnings;
use strict;

my $text = "a lot of words in this text";
# note \w may not work well for you, adjust accordingly
$text =~ s/(\w+)/length $1 > 3 ? $1 : ''/eg;
# if you need multiple spaces collapsed to just one
$text =~ s/\s+/ /g;
print $text;
 
D

Dr.Ruud

Mirco Wahab schreef:
Leif:

I'll try to give a easy example and you'll
try to explain it line by line in your reply, ok?

use strict;
use warnings;

my $shortlen = 3;
my $fulltext = 'a lot of words in this text';
my $no_shorts = $fulltext;

$no_shorts =~ s/ \b \w{1,$shortlen} \b \s+//gmx;

1. Won't work well with a short last word. Maybe use "\s*" or
"(?:\s+|$)".

2. Maybe "\w" is too limited, it is just [[:alnum:]_], so doesn't
contain "-", which could lead to unwanted changes, like of
"non-essential", etc.
 
T

Ted Zlatanov

Ted Zlatanov said:
Hi all!

How can I remove all words that have a length that is 3 or less?
[...]
Here is your hint.
grep { length > 3 } @words;

That's not a good hint.

What's wrong with it?

As I explained in my other post, the split/grep/join approach is not
aware of punctuation and whitespace. Two spaces may become one, a
period may count as a letter... It's just not a good solution unless
we know for sure it's OK to use it.

Also the hint doesn't say anything about split() and join(). It's not
very useful. At least say "split() before, join() after" in a
comment. Takes 4 words, and may save the OP hours of work. If I
didn't know Perl well and got this hint, I'd be puzzled for many
reasons.

Finally, the OP's requirements (as I mentioned in my other post too)
contradict each other. He's removing words of length <= 3, but the
example he gives also eliminates whitespace.

Ted
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top