Best way to search for a string which has N% in a character class?

P

Peng Yu

Hi,

Suppose that I want to search for a substring which has say 50%
letters are in a letter class say [A-D]. Note that there is some
ambiguity at the two ends of the substring. But other than that, this
problem is well defined.

It seems that this problem can not (or can not easily, please let me
know if there is a way) be formulated in regex. Since perl is strong
in processing string, I think that there might be a good way to search
for such strings in perl. Does anybody have some good way in search
this type of substring?

Regards,
Peng
 
J

J. Gleixner

Hi,

Suppose that I want to search for a substring which has say 50%
letters are in a letter class say [A-D]. Note that there is some
ambiguity at the two ends of the substring. But other than that, this
problem is well defined.

It seems that this problem can not (or can not easily, please let me
know if there is a way) be formulated in regex. Since perl is strong
in processing string, I think that there might be a good way to search
for such strings in perl. Does anybody have some good way in search
this type of substring?

What have you tried?????????????????

Using 'tr' and 'length' would probably help you.

From perldoc perlop:

y/SEARCHLIST/REPLACEMENTLIST/cds
[...]Transliterates all occurrences of the characters found in the
search list with the corresponding character in the replacement list.
It returns the number of characters replaced or deleted.

Using that you can get the number of characters in the class.
e.g. $cnt = tr/[A-D]/[A-D]/;

Using 'length' you can find how many characters are in the string.

perldoc -f length

Divide one by the other, multiply by 100 and you have the percent.
 
T

Tim McDaniel

Suppose that I want to search for a substring which has say 50%
letters are in a letter class say [A-D]. Note that there is some
ambiguity at the two ends of the substring. But other than that,
this problem is well defined.

It seems that this problem can not (or can not easily, please let
me know if there is a way) be formulated in regex. Since perl is
strong in processing string, I think that there might be a good way
to search for such strings in perl. Does anybody have some good way
in search this type of substring?

What have you tried?????????????????

Using 'tr' and 'length' would probably help you.

From perldoc perlop:

y/SEARCHLIST/REPLACEMENTLIST/cds
[...]Transliterates all occurrences of the characters found in the
search list with the corresponding character in the replacement list.
It returns the number of characters replaced or deleted.

Using that you can get the number of characters in the class.
e.g. $cnt = tr/[A-D]/[A-D]/;

"man perlop" continues

Note that "tr" does not do regular expression character classes
such as "\d" or "[:lower:]". The <tr> operator is not equivalent
to the tr(1) utility. If you want to map strings between
lower/upper cases, see "lc" in perlfunc and "uc" in perlfunc, and
in general consider using the "s" operator if you need regular
expressions.

The expression
tr/[A-D]/[A-D]/;
will translate [ to [ and ] to ], so they will be included in the
count. A-D works because that's a special case in tr. Also,

If the "/d" modifier is used, the REPLACEMENTLIST is always
interpreted exactly as specified. Otherwise, if the
REPLACEMENTLIST is shorter than the SEARCHLIST, the final
character is replicated till it is long enough. If the
REPLACEMENTLIST is empty, the SEARCHLIST is replicated. This
latter is useful for counting characters in a class or for
squashing character sequences in a class.

So if you really want a range of characters like A thru D,
tr/A-D//
works. If you want all digits, or all alphabetics, or some other
character class, you need to use s/// instead.
 
J

J. Gleixner

Suppose that I want to search for a substring which has say 50%
letters are in a letter class say [A-D]. Note that there is some
ambiguity at the two ends of the substring. But other than that,
this problem is well defined.

It seems that this problem can not (or can not easily, please let
me know if there is a way) be formulated in regex. Since perl is
strong in processing string, I think that there might be a good way
to search for such strings in perl. Does anybody have some good way
in search this type of substring?

What have you tried?????????????????

Using 'tr' and 'length' would probably help you.
[...]
So if you really want a range of characters like A thru D,
tr/A-D//
works. If you want all digits, or all alphabetics, or some other
character class, you need to use s/// instead.

Thanks for the correction.
 
P

Peng Yu

Suppose that I want to search for a substring which has say 50%
letters are in a letter class say [A-D]. Note that there is some
ambiguity at the two ends of the substring. But other than that, this
problem is well defined.
It seems that this problem can not (or can not easily, please let me
know if there is a way) be formulated in regex. Since perl is strong
in processing string, I think that there might be a good way to search
for such strings in perl. Does anybody have some good way in search
this type of substring?

What have you tried?????????????????

Using 'tr' and 'length' would probably help you.

 From perldoc perlop:

  y/SEARCHLIST/REPLACEMENTLIST/cds
     [...]Transliterates all occurrences of the characters found inthe
search list with the corresponding character in the replacement list.
It returns the number of characters replaced or deleted.

Using that you can get the number of characters in the class.
e.g. $cnt = tr/[A-D]/[A-D]/;

Using 'length' you can find how many characters are in the string.

perldoc -f length

Divide one by the other, multiply by 100 and you have the percent.

I don't think that you understand my question.

Suppose that I have a string $str which the concatenation of $str1,
$str2 and $str3, where both $str1 and $str3 have less than 50% of [A-
D] and $str2 have more than 50% of [A-D].

I need to discovered from $str where $str2 starts and ends. I don't
see how tr and length alone can address this question.
 
S

sln

On 03/02/12 10:29, Peng Yu wrote: [snip]
Using 'tr' and 'length' would probably help you.
[snip]

Divide one by the other, multiply by 100 and you have the percent.

I don't think that you understand my question.

Suppose that I have a string $str which the concatenation of $str1,
$str2 and $str3, where both $str1 and $str3 have less than 50% of [A-
D] and $str2 have more than 50% of [A-D].

I need to discovered from $str where $str2 starts and ends. I don't
see how tr and length alone can address this question.

%50 of what? Without boundry conditions, the type of regex solution
your thinking of is impossible.

The way you state your problem is that [A-D] can exist randomly
in sequence or between [^A-D] characters.

The the only thing you state as known is the total length of random
length strings after cattenation and before the %50 over/under content
of each.

You can slide a regex frame over the final string but ther is not enough
information about boundry conditions to get real information.
There is just more unknowns than there are equations.

For instance,
- if the length of each substring were the same it could be
solved, but this way would not need a regex.
- if the [A-D] were adjacent, still the start/end could not be
determined, only the knowledge that this match of > %50 is in
the substring that needs to be found, but still no begin/end information
about it.

I think it was a nice try though, futile, but nice.

-sln
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,012
Latest member
RoxanneDzm

Latest Threads

Top