Simple string search

J

Jack

hi guys,
A little problem here. I am very new to perl and i am having a problem
search for a substring in a file. So here is a sample

(this is my id for id="wksOI*84sk_")
(this is my id for id="@s3dSSos_")
(this is my id for id="dksWDkps_")

So i have page with 20 of these lines. all I am interested in the id
part of each line ie, wksOl*84sk_ . As you maybe able to tell the id
part of each line is 12 char and it always ends with _"). I think the
regex must be for an expression that starts with id=" and ends with
") with 12 letters in the middle. So once this id has been found I
need to write it in a file.

I know with m/regex/ I can find stuff, but I don' t how to return the
cryptic id.

Any solutions.

Thanks
 
J

J. Gleixner

Jack said:
hi guys,
A little problem here. I am very new to perl and i am having a problem
search for a substring in a file. So here is a sample

(this is my id for id="wksOI*84sk_")
(this is my id for id="@s3dSSos_")
(this is my id for id="dksWDkps_")

So i have page with 20 of these lines. all I am interested in the id
part of each line ie, wksOl*84sk_ . As you maybe able to tell the id
part of each line is 12 char and it always ends with _"). I think the
regex must be for an expression that starts with id=" and ends with
") with 12 letters in the middle. So once this id has been found I
need to write it in a file.

I know with m/regex/ I can find stuff, but I don' t how to return the
cryptic id.

Any solutions.

Many solutions.

perldoc perlretut

See "Extracting matches"
 
J

Josef Moellers

Jack said:
hi guys,
A little problem here. I am very new to perl and i am having a problem
search for a substring in a file. So here is a sample

(this is my id for id="wksOI*84sk_")
(this is my id for id="@s3dSSos_")
(this is my id for id="dksWDkps_")

So i have page with 20 of these lines. all I am interested in the id
part of each line ie, wksOl*84sk_ . As you maybe able to tell the id
part of each line is 12 char and it always ends with _"). I think the
regex must be for an expression that starts with id=" and ends with
") with 12 letters in the middle. So once this id has been found I
need to write it in a file.

This is somewhat inconsistent. The example you gave (wksOl*84sk_) is
only 11 characters long.
I know with m/regex/ I can find stuff, but I don' t how to return the
cryptic id.

Any solutions.

if ($string =~ m/id="(.{12})"/) {
$desired_id = $1;
}
 
J

jordilin

This is somewhat inconsistent. The example you gave (wksOl*84sk_) is
only 11 characters long.



if ($string =~ m/id="(.{12})"/) {
$desired_id = $1;

}

I have quickly written the following and tested it successfully:

while (<>) {
if (/^\(.* id="(.*)"\)/) {
print "$1\n";
}
}

This works.
Best regards,
jordi
 
J

Josef Moellers

jordilin said:
I have quickly written the following and tested it successfully:

while (<>) {
if (/^\(.* id="(.*)"\)/) {
print "$1\n";
}
}

This works.

Fine. TMTOWTDI.

This regex also works:
\(this is my id for id="(.*_)"\)

It's a question of the requirement: how is the input structured and how
much of the input has to be matched in order to avoid false positives.
 
J

jordilin

This is somewhat inconsistent. The example you gave (wksOl*84sk_) is
only 11 characters long.



if ($string =~ m/id="(.{12})"/) {
$desired_id = $1;

}

Yes, there is more than one way to do it. This is Perl, isn't it? By
the way, can you explain the following regex?
m/id="(.{12})"/
I am not sure about this 12 between brackets.
Thanks in advance,
Jordi
 
J

J. Gleixner

jordilin wrote:
[...]
By the way, can you explain the following regex?
m/id="(.{12})"/
I am not sure about this 12 between brackets.

When in doubt, read the documentation, or a book.

perldoc perlretut

Look for 'Matching repetitions'.
 
J

jordilin

jordilin wrote:

[...]
By the way, can you explain the following regex?
m/id="(.{12})"/
I am not sure about this 12 between brackets.

When in doubt, read the documentation, or a book.

perldoc perlretut

Look for 'Matching repetitions'.

The reason I am asking is because I have tried this particular regex
and it does not work in this particular example and I want the
explanation of the author.
It is very easy saying look at the docs. I recommend you Mastering
Regular Expressions from Oreilly, by the way.
best regards,
Jordi
 
J

jordilin

. means "any character"
{12} means "whatever came just before this, we're looking for 12 of it".

So .{12} means "any sequence of exactly 12 characters", and (.{12}) means
"open paren, followed by any sequence of exactly 12 characters, followed by
close paren".

--
Regards,
Doug Miller (alphageek at milmac dot com)

It's time to throw all their damned tea in the harbor again.

Well, I understand. The problem is that, in this example the ids
differ in length, so it does not work here. We should write sth like

m/id="(.{7,})"/

match at least 7 times, taking into account there are no ids with less
than 7 chars.
Thanks
jordi
 
J

Jürgen Exner

jordilin said:
Well, I understand. The problem is that, in this example the ids
differ in length, so it does not work here. We should write sth like

m/id="(.{7,})"/

match at least 7 times, taking into account there are no ids with less
than 7 chars.

Taking into account that HTML is not a regular language only a fool would
try to parse HTML using Regular Expressions. Even with the non-regular
enhancements in Perl REs are the wrong tool to parse HTML. This has been
discussed in this NG gazillions of times.
Or do you also use a hammer to fasten a screw? It works, ... sort of.

Use a tool that is meant to parse HTML if you want to parse HTML, e.g.
HTML::parse.

jue
 
J

jordilin

Taking into account that HTML is not a regular language only a fool would
try to parse HTML using Regular Expressions. Even with the non-regular
enhancements in Perl REs are the wrong tool to parse HTML. This has been
discussed in this NG gazillions of times.
Or do you also use a hammer to fasten a screw? It works, ... sort of.

Use a tool that is meant to parse HTML if you want to parse HTML, e.g.
HTML::parse.

jue

I think you have posted in the wrong thread mate. This is not about
html,
Best regards,
Jordi
 
J

Jürgen Exner

jordilin said:
I think you have posted in the wrong thread mate. This is not about
html,


Oooops, indeed.
Sorry, I got two threads confused. You are right.

jue
 
J

Josef Moellers

jordilin said:
Well, I understand. The problem is that, in this example the ids
differ in length, so it does not work here. We should write sth like

m/id="(.{7,})"/

match at least 7 times, taking into account there are no ids with less
than 7 chars.

But "Jack" writes in the original post "all I am interested in the id
part of each line ie, wksOl*84sk_ . As you maybe able to tell the id
part of each line is 12 char and it always ends with _")."
So I thought that whatever is between the "" is the id and it's supposed
to be 12 characters long.
If you now state that it should have been 7 or more, please re-read the
original post.

If the requirement is "at least 7", then, indeed, ".{7,}" is correct, as
can be found in "predoc perlre. If the requirement is "12", then ".{12}"
is correct. If the requirement were "anything between the quote signs,
no matter how much", then ".*" would be correct.
I was under the assumption that the OP wanted to filter out illegal ids
which are not 12 characters long.

YMMV,

Josef
 
D

Doug Miller

Yes, there is more than one way to do it. This is Perl, isn't it? By
the way, can you explain the following regex?
m/id="(.{12})"/
I am not sure about this 12 between brackets.

. means "any character"
{12} means "whatever came just before this, we're looking for 12 of it".

So .{12} means "any sequence of exactly 12 characters", and (.{12}) means
"open paren, followed by any sequence of exactly 12 characters, followed by
close paren".
 
D

Doug Miller

So far so good


Aehmmm, no.
My fault -- you're right. It *would* mean that if the parens were escaped,
i.e. \( and \). As is, it just means a sequence of 12 characters.
 
J

jordilin

jordilin said:
But "Jack" writes in the original post "all I am interested in the id
part of each line ie, wksOl*84sk_ . As you maybe able to tell the id
part of each line is 12 char and it always ends with _")."
So I thought that whatever is between the "" is the id and it's supposed
to be 12 characters long.
If you now state that it should have been 7 or more, please re-read the
original post.

If the requirement is "at least 7", then, indeed, ".{7,}" is correct, as
can be found in "predoc perlre. If the requirement is "12", then ".{12}"
is correct. If the requirement were "anything between the quote signs,
no matter how much", then ".*" would be correct.
I was under the assumption that the OP wanted to filter out illegal ids
which are not 12 characters long.

YMMV,

Josef

Well, you are absolutely right. The original poster should state
clearly what does he want, but he doesn´t.
In any case, I think we have already answered several options that the
original poster can take to solve his problem.
regards,
jordi
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,772
Messages
2,569,593
Members
45,111
Latest member
KetoBurn
Top