reg expression example...

J

john

Hi All,
I need to process large text file and I'm using this expression.
All I know that these words are coming in this order.

"word1.*word2(.*)word3.*word4.*word5"

How can I optimize it ?

thanks.
 
A

Arne Vajhøj

I need to process large text file and I'm using this expression.
All I know that these words are coming in this order.

"word1.*word2(.*)word3.*word4.*word5"

How can I optimize it ?

Either just use standard Pattern.compile or ask your boss
to approve a lot of hours to do a custom solution, test
it, fix bugs and maintain it for a decade or two.

Arne

PS: Are you sure you want greedy?
 
E

Eric Sosman

Hi All,
I need to process large text file and I'm using this expression.
All I know that these words are coming in this order.

"word1.*word2(.*)word3.*word4.*word5"

How can I optimize it ?

What separates the words from each other, and how do you know
you've reached the end of the interstitial space and reached the
start of the next word?

Or if you're actually looking for lines like

word1word2buzzword3lightyearword4mumbleword5

.... then you have my sympathies.
 
J

john

Eric said:
What separates the words from each other, and how do you know
you've reached the end of the interstitial space and reached the
start of the next word?

Or if you're actually looking for lines like

word1word2buzzword3lightyearword4mumbleword5

... then you have my sympathies.

yeah, All I know is the order and I need to get piece between word2 and
word3 . It's possible to have multiple word1...word5 patterns and not
all of them include other words.

Pattern pattern = Pattern.compile(
"word1.*word2(.*)word3.*word4.*word5" , Pattern.MULTILINE|Pattern.DOTALL);

This works, but I guess, it's not the most efficient way...
 
E

Eric Sosman

yeah, All I know is the order and I need to get piece between word2 and
word3 . It's possible to have multiple word1...word5 patterns and not
all of them include other words.

Pattern pattern = Pattern.compile(
"word1.*word2(.*)word3.*word4.*word5" , Pattern.MULTILINE|Pattern.DOTALL);

So, from "word1word2buzzword3lightyearword4mumbleword5", literally,
you want to extract "buzz" as the group between "word2" (those exact
five characters) and "word3" (those five)? And you want to reject (not
match) "word9word8buzzword7lightyearword6word5"?
This works, but I guess, it's not the most efficient way...

It's the straightforward approach for the problem you've described.
Straightforward very often equals best, for many definitions of "best."
Have you made measurements that indicate it's not "good enough?"
 
J

john

Eric said:
So, from "word1word2buzzword3lightyearword4mumbleword5", literally,
you want to extract "buzz" as the group between "word2" (those exact
five characters) and "word3" (those five)?
yes.

I need to use word1 and word5 as start and end of this pattern, but
there may be other word1...word5 patterns which don't include
word3/word4 - I don't need them.

actually, I used "word1.*word2(.*?)word3.*word4.*word5"
>And you want to reject (not
match) "word9word8buzzword7lightyearword6word5"? yes.




It's the straightforward approach for the problem you've described.
Straightforward very often equals best, for many definitions of "best."
Have you made measurements that indicate it's not "good enough?"
no.
 
E

Eric Sosman

yes.

I need to use word1 and word5 as start and end of this pattern, but
there may be other word1...word5 patterns which don't include
word3/word4 - I don't need them.

actually, I used "word1.*word2(.*?)word3.*word4.*word5"

yes.

Okay: As I wrote earlier, "You have my sympathies."

Then you should, before haring off after efficiencies that may
turn out to be meaningless. People have made studies of how good
programmers are at predicting which pieces of their programs will
take the most time, and study after study has shown that even the
Great Grand Gurus are dismal failures at it. Measure, *then* worry
about efficiency -- because if you don't, chances are better than
even that you'll be worrying about something irrelevant.

Idle Question #1: How much time have you spent writing these
messages to Usenet and contemplating the answers?

Idle Question #2: How much faster must you make the pattern-
matching merely to break even on the time already devoted to #1?
 
R

Roedy Green

"word1.*word2(.*)word3.*word4.*word5"

Give us half a dozen examples of the exact strings you are trying to
find/parse. Anything with any sort of BNF notation in it is
ambiguous.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,922
Messages
2,570,047
Members
46,475
Latest member
RacheleGri

Latest Threads

Top