Regex problem

H

Hendrik Maryns

(This is in Java, but the regex is general, therefore x-post to
c.l.p.m., f-up to c.l.j.h.)

Hi all,

I want to discard the header of some file. The header is everything
before a line beginning with "#BOS". However, I do not want #BOS to be
part of the match, since I need it later on.

I thought of using a regex to do that. I came up with

..*(?s)(?=#BOS)

However, this gave me nothing.
(To be precise, I have

Scanner corpus = new Scanner(inFile);
Pattern header = Pattern.compile(".*(?s)(?=#BOS)", Pattern.MULTILINE);
corpus.skip(header);

and it gives me

java.util.NoSuchElementException
at java.util.Scanner.skip(Scanner.java:1706)
at
de.uni_tuebingen.sfb.lichtenstein.binarytrees.Converter2.main(Converter2.java:61)

so if any of the Java people sees a problem there, please point out.)

So to pinpoint my problem: I want a regex which matches any number of
lines until it finds a line beginning with #BOS, but does not include
#BOS in the match.

Other tries looked like this:

..*?(?s)(?=#BOS)
(.|\n)*?(?=#BOS) (this freezes the program)
..*(?=#BOS) with MULTLINE uption to Pattern.Compile
..*(?s)^(?=#BOS)

and several others, but I find no solution. So my last resort is asking
here.

TIA, H.
--
Hendrik Maryns
http://tcl.sfs.uni-tuebingen.de/~hendrik/
==================
http://aouw.org
Ask smart questions, get good answers:
http://www.catb.org/~esr/faqs/smart-questions.html
 
T

Tad McClellan

[ f-up set to a newsgroup that I participate in. ]


Lew said:
^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^
^^^^^^^^^^
^^^^^^^^^^

But the OP /does/ read clj.help, and pointed out that his problem is in Java,


And he will see Abigail's helpful answer there.

So what's the problem?

so redirecting the answers away from clj.help is pure arrogance.


He did not redirect answers away!

His post containing an answer was posted to the newsgroup that
the OP asked for.

Abigail does not participate in clj.help, and so won't
be able to see any followups to his post.

Dumping stuff into a newsgroup you do not read is arrogance.
 
H

Hendrik Maryns

Abigail schreef:
_
Hendrik Maryns ([email protected]) wrote on VCLI September
MCMXCIII in <URL:** (This is in Java, but the regex is general, therefore x-post to
** c.l.p.m., f-up to c.l.j.h.)

I don't read the latter, so I won't post just there. Followups set to
clpm though.

Due to the quibbling, now cross-posting.

It seems it was a good idea to ask in c.l.p.m., though, since two of
three helpful answers came from there!
** I want to discard the header of some file. The header is everything
** before a line beginning with "#BOS". However, I do not want #BOS to be
** part of the match, since I need it later on.
**
** I thought of using a regex to do that. I came up with
**
** .*(?s)(?=#BOS)

That changes the meaning of . *after* matching .*

Ah, I thought that was a global thing.
/(?s).*(?=#BOS)/

would do, although I would write it as:

/^.*(?=#BOS)/s

Note that due to the .*, it will match everything up to the *last* occurance
of #BOS. You might want to write that differently if you want to removethings
up to the first #BOS, for instance (untested):

/^[^#]*(?:#(?!BOS)[^#]*)*#BOS/

which does some loop unrolling, avoids the usage of .*? (which can be
costly), and doesn't need (?s) because there's no . in the pattern.

The version with .*? works fine. Why would that be costly?

Would you mind explaining a bit what the above does? My hunch:
-look for anything except # (this matches \n as well, I suppose), as
often as possible
-if you see a #, check that it is not followed by BOS, and is then again
followed by anything except #; and this whole thing as often as
possible, until #BOS is effectively seen

What I do not understand, is why the first non-capturing group is
necessary, and did you forget to make the last #BOS a positive
lookahead, or is that on purpose?
Note that I anchored the pattern to the beginning of the string. This
should speed up the case where no #BOS is present in the string matched
against.

Hm, seems like there is still a lot to regular expressions to be explored…

Thanks, H.
--
Hendrik Maryns
http://tcl.sfs.uni-tuebingen.de/~hendrik/
==================
http://aouw.org
Ask smart questions, get good answers:
http://www.catb.org/~esr/faqs/smart-questions.html


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org

iD8DBQFHDOTee+7xMGD3itQRArl5AJ0YqvRQa5kCDJNdVARG499BLr/+GgCfZSQm
qYtA0Me9SiWGMrEHInYdNjk=
=TBCE
-----END PGP SIGNATURE-----
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

@Regex, the catch 3
regex problem 7
regex problem 9
regex problem 10
Regex Help 1
Regex challenge 15
Yet another Java regex problem 6
RegEx problem 5

Members online

Forum statistics

Threads
473,767
Messages
2,569,572
Members
45,045
Latest member
DRCM

Latest Threads

Top