Regex problem

Discussion in 'Perl Misc' started by Hendrik Maryns, Oct 8, 2007.

  1. (This is in Java, but the regex is general, therefore x-post to
    c.l.p.m., f-up to c.l.j.h.)

    Hi all,

    I want to discard the header of some file. The header is everything
    before a line beginning with "#BOS". However, I do not want #BOS to be
    part of the match, since I need it later on.

    I thought of using a regex to do that. I came up with

    ..*(?s)(?=#BOS)

    However, this gave me nothing.
    (To be precise, I have

    Scanner corpus = new Scanner(inFile);
    Pattern header = Pattern.compile(".*(?s)(?=#BOS)", Pattern.MULTILINE);
    corpus.skip(header);

    and it gives me

    java.util.NoSuchElementException
    at java.util.Scanner.skip(Scanner.java:1706)
    at
    de.uni_tuebingen.sfb.lichtenstein.binarytrees.Converter2.main(Converter2.java:61)

    so if any of the Java people sees a problem there, please point out.)

    So to pinpoint my problem: I want a regex which matches any number of
    lines until it finds a line beginning with #BOS, but does not include
    #BOS in the match.

    Other tries looked like this:

    ..*?(?s)(?=#BOS)
    (.|\n)*?(?=#BOS) (this freezes the program)
    ..*(?=#BOS) with MULTLINE uption to Pattern.Compile
    ..*(?s)^(?=#BOS)

    and several others, but I find no solution. So my last resort is asking
    here.

    TIA, H.
    --
    Hendrik Maryns
    http://tcl.sfs.uni-tuebingen.de/~hendrik/
    ==================
    http://aouw.org
    Ask smart questions, get good answers:
    http://www.catb.org/~esr/faqs/smart-questions.html
     
    Hendrik Maryns, Oct 8, 2007
    #1
    1. Advertising

  2. [ f-up set to a newsgroup that I participate in. ]


    Lew <> wrote:
    >> Hendrik Maryns () wrote on VCLI September
    >> MCMXCIII in <URL:news:fed8ua$v7n$-tuebingen.de>:
    >> ** (This is in Java, but the regex is general, therefore x-post to
    >> ** c.l.p.m., f-up to c.l.j.h.)

    ^^^^^^^^^^^^^^^^
    ^^^^^^^^^^^^^^^^
    > Abigail wrote:
    >> I don't read the latter, so I won't post just there. Followups set to

    ^^^^^^^^^^
    ^^^^^^^^^^
    >> clpm though.

    >
    > But the OP /does/ read clj.help, and pointed out that his problem is in Java,



    And he will see Abigail's helpful answer there.

    So what's the problem?


    > so redirecting the answers away from clj.help is pure arrogance.



    He did not redirect answers away!

    His post containing an answer was posted to the newsgroup that
    the OP asked for.

    Abigail does not participate in clj.help, and so won't
    be able to see any followups to his post.

    Dumping stuff into a newsgroup you do not read is arrogance.


    --
    Tad McClellan
    email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"
     
    Tad McClellan, Oct 8, 2007
    #2
    1. Advertising

  3. Abigail schreef:
    > _
    > Hendrik Maryns () wrote on VCLI September
    > MCMXCIII in <URL:news:fed8ua$v7n$-tuebingen.de>:
    > ** (This is in Java, but the regex is general, therefore x-post to
    > ** c.l.p.m., f-up to c.l.j.h.)
    >
    > I don't read the latter, so I won't post just there. Followups set to
    > clpm though.


    Due to the quibbling, now cross-posting.

    It seems it was a good idea to ask in c.l.p.m., though, since two of
    three helpful answers came from there!

    > ** I want to discard the header of some file. The header is everything
    > ** before a line beginning with "#BOS". However, I do not want #BOS to be
    > ** part of the match, since I need it later on.
    > **
    > ** I thought of using a regex to do that. I came up with
    > **
    > ** .*(?s)(?=#BOS)
    >
    > That changes the meaning of . *after* matching .*


    Ah, I thought that was a global thing.

    > /(?s).*(?=#BOS)/
    >
    > would do, although I would write it as:
    >
    > /^.*(?=#BOS)/s
    >
    > Note that due to the .*, it will match everything up to the *last* occurance
    > of #BOS. You might want to write that differently if you want to removethings
    > up to the first #BOS, for instance (untested):
    >
    > /^[^#]*(?:#(?!BOS)[^#]*)*#BOS/
    >
    > which does some loop unrolling, avoids the usage of .*? (which can be
    > costly), and doesn't need (?s) because there's no . in the pattern.


    The version with .*? works fine. Why would that be costly?

    Would you mind explaining a bit what the above does? My hunch:
    -look for anything except # (this matches \n as well, I suppose), as
    often as possible
    -if you see a #, check that it is not followed by BOS, and is then again
    followed by anything except #; and this whole thing as often as
    possible, until #BOS is effectively seen

    What I do not understand, is why the first non-capturing group is
    necessary, and did you forget to make the last #BOS a positive
    lookahead, or is that on purpose?

    > Note that I anchored the pattern to the beginning of the string. This
    > should speed up the case where no #BOS is present in the string matched
    > against.


    Hm, seems like there is still a lot to regular expressions to be explored…

    Thanks, H.
    --
    Hendrik Maryns
    http://tcl.sfs.uni-tuebingen.de/~hendrik/
    ==================
    http://aouw.org
    Ask smart questions, get good answers:
    http://www.catb.org/~esr/faqs/smart-questions.html


    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.4.5 (GNU/Linux)
    Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org

    iD8DBQFHDOTee+7xMGD3itQRArl5AJ0YqvRQa5kCDJNdVARG499BLr/+GgCfZSQm
    qYtA0Me9SiWGMrEHInYdNjk=
    =TBCE
    -----END PGP SIGNATURE-----
     
    Hendrik Maryns, Oct 10, 2007
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?SmViQnVzaGVsbA==?=

    Is ASP Validator Regex Engine Same As VS2003 Find Regex Engine?

    =?Utf-8?B?SmViQnVzaGVsbA==?=, Oct 22, 2005, in forum: ASP .Net
    Replies:
    2
    Views:
    742
    =?Utf-8?B?SmViQnVzaGVsbA==?=
    Oct 22, 2005
  2. Rick Venter

    perl regex to java regex

    Rick Venter, Oct 29, 2003, in forum: Java
    Replies:
    5
    Views:
    1,687
    Ant...
    Nov 6, 2003
  3. Replies:
    2
    Views:
    626
  4. Xah Lee
    Replies:
    1
    Views:
    971
    Ilias Lazaridis
    Sep 22, 2006
  5. Replies:
    3
    Views:
    823
    Reedick, Andrew
    Jul 1, 2008
Loading...

Share This Page