Is this efficient way to validate phrase against collection of regular expressions

Discussion in 'Java' started by Ossie, Feb 14, 2004.

  1. Ossie

    Ossie Guest

    Note the output below. Is this a reasonable way to determine if a
    phrase is
    found within a collection of patterns? Is there a better way to manage
    the
    validation of the phrase against a collection of patterns? My fear is
    this
    is pretty intense way of validating the pattern of the phrase.


    [omoore@skipper tmp]$ java MyExample "STK PIPE" "*PART" "this should
    not qualify for exlcusion" "AA1234567890"
    ============================================
    phrase: STK PIPE
    exlclude? true
    ============================================
    phrase: *PART
    exlclude? true
    ============================================
    phrase: this should not qualify for exlcusion
    exlclude? false
    ============================================
    phrase: AA1234567890
    exlclude? true
    ============================================
    [omoore@skipper tmp]$ cat MyExample.java
    import java.util.*;
    import java.util.regex.*;

    public class MyExample
    {
    public static void main(String [] args)
    {
    System.out.println("============================================");
    for ( int i = 0 ; i < args.length ; i++ )
    {
    System.out.println(" phrase: " + args );
    System.out.println("exlclude? " + isPhraseInExcludeList(args) );
    System.out.println("============================================");
    }
    }

    public static void initPatternVector()
    {
    m_ExcludedPhraseList = new Vector(0);

    //starts with 'STK '
    m_ExcludedPhraseList.add("STK .*");
    //starts with '*'
    m_ExcludedPhraseList.add("[*].*");
    //containing 'FRT'
    m_ExcludedPhraseList.add(".*FRT.*");
    //starts 2 upper & 6 nbrs
    m_ExcludedPhraseList.add("[A-Z]{2}[0-9]{6}.*");
    }

    public static boolean isPhraseInExcludeList(String phrase)
    {
    if ( m_ExcludedPhraseList == null)
    {
    initPatternVector();
    }

    for ( int i = 0 ; i < m_ExcludedPhraseList.capacity() ; i++ )
    {
    String pattern = (String) m_ExcludedPhraseList.get(i);
    if( Pattern.matches( pattern , phrase ) )
    {
    return true;
    }
    }

    return false;
    }

    public static Vector m_ExcludedPhraseList;
    }
    [omoore@skipper tmp]$
    Ossie, Feb 14, 2004
    #1
    1. Advertising

  2. Ossie

    Eric Bodden Guest

    Once upon a time <>,
    Ossie <> enriched the world with the following:

    > Is there a better way to manage the
    > validation of the phrase against a collection of patterns? My fear is
    > this is pretty intense way of validating the pattern of the phrase.

    No, I think that's pretty much optimal. You could use an arraylist instead
    of a vector though. Vector is pretty slow. The matching itself however is
    the most expensive part and that overhead cannot be overcome anyway.

    Eric

    --
    -----------------------------------------------------------------
    Eric Bodden
    ICQ UIN: 12656220
    Website: http://www.bodden.de
    PGP key available
    Eric Bodden, Feb 14, 2004
    #2
    1. Advertising

  3. Ossie

    Chris Uppal Guest

    Ossie wrote:

    > Note the output below. Is this a reasonable way to determine if a
    > phrase is
    > found within a collection of patterns?


    I'm afraid not. Two reasons:


    First, this line:

    if( Pattern.matches( pattern , phrase ) )


    creates a new pattern matcher object for every string that you want to check.
    Creating a pattern matcher is a relatively expensive operation (much more
    expensive than using it to check a short string), so you should create
    instances of java.util.Matcher as part of your settup, and then re-use them for
    each input string you want to check.

    Secondly, in theory (i.e. I have not intention of testing it), the time taken
    to check whether a string matches a regexp is proportional to the length of the
    string, independently of the complexity of the regexp[*]. So I'd expect that
    it would be quicker to generate single regexp that was an alternation of all
    your exclusion cases. That *should* execute as fast as any one of the separate
    tests in your code.

    I've never looked at Java's regexp implementation, but if the syntax is similar
    to what I'm used to in other fields, then a single exclusion pattern like:

    "(STK .*)|([*].*)|(.*FRT.*)|([A-Z]{2}[0-9]{6}.*)"

    would do the trick.

    -- chris

    [*] That's to say that the theory of how to do it is well-known. However there
    are tradeoffs of time vs. space and any particular regexp implementation may,
    in order to save space, use an implementation that isn't constant-time in the
    complexity of the regexp. I don't know what -- say -- Sun's implementation
    does, but I'd *hope* that it is constant-time.
    Chris Uppal, Feb 14, 2004
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. bruce
    Replies:
    4
    Views:
    736
    Cameron Laird
    Sep 22, 2006
  2. Ben Finney
    Replies:
    0
    Views:
    429
    Ben Finney
    Sep 22, 2006
  3. Øyvind Isaksen
    Replies:
    1
    Views:
    939
    Øyvind Isaksen
    May 18, 2007
  4. Ted
    Replies:
    5
    Views:
    171
    Eric Bohlman
    May 30, 2006
  5. Noman Shapiro
    Replies:
    0
    Views:
    219
    Noman Shapiro
    Jul 17, 2013
Loading...

Share This Page