Regular expression question.

Discussion in 'Perl Misc' started by MENTAT, Apr 5, 2005.

  1. MENTAT

    MENTAT Guest

    Hi,

    I have a log file that looks something like this

    2005-03-29 17:17:11.293|DEBUG|Line 1|
    >>>>>>>

    Actual Log output line 1
    Actual Log output line 2
    Actual Log output line 3
    Actual Log output line ...
    <<<<<<<
    2005-03-29 17:17:11.293|DEBUG|Line 9|
    >>>>>>>

    Actual Log output line 1
    Actual Log output line 2
    Actual Log output line 3
    Actual Log output line ...
    <<<<<<<
    2005-03-29 17:17:11.309|INFO|Line 4|
    >>>>>>>

    Actual Log output line 1
    Actual Log output line 2
    Actual Log output line 3
    Actual Log output line ...
    <<<<<<<
    2005-03-29 17:17:11.319|DEBUG|Line 9|
    >>>>>>>

    Actual Log output line 1
    Actual Log output line 2
    Actual Log output line 3
    Actual Log output line ...
    <<<<<<<

    I am trying to write a regular expression that extracts all the log
    entries for a given value of "Line".

    So if I wanted to look at the entries for Line 4 I want the output to
    look something like

    2005-03-29 17:17:11.309|INFO|Line 4|
    >>>>>>>

    Actual Log output line 1
    Actual Log output line 2
    Actual Log output line 3
    Actual Log output line ...
    <<<<<<<

    Note that I want everything in the line before the "Line 4" text as
    well, such as time.

    I tried using "<<<<<<<(.*?\|Line 4.*?)>>>>>>>(.*?)<<<<<<<" as the
    regular expression (with the ms global modifiers), but the problem is
    it matches everything from the beginning of the file (basically the
    first <<<<<<<) to the "Line 4". Making .* non-greedy doesn't help
    because after it finds the first <<<<<<< the non-greedy match goes all
    the way upto the first "Line 4".

    Replacing <<<<<<< with the beginning of line (^) doesn't make any
    difference either because after the start of the first line, the .*?
    still matches everything until "Line 4".

    I tried using lookahead and lookbehind assertions as well, but to no
    avail. This "<<<<<<<$\(.(?!$)*\|Line 4.*?)>>>>>>>(.*?)<<<<<<<" doesn't
    match anything.

    Ofcourse, if i remove the s global modifier, i can easily match it
    using "(^.*\|Line 4.*)", but then I can't get all the (variable) lines
    between <<<<<<< and >>>>>>>. The .* won't match across new line.

    Any idea how this problem could be solved? Any help is much
    appreciated.

    Thanks in advance.
     
    MENTAT, Apr 5, 2005
    #1
    1. Advertising

  2. MENTAT wrote:
    > I have a log file that looks something like this
    >
    > 2005-03-29 17:17:11.293|DEBUG|Line 1|
    > >>>>>>>

    > Actual Log output line 1
    > Actual Log output line 2
    > Actual Log output line 3
    > Actual Log output line ...
    > <<<<<<<
    > 2005-03-29 17:17:11.293|DEBUG|Line 9|
    > >>>>>>>

    > Actual Log output line 1
    > Actual Log output line 2
    > Actual Log output line 3
    > Actual Log output line ...
    > <<<<<<<
    >
    > I am trying to write a regular expression that extracts all the log
    > entries for a given value of "Line".


    Set the input record separator:

    my $line = 4;
    local $/ = "<<<\n";
    /\|Line $line\|/ and print while <LOG>;

    See "perldoc perlvar".

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
     
    Gunnar Hjalmarsson, Apr 5, 2005
    #2
    1. Advertising

  3. MENTAT

    Guest

    MENTAT <> wrote:
    > I have a log file that looks something like this
    >
    > 2005-03-29 17:17:11.293|DEBUG|Line 1|
    >>>>>>>>

    > Actual Log output line 1
    > Actual Log output line 2
    > Actual Log output line 3
    > Actual Log output line ...
    > <<<<<<<
    > 2005-03-29 17:17:11.293|DEBUG|Line 9|


    [...]

    > So if I wanted to look at the entries for Line 4 I want the output to
    > look something like


    > 2005-03-29 17:17:11.309|INFO|Line 4|
    >>>>>>>>

    > Actual Log output line 1
    > Actual Log output line 2
    > Actual Log output line 3
    > Actual Log output line ...
    > <<<<<<<


    > Note that I want everything in the line before the "Line 4" text as
    > well, such as time.


    > I tried using "<<<<<<<(.*?\|Line 4.*?)>>>>>>>(.*?)<<<<<<<" as the
    > regular expression (with the ms global modifiers), but the problem is
    > it matches everything from the beginning of the file (basically the
    > first <<<<<<<) to the "Line 4". Making .* non-greedy doesn't help
    > because after it finds the first <<<<<<< the non-greedy match goes all
    > the way upto the first "Line 4".


    > Replacing <<<<<<< with the beginning of line (^) doesn't make any
    > difference either because after the start of the first line, the .*?
    > still matches everything until "Line 4".


    You could instead of using .*? specify exactly where a line end may
    occur... along the lines of:

    '<<<<<<<\n([\w |.:-]*?\|Line 4.*?)>>>>>>>(.*?)<<<<<<<'

    Although from the sample data that you have provided, this would not
    work if the data sought is at the begining of the file. So perhaps:

    '^([\w |.:-]*?\|Line 1.*?)>>>>>>>(.*?)<<<<<<<'

    would be better. Of course you would have to check that the [\w |.:-]
    part covers exverything that will occur on that line and that there will
    not be clashed with the 'Actual Log output' lines.

    Axel
     
    , Apr 5, 2005
    #3
  4. MENTAT <> wrote:

    > I have a log file that looks something like this
    >
    > 2005-03-29 17:17:11.293|DEBUG|Line 1|
    >>>>>>>>

    > Actual Log output line 1
    > Actual Log output line 2
    > Actual Log output line 3
    > Actual Log output line ...
    ><<<<<<<
    > 2005-03-29 17:17:11.293|DEBUG|Line 9|
    >>>>>>>>

    > Actual Log output line 1
    > Actual Log output line 2
    > Actual Log output line 3
    > Actual Log output line ...
    ><<<<<<<



    > I am trying to write a regular expression that extracts all the log
    > entries for a given value of "Line".



    Would a much easier way that makes no use of regular expressions be OK?


    > Ofcourse, if i remove the s global modifier, i can easily match it
    > using "(^.*\|Line 4.*)", but then I can't get all the (variable) lines
    > between <<<<<<< and >>>>>>>. The .* won't match across new line.



    There are several ways to write "any character" (which includes
    newline) that remain unaffected by the m//s modifier.

    [\000-\0377]
    [\d\D]
    [\w\W]
    [\s\S]


    > Any idea how this problem could be solved?



    Setting

    $/ = "<<<<<<<\n";

    before reading the input would help a lot.


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Apr 5, 2005
    #4
  5. MENTAT

    MENTAT Guest

    Thanks Guys. It works with the input record seperator. That was the
    missing key. The following code works.

    $required_pattern = "(\\|Line 4)";
    if (-e $file_name)
    {
    open (THEFILE, $file_name) or die "Unable to open file $file_name";
    $/ = "<<<<<<<\n"; #set the input record seperator to this string.

    while (<THEFILE>)
    {
    if ($_ =~ m/$required_pattern/ms)
    {
    print $_;
    }
    }

    close (THEFILE);
    }


    Thanks again ...

    Tad McClellan <> wrote in message news:<>...
    > MENTAT <> wrote:
    >
    > > I have a log file that looks something like this
    > >
    > > 2005-03-29 17:17:11.293|DEBUG|Line 1|
    > >>>>>>>>

    > > Actual Log output line 1
    > > Actual Log output line 2
    > > Actual Log output line 3
    > > Actual Log output line ...
    > ><<<<<<<
    > > 2005-03-29 17:17:11.293|DEBUG|Line 9|
    > >>>>>>>>

    > > Actual Log output line 1
    > > Actual Log output line 2
    > > Actual Log output line 3
    > > Actual Log output line ...
    > ><<<<<<<

    >
    >
    > > I am trying to write a regular expression that extracts all the log
    > > entries for a given value of "Line".

    >
    >
    > Would a much easier way that makes no use of regular expressions be OK?
    >
    >
    > > Ofcourse, if i remove the s global modifier, i can easily match it
    > > using "(^.*\|Line 4.*)", but then I can't get all the (variable) lines
    > > between <<<<<<< and >>>>>>>. The .* won't match across new line.

    >
    >
    > There are several ways to write "any character" (which includes
    > newline) that remain unaffected by the m//s modifier.
    >
    > [\000-\0377]
    > [\d\D]
    > [\w\W]
    > [\s\S]
    >
    >
    > > Any idea how this problem could be solved?

    >
    >
    > Setting
    >
    > $/ = "<<<<<<<\n";
    >
    > before reading the input would help a lot.
     
    MENTAT, Apr 6, 2005
    #5
  6. MENTAT

    MENTAT Guest

    PS: Tad, what was the other approach that doesn't use regular expressions?

    Tad McClellan <> wrote in message news:<>...
    > MENTAT <> wrote:
    >
    > > I have a log file that looks something like this
    > >
    > > 2005-03-29 17:17:11.293|DEBUG|Line 1|
    > >>>>>>>>

    > > Actual Log output line 1
    > > Actual Log output line 2
    > > Actual Log output line 3
    > > Actual Log output line ...
    > ><<<<<<<
    > > 2005-03-29 17:17:11.293|DEBUG|Line 9|
    > >>>>>>>>

    > > Actual Log output line 1
    > > Actual Log output line 2
    > > Actual Log output line 3
    > > Actual Log output line ...
    > ><<<<<<<

    >
    >
    > > I am trying to write a regular expression that extracts all the log
    > > entries for a given value of "Line".

    >
    >
    > Would a much easier way that makes no use of regular expressions be OK?
    >
    >
    > > Ofcourse, if i remove the s global modifier, i can easily match it
    > > using "(^.*\|Line 4.*)", but then I can't get all the (variable) lines
    > > between <<<<<<< and >>>>>>>. The .* won't match across new line.

    >
    >
    > There are several ways to write "any character" (which includes
    > newline) that remain unaffected by the m//s modifier.
    >
    > [\000-\0377]
    > [\d\D]
    > [\w\W]
    > [\s\S]
    >
    >
    > > Any idea how this problem could be solved?

    >
    >
    > Setting
    >
    > $/ = "<<<<<<<\n";
    >
    > before reading the input would help a lot.
     
    MENTAT, Apr 6, 2005
    #6
  7. (MENTAT) wrote in
    news::

    [ top-posting fixed. please don't do that. ]

    > Tad McClellan <> wrote in message
    > news:<>...
    >> MENTAT <> wrote:

    ....

    >> > I am trying to write a regular expression that extracts all the log
    >> > entries for a given value of "Line".


    ....

    >> Setting
    >>
    >> $/ = "<<<<<<<\n";
    >>
    >> before reading the input would help a lot.



    > Thanks Guys. It works with the input record seperator. That was the
    > missing key. The following code works.


    use strict;
    use warnings;

    > $required_pattern = "(\\|Line 4)";


    my $required_pattern = '(\|Line 4)';

    Why are you capturing?

    > if (-e $file_name)


    This is a useless test.

    > {
    > open (THEFILE, $file_name) or die "Unable to open file $file_name";


    Because open will fail if the file does not exist. BTW, you should
    include the reason open failed in the error message:

    open my $file, '<', $file_name
    or die "Unable to open file $file_name: $!";

    > if ($_ =~ m/$required_pattern/ms)


    By default, m// matches against $_, so no need to explicitly specify it.

    What do you think using both the m and s options for the match above
    achieves?

    From perldoc perlop:

    m Treat string as multiple lines.
    s Treat string as single line.

    Which one is it?

    > {
    > print $_;
    > }


    The whole thing can be written as

    print if /$required_pattern/ose;

    Sinan

    --
    A. Sinan Unur <>
    (reverse each component and remove .invalid for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
     
    A. Sinan Unur, Apr 6, 2005
    #7
  8. MENTAT

    John Bokma Guest

    John Bokma, Apr 6, 2005
    #8
  9. A. Sinan Unur wrote:
    >
    > use strict;
    > use warnings;
    >
    >>$required_pattern = "(\\|Line 4)";

    >
    > my $required_pattern = '(\|Line 4)';


    Or even better:

    my $required_pattern = qr'(?:\|Line 4)';


    > Why are you capturing?


    Indeed.


    > [snip]
    >
    >
    >> if ($_ =~ m/$required_pattern/ms)

    >
    > By default, m// matches against $_, so no need to explicitly specify it.
    >
    > What do you think using both the m and s options for the match above
    > achieves?
    >
    > From perldoc perlop:
    >
    > m Treat string as multiple lines.
    > s Treat string as single line.
    >
    > Which one is it?


    According to the OP's pattern he doesn't need either.


    >> {
    >> print $_;
    >> }

    >
    > The whole thing can be written as
    >
    > print if /$required_pattern/ose;


    /s ??? /e ???

    There are no periods in the pattern for /s and there are no expressions for /e
    to evaluate. (And if he uses qr// to compile the regexp there is no need for /o.)



    John
    --
    use Perl;
    program
    fulfillment
     
    John W. Krahn, Apr 6, 2005
    #9
  10. "John W. Krahn" <> wrote in
    news:jnM4e.2233$7Q4.2067@clgrps13:

    > A. Sinan Unur wrote:


    >> print if /$required_pattern/ose;

    >
    > /s ??? /e ???
    >
    > There are no periods in the pattern for /s and there are no
    > expressions for /e to evaluate. (And if he uses qr// to compile the
    > regexp there is no need for /o.)


    Indeed :)

    Dunno what I was thinking.

    Sinan

    --
    A. Sinan Unur <>
    (reverse each component and remove .invalid for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
     
    A. Sinan Unur, Apr 6, 2005
    #10
  11. Tad McClellan <> wrote in
    news::

    > A. Sinan Unur <> wrote:
    >> (MENTAT) wrote in
    >> news::

    >
    >
    >>> if (-e $file_name)

    >>
    >> This is a useless test.
    >>
    >>> {
    >>> open (THEFILE, $file_name) or die "Unable to open file
    >>> $file_name";

    >>
    >> Because open will fail if the file does not exist.

    >
    > It is not useless.


    ....

    > Remove the test and those semantics change.


    I see your point. I would prefer to handle the case where the file did
    not exist, if that is an important special case, as part of handling the
    failure from open.

    On the other hand, in the OP's code, if the file did not exist, the
    program did not convey this information to the user. Given that this
    might be one of the most ways an open might fail, it would have been
    better to 'tell' the user why open failed and be done with it.

    >> What do you think using both the m and s options for the match above
    >> achieves?

    >
    > (the OP needs neither of course)
    >
    >> From perldoc perlop:
    >>
    >> m Treat string as multiple lines.
    >> s Treat string as single line.
    >>
    >> Which one is it?

    >
    > This illustrates precisely why I don't like the doc's treatment
    > of these two modifiers. I'm sure the docs do it that way for
    > mnemonic reasons.
    >
    > But it falsely implies that they are mutually exclusive.


    Yeah, as I said, I don't know what I was thinking when I wrote that
    part. Thanks for the correction.

    Sinan


    --
    A. Sinan Unur <>
    (reverse each component and remove .invalid for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
     
    A. Sinan Unur, Apr 6, 2005
    #11
  12. John W. Krahn <> wrote:
    > A. Sinan Unur wrote:



    >> print if /$required_pattern/ose;



    > there are no expressions for /e
    > to evaluate.



    It is worse than that. It won't even compile, since /e is only
    valid for s/// not for m//. :)


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Apr 6, 2005
    #12
  13. A. Sinan Unur <> wrote:
    > (MENTAT) wrote in
    > news::



    >> if (-e $file_name)

    >
    > This is a useless test.
    >
    >> {
    >> open (THEFILE, $file_name) or die "Unable to open file $file_name";

    >
    > Because open will fail if the file does not exist.



    It is not useless.

    If the file does not exist: do nothing.

    If the file exists but cannot be opened: complain and exit.

    If the file exists and can be opened: normal processing.

    Remove the test and those semantics change.


    > What do you think using both the m and s options for the match above
    > achieves?



    (the OP needs neither of course)


    > From perldoc perlop:
    >
    > m Treat string as multiple lines.
    > s Treat string as single line.
    >
    > Which one is it?



    This illustrates precisely why I don't like the doc's treatment
    of these two modifiers. I'm sure the docs do it that way for
    mnemonic reasons.

    But it falsely implies that they are mutually exclusive.

    There are times when you might use both modifiers.

    So I'd prefer to give up on the mnemonicness:


    m Makes ^ and $ match begin/end of line (rather than of string)
    s Makes . match a newline


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Apr 6, 2005
    #13
  14. John W. Krahn wrote:

    > A. Sinan Unur wrote:
    >
    >>
    >> use strict;
    >> use warnings;
    >>
    >>> $required_pattern = "(\\|Line 4)";

    >>
    >> my $required_pattern = '(\|Line 4)';

    >
    > Or even better:
    >
    > my $required_pattern = qr'(?:\|Line 4)';


    Or even better:

    my $required_pattern = qr/\|Line 4/;

    When there's no beniefit gained from using non-standard delimiters on
    qr// it's best not to do so (IMNSHO).

    Precompiling a regex with qr// implicitly has the same effect as
    wrapping it in (?:...).
     
    Brian McCauley, Apr 6, 2005
    #14
  15. Brian McCauley wrote:
    > John W. Krahn wrote:
    >
    >> A. Sinan Unur wrote:
    >>>
    >>> use strict;
    >>> use warnings;
    >>>
    >>>> $required_pattern = "(\\|Line 4)";

    >
    >>> my $required_pattern = '(\|Line 4)';

    >>
    >> Or even better:
    >>
    >> my $required_pattern = qr'(?:\|Line 4)';

    >
    > Or even better:
    >
    > my $required_pattern = qr/\|Line 4/;


    Then you would have to backslash the backslash character because qr//
    interpolates its contents.

    my $required_pattern = qr/\\|Line 4/;


    > When there's no beniefit gained from using non-standard delimiters on
    > qr// it's best not to do so (IMNSHO).


    I used the single quotes to avoid interpolation. :)


    John
    --
    use Perl;
    program
    fulfillment
     
    John W. Krahn, Apr 7, 2005
    #15
  16. MENTAT

    Guest

    John W. Krahn wrote:
    > Brian McCauley wrote:
    > > John W. Krahn wrote:
    > >
    > >> my $required_pattern = qr'(?:\|Line 4)';

    > >
    > > When there's no beniefit gained from using non-standard delimiters
    > > it's best not to do so (IMNSHO).

    >
    > I used the single quotes to avoid interpolation. :)


    There were no characters in that pattern that would be interpreted as
    interpolation.

    If you use m'' rather than // thoughout your code whenever you want a
    regex but don't need interpolation then that would be perfectly
    reasonable.

    If you don't habitually use m'' then using qr'' seems inconsitent.
     
    , Apr 7, 2005
    #16
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Andrew Munn

    Regular expression question...

    Andrew Munn, Jun 29, 2003, in forum: Perl
    Replies:
    1
    Views:
    2,138
    rakesh sharma
    Jun 30, 2003
  2. Glenn Kidd

    Regular expression question

    Glenn Kidd, Aug 18, 2003, in forum: Perl
    Replies:
    0
    Views:
    933
    Glenn Kidd
    Aug 18, 2003
  3. VSK
    Replies:
    2
    Views:
    2,310
  4. =?iso-8859-1?B?bW9vcJk=?=

    Matching abitrary expression in a regular expression

    =?iso-8859-1?B?bW9vcJk=?=, Dec 1, 2005, in forum: Java
    Replies:
    8
    Views:
    851
    Alan Moore
    Dec 2, 2005
  5. GIMME
    Replies:
    3
    Views:
    11,978
    vforvikash
    Dec 29, 2008
Loading...

Share This Page