Regular Expression assistance

Discussion in 'Perl Misc' started by Steve Dunn, Dec 29, 2003.

  1. Steve Dunn

    Steve Dunn Guest

    I'm wondering if anyone can help with the following problem:

    I have the following text:

    <DOCUMENT>

    <TYPE>EX-5

    <SEQUENCE>3

    <DESCRIPTION>OPINION OF

    BRADLEY ARANT, ET AL.

    <TEXT>

    ..

    And I have the following (multi-line) regular expression:

    ^<([^/].+?[^/])>([\S ]+)



    This correctly matches any line that contains "<tag>any characters" but not
    "</tag>" or "<tag>". The following captures are returned from the
    expression:

    1 => TYPE

    2 => EX-5



    1 => SEQUENCE

    2 => 3



    1 => DESCRIPTION

    2 => OPINION OF



    I now need to modify the expression to take into account multi-line content.
    To give an example, the current expression matches "<DESCRIPTION>OPINION OF"
    but it needs to match "<DESCRIPTION>OPINION OF 'new line' BRADLEY ARANT, ET
    AL."



    Many thanks in advance,



    Steve.
     
    Steve Dunn, Dec 29, 2003
    #1
    1. Advertising

  2. "Steve Dunn" <> wrote in message
    news:UnSHb.12489$...
    > I'm wondering if anyone can help with the following problem:
    >
    > I have the following text:
    >
    > <DOCUMENT>

    snipped vaguely xml-like text ...
    > And I have the following (multi-line) regular expression:
    >


    > ^<([^/].+?[^/])>([\S ]+)
    >


    first , a warning:
    regular expressions will only work for simple xml-like stuff.
    i hope you do not have tag nesting or attributes.
    >
    >
    > I now need to modify the expression to take into account multi-line

    content.
    > To give an example, the current expression matches "<DESCRIPTION>OPINION

    OF"
    > but it needs to match "<DESCRIPTION>OPINION OF 'new line' BRADLEY ARANT,

    ET
    > AL."


    a few methods come to mind:

    1) if the file is small (not huge) , you can slurp it in, and use something
    like
    m!^<([^/].+?[^/])>([^<]+)!s

    2) set the input record separator to '<' and work with that

    3) when you read a line not starting with '<', add it to previous item


    what have you tried?

    gnari
     
    Ragnar Hafstað, Dec 29, 2003
    #2
    1. Advertising

  3. Steve Dunn

    Steve Dunn Guest

    Hi Ragnar,
    Thanks. I'm not using perl just the regular expression (in .NET). It's
    not XML (nor HTML), but some half-baked attempt at mark-up that was thought
    of shortly after the dinosaurs became extinct! There are no nested tags
    within the text, but empty tags must be ignored (in the example below,
    <DOCUMENT> is an empty tag). The files are very small, and 'slurping' (like
    the expression!) is one possibility if I can't get the regex to work.

    Thanks again,

    Steve.

    "Ragnar Hafstað" <> wrote in message
    news:bsotm8$vjt$...
    > "Steve Dunn" <> wrote in message
    > news:UnSHb.12489$...
    > > I'm wondering if anyone can help with the following problem:
    > >
    > > I have the following text:
    > >
    > > <DOCUMENT>

    > snipped vaguely xml-like text ...
    > > And I have the following (multi-line) regular expression:
    > >

    >
    > > ^<([^/].+?[^/])>([\S ]+)
    > >

    >
    > first , a warning:
    > regular expressions will only work for simple xml-like stuff.
    > i hope you do not have tag nesting or attributes.
    > >
    > >
    > > I now need to modify the expression to take into account multi-line

    > content.
    > > To give an example, the current expression matches "<DESCRIPTION>OPINION

    > OF"
    > > but it needs to match "<DESCRIPTION>OPINION OF 'new line' BRADLEY ARANT,

    > ET
    > > AL."

    >
    > a few methods come to mind:
    >
    > 1) if the file is small (not huge) , you can slurp it in, and use

    something
    > like
    > m!^<([^/].+?[^/])>([^<]+)!s
    >
    > 2) set the input record separator to '<' and work with that
    >
    > 3) when you read a line not starting with '<', add it to previous item
    >
    >
    > what have you tried?
    >
    > gnari
    >
    >
    >
     
    Steve Dunn, Dec 29, 2003
    #3
  4. "Steve Dunn" <> wrote in message
    news:sjTHb.12510$...
    > Hi Ragnar,
    > Thanks. I'm not using perl just the regular expression (in .NET).

    It's

    well, i do not know if many here are familiar with it.
    are you processing the file line by line?

    > not XML (nor HTML), but some half-baked attempt at mark-up that was

    thought
    > of shortly after the dinosaurs became extinct! There are no nested tags
    > within the text, but empty tags must be ignored (in the example below,
    > <DOCUMENT> is an empty tag).


    in your example there was no end tags (</xxx>), so I am not sure of the file
    format.

    if you can collect the file into one string without linebreaks, you probably
    can do a
    match with
    <([^/].+?[^/])>([^<]+)

    gnari

    P.S.:
    in this newsgroup, it is considered bad form to top-post, i.e. to
    put a reply/followup at the top of the message, and quote the whole thread
    below, it is better to quote relevant parts along with replys and comments
    (a bit like I am doing in this message)
    if the conversation develops into a thread, the top-posting becomes more and
    more irritating.
     
    Ragnar Hafstað, Dec 29, 2003
    #4
  5. Steve Dunn

    Steve Dunn Guest

    Hi Gnari,

    "Ragnar Hafstað" <> wrote in message
    news:bsp41o$vrp$...
    > "Steve Dunn" <> wrote in message
    > news:sjTHb.12510$...
    > > Hi Ragnar,
    > > Thanks. I'm not using perl just the regular expression (in .NET).

    > It's
    >
    > well, i do not know if many here are familiar with it.
    > are you processing the file line by line?

    I am processing the text as one whole string. I've implemented a
    work-around that 'slurps' line by line, although I'm not happy with it.

    >
    > > not XML (nor HTML), but some half-baked attempt at mark-up that was

    > thought
    > > of shortly after the dinosaurs became extinct! There are no nested tags
    > > within the text, but empty tags must be ignored (in the example below,
    > > <DOCUMENT> is an empty tag).

    >
    > in your example there was no end tags (</xxx>), so I am not sure of the

    file
    > format.

    End tags for these elements do not exist in this mark-up (I haven't got a
    clue as to why not, but as I said, it was designed before the wheel !)
    >
    > if you can collect the file into one string without linebreaks, you

    probably
    > can do a
    > match with
    > <([^/].+?[^/])>([^<]+)

    Thanks for this. It works great although doesn't take into account the '<'
    being on a new-line. It is returning the desired results, but will break if
    there's any '<' characters in the text (and this 'mark-up' has no
    escaping(!))
    >
    > gnari

    Steve.
    >
    > P.S.:
    > in this newsgroup, it is considered bad form to top-post, i.e. to
    > put a reply/followup at the top of the message, and quote the whole thread
    > below, it is better to quote relevant parts along with replys and comments
    > (a bit like I am doing in this message)
    > if the conversation develops into a thread, the top-posting becomes more

    and
    > more irritating.

    Message understood. Many thanks for pointing this out and many many thanks
    for your help!
    >
    >
    >
     
    Steve Dunn, Dec 29, 2003
    #5
  6. "Steve Dunn" <> wrote in message
    news:YKVHb.12565$...
    > Hi Gnari,
    >
    > "Ragnar Hafstað" <> wrote in message
    > news:bsp41o$vrp$...
    > > if you can collect the file into one string without linebreaks, you

    > probably
    > > can do a
    > > match with
    > > <([^/].+?[^/])>([^<]+)

    > Thanks for this. It works great although doesn't take into account the

    '<'
    > being on a new-line. It is returning the desired results, but will break

    if
    > there's any '<' characters in the text (and this 'mark-up' has no
    > escaping(!))


    ok. if you collect the string *with* linefeeds, you should be able to match
    with
    \n<([^/].+?[^/])>([^<]+)
    then you will have to deal with linefeeds in the capture

    by the way, why are you testing for </xxx> and <xxx/> tags?
    i thought you said there were none.

    > Message understood. Many thanks for pointing this out and many many thanks
    > for your help!


    you are welcome

    gnari
     
    Ragnar Hafstað, Dec 29, 2003
    #6
  7. Steve Dunn

    Steve Dunn Guest

    "Ragnar Hafstað" <> wrote in message
    news:bspo66$3nl$...
    > "Steve Dunn" <> wrote in message
    > news:YKVHb.12565$...
    > > Hi Gnari,
    > >
    > > "Ragnar Hafstað" <> wrote in message
    > > news:bsp41o$vrp$...
    > > > if you can collect the file into one string without linebreaks, you

    > > probably
    > > > can do a
    > > > match with
    > > > <([^/].+?[^/])>([^<]+)

    > > Thanks for this. It works great although doesn't take into account the

    > '<'
    > > being on a new-line. It is returning the desired results, but will

    break
    > if
    > > there's any '<' characters in the text (and this 'mark-up' has no
    > > escaping(!))

    >
    > ok. if you collect the string *with* linefeeds, you should be able to

    match
    > with
    > \n<([^/].+?[^/])>([^<]+)
    > then you will have to deal with linefeeds in the capture


    Many thanks Gnari. I think we're almost there.

    >
    > by the way, why are you testing for </xxx> and <xxx/> tags?
    > i thought you said there were none.
    >

    There aren't any in the snippet that I'm parsing, but the regex is also
    used on larger peices of text that might contain closing tags

    > > Message understood. Many thanks for pointing this out and many many

    thanks
    > > for your help!

    >
    > you are welcome
    >
    > gnari
    >

    Steve.
    p.s. Happy New Year!
    >
    >
     
    Steve Dunn, Dec 30, 2003
    #7
  8. Steve Dunn

    Matt Garrish Guest

    "Steve Dunn" <> wrote in message
    news:UnSHb.12489$...
    >
    > I now need to modify the expression to take into account multi-line

    content.
    > To give an example, the current expression matches "<DESCRIPTION>OPINION

    OF"
    > but it needs to match "<DESCRIPTION>OPINION OF 'new line' BRADLEY ARANT,

    ET
    > AL."
    >


    You're probably better off "unbust"ing the file first (never checked if
    that's actually a technical term, but it is the name of a script we have
    where I work). Essentially, you'd just have to write a script to remove
    newlines from the file unless the line begins with a top-level tag. You
    could then read the file line-by-line with a simple expression like:

    m#^<([^>]*)>(.*)(</\1>)?#i

    to grab all the data you need. The usefulness, however, will vary depending
    on what you are trying to capture and how it is formatted.

    Matt
     
    Matt Garrish, Dec 30, 2003
    #8
  9. Matt Garrish <> wrote:

    > You're probably better off "unbust"ing the file first (never checked if
    > that's actually a technical term, but it is the name of a script we have
    > where I work).



    I call my unbusters "preprocessor"s when in polite company,
    otherwise they're "defoo"s. :)


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Dec 30, 2003
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Keith-Earl
    Replies:
    1
    Views:
    463
    Mary Chipman
    Jun 15, 2004
  2. Glen Herrmannsfeldt

    Re: binary to BCD assistance

    Glen Herrmannsfeldt, Jul 30, 2003, in forum: VHDL
    Replies:
    4
    Views:
    7,081
  3. VSK
    Replies:
    2
    Views:
    2,335
  4. =?iso-8859-1?B?bW9vcJk=?=

    Matching abitrary expression in a regular expression

    =?iso-8859-1?B?bW9vcJk=?=, Dec 1, 2005, in forum: Java
    Replies:
    8
    Views:
    863
    Alan Moore
    Dec 2, 2005
  5. GIMME
    Replies:
    3
    Views:
    11,997
    vforvikash
    Dec 29, 2008
Loading...

Share This Page