Regular Expression assistance

S

Steve Dunn

I'm wondering if anyone can help with the following problem:

I have the following text:

<DOCUMENT>

<TYPE>EX-5

<SEQUENCE>3

<DESCRIPTION>OPINION OF

BRADLEY ARANT, ET AL.

<TEXT>

..

And I have the following (multi-line) regular expression:

^<([^/].+?[^/])>([\S ]+)



This correctly matches any line that contains "<tag>any characters" but not
"</tag>" or "<tag>". The following captures are returned from the
expression:

1 => TYPE

2 => EX-5



1 => SEQUENCE

2 => 3



1 => DESCRIPTION

2 => OPINION OF



I now need to modify the expression to take into account multi-line content.
To give an example, the current expression matches "<DESCRIPTION>OPINION OF"
but it needs to match "<DESCRIPTION>OPINION OF 'new line' BRADLEY ARANT, ET
AL."



Many thanks in advance,



Steve.
 
R

Ragnar Hafstað

Steve Dunn said:
I'm wondering if anyone can help with the following problem:

I have the following text:

<DOCUMENT>
snipped vaguely xml-like text ...
And I have the following (multi-line) regular expression:
^<([^/].+?[^/])>([\S ]+)

first , a warning:
regular expressions will only work for simple xml-like stuff.
i hope you do not have tag nesting or attributes.
I now need to modify the expression to take into account multi-line content.
To give an example, the current expression matches "<DESCRIPTION>OPINION OF"
but it needs to match "<DESCRIPTION>OPINION OF 'new line' BRADLEY ARANT, ET
AL."

a few methods come to mind:

1) if the file is small (not huge) , you can slurp it in, and use something
like
m!^<([^/].+?[^/])>([^<]+)!s

2) set the input record separator to '<' and work with that

3) when you read a line not starting with '<', add it to previous item


what have you tried?

gnari
 
S

Steve Dunn

Hi Ragnar,
Thanks. I'm not using perl just the regular expression (in .NET). It's
not XML (nor HTML), but some half-baked attempt at mark-up that was thought
of shortly after the dinosaurs became extinct! There are no nested tags
within the text, but empty tags must be ignored (in the example below,
<DOCUMENT> is an empty tag). The files are very small, and 'slurping' (like
the expression!) is one possibility if I can't get the regex to work.

Thanks again,

Steve.

Ragnar Hafstað said:
Steve Dunn said:
I'm wondering if anyone can help with the following problem:

I have the following text:

<DOCUMENT>
snipped vaguely xml-like text ...
And I have the following (multi-line) regular expression:
^<([^/].+?[^/])>([\S ]+)

first , a warning:
regular expressions will only work for simple xml-like stuff.
i hope you do not have tag nesting or attributes.
I now need to modify the expression to take into account multi-line content.
To give an example, the current expression matches "<DESCRIPTION>OPINION OF"
but it needs to match "<DESCRIPTION>OPINION OF 'new line' BRADLEY ARANT, ET
AL."

a few methods come to mind:

1) if the file is small (not huge) , you can slurp it in, and use something
like
m!^<([^/].+?[^/])>([^<]+)!s

2) set the input record separator to '<' and work with that

3) when you read a line not starting with '<', add it to previous item


what have you tried?

gnari
 
R

Ragnar Hafstað

Steve Dunn said:
Hi Ragnar,
Thanks. I'm not using perl just the regular expression (in .NET).
It's

well, i do not know if many here are familiar with it.
are you processing the file line by line?
not XML (nor HTML), but some half-baked attempt at mark-up that was thought
of shortly after the dinosaurs became extinct! There are no nested tags
within the text, but empty tags must be ignored (in the example below,
<DOCUMENT> is an empty tag).

in your example there was no end tags (</xxx>), so I am not sure of the file
format.

if you can collect the file into one string without linebreaks, you probably
can do a
match with
<([^/].+?[^/])>([^<]+)

gnari

P.S.:
in this newsgroup, it is considered bad form to top-post, i.e. to
put a reply/followup at the top of the message, and quote the whole thread
below, it is better to quote relevant parts along with replys and comments
(a bit like I am doing in this message)
if the conversation develops into a thread, the top-posting becomes more and
more irritating.
 
S

Steve Dunn

Hi Gnari,

Ragnar Hafstað said:
It's

well, i do not know if many here are familiar with it.
are you processing the file line by line?
I am processing the text as one whole string. I've implemented a
work-around that 'slurps' line by line, although I'm not happy with it.
in your example there was no end tags (</xxx>), so I am not sure of the file
format.
End tags for these elements do not exist in this mark-up (I haven't got a
clue as to why not, but as I said, it was designed before the wheel !)
if you can collect the file into one string without linebreaks, you probably
can do a
match with
<([^/].+?[^/])>([^<]+)
Thanks for this. It works great although doesn't take into account the '<'
being on a new-line. It is returning the desired results, but will break if
there's any '<' characters in the text (and this 'mark-up' has no
escaping(!))
gnari Steve.

P.S.:
in this newsgroup, it is considered bad form to top-post, i.e. to
put a reply/followup at the top of the message, and quote the whole thread
below, it is better to quote relevant parts along with replys and comments
(a bit like I am doing in this message)
if the conversation develops into a thread, the top-posting becomes more and
more irritating.
Message understood. Many thanks for pointing this out and many many thanks
for your help!
 
R

Ragnar Hafstað

Steve Dunn said:
Hi Gnari,

Ragnar Hafstað said:
if you can collect the file into one string without linebreaks, you probably
can do a
match with
<([^/].+?[^/])>([^<]+)
Thanks for this. It works great although doesn't take into account the '<'
being on a new-line. It is returning the desired results, but will break if
there's any '<' characters in the text (and this 'mark-up' has no
escaping(!))

ok. if you collect the string *with* linefeeds, you should be able to match
with
\n<([^/].+?[^/])>([^<]+)
then you will have to deal with linefeeds in the capture

Message understood. Many thanks for pointing this out and many many thanks
for your help!

you are welcome

gnari
 
S

Steve Dunn

Ragnar Hafstað said:
Steve Dunn said:
Hi Gnari,

Ragnar Hafstað said:
if you can collect the file into one string without linebreaks, you probably
can do a
match with
<([^/].+?[^/])>([^<]+)
Thanks for this. It works great although doesn't take into account the '<'
being on a new-line. It is returning the desired results, but will
break
if
there's any '<' characters in the text (and this 'mark-up' has no
escaping(!))

ok. if you collect the string *with* linefeeds, you should be able to match
with
\n<([^/].+?[^/])>([^<]+)
then you will have to deal with linefeeds in the capture

Many thanks Gnari. I think we're almost there.
by the way, why are you testing for </xxx> and <xxx/> tags?
i thought you said there were none.
There aren't any in the snippet that I'm parsing, but the regex is also
used on larger peices of text that might contain closing tags
you are welcome

gnari
Steve.
p.s. Happy New Year!
 
M

Matt Garrish

Steve Dunn said:
I now need to modify the expression to take into account multi-line content.
To give an example, the current expression matches "<DESCRIPTION>OPINION OF"
but it needs to match "<DESCRIPTION>OPINION OF 'new line' BRADLEY ARANT, ET
AL."

You're probably better off "unbust"ing the file first (never checked if
that's actually a technical term, but it is the name of a script we have
where I work). Essentially, you'd just have to write a script to remove
newlines from the file unless the line begins with a top-level tag. You
could then read the file line-by-line with a simple expression like:

m#^<([^>]*)>(.*)(</\1>)?#i

to grab all the data you need. The usefulness, however, will vary depending
on what you are trying to capture and how it is formatted.

Matt
 
T

Tad McClellan

Matt Garrish said:
You're probably better off "unbust"ing the file first (never checked if
that's actually a technical term, but it is the name of a script we have
where I work).


I call my unbusters "preprocessor"s when in polite company,
otherwise they're "defoo"s. :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top