Regular Expression assistance

Steve Dunn · Dec 29, 2003

I'm wondering if anyone can help with the following problem:

I have the following text:

<DOCUMENT>

<TYPE>EX-5

<SEQUENCE>3

<DESCRIPTION>OPINION OF

BRADLEY ARANT, ET AL.

<TEXT>

..

And I have the following (multi-line) regular expression:

^<([^/].+?[^/])>([\S ]+)

This correctly matches any line that contains "<tag>any characters" but not
"</tag>" or "<tag>". The following captures are returned from the
expression:

1 => TYPE

2 => EX-5

1 => SEQUENCE

2 => 3

1 => DESCRIPTION

2 => OPINION OF

I now need to modify the expression to take into account multi-line content.
To give an example, the current expression matches "<DESCRIPTION>OPINION OF"
but it needs to match "<DESCRIPTION>OPINION OF 'new line' BRADLEY ARANT, ET
AL."

Many thanks in advance,

Steve.

Ragnar Hafstað · Dec 29, 2003

Steve Dunn said:
I'm wondering if anyone can help with the following problem:

I have the following text:

<DOCUMENT>

snipped vaguely xml-like text ...

And I have the following (multi-line) regular expression:

^<([^/].+?[^/])>([\S ]+)

first , a warning:
regular expressions will only work for simple xml-like stuff.
i hope you do not have tag nesting or attributes.

I now need to modify the expression to take into account multi-line content.
To give an example, the current expression matches "<DESCRIPTION>OPINION OF"
but it needs to match "<DESCRIPTION>OPINION OF 'new line' BRADLEY ARANT, ET
AL."

a few methods come to mind:

1) if the file is small (not huge) , you can slurp it in, and use something
like
m!^<([^/].+?[^/])>([^<]+)!s

2) set the input record separator to '<' and work with that

3) when you read a line not starting with '<', add it to previous item

what have you tried?

gnari

Steve Dunn · Dec 29, 2003

Hi Ragnar,
Thanks. I'm not using perl just the regular expression (in .NET). It's
not XML (nor HTML), but some half-baked attempt at mark-up that was thought
of shortly after the dinosaurs became extinct! There are no nested tags
within the text, but empty tags must be ignored (in the example below,
<DOCUMENT> is an empty tag). The files are very small, and 'slurping' (like
the expression!) is one possibility if I can't get the regex to work.

Thanks again,

Steve.

Ragnar Hafstað said:
Steve Dunn said:

I'm wondering if anyone can help with the following problem:

I have the following text:

<DOCUMENT>

Click to expand...

snipped vaguely xml-like text ...

And I have the following (multi-line) regular expression:

Click to expand...

^<([^/].+?[^/])>([\S ]+)

Click to expand...

first , a warning:
regular expressions will only work for simple xml-like stuff.
i hope you do not have tag nesting or attributes.

I now need to modify the expression to take into account multi-line content.
To give an example, the current expression matches "<DESCRIPTION>OPINION OF"
but it needs to match "<DESCRIPTION>OPINION OF 'new line' BRADLEY ARANT, ET
AL."

Click to expand...

a few methods come to mind:

1) if the file is small (not huge) , you can slurp it in, and use something
like
m!^<([^/].+?[^/])>([^<]+)!s

2) set the input record separator to '<' and work with that

3) when you read a line not starting with '<', add it to previous item

what have you tried?

gnari

Ragnar Hafstað · Dec 29, 2003

Steve Dunn said:
Hi Ragnar,
Thanks. I'm not using perl just the regular expression (in .NET).

It's

well, i do not know if many here are familiar with it.
are you processing the file line by line?

not XML (nor HTML), but some half-baked attempt at mark-up that was thought
of shortly after the dinosaurs became extinct! There are no nested tags
within the text, but empty tags must be ignored (in the example below,
<DOCUMENT> is an empty tag).

in your example there was no end tags (</xxx>), so I am not sure of the file
format.

if you can collect the file into one string without linebreaks, you probably
can do a
match with
<([^/].+?[^/])>([^<]+)

gnari

P.S.:
in this newsgroup, it is considered bad form to top-post, i.e. to
put a reply/followup at the top of the message, and quote the whole thread
below, it is better to quote relevant parts along with replys and comments
(a bit like I am doing in this message)
if the conversation develops into a thread, the top-posting becomes more and
more irritating.

Steve Dunn · Dec 29, 2003

Hi Gnari,

Ragnar Hafstað said:
It's

well, i do not know if many here are familiar with it.
are you processing the file line by line?

I am processing the text as one whole string. I've implemented a
work-around that 'slurps' line by line, although I'm not happy with it.

in your example there was no end tags (</xxx>), so I am not sure of the file
format.

End tags for these elements do not exist in this mark-up (I haven't got a
clue as to why not, but as I said, it was designed before the wheel !)

if you can collect the file into one string without linebreaks, you probably
can do a
match with
<([^/].+?[^/])>([^<]+)

Thanks for this. It works great although doesn't take into account the '<'
being on a new-line. It is returning the desired results, but will break if
there's any '<' characters in the text (and this 'mark-up' has no
escaping(!))

gnari Steve.

P.S.:
in this newsgroup, it is considered bad form to top-post, i.e. to
put a reply/followup at the top of the message, and quote the whole thread
below, it is better to quote relevant parts along with replys and comments
(a bit like I am doing in this message)
if the conversation develops into a thread, the top-posting becomes more and
more irritating.

Message understood. Many thanks for pointing this out and many many thanks
for your help!

Ragnar Hafstað · Dec 29, 2003

Steve Dunn said:
Hi Gnari,

Ragnar Hafstað said:

if you can collect the file into one string without linebreaks, you probably
can do a
match with
<([^/].+?[^/])>([^<]+)

Click to expand...

Thanks for this. It works great although doesn't take into account the '<'
being on a new-line. It is returning the desired results, but will break if
there's any '<' characters in the text (and this 'mark-up' has no
escaping(!))

ok. if you collect the string *with* linefeeds, you should be able to match
with
\n<([^/].+?[^/])>([^<]+)
then you will have to deal with linefeeds in the capture

Message understood. Many thanks for pointing this out and many many thanks
for your help!

you are welcome

gnari

Steve Dunn · Dec 30, 2003

Ragnar Hafstað said:
Steve Dunn said:

Hi Gnari,

Ragnar Hafstað said:

if you can collect the file into one string without linebreaks, you probably
can do a
match with
<([^/].+?[^/])>([^<]+)

Click to expand...

Thanks for this. It works great although doesn't take into account the '<'
being on a new-line. It is returning the desired results, but will

Click to expand...

break

if
there's any '<' characters in the text (and this 'mark-up' has no
escaping(!))

Click to expand...

ok. if you collect the string *with* linefeeds, you should be able to match
with
\n<([^/].+?[^/])>([^<]+)
then you will have to deal with linefeeds in the capture

Many thanks Gnari. I think we're almost there.

by the way, why are you testing for </xxx> and <xxx/> tags?
i thought you said there were none.

There aren't any in the snippet that I'm parsing, but the regex is also
used on larger peices of text that might contain closing tags

you are welcome

gnari

Steve.
p.s. Happy New Year!

Matt Garrish · Dec 30, 2003

Steve Dunn said:
I now need to modify the expression to take into account multi-line content.
To give an example, the current expression matches "<DESCRIPTION>OPINION OF"
but it needs to match "<DESCRIPTION>OPINION OF 'new line' BRADLEY ARANT, ET
AL."

You're probably better off "unbust"ing the file first (never checked if
that's actually a technical term, but it is the name of a script we have
where I work). Essentially, you'd just have to write a script to remove
newlines from the file unless the line begins with a top-level tag. You
could then read the file line-by-line with a simple expression like:

m#^<([^>]*)>(.*)(</\1>)?#i

to grab all the data you need. The usefulness, however, will vary depending
on what you are trying to capture and how it is formatted.

Matt

Tad McClellan · Dec 30, 2003

Matt Garrish said:
You're probably better off "unbust"ing the file first (never checked if
that's actually a technical term, but it is the name of a script we have
where I work).

I call my unbusters "preprocessor"s when in polite company,
otherwise they're "defoo"s.

How do I get the text that is found by a regular expression?	10	Apr 30, 2014
Regular expression for BOM required	6	Jan 12, 2013
Recursion regular expression (xtended)	1	Aug 16, 2010
regular expression for beow text	8	Aug 20, 2010
Unwanted collector in regular expression	2	Apr 1, 2011
Regular Expression	9	Sep 7, 2007
Regular Expression Help	1	Jun 13, 2008
Custom Minecraft launcher client error; I think regarding java	0	Sep 7, 2022

Regular Expression assistance

Steve Dunn

Ragnar Hafstað

Steve Dunn

Ragnar Hafstað

Steve Dunn

Ragnar Hafstað

Steve Dunn

Matt Garrish

Tad McClellan

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads