Regex Black Magic... how to stop matching if char?

J

Jon

I'm trying to translate a strange derivative of xml into valid xml. Here
is an example line:

<SUBEVENTSTATUS
1:2><OPERATIONNAME></OPERATIONNAME>gofast<OPERATIONSTATUS>stopped</OPERATIONSTATUS><TARGETOBJECTNAME>name</TARGETOBJECTNAME><TARGETOBJECTVALUE>val</TARGETOBJECTVALUE></SUBEVENTSTATUS
1:1><SUBEVENTSTATUS 2:2><......and on

REXML pukes on the <SUBEVENTSTATUS 1:2> tag... which it should. There
should be some kind of attribute declaration instead. I want to
translate it to something like this: <SUBEVENTSTATUS no="1" of="2">

I'm trying to make a regex to detect the funny tags. Here is what I have
so far:

xml_fix=/<(\S+)\s+(\d+):(\d+)>/

This is great, but it will match this:

<Request><code_set_list 1:2>

instead of just this:

<code_set_list 1:2>

...because there is no gauranteed whitespace between tags. Basically, I
need to stop matching if a ">" is found. I've never had to deal with
anything quite like this in my regex experience. Any help or thoughts of
a better way to do things is much appreciated!
 
R

Robert Klemme

I'm trying to translate a strange derivative of xml into valid xml. Here
is an example line:

<SUBEVENTSTATUS
1:2><OPERATIONNAME></OPERATIONNAME>gofast<OPERATIONSTATUS>stopped</OPERATIONSTATUS><TARGETOBJECTNAME>name</TARGETOBJECTNAME><TARGETOBJECTVALUE>val</TARGETOBJECTVALUE></SUBEVENTSTATUS
1:1><SUBEVENTSTATUS 2:2><......and on

REXML pukes on the <SUBEVENTSTATUS 1:2> tag... which it should. There
should be some kind of attribute declaration instead. I want to
translate it to something like this: <SUBEVENTSTATUS no="1" of="2">

I'm trying to make a regex to detect the funny tags. Here is what I have
so far:

xml_fix=/<(\S+)\s+(\d+):(\d+)>/

This is great, but it will match this:

<Request><code_set_list 1:2>

instead of just this:

<code_set_list 1:2>

..because there is no gauranteed whitespace between tags. Basically, I
need to stop matching if a ">" is found. I've never had to deal with
anything quite like this in my regex experience. Any help or thoughts of
a better way to do things is much appreciated!

I can think of several solutions:

/<([^>\s]+)\s+(\d+):(\d+)>/

Or even a two phased approach

/<[^>]+>/

and then with the match
/(\d+):(\d+)>\z/

HTH

robert
 
F

F. Senault

Le 30 mars à 17:34, Jon a écrit :
..because there is no gauranteed whitespace between tags. Basically, I
need to stop matching if a ">" is found. I've never had to deal with
anything quite like this in my regex experience. Any help or thoughts of
a better way to do things is much appreciated!

I'd simply use /<[^>]+\s+(\d+):(\d+)>/ (untested, but you get my
drift)...

Fred
--
Microsoft sucks, sucks, sucks.
Which wouldn't be such a bad thing, if it were cuter, didn't use its
teeth at inopportune moments, didn't hog the bed, cooked well, and had
good taste in films. Sadly, that's not the case. (Dan Birchall, SDM)
 
J

Jon Fi

Robert said:
<code_set_list 1:2>

..because there is no gauranteed whitespace between tags. Basically, I
need to stop matching if a ">" is found. I've never had to deal with
anything quite like this in my regex experience. Any help or thoughts of
a better way to do things is much appreciated!

I can think of several solutions:

/<([^>\s]+)\s+(\d+):(\d+)>/

Or even a two phased approach

/<[^>]+>/

and then with the match
/(\d+):(\d+)>\z/

HTH

robert


awesome, and thank you! but for my benefit, could you explain why that
works? I thought ^ was line start?
 
R

Rob Biedenharn

Robert said:
<code_set_list 1:2>

..because there is no gauranteed whitespace between tags.
Basically, I
need to stop matching if a ">" is found. I've never had to deal with
anything quite like this in my regex experience. Any help or
thoughts of
a better way to do things is much appreciated!

I can think of several solutions:

/<([^>\s]+)\s+(\d+):(\d+)>/

Or even a two phased approach

/<[^>]+>/

and then with the match
/(\d+):(\d+)>\z/

HTH

robert


awesome, and thank you! but for my benefit, could you explain why that
works? I thought ^ was line start?

Within a character set it inverts the selection so [^>] matches any
character that's NOT a '>'

My solution is: .gsub(/<([^>]*?\b\s+)(\d+):(\d+)>/, '<\1no="\2"
of="\3">')

-Rob

Rob Biedenharn http://agileconsultingllc.com
(e-mail address removed)
 
B

Brian Candler

<SUBEVENTSTATUS
1:2><OPERATIONNAME></OPERATIONNAME>gofast<OPERATIONSTATUS>stopped</OPERATIONSTATUS><TARGETOBJECTNAME>name</TARGETOBJECTNAME><TARGETOBJECTVALUE>val</TARGETOBJECTVALUE></SUBEVENTSTATUS
1:1><SUBEVENTSTATUS 2:2><......and on

REXML pukes on the <SUBEVENTSTATUS 1:2> tag... which it should. There
should be some kind of attribute declaration instead. I want to
translate it to something like this: <SUBEVENTSTATUS no="1" of="2">

I'm trying to make a regex to detect the funny tags. Here is what I have
so far:

xml_fix=/<(\S+)\s+(\d+):(\d+)>/

This is great, but it will match this:

<Request><code_set_list 1:2>

instead of just this:

<code_set_list 1:2>

Try (\w+) instead of (\S+)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,906
Latest member
SkinfixSkintag

Latest Threads

Top