Negative lookahead in Regexp question

A

Axel Etzold

Dear all,

I have strings like this one:

"<NC> In North Carolina </NC>"

and I'd like to match the part in between the brackets with
a regexp with negative lookahead excluding a substring,
(</NC> in this case, rather than just single characters),
but I can't get it right...


Thanks for your help,

Axel
 
B

byronsalty

Dear all,

I have strings like this one:

"<NC> In North Carolina </NC>"

and I'd like to match the part in between the brackets with
a regexp with negative lookahead excluding a substring,
(</NC> in this case, rather than just single characters),
but I can't get it right...

Thanks for your help,

Axel

Can you provide a few examples of what you're looking for? I'm not
sure what you're asking but it doesn't sound too bad.

- Byron
 
A

Axel Etzold

Dear Byron,

thank you for responding.
I am working on an analysis of so-called chunked text,
i.e., an analysis of words in a sentence, that
classifies words as nouns / verbs / adjectives etc.

A typical sentence with chunking tags thus looks like this:

"<NC> The physical descriptions </NC> <PC> of <NC> places </NC> <PC> in <NC> North Carolina </NC> </PC> , <PC> in </PC> <ADV> so far </ADV> as <NC> they </NC> <VC> are </VC> <NC> specific </NC> <PC> at <NC> all </NC> </PC> , <VC> owe </VC> <NC> a little </NC> <PC> to <NC> memories </NC> </PC> <PC> of <NC> my childhood </NC>, although <NC> I </NC> <VC> 've also borrowed </VC> <ADV> indiscriminately </ADV> <PC> from <NC> other people 's childhood memories </NC> </PC> <PC> as </PC> <ADV> well </ADV> ."

Originally, I wanted to use Regexps to split the original sentence
into groups using negative lookahead, which I've now skipped in favor
of repeated Array.splits, but I think I could you use knowing how to
search for a substring using negative lookahead, i.e., as in my example:

regexp=/.../ <= searched for, such that:
string="<NC> In North Carolina </NC>"
ref=regexp.match(string)
p ref[1] => "In North Carolina"

Thank you for any help!

Best regards,

Axel
 
A

Axel Etzold

Aureliano,

no - since the tags are not XML tags, and since
I wanted to know about negative lookahead
for regexps ...

Best regards,

Axel
 
B

byronsalty

regexp=/.../ <= searched for, such that:
string="<NC> In North Carolina </NC>"
ref=regexp.match(string)
p ref[1] => "In North Carolina"

This will work pretty well (works for the above):
/<\w+>(.*?)<\/\w+>/

The only thing fancy there is making the .* non-greedy by adding .*?.
This means it will take the shortest possible match instead of the
longest.

But it will not work as I think you would want with a string of nested
clauses. If you want to include internal clauses then you would need
to make sure that the close tag matches the open tag. The side effect
is that you'll need to have another sub match within the regex.

So consider:
/<(\w+)>(.*?)<\/\1>/

Example:
irb(main):033:0> str = "<NC>In North Carolina <FOO>adsf</FOO> </NC>"
=> "<NC>In North Carolina <FOO>adsf</FOO> </NC>"
irb(main):034:0> re = /<(\w+)>(.*?)<\/\1>/
=> /<(\w+)>(.*?)<\/\1>/
irb(main):035:0> re.match(str)[1]
=> "NC"
irb(main):036:0> re.match(str)[2]
=> "In North Carolina <FOO>adsf</FOO> "


Does that help?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,050
Latest member
AngelS122

Latest Threads

Top