reg exp

H

Hai anh Le

I have a problem with regexp. I have some document like :

A method for detecting a post-translationally modified protein with a
glycosyl group comprising contacting the protein with a glycosyl
transferase enzyme and a labeling agent, wherein the labeling agent
comprises a chemical handle and a transferable glycosyl group.

I want to divide it to some string follow a rule that string start with
"a, an, the" like :

"A method for detecting "
"a post-translationally modified protein with "
"a glycosyl group comprising contacting "
"the protein with "
"a glycosyl transferase enzyme and "
"a labeling agent, wherein "
"the labeling agent comprises "
"a chemical handle and "
"a transferable glycosyl group."


I use the code

while element.size do
if element =~ /([Aa]|[Aa]n|[Tt]he)( [^ ]+)(?:[Aa]|[Aa]n|[Tt]he)?/
then
temp_string = $1 +$2
temp_array << temp_string
else
break
end
element.slice!(temp_string)
end


but it's not ok. the result is

A method
a post-translationally
a glycosyl
the protein
a glycosyl
a labeling
the labeling
a chemical
a transferable

Can anyone help me about the code ?

Hai Anh
 
P

Peña, Botp

RnJvbTogSGFpIGFuaCBMZSBbbWFpbHRvOmxoYW5oQHdpY2hpcHRlY2guY29tXSANCiMgd2hpbGUg
ZWxlbWVudC5zaXplIGRvDQojICAgICBpZiBlbGVtZW50ID1+IC8oW0FhXXxbQWFdbnxbVHRdaGUp
KCBbXiBdKykoPzpbQWFdfFtBYV1ufFtUdF1oZSk/Lw0KIyB0aGVuDQojICAgICAgIHRlbXBfc3Ry
aW5nID0gJDEgKyQyDQojICAgICAgIHRlbXBfYXJyYXkgPDwgdGVtcF9zdHJpbmcNCiMgICAgIGVs
c2UNCiMgICAgICAgYnJlYWsNCiMgICAgIGVuZA0KIyAgICAgZWxlbWVudC5zbGljZSEodGVtcF9z
dHJpbmcpDQojIGVuZA0KDQppZiB5b3UgbGlrZSB0byB3YWxrIHRocnUgdGhlIHN0cmluZyBtYW51
YWxseSwgdGhlbiBzdHJpbmdzY2FubmVyIHdvcmtzIGJlc3QuDQoNCm90b2gsIHlvdSBjYW4gYWxz
byB0cnkgc3RyaW5nI3NwbGl0DQoNCmJ1dCBteSBpbml0aWFsIHJlYWN0aW9uIHdhcyBqdXN0IHRv
IGZpbmQgdGhvc2UgYXJ0aWNsZXMgdGhlbiBtYXJrIHRoZW0gdyBuZXdsaW5lcyAobWF5YmUgYmVj
YXVzZSBpJ20gZm9uZCBvZiBwcm9ncmFtbWluZyB3IHBhcGVyIGFuZCBwZW5jaWwgOikNCg0KZWcs
DQoNCj5wdXRzIHMuZ3N1YigvKFtBYV18W0FhXW58W1R0XWhlKVxXKy8peyJcbiIrJDErIlxzIn0N
Cg0KQSBtZXRob2QgZm9yIGRldGVjdGluZw0KYSBwb3N0LXRyYW5zbGF0aW9uYWxseSBtb2RpZmll
ZCBwcm90ZWluIHdpdGgNCmEgZ2x5Y29zeWwgZ3JvdXAgY29tcHJpc2luZyBjb250YWN0aW5nDQp0
aGUgcHJvdGVpbiB3aXRoDQphIGdseWNvc3lsIHRyYW5zZmVyYXNlIGVuenltZSBhbmQNCmEgbGFi
ZWxpbmcgYWdlbnQsIHdoZXJlaW4NCnRoZSBsYWJlbGluZyBhZ2VudCBjb21wcmlzZXMNCmEgY2hl
bWljYWwgaGFuZGxlIGFuZA0KYSB0cmFuc2ZlcmFibGUgZ2x5Y29zeWwgZ3JvdXAuDQoNCg==
 
H

Hai anh Le

puts s.gsub(/([Aa]|[Aa]n|[Tt]he)\W+/){"\n"+$1+"\s"}

A method for detecting
a post-translationally modified protein with
a glycosyl group comprising contacting
the protein with
a glycosyl transferase enzyme and
a labeling agent, wherein
the labeling agent comprises
a chemical handle and
a transferable glycosyl group.

Thank for your help, this is another way. I think I can use "\n" to
mark, after that depend on it to divide.
 
J

Jesús Gabriel y Galán

I have a problem with regexp. I have some document like :

A method for detecting a post-translationally modified protein with a
glycosyl group comprising contacting the protein with a glycosyl
transferase enzyme and a labeling agent, wherein the labeling agent
comprises a chemical handle and a transferable glycosyl group.

I want to divide it to some string follow a rule that string start with
"a, an, the" like :

"A method for detecting "
"a post-translationally modified protein with "
"a glycosyl group comprising contacting "
"the protein with "
"a glycosyl transferase enzyme and "
"a labeling agent, wherein "
"the labeling agent comprises "
"a chemical handle and "
"a transferable glycosyl group."

I though of split, but then you get an array entry for the "separator"
and the following part, so you would need to paste them again
yourself:

irb(main):008:0> a = "A method for detecting a post-translationally
modified protein with a glycosyl group comprising contacting the
protein with a glycosyl transferase enzyme and a labeling agent,
wherein the labeling agent comprises a chemical handle and a
transferable glycosyl group."
=> "A method for detecting a post-translationally modified protein
with a glycosyl group comprising contacting the protein with a
glycosyl transferase enzyme and a labeling agent, wherein the labeling
agent comprises a chemical handle and a transferable glycosyl group."
irb(main):011:0> require 'enumerator'
=> true
irb(main):012:0> result = []
=> []
irb(main):015:0> a.split(/\b(a|an|the)\b/i)[1..-1].each_slice(2) {|a,
b| result << (a+b)}
=> nil
irb(main):016:0> result
=> ["A method for detecting ", "a post-translationally modified
protein with ", "a glycosyl group comprising contacting ", "the
protein with ", "a glycosyl transferase enzyme and ", "a labeling
agent, wherein ", "the labeling agent comprises ", "a chemical handle
and ", "a transferable glycosyl group."]

Maybe someone can come with a split solution easier to join?

Hope this helps,

Jesus.
 
J

Jesús Gabriel y Galán

Try this:

a.split(/(?=an?\b|the\b)/i)

Neat, thanks. I understand how lookaheads work when it comes to
matching a regex,
but it's still not clear to me why split works with the lookahead.
Isn't the match of the lookahead a 0-width string?

irb(main):001:0> a = "abcxxxabcxxxabcxxx"
=> "abcxxxabcxxxabcxxx"
irb(main):006:0> a.match(/(?=abc)/)[0]
=> ""
irb(main):007:0> a =~ /(?=abc)/
=> 0

Then, how does split knows where to start the next search?

Thanks,

Jesus.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,774
Messages
2,569,596
Members
45,143
Latest member
DewittMill
Top