reg exp

Hai anh Le · Dec 4, 2008

I have a problem with regexp. I have some document like :

A method for detecting a post-translationally modified protein with a
glycosyl group comprising contacting the protein with a glycosyl
transferase enzyme and a labeling agent, wherein the labeling agent
comprises a chemical handle and a transferable glycosyl group.

I want to divide it to some string follow a rule that string start with
"a, an, the" like :

"A method for detecting "
"a post-translationally modified protein with "
"a glycosyl group comprising contacting "
"the protein with "
"a glycosyl transferase enzyme and "
"a labeling agent, wherein "
"the labeling agent comprises "
"a chemical handle and "
"a transferable glycosyl group."

I use the code

while element.size do
if element =~ /([Aa]|[Aa]n|[Tt]he)( [^ ]+)(?:[Aa]|[Aa]n|[Tt]he)?/
then
temp_string = $1 +$2
temp_array << temp_string
else
break
end
element.slice!(temp_string)
end

but it's not ok. the result is

A method
a post-translationally
a glycosyl
the protein
a glycosyl
a labeling
the labeling
a chemical
a transferable

Can anyone help me about the code ?

Hai Anh

PeÃ±a, Botp · Dec 4, 2008

RnJvbTogSGFpIGFuaCBMZSBbbWFpbHRvOmxoYW5oQHdpY2hpcHRlY2guY29tXSANCiMgd2hpbGUg
ZWxlbWVudC5zaXplIGRvDQojICAgICBpZiBlbGVtZW50ID1+IC8oW0FhXXxbQWFdbnxbVHRdaGUp
KCBbXiBdKykoPzpbQWFdfFtBYV1ufFtUdF1oZSk/Lw0KIyB0aGVuDQojICAgICAgIHRlbXBfc3Ry
aW5nID0gJDEgKyQyDQojICAgICAgIHRlbXBfYXJyYXkgPDwgdGVtcF9zdHJpbmcNCiMgICAgIGVs
c2UNCiMgICAgICAgYnJlYWsNCiMgICAgIGVuZA0KIyAgICAgZWxlbWVudC5zbGljZSEodGVtcF9z
dHJpbmcpDQojIGVuZA0KDQppZiB5b3UgbGlrZSB0byB3YWxrIHRocnUgdGhlIHN0cmluZyBtYW51
YWxseSwgdGhlbiBzdHJpbmdzY2FubmVyIHdvcmtzIGJlc3QuDQoNCm90b2gsIHlvdSBjYW4gYWxz
byB0cnkgc3RyaW5nI3NwbGl0DQoNCmJ1dCBteSBpbml0aWFsIHJlYWN0aW9uIHdhcyBqdXN0IHRv
IGZpbmQgdGhvc2UgYXJ0aWNsZXMgdGhlbiBtYXJrIHRoZW0gdyBuZXdsaW5lcyAobWF5YmUgYmVj
YXVzZSBpJ20gZm9uZCBvZiBwcm9ncmFtbWluZyB3IHBhcGVyIGFuZCBwZW5jaWwgOikNCg0KZWcs
DQoNCj5wdXRzIHMuZ3N1YigvKFtBYV18W0FhXW58W1R0XWhlKVxXKy8peyJcbiIrJDErIlxzIn0N
Cg0KQSBtZXRob2QgZm9yIGRldGVjdGluZw0KYSBwb3N0LXRyYW5zbGF0aW9uYWxseSBtb2RpZmll
ZCBwcm90ZWluIHdpdGgNCmEgZ2x5Y29zeWwgZ3JvdXAgY29tcHJpc2luZyBjb250YWN0aW5nDQp0
aGUgcHJvdGVpbiB3aXRoDQphIGdseWNvc3lsIHRyYW5zZmVyYXNlIGVuenltZSBhbmQNCmEgbGFi
ZWxpbmcgYWdlbnQsIHdoZXJlaW4NCnRoZSBsYWJlbGluZyBhZ2VudCBjb21wcmlzZXMNCmEgY2hl
bWljYWwgaGFuZGxlIGFuZA0KYSB0cmFuc2ZlcmFibGUgZ2x5Y29zeWwgZ3JvdXAuDQoNCg==

Hai anh Le · Dec 4, 2008

puts s.gsub(/([Aa]|[Aa]n|[Tt]he)\W+/){"\n"+$1+"\s"}

Click to expand...

A method for detecting
a post-translationally modified protein with
a glycosyl group comprising contacting
the protein with
a glycosyl transferase enzyme and
a labeling agent, wherein
the labeling agent comprises
a chemical handle and
a transferable glycosyl group.

Thank for your help, this is another way. I think I can use "\n" to
mark, after that depend on it to divide.

Jesús Gabriel y Galán · Dec 4, 2008

I have a problem with regexp. I have some document like :

A method for detecting a post-translationally modified protein with a
glycosyl group comprising contacting the protein with a glycosyl
transferase enzyme and a labeling agent, wherein the labeling agent
comprises a chemical handle and a transferable glycosyl group.

I want to divide it to some string follow a rule that string start with
"a, an, the" like :

"A method for detecting "
"a post-translationally modified protein with "
"a glycosyl group comprising contacting "
"the protein with "
"a glycosyl transferase enzyme and "
"a labeling agent, wherein "
"the labeling agent comprises "
"a chemical handle and "
"a transferable glycosyl group."

I though of split, but then you get an array entry for the "separator"
and the following part, so you would need to paste them again
yourself:

irb(main):008:0> a = "A method for detecting a post-translationally
modified protein with a glycosyl group comprising contacting the
protein with a glycosyl transferase enzyme and a labeling agent,
wherein the labeling agent comprises a chemical handle and a
transferable glycosyl group."
=> "A method for detecting a post-translationally modified protein
with a glycosyl group comprising contacting the protein with a
glycosyl transferase enzyme and a labeling agent, wherein the labeling
agent comprises a chemical handle and a transferable glycosyl group."
irb(main):011:0> require 'enumerator'
=> true
irb(main):012:0> result = []
=> []
irb(main):015:0> a.split(/\b(a|an|the)\b/i)[1..-1].each_slice(2) {|a,
b| result << (a+b)}
=> nil
irb(main):016:0> result
=> ["A method for detecting ", "a post-translationally modified
protein with ", "a glycosyl group comprising contacting ", "the
protein with ", "a glycosyl transferase enzyme and ", "a labeling
agent, wherein ", "the labeling agent comprises ", "a chemical handle
and ", "a transferable glycosyl group."]

Maybe someone can come with a split solution easier to join?

Hope this helps,

Jesus.

Jesús Gabriel y Galán · Dec 4, 2008

Try this:

a.split(/(?=an?\b|the\b)/i)

Neat, thanks. I understand how lookaheads work when it comes to
matching a regex,
but it's still not clear to me why split works with the lookahead.
Isn't the match of the lookahead a 0-width string?

irb(main):001:0> a = "abcxxxabcxxxabcxxx"
=> "abcxxxabcxxxabcxxx"
irb(main):006:0> a.match(/(?=abc)/)[0]
=> ""
irb(main):007:0> a =~ /(?=abc)/
=> 0

Then, how does split knows where to start the next search?

Thanks,

Jesus.

BioRuby & Google Summer of Code 2011	0	Mar 25, 2011
[ANN] Mechanize 2.0.pre.2	0	Apr 18, 2011
What does this warning actually mean?	1	Aug 2, 2007
[SUMMARY] Reverse the Polarity (#143)	0	Oct 18, 2007
How can users paste tabular data into a web-based VB.NET app?	3	Jun 5, 2008
tornado.web ioloop add_timeout eats CPU	3	Sep 3, 2012
[ANN] pyparsing 1.5.3 released	0	Jun 25, 2010
FAQ 9.9 How do I automate an HTML form submission?	0	Mar 16, 2011

reg exp

Hai anh Le

PeÃ±a, Botp

Hai anh Le

Jesús Gabriel y Galán

Jesús Gabriel y Galán

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads