Cutting a piece of text

Z

Zdebel

Helo !
I've started to learn ruby and I'm amazed with it. Now I have a problem
that I can't solve. If I have a string like this:
"<lyrics artist=XXX album=XXX title=XXX> Lalalalala </lyrics>" how can I
cut the " artist=XXX album=XXX title=XXX" part, so it would look like:
"<lyrcis> Lalalalala </lyrics>" Could you please help me ?
 
J

James Edward Gray II

Helo !
I've started to learn ruby and I'm amazed with it. Now I have a
problem
that I can't solve. If I have a string like this:
"<lyrics artist=XXX album=XXX title=XXX> Lalalalala </lyrics>" how
can I
cut the " artist=XXX album=XXX title=XXX" part, so it would look like:
"<lyrcis> Lalalalala </lyrics>" Could you please help me ?

You can do it with a regular expression like the following, but I
must stress that this isn't very robust:
(/<(\w+)[^>]+>/, "<\\1>")
=> "<lyrics> Lalalalala </lyrics>"

Hope that helps.

James Edward Gray II
 
D

David Vallner

D=C5=88a Nede=C4=BEa 12 Febru=C3=A1r 2006 17:18 Zdebel nap=C3=ADsal:
Helo !
I've started to learn ruby and I'm amazed with it. Now I have a problem
that I can't solve. If I have a string like this:
"<lyrics artist=3DXXX album=3DXXX title=3DXXX> Lalalalala </lyrics>" how = can I
cut the " artist=3DXXX album=3DXXX title=3DXXX" part, so it would look li= ke:
"<lyrcis> Lalalalala </lyrics>" Could you please help me ?

The very geeky, and most probably least error-prone way would be whacking t=
he=20
string with a DOM parser, clearing the attributes, and then printing it out=
=20
again. Unfortunately, I haven't been doing any DOM manipulation in Ruby, so=
I=20
can't provide code.

David Vallner
 
J

James Edward Gray II

D=C5=88a Nede=C4=BEa 12 Febru=C3=A1r 2006 17:18 Zdebel nap=C3=ADsal:

The very geeky, and most probably least error-prone way would be =20
whacking the
string with a DOM parser, clearing the attributes, and then =20
printing it out
again. Unfortunately, I haven't been doing any DOM manipulation in =20
Ruby, so I
can't provide code.

The following is how you do it for valid XML, but the posted example =20
wasn't quite:

#!/usr/local/bin/ruby -w

require "rexml/document"

doc =3D "<lyrics artist=3D'XXX' album=3D'XXX' title=3D'XXX'> Lalalalala =
</=20
lyrics>"
xml =3D REXML::Document.new(doc)
xml.root.attributes.clear
xml.write
puts

__END__

James Edward Gray II
 
S

samuel.murphy

Learn regular expressions. Here's a not great example:

a = "<lyrics artist=XXX album=XXX title=XXX> Lalalalala </lyrics>"
b = a.gsub(/\w*=\w*/ , "")
c = b.gsub(/\s/, "")
print c, "\n"

<lyrics>Lalalalala</lyrics>


A slightly (yes very slightly) more realistic example:

a = '<lyrics artist="Prince" album="purplerain" title="computerblue">
Lalalalala </lyrics>'
b = a.gsub(/\w*="\w*"/ , "")
c = b.gsub(/\s/, "")
print c, "\n"


<lyrics>Lalalalala</lyrics>


And what if there are spaces in a tag:

a = '<lyrics artist="Prince" album="purplerain" title="Computer Blue">
Lalalalala </lyrics>'
b = a.gsub(/\w*=".*"/ , "")
c = b.gsub(/\s/, "")
 
J

James Edward Gray II

I wish I knew how this (/<(\w+)[^>]+>/, "<\\1>")
regular expresion works :).

It reads:

/ < # find a < character
( # capture this next part into $1 (\\1 in the replacement
string)
\w+ # followed by one or more word characters
) # end capture
[^>]+ # followed by one or more non > characters
# and finally a > character
/x


The replacement just restores the <\w+> and leaves out the [^>]+ part
(the space and attributes).

Hope that helps.

James Edward Gray II
 
Z

Zdebel

Big thank you too all of you guys for such a response. This helped me
alot and my script is working, but I will practice more using your
advices :)
 
M

Marcin Mielżyński

James said:
</lyrics>".sub(/<(\w+)[^>]+>/, "<\\1>")
=> "<lyrics> Lalalalala </lyrics>"

reluctant would a bit faster:

p "<lyrics artist=XXX album=XXX title=XXX> Lalalalala
</lyrics>".gsub(/<(\w+).*?>/, "<\\1>")


lopex
 
D

David Vallner

D=C5=88a Nede=C4=BEa 12 Febru=C3=A1r 2006 19:30 James Edward Gray II nap=C3=
=ADsal:
James said:
"<lyrics artist=3DXXX album=3DXXX title=3DXXX> Lalalalala </

lyrics>".sub(/<(\w+)[^>]+>/, "<\\1>")
=3D> "<lyrics> Lalalalala </lyrics>"

reluctant would a bit faster:

p "<lyrics artist=3DXXX album=3DXXX title=3DXXX> Lalalalala </
lyrics>".gsub(/<(\w+).*?>/, "<\\1>")

Are you sure?

$ ruby regexp_time.rb
Rehearsal -------------------------------------------------
/<(w+)[^>]+>/ 7.210000 0.030000 7.240000 ( 7.266166)
/<(w+).*?>/ 7.710000 0.020000 7.730000 ( 7.757304)
--------------------------------------- total: 14.970000sec

user system total real
/<(w+)[^>]+>/ 7.170000 0.030000 7.200000 ( 7.227075)
/<(w+).*?>/ 7.730000 0.020000 7.750000 ( 7.777196)
$ cat regexp_time.rb
#!/usr/local/bin/ruby -w

require "benchmark"

tests =3D 1000000
data =3D "<lyrics artist=3DXXX album=3DXXX title=3DXXX> Lalalalala </lyr= ics>"

Benchmark.bmbm do |x|
x.report("/<(\w+)[^>]+>/") do
tests.times { data.sub(/<(\w+)[^>]+>/, "<\\1>") }
end
x.report("/<(\w+).*?>/") do
tests.times { data.sub(/<(\w+).*?>/, "<\\1>") }
end
end

__END__

;)

James Edward Gray II

The nongreedy match has to "back up" and retry on every character after the=
=20
tag name, whileas James' [^>] doesn't ever have to back up. In fact, even a=
=20
greedy .* would probably be faster than a nongreedy one in this case.

Gotta love the black art that is optimizing regexps.

David Vallner
 
M

Marcin Mielżyński

David said:
The nongreedy match has to "back up" and retry on every character after the
tag name, whileas James' [^>] doesn't ever have to back up. In fact, even a
greedy .* would probably be faster than a nongreedy one in this case.

Gotta love the black art that is optimizing regexps.

Ooops.. You are right!

But as I read greedy quantifiers do backtrack as well (but not in the
case above).

/a+aa/ =~ "aaaaa"
will backtrack two characters

only possesive quantifier (in oniguruma e.g.) consumes in the real,
greedy way.

so
/a++aa/ =~ "aaaaa"
won't match.

lopex
 
W

William James

Zdebel said:
Helo !
I've started to learn ruby and I'm amazed with it. Now I have a problem
that I can't solve. If I have a string like this:
"<lyrics artist=XXX album=XXX title=XXX> Lalalalala </lyrics>" how can I
cut the " artist=XXX album=XXX title=XXX" part, so it would look like:
"<lyrcis> Lalalalala </lyrics>" Could you please help me ?

p " <lyrics artist=XXX album=XXX title=XXX> Lalalalala </lyrics>".
sub(/\s+[^<>]*(?=>)/, '' )

p " <lyrics artist=XXX album=XXX title=XXX> Lalalalala </lyrics>".
scan( /\G ( [^<]+ ) | \G ( < \S* ) [^>]* ( > ) /x ).
flatten.compact.join
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top