Regular expression

Arun Kumar · Mar 23, 2009

Hi,
I know that what i'm going to ask is for the solution for a simple
problem. But as I'm new to Ruby I have not learnt a lot about regular
expressions in Ruby.

Can anybody tell me how to extract all the contents which are included
inside the '<html>' and '</html>' tag and also to extract the text given
in between the '<a>' and '</a>' tag using regular expression. I know it
can be extracted using the 'scan' method but I dont know what should be
the matching patterns or expressions. Can anybody pls help me

Regards
Arun

Ryan Davis · Mar 23, 2009

Hi,
I know that what i'm going to ask is for the solution for a simple
problem. But as I'm new to Ruby I have not learnt a lot about regular
expressions in Ruby.

Can anybody tell me how to extract all the contents which are included
inside the '<html>' and '</html>' tag and also to extract the text
given
in between the '<a>' and '</a>' tag using regular expression. I know
it
can be extracted using the 'scan' method but I dont know what should
be
the matching patterns or expressions. Can anybody pls help me

regexps are about the worst thing to use in this case. Look at this
instead:

http://mechanize.rubyforge.org/files/GUIDE_txt.html

Arun Kumar · Mar 23, 2009

Ryan said:
regexps are about the worst thing to use in this case. Look at this
instead:

http://mechanize.rubyforge.org/files/GUIDE_txt.html

I know that using mechanize or hpricot is a far better option in this
case. But i'm just asking as a matter of curiosity to know about regexps

Regards
ArunKumar

7stud -- · Mar 23, 2009

Arun said:
Hi,
I know that what i'm going to ask is for the solution for a simple
problem. But as I'm new to Ruby I have not learnt a lot about regular
expressions in Ruby.

Can anybody tell me how to extract all the contents which are included
inside the '<html>' and '</html>' tag and also to extract the text given
in between the '<a>' and '</a>' tag using regular expression. I know it
can be extracted using the 'scan' method but I dont know what should be
the matching patterns or expressions. Can anybody pls help me

Regards
Arun

s = "<a>hello world</a>"
new_s = s.gsub(/<.*?>/, "")
puts new_s

--output:--
hello world

html = DATA.read()
regex = Regexp.new("<html>(.*)</html>", Regexp::MULTILINE)
puts html[regex, 1]

__END__
<html>
<head>
<title>html page</title>
</head>
<body>
<div>hello</div>
<div>world</div>
<div>goodbye</div>
</body>
</html>

--output:--
<head>
<title>html page</title>
</head>
<body>
<div>hello</div>
<div>world</div>
<div>goodbye</div>
</body>

In the expression:

html[regex, 1]

The 1 says to return the first parenthesized group in the regex.

7stud -- · Mar 23, 2009

7stud said:
regex = Regexp.new("<html>(.*)</html>", Regexp::MULTILINE)

...oh, yeah. Normally, a . matches any character except a newline. The
regex .* matches any character 0 or more times--but to get it to match
newlines as well, you have to specify Regxp::MULTILINE.

7stud -- · Mar 23, 2009

7stud said:
In the expression:

html[regex, 1]

The 1 says to return the first parenthesized group in the regex.

To be a little clearer, the 1 says to return whatever matched the first
parenthesized group in the regex.

Yaser Sulaiman · Mar 23, 2009

[Note: parts of this message were removed to make it a legal post.]

Can anybody tell me how to extract all the contents which are included
inside the '<html>' and '</html>' tag and also to extract the text given
in between the '<a>' and '</a>' tag using regular expression. I know it
can be extracted using the 'scan' method but I dont know what should be
the matching patterns or expressions. Can anybody pls help me

Let's assume we have the following content:

<html>
<body>
<p>
Want a Ruby regular expression editor? Check out <a href="
http://www.rubular.com/">Rubular</a>.
</p>
</body>
</html>

Here are two quick and dirty regexps:

/<html>(.*)<\/html>/m
This regexp will capture anything between an opening html tag and a closing
one. the /m option specifies "Multiline Mode: "." will match any character
including a newline.
For our content, it will capture:
<body>
<p>
Want a Ruby regular expression editor? Check out <a href="
http://www.rubular.com/">Rubular</a>.
</p>
</body>

/<a.*>(.*)<\/a>/
This regexp will capture the text between an opening anchor element and a
closing one. The first ".*" is there to deal with href and any other
attribute. You might wanna throw the /m option in there too.
For our content, it will capture:
Rubular

I know that using mechanize or hpricot is a far better option in this
case. But i'm just asking as a matter of curiosity to know about regexps

Dare I say, a man should use regexps if only to satisfy his curiosity. ;-)

Regards,
Yaser

arjun ghosh · Mar 23, 2009

[Note: parts of this message were removed to make it a legal post.]

Check out the site http://www.rubular.com/
It is very helpful in solving RegEx problems

ciao,
Arjun
http://arjunghosh.wordpress.com
twitter.com/arjunghosh

Regular Expression help - Replacing Regexp that worked withOniguruma in 1.8.6	5	Feb 20, 2011
Regular Expression interesting problem	0	Mar 28, 2009
Regular Expression interesting problem	8	Mar 28, 2009
substitution in a regular expression	3	Mar 11, 2011
Help with regular expression	7	Oct 21, 2008
how to solve this regular expression problem	6	Mar 31, 2010
Unwanted collector in regular expression	2	Apr 1, 2011
Regular expression to structure HTML	11	Oct 2, 2009

Regular expression

Arun Kumar

Ryan Davis

Arun Kumar

7stud --

7stud --

7stud --

Yaser Sulaiman

arjun ghosh

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads