How can I count number of elements in an HTML page

P

Paul

Hi there, I'm using net/http to retrieve some html pages and now I
want to count the number of items in a list on the page. The
response.body is stored as a string.

The HTML looks something like this:
----
<div class="first section">
<h1>
Section Heading I'm interested in:
</h1>
<ul>
<li>
<form action="foo" method="post" name="">
<button type="submit">foo</button>
</form>
</li>
<li>
<form action="bar" method="post" name="">
<button type="submit">bar</button>
</form>
</li>
</ul>
</div>

<div class="next section">
----

So what I want to do is count the number of li's in a particular div
section. In this case the answer is 2. It might be more, it might be
0.

I can find the section I want with a regex but I don't know how to
iterate through the string looking for particular elements. I was
thinking about taking the section I'm interested in and saving it as
an array and then iterating through each array element (html line)
that way, but I thought there might be a quicker way to do it.

suggestions?

TIA
 
J

Jesús Gabriel y Galán

Hi there, I'm using net/http to retrieve some html pages and now I
want to count the number of items in a list on the page. =A0The
response.body is stored as a string.

The HTML looks something like this:
----
<div class=3D"first section">
=A0 <h1>
=A0 =A0 =A0Section Heading I'm interested in:
=A0 </h1>
=A0 <ul>
=A0 =A0 =A0<li>
=A0 =A0 =A0 =A0 <form action=3D"foo" method=3D"post" name=3D"">
=A0 =A0 =A0 =A0 =A0 <button type=3D"submit">foo</button>
=A0 =A0 =A0 =A0 </form>
=A0 =A0 =A0</li>
=A0 =A0 =A0<li>
=A0 =A0 =A0 =A0 <form action=3D"bar" method=3D"post" name=3D"">
=A0 =A0 =A0 =A0 =A0 =A0<button type=3D"submit">bar</button>
=A0 =A0 =A0 =A0 </form>
=A0 =A0 =A0</li>
=A0 </ul>
</div>

<div class=3D"next section">
----

So what I want to do is count the number of li's in a particular div
section. =A0In this case the answer is 2. It might be more, it might be
0.

I can find the section I want with a regex but I don't know how to
iterate through the string looking for particular elements. =A0I was
thinking about taking the section I'm interested in and saving it as
an array and then iterating through each array element (html line)
that way, but I thought there might be a quicker way to do it.

suggestions?

I'd use Nokogiri. Off the top of my head, it would be something like (untes=
ted):

require 'nokogiri'

html_string=3D<<END
#[your html]
END

doc =3D Nokogiri::HTML(html_string)
puts doc.search("/div/ul/li").size

Maybe you will need to adjust the xpath search, but I think it should
be something like that.

Hope this helps,

Jesus.
 
A

Alex Stahl

Nokogiri allows direct access to HTML elements as data. I use it a lot
in my work. Try something like:

require 'nokogiri'
page = Nokogiri::HTML response.body
count = 0
page.xpath("//div[@class='first section']).each do |element|
count += 1 if element.xpath("/ul")
end

Or something along those lines... (I didn't test this first).
 
S

Steel Steel

I can find the section I want with a regex but I don't know how to
iterate through the string looking for particular elements. I was
thinking about taking the section I'm interested in and saving it as
an array and then iterating through each array element (html line)
that way, but I thought there might be a quicker way to do it.

$html.scan(%r{<div.*first section.*</div>}m).to_s.scan(/<li>/).size
 
J

Jesús Gabriel y Galán

2010/10/5 Jes=FAs Gabriel y Gal=E1n said:
Hi there, I'm using net/http to retrieve some html pages and now I
want to count the number of items in a list on the page. =A0The
response.body is stored as a string.

The HTML looks something like this:
----
<div class=3D"first section">
=A0 <h1>
=A0 =A0 =A0Section Heading I'm interested in:
=A0 </h1>
=A0 <ul>
=A0 =A0 =A0<li>
=A0 =A0 =A0 =A0 <form action=3D"foo" method=3D"post" name=3D"">
=A0 =A0 =A0 =A0 =A0 <button type=3D"submit">foo</button>
=A0 =A0 =A0 =A0 </form>
=A0 =A0 =A0</li>
=A0 =A0 =A0<li>
=A0 =A0 =A0 =A0 <form action=3D"bar" method=3D"post" name=3D"">
=A0 =A0 =A0 =A0 =A0 =A0<button type=3D"submit">bar</button>
=A0 =A0 =A0 =A0 </form>
=A0 =A0 =A0</li>
=A0 </ul>
</div>

<div class=3D"next section">
----

So what I want to do is count the number of li's in a particular div
section. =A0In this case the answer is 2. It might be more, it might be
0.

I can find the section I want with a regex but I don't know how to
iterate through the string looking for particular elements. =A0I was
thinking about taking the section I'm interested in and saving it as
an array and then iterating through each array element (html line)
that way, but I thought there might be a quicker way to do it.

suggestions?

I'd use Nokogiri. Off the top of my head, it would be something like (unt= ested):

require 'nokogiri'

html_string=3D<<END
#[your html]
END

doc =3D Nokogiri::HTML(html_string)
puts doc.search("/div/ul/li").size

Maybe you will need to adjust the xpath search, but I think it should
be something like that.

I am now at my computer so I can test this. It seems that
Nokogiri::HTML yields a complete HTML, adding <html> and <body> tags
around the fragment, so this works:

irb(main):002:0> require 'nokogiri'
=3D> true
irb(main):003:0> html_string =3D<<END
irb(main):004:0" <div class=3D"first section">
irb(main):005:0" <h1>
irb(main):006:0" Section Heading I'm interested in:
irb(main):007:0" </h1>
irb(main):008:0" <ul>
irb(main):009:0" <li>
irb(main):010:0" <form action=3D"foo" method=3D"post" name=3D"">
irb(main):011:0" <button type=3D"submit">foo</button>
irb(main):012:0" </form>
irb(main):013:0" </li>
irb(main):014:0" <li>
irb(main):015:0" <form action=3D"bar" method=3D"post" name=3D"">
irb(main):016:0" <button type=3D"submit">bar</button>
irb(main):017:0" </form>
irb(main):018:0" </li>
irb(main):019:0" </ul>
irb(main):020:0" </div>
irb(main):021:0" END
[...snip...]
irb(main):034:0> doc.search("/html/body/div/ul/li").size
=3D> 2

Hope this helps,

Jesus.
 
R

Raveendran .P

For more details :

3. Get the Html tags

Ex.

require ‘jazzez’
output= Jazzez.new
output.tagdetails(“google.com\â€)

Output:

1<html tag(s)
1</html> tag(s)
1<head tag(s)
1</head> tag(s)
1<body tag(s)
1</body> tag(s)
2<table tag(s)
2</table> tag(s)
3<tr tag(s)
3</tr> tag(s)
9<td tag(s)
9</td> tag(s)
0<th tag(s)
0</th> tag(s)
0<l tag(s)
0</l> tag(s)
0<link tag(s)
1<p tag(s)
1</p> tag(s)
4<div tag(s)
4</div> tag(s)
0<span tag(s)
0</span> tag(s)
4<script tag(s)
4</script> tag(s)
0<ul tag(s)
0</ul> tag(s)
0<ol tag(s)
0</ol> tag(s)
16<a tag(s)
15</a> tag(s)
0<h1 tag(s)
0</h1> tag(s)
0<h2 tag(s)
0</h2> tag(s)
0<h3 tag(s)
0</h3> tag(s)
0<h4 tag(s)
0</h4> tag(s)
0<h5 tag(s)
0</h5> tag(s)
0<h6 tag(s)
0</h6> tag(s)
4<font tag(s)
4</font> tag(s)
0<select tag(s)
0</select> tag(s)
0<option tag(s)
0</option> tag(s)

Thanks
Raveendran
http://raveendran.wordpress.com
 
J

Josh Cheek

[Note: parts of this message were removed to make it a legal post.]

Thanks Steel. This worked fine. I just needed to make it a lazy
search with .*?

I've got nothing against Nokogiri or the other solutions but I was
hoping for a solution like this that just uses the core libraries for
portability.

Cheers! Paul.
I would try REXML, then. It's an XML parser in the standard library.
http://ruby-doc.org/stdlib/libdoc/rexml/rdoc/index.html

I'd be reserve regex parsing of xml only for very informal situations where
I just a quick solution non rigorous solution (ie a one-time solution that I
plan to verify personally), I am pretty sure that it is not possible to
correctly parse xml with regex.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,262
Messages
2,571,056
Members
48,769
Latest member
Clifft

Latest Threads

Top