The best way to parse an html file?

G

Guest

Hi,

I have a html file file that I want to parse with ASP.NET to retreive the
value of a custom tag. Let's say that the average html file is about 30 ko.
Once the html file is loaded and converted into a single string, I'm using
for now is two string.indexOf to find the begin and the end of the desired
tag and then a string.substring to extract the data. I'm not using regular
expressions since I know exactly what are the tags to find.
My function goes like this:

private string ParseHtml(string html)
{
html = html.Replace("\r\n","");
int begin = html.IndexOf("%%StartGetHtml%%");
int end = html.IndexOf("%%EndGetHtml%%",begin);
int begin2, end2;
string str = null;
if (begin > 0 && end > 0)
{
// Gets the beginning of the tag
begin2 = html.IndexOf("<",begin);
// Gets the end of the tag
end2 = html.IndexOf(">",end-3);
if (begin2 < end2 && end2 < end)
{
// Gets the tag
str = html.Substring(begin2,end-begin2);
}
}
return str;
}

Is this the fastest way or there could be a better way to do this?

Thanks

Stephane
 
M

Martin Honnen

Stephane wrote:

I have a html file file that I want to parse with ASP.NET to retreive the
value of a custom tag. Let's say that the average html file is about 30 ko.
Once the html file is loaded and converted into a single string, I'm using
for now is two string.indexOf to find the begin and the end of the desired
tag and then a string.substring to extract the data. I'm not using regular
expressions since I know exactly what are the tags to find.
My function goes like this:

private string ParseHtml(string html)
{
html = html.Replace("\r\n","");
int begin = html.IndexOf("%%StartGetHtml%%");
int end = html.IndexOf("%%EndGetHtml%%",begin);
int begin2, end2;
string str = null;
if (begin > 0 && end > 0)
{
// Gets the beginning of the tag
begin2 = html.IndexOf("<",begin);
// Gets the end of the tag
end2 = html.IndexOf(">",end-3);
if (begin2 < end2 && end2 < end)
{
// Gets the tag
str = html.Substring(begin2,end-begin2);
}
}
return str;
}

Is this the fastest way or there could be a better way to do this?

If those string processing attempts suffice for you then use them but in
general if you want to parse HTML you might want to check SGMLReader, see
http://www.gotdotnet.com/community/usersamples/Default.aspx?query=sgmlreader
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,013
Latest member
KatriceSwa

Latest Threads

Top