The best way to parse an html file?

G

Guest

Hi,

I have a html file file that I want to parse with ASP.NET to retreive the
value of a custom tag. Let's say that the average html file is about 30 ko.
Once the html file is loaded and converted into a single string, I'm using
for now is two string.indexOf to find the begin and the end of the desired
tag and then a string.substring to extract the data. I'm not using regular
expressions since I know exactly what are the tags to find.
My function goes like this:

private string ParseHtml(string html)
{
html = html.Replace("\r\n","");
int begin = html.IndexOf("%%StartGetHtml%%");
int end = html.IndexOf("%%EndGetHtml%%",begin);
int begin2, end2;
string str = null;
if (begin > 0 && end > 0)
{
// Gets the beginning of the tag
begin2 = html.IndexOf("<",begin);
// Gets the end of the tag
end2 = html.IndexOf(">",end-3);
if (begin2 < end2 && end2 < end)
{
// Gets the tag
str = html.Substring(begin2,end-begin2);
}
}
return str;
}

Is this the fastest way or there could be a better way to do this?

Thanks

Stephane
 
M

Martin Honnen

Stephane wrote:

I have a html file file that I want to parse with ASP.NET to retreive the
value of a custom tag. Let's say that the average html file is about 30 ko.
Once the html file is loaded and converted into a single string, I'm using
for now is two string.indexOf to find the begin and the end of the desired
tag and then a string.substring to extract the data. I'm not using regular
expressions since I know exactly what are the tags to find.
My function goes like this:

private string ParseHtml(string html)
{
html = html.Replace("\r\n","");
int begin = html.IndexOf("%%StartGetHtml%%");
int end = html.IndexOf("%%EndGetHtml%%",begin);
int begin2, end2;
string str = null;
if (begin > 0 && end > 0)
{
// Gets the beginning of the tag
begin2 = html.IndexOf("<",begin);
// Gets the end of the tag
end2 = html.IndexOf(">",end-3);
if (begin2 < end2 && end2 < end)
{
// Gets the tag
str = html.Substring(begin2,end-begin2);
}
}
return str;
}

Is this the fastest way or there could be a better way to do this?

If those string processing attempts suffice for you then use them but in
general if you want to parse HTML you might want to check SGMLReader, see
http://www.gotdotnet.com/community/usersamples/Default.aspx?query=sgmlreader
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,772
Messages
2,569,593
Members
45,111
Latest member
KetoBurn
Top