Parsing HTML

Discussion in 'ASP .Net Web Services' started by Mohammad-Reza, Feb 23, 2007.

  1. Hi
    I want to parse a web page (in a web service) and retrive some of its
    information. I googled the MSDN and found a walkthrough (How to: Create Web
    Services That Parse the Contents of a Web Page) but the walkthrogh is a
    little complex and the writer did not completly describe all the aspects of
    the solution.
    Could any one elaborate on this walkthrough? Or direct me to another (or
    better) way to deal with such a problem.

    Thanks in advance.
    Mohammad-Reza, Feb 23, 2007
    #1
    1. Advertising

  2. Mohammad-Reza

    Scott M. Guest

    How about using the W3C Document Object Model, which was designed to do just
    what you are trying to do?


    "Mohammad-Reza" <> wrote in message
    news:...
    > Hi
    > I want to parse a web page (in a web service) and retrive some of its
    > information. I googled the MSDN and found a walkthrough (How to: Create
    > Web
    > Services That Parse the Contents of a Web Page) but the walkthrogh is a
    > little complex and the writer did not completly describe all the aspects
    > of
    > the solution.
    > Could any one elaborate on this walkthrough? Or direct me to another (or
    > better) way to deal with such a problem.
    >
    > Thanks in advance.
    Scott M., Feb 23, 2007
    #2
    1. Advertising

  3. I want to write a web service that extracts some information from a web page
    and use that web service in a windows application. I think the usual solution
    for parsing is a little bit slow and costs too much (getting HTML code and
    finding the keys using loops). I want to know if there is any possible way in
    ..NET to simply extract those information (for example a method that returns
    every HTML tag of the web page with its value)?
    The process time of the web service is very important for me.

    Thanks in advance.

    "Scott M." wrote:

    > How about using the W3C Document Object Model, which was designed to do just
    > what you are trying to do?
    >
    >
    > "Mohammad-Reza" <> wrote in message
    > news:...
    > > Hi
    > > I want to parse a web page (in a web service) and retrive some of its
    > > information. I googled the MSDN and found a walkthrough (How to: Create
    > > Web
    > > Services That Parse the Contents of a Web Page) but the walkthrogh is a
    > > little complex and the writer did not completly describe all the aspects
    > > of
    > > the solution.
    > > Could any one elaborate on this walkthrough? Or direct me to another (or
    > > better) way to deal with such a problem.
    > >
    > > Thanks in advance.

    >
    >
    >
    Mohammad-Reza, Feb 24, 2007
    #3
  4. Mohammad-Reza

    Scott M. Guest

    I don't know where you have gotten your information, but this is exactly
    what the DOM is for.


    "Mohammad-Reza" <> wrote in message
    news:...
    >I want to write a web service that extracts some information from a web
    >page
    > and use that web service in a windows application. I think the usual
    > solution
    > for parsing is a little bit slow and costs too much (getting HTML code and
    > finding the keys using loops). I want to know if there is any possible way
    > in
    > .NET to simply extract those information (for example a method that
    > returns
    > every HTML tag of the web page with its value)?
    > The process time of the web service is very important for me.
    >
    > Thanks in advance.
    >
    > "Scott M." wrote:
    >
    >> How about using the W3C Document Object Model, which was designed to do
    >> just
    >> what you are trying to do?
    >>
    >>
    >> "Mohammad-Reza" <> wrote in message
    >> news:...
    >> > Hi
    >> > I want to parse a web page (in a web service) and retrive some of its
    >> > information. I googled the MSDN and found a walkthrough (How to: Create
    >> > Web
    >> > Services That Parse the Contents of a Web Page) but the walkthrogh is a
    >> > little complex and the writer did not completly describe all the
    >> > aspects
    >> > of
    >> > the solution.
    >> > Could any one elaborate on this walkthrough? Or direct me to another
    >> > (or
    >> > better) way to deal with such a problem.
    >> >
    >> > Thanks in advance.

    >>
    >>
    >>
    Scott M., Feb 24, 2007
    #4
  5. "Scott M." <> wrote in message
    news:...
    >I don't know where you have gotten your information, but this is exactly
    >what the DOM is for.


    Scott,

    I used this approach with a Windows Forms application back in 2001, with
    ..NET 1.0. It worked, but was a bit clumsy, and it was time-consuming. I used
    the ActiveX Internet Browser control to load the page I was interested in,
    and once the page was loaded, I could access the DOM from C# code. Did you
    have a different technique in mind when you talk about the DOM?

    Perhaps a faster technique would be to use regular expressions to parse the
    HTML and find what you're looking for.

    John
    John Saunders, Feb 25, 2007
    #5
  6. Mohammad-Reza

    Scott M. Guest

    What I had in mind was, if the HTML in question was well-formed (XHTML), you
    could just load it into an XMLDocument (from a string) object and use the
    XML DOM to parse from there.



    "John Saunders" <john.saunders at trizetto.com> wrote in message
    news:...
    > "Scott M." <> wrote in message
    > news:...
    >>I don't know where you have gotten your information, but this is exactly
    >>what the DOM is for.

    >
    > Scott,
    >
    > I used this approach with a Windows Forms application back in 2001, with
    > .NET 1.0. It worked, but was a bit clumsy, and it was time-consuming. I
    > used the ActiveX Internet Browser control to load the page I was
    > interested in, and once the page was loaded, I could access the DOM from
    > C# code. Did you have a different technique in mind when you talk about
    > the DOM?
    >
    > Perhaps a faster technique would be to use regular expressions to parse
    > the HTML and find what you're looking for.
    >
    > John
    >
    >
    Scott M., Feb 25, 2007
    #6
  7. Can you give a sample code for loading XHTML to a XMLDocument?

    "Scott M." wrote:

    > What I had in mind was, if the HTML in question was well-formed (XHTML), you
    > could just load it into an XMLDocument (from a string) object and use the
    > XML DOM to parse from there.
    >
    >
    >
    > "John Saunders" <john.saunders at trizetto.com> wrote in message
    > news:...
    > > "Scott M." <> wrote in message
    > > news:...
    > >>I don't know where you have gotten your information, but this is exactly
    > >>what the DOM is for.

    > >
    > > Scott,
    > >
    > > I used this approach with a Windows Forms application back in 2001, with
    > > .NET 1.0. It worked, but was a bit clumsy, and it was time-consuming. I
    > > used the ActiveX Internet Browser control to load the page I was
    > > interested in, and once the page was loaded, I could access the DOM from
    > > C# code. Did you have a different technique in mind when you talk about
    > > the DOM?
    > >
    > > Perhaps a faster technique would be to use regular expressions to parse
    > > the HTML and find what you're looking for.
    > >
    > > John
    > >
    > >

    >
    >
    >
    Mohammad-Reza, Feb 26, 2007
    #7
  8. Mohammad-Reza

    Scott M. Guest

    Well, XHTML is XML, so you'd really be loading XML into an XMLDocument, but
    once it's loaded, you can parse out whatever you like using the DOM.

    Dim xmlDoc As New System.XML.XMLDocument()
    'You can load the XML in one of two ways...

    'docPath represents a path to an file containing the XML
    xmlDoc.Load(docPath)

    'or
    'Here you can load a string directly
    xmlDoc.LoadXML(string)

    'Example of getting all the paragraph tags and then the text of the first
    one using the DOM...
    dim theParagraphs As XMLNodeList = xmlDoc.GetElementsByTagName("P")
    dim firstParagraphText As String = theParagraphs(0).Text


    -Scott


    "Mohammad-Reza" <> wrote in message
    news:...
    > Can you give a sample code for loading XHTML to a XMLDocument?
    >
    > "Scott M." wrote:
    >
    >> What I had in mind was, if the HTML in question was well-formed (XHTML),
    >> you
    >> could just load it into an XMLDocument (from a string) object and use the
    >> XML DOM to parse from there.
    >>
    >>
    >>
    >> "John Saunders" <john.saunders at trizetto.com> wrote in message
    >> news:...
    >> > "Scott M." <> wrote in message
    >> > news:...
    >> >>I don't know where you have gotten your information, but this is
    >> >>exactly
    >> >>what the DOM is for.
    >> >
    >> > Scott,
    >> >
    >> > I used this approach with a Windows Forms application back in 2001,
    >> > with
    >> > .NET 1.0. It worked, but was a bit clumsy, and it was time-consuming. I
    >> > used the ActiveX Internet Browser control to load the page I was
    >> > interested in, and once the page was loaded, I could access the DOM
    >> > from
    >> > C# code. Did you have a different technique in mind when you talk about
    >> > the DOM?
    >> >
    >> > Perhaps a faster technique would be to use regular expressions to parse
    >> > the HTML and find what you're looking for.
    >> >
    >> > John
    >> >
    >> >

    >>
    >>
    >>
    Scott M., Feb 26, 2007
    #8
  9. "Scott M." <> wrote in message
    news:...
    > What I had in mind was, if the HTML in question was well-formed (XHTML),
    > you could just load it into an XMLDocument (from a string) object and use
    > the XML DOM to parse from there.


    That works well for XHTML. The problem is that most web sites are still
    using HTML, which is not well-formed XML.

    John
    John Saunders, Feb 26, 2007
    #9
  10. Mohammad-Reza

    Scott M. Guest

    But, we're not talking about most web pages. We are talking about a
    particular page that is being used with a web service. In other words, it's
    part of the OP's applicaiton, which he should have some control over.


    "John Saunders" <john.saunders at trizetto.com> wrote in message
    news:...
    > "Scott M." <> wrote in message
    > news:...
    >> What I had in mind was, if the HTML in question was well-formed (XHTML),
    >> you could just load it into an XMLDocument (from a string) object and use
    >> the XML DOM to parse from there.

    >
    > That works well for XHTML. The problem is that most web sites are still
    > using HTML, which is not well-formed XML.
    >
    > John
    >
    >
    Scott M., Feb 26, 2007
    #10
  11. "Scott M." <> wrote in message
    news:...
    > But, we're not talking about most web pages. We are talking about a
    > particular page that is being used with a web service. In other words,
    > it's part of the OP's applicaiton, which he should have some control over.


    Sorry, I didn't recall that he said it was his application. I assumed he was
    scraping from somebody else's application.

    Even though it's his, there may be reasons why he can't guarantee that the
    page he needs will be XHTML and will be guaranteed to remain XHTML.

    John
    John Saunders, Feb 26, 2007
    #11
  12. Mohammad-Reza

    Stane Bozic Guest

    An answer is HtmlAgilityPack (www.codeplex.com/htmlagilitypack).

    "Mohammad-Reza" wrote:

    > Hi
    > I want to parse a web page (in a web service) and retrive some of its
    > information. I googled the MSDN and found a walkthrough (How to: Create Web
    > Services That Parse the Contents of a Web Page) but the walkthrogh is a
    > little complex and the writer did not completly describe all the aspects of
    > the solution.
    > Could any one elaborate on this walkthrough? Or direct me to another (or
    > better) way to deal with such a problem.
    >
    > Thanks in advance.
    Stane Bozic, Nov 3, 2007
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. GIMME
    Replies:
    2
    Views:
    863
    GIMME
    Feb 11, 2004
  2. Naren
    Replies:
    0
    Views:
    567
    Naren
    May 11, 2004
  3. Replies:
    7
    Views:
    1,336
  4. Ninja Li

    Parsing HTML with HTML::TableExtract

    Ninja Li, Nov 27, 2009, in forum: Perl Misc
    Replies:
    2
    Views:
    207
    Martien Verbruggen
    Nov 28, 2009
  5. Ninja Li

    Parsing HTML with HTML::Tree

    Ninja Li, Mar 1, 2010, in forum: Perl Misc
    Replies:
    1
    Views:
    133
    Ninja Li
    Mar 1, 2010
Loading...

Share This Page