ASP Question: Parse HTML file?

Discussion in 'ASP General' started by Rob Meade, May 18, 2006.

  1. Rob Meade

    Rob Meade Guest

    Hi all,

    I'm working on a project where there are just under 1300 course files, these
    are HTML files - my problem is that I need to do more with the content of
    these pages - and the thought of writing 1300 asp pages to deal with this
    doesn't thrill me.

    The HTML pages are provided by a training company. They seem to be
    "structured" to some degree, but I'm not sure how easy its going to be to
    parse the page.

    Typically there are the following "sections" of each page:

    Title
    Summary
    Topics
    Technical Requirements
    Copyright Information
    Terms Of Use

    I need to get the content for the Title, Summary, Topics, Technical
    Requirements and lose the Copyright and Terms of use...in addition I need to
    squeeze in a new section which will display pricing information and a link
    to "Add to cart" etc....

    My "plan" (if you can call it that) was to have 1 asp page which can parse
    the appropriate HTML file based on the asp page being passed a code in the
    querystring - the code will match the filename of the HTML page (the first
    part prior to the dot).

    What I then need to do is go through the content of the HTML....this is
    where I am currently stuck....

    I have pasted an example of one of these pages below - if anyone can suggest
    to me how I might achieve this I would be most grateful - in addition - if
    anyone can explain the XML Name Space stuff in there that would be handy
    too - I figure this is just a normal HTML page, as there is no declaration
    or anything at the top?

    Any information/suggestions would be most appreciated.

    Thanks in advance for your help,

    Regards

    Rob


    Example file:

    <html>
    <head>
    <title>Novell 560 CNE Series: File System</title>
    <meta name="Description" content="">
    <link rel="stylesheet" href="../resource/mlcatstyle.css"
    type="text/css">
    </head>
    <body class="MlCatPage">
    <table class="Header" xmlns:fo="http://www.w3.org/1999/XSL/Format"
    xmlns:fn="http://www.w3.org/2005/xpath-functions">
    <tr>
    <td class="Logo" colspan="2">
    <img class="Logo" src="../images/logo.gif">
    </td>
    </tr>
    <tr>
    <td class="Title">
    <div class="ProductTitle">
    <span class="CoCat">Novell 560 CNE Series: File System</span>
    </div>
    <div class="ProductDetails">
    <span class="SmallText">
    <span class="BoldText"> Product Code: </span>
    560c04<span class="BoldText"> Time: </span>
    4.0 hour(s)<span class="BoldText"> CEUs: </span>
    Available</span>
    </div>
    </td>
    <td class="Back">
    <div class="BackButton">
    <a href="javascript:history.back()">
    <img src="../images/back.gif" align="right" border="0">
    </a>
    </div>
    </td>
    </tr>
    </table>
    <br xmlns:fo="http://www.w3.org/1999/XSL/Format"
    xmlns:fn="http://www.w3.org/2005/xpath-functions">
    <table class="HighLevel" xmlns:fo="http://www.w3.org/1999/XSL/Format"
    xmlns:fn="http://www.w3.org/2005/xpath-functions">
    <tr>
    <td class="BlockHeader">
    <h3 class="sectiontext">Summary:</h3>
    </td>
    </tr>
    <tr>
    <td class="Overview">
    <div class="ProductSummary">This course provides an introduction
    to NetWare 5 file system concepts and management procedures.</div>
    <br>
    <h3 class="Sectiontext">Objectives:</h3>
    <div class="FreeText">After completing this course, students will
    be able to: </div>
    <div class="ObjectiveList">
    <ul class="listing">
    <li class="ObjectiveItem">Explain the relationship of the file
    system and login scripts</li>
    <li class="ObjectiveItem">Create login scripts</li>
    <li class="ObjectiveItem">Manage file system directories and
    files</li>
    <li class="ObjectiveItem">Map network drives</li>
    </ul>
    </div>
    <br></br>
    <h3 class="Sectiontext">Topics:</h3>
    <div class="OutlineList">
    <ul class="listing">
    <li class="OutlineItem">Managing the File System</li>
    <li class="OutlineItem">Volume Space</li>
    <li class="OutlineItem">Examining Login Scripts</li>
    <li class="OutlineItem">Creating and Executing Login
    Scripts</li>
    <li class="OutlineItem">Drive Mappings</li>
    <li class="OutlineItem">Login Scripts and Resources</li>
    </ul>
    </div>
    </td>
    </tr>
    </table>
    <br xmlns:fo="http://www.w3.org/1999/XSL/Format"
    xmlns:fn="http://www.w3.org/2005/xpath-functions">
    <table class="Details" xmlns:fo="http://www.w3.org/1999/XSL/Format"
    xmlns:fn="http://www.w3.org/2005/xpath-functions">
    <tr>
    <td class="BlockHeader">
    <h3 class="Sectiontext">Technical Requirements:</h3>
    </td>
    </tr>
    <tr>
    <td class="Details">
    <div class="ProductRequirements">200MHz Pentium with 32MB Ram. 800
    x 600 minimum screen resolution. Windows 98, NT, 2000, or XP. 56K minimum
    connection speed, broadband (256 kbps or greater) connection recommended.
    Internet Explorer 5.0 or higher required. Flash Player 7.0 or higher
    required. JavaScript must be enabled. Netscape, Firefox and AOL browsers not
    supported.</div>
    </td>
    </tr>
    </table>
    <br xmlns:fo="http://www.w3.org/1999/XSL/Format"
    xmlns:fn="http://www.w3.org/2005/xpath-functions">
    <table class="Legal" xmlns:fo="http://www.w3.org/1999/XSL/Format"
    xmlns:fn="http://www.w3.org/2005/xpath-functions">
    <tr>
    <td class="BlockHeader">
    <h3 class="Sectiontext">Copyright Information:</h3>
    </td>
    </tr>
    <tr>
    <td class="Copyright">
    <div class="ProductRequirements">Product names mentioned in this
    catalog may be trademarks/servicemarks or registered trademarks/servicemarks
    of their respective companies and are hereby acknowledged. All product
    names that are known to be trademarks or service marks have been
    appropriately capitalized. Use of a name in this catalog is for
    identification purposes only, and should not be regarded as affecting the
    validity of any trademark or service mark, or as suggesting any affiliation
    between MindLeaders.com, Inc. and the trademark/servicemark
    proprietor.</div>
    <br>
    <h3 class="Sectiontext">Terms of Use:</h3>
    <div class="ProductUsenote"></div>
    </td>
    </tr>
    </table>
    <p align="center">
    <span class="SmallText">Copyright &copy; 2006 MindLeaders. All rights
    reserved.</span>
    </p>
    </body>
    </html>
    Rob Meade, May 18, 2006
    #1
    1. Advertising

  2. Rob Meade

    Mike Brind Guest

    Rob Meade wrote:
    > Hi all,
    >
    > I'm working on a project where there are just under 1300 course files, these
    > are HTML files - my problem is that I need to do more with the content of
    > these pages - and the thought of writing 1300 asp pages to deal with this
    > doesn't thrill me.
    >
    > The HTML pages are provided by a training company. They seem to be
    > "structured" to some degree, but I'm not sure how easy its going to be to
    > parse the page.
    >
    > Typically there are the following "sections" of each page:
    >
    > Title
    > Summary
    > Topics
    > Technical Requirements
    > Copyright Information
    > Terms Of Use


    If you can identify the specific divs that hold this information (and
    they are consistent across pages), you could use regex to parse the
    files and pop the relevant bits into a database.

    --
    Mike Brind
    Mike Brind, May 18, 2006
    #2
    1. Advertising


  3. >
    > I have pasted an example of one of these pages below - if anyone can

    suggest
    > to me how I might achieve this I would be most grateful - in addition - if
    > anyone can explain the XML Name Space stuff in there that would be handy
    > too - I figure this is just a normal HTML page, as there is no declaration
    > or anything at the top?
    >


    These pages will have been generated via an XSLT transform. The transform
    will have made use of these namespaces. However unless informed otherwise
    XSLT will output the xmlns tags for these namespaces even though no element
    is output belonging to them which is the case here.

    That's a long winded way of saying they don't do anything, ignore them.

    It's a pity they didn't go the whole hog and output the whole page as XML it
    would be a lot easier to do what you need. Still it's a good sign that the
    content of the other 1299 pages are likely to be consistent so Mike's idea
    of scanning with RegExp should work.

    Anthony.
    Anthony Jones, May 18, 2006
    #3
  4. Rob Meade

    Rob Meade Guest

    "McKirahan" wrote ...

    > Consider displaying their page inside of an <iframe>
    > inside of a page that has your content.


    Hi McKirahan,

    Thanks for your reply - alas I need "bits" of their pages, with "bits" of my
    stuff inserted in between, so including their whole page as-is unfortunately
    is no good for me.

    Regards

    Rob
    Rob Meade, May 19, 2006
    #4
  5. Rob Meade

    Rob Meade Guest

    "Mike Brind" wrote ...

    > If you can identify the specific divs that hold this information (and
    > they are consistent across pages), you could use regex to parse the
    > files and pop the relevant bits into a database.


    Hi Mike,

    Thanks for your reply.

    I don't suppose by any chance you might have an example that would get me
    started with that approach would you - it sounds like it could well work.

    Regards

    Rob
    Rob Meade, May 19, 2006
    #5
  6. Rob Meade

    Rob Meade Guest

    "Anthony Jones" wrote ...

    > These pages will have been generated via an XSLT transform. The transform
    > will have made use of these namespaces. However unless informed otherwise
    > XSLT will output the xmlns tags for these namespaces even though no
    > element
    > is output belonging to them which is the case here.
    >
    > That's a long winded way of saying they don't do anything, ignore them.
    >
    > It's a pity they didn't go the whole hog and output the whole page as XML
    > it
    > would be a lot easier to do what you need. Still it's a good sign that
    > the
    > content of the other 1299 pages are likely to be consistent so Mike's idea
    > of scanning with RegExp should work.


    Hi Anthony,

    Thanks for the reply.

    I especially appreciate the explanation for why they are there - I tried
    googling it last night and found some stuff about XSLT 2.0 but it didn't
    really get me anywhere - I would agree that it's a shame they are not as
    XML - that would have been nice!

    Cheers

    Rob
    Rob Meade, May 19, 2006
    #6
  7. Rob Meade

    McKirahan Guest

    "Mike Brind" <> wrote in message
    news:...
    >
    > Rob Meade wrote:
    > > Hi all,
    > >
    > > I'm working on a project where there are just under 1300 course files,

    these
    > > are HTML files - my problem is that I need to do more with the content

    of
    > > these pages - and the thought of writing 1300 asp pages to deal with

    this
    > > doesn't thrill me.
    > >
    > > The HTML pages are provided by a training company. They seem to be
    > > "structured" to some degree, but I'm not sure how easy its going to be

    to
    > > parse the page.
    > >
    > > Typically there are the following "sections" of each page:
    > >
    > > Title
    > > Summary
    > > Topics
    > > Technical Requirements
    > > Copyright Information
    > > Terms Of Use

    >
    > If you can identify the specific divs that hold this information (and
    > they are consistent across pages), you could use regex to parse the
    > files and pop the relevant bits into a database.
    >
    > --
    > Mike Brind
    >


    It would have been nice if each div calss were unquie.
    This one is repeated:
    <div class="ProductRequirements">
    It's not wrong just (potentially) inconvenient.

    <td class="Details">
    <div class="ProductRequirements">200MHz Pentium ...

    <td class="Copyright">
    <div class="ProductRequirements">Product names ...

    Which div's are you interested in?


    Here's a script that will extract all the div's into a new file:

    Option Explicit
    '*
    Const cVBS = "Novell.vbs"
    Const cOT1 = "Novell.htm" '= Input filename
    Const cOT2 = "Novell.txt" '= Output filename
    Const cDIV = "</div>"
    '*
    '* Declare Variables
    '*
    Dim intBEG
    intBEG = 1
    Dim arrDIV(9)
    arrDIV(0) = "<div class=" & Chr(34) & "?" & Chr(34) & ">"
    arrDIV(1) = "ProductTitle"
    arrDIV(2) = "ProductDetails"
    arrDIV(3) = "ProductSummary"
    arrDIV(4) = "FreeText"
    arrDIV(5) = "ObjectiveList"
    arrDIV(6) = "OutlineList"
    arrDIV(7) = "ProductRequirements"
    arrDIV(8) = "ProductRequirements"
    arrDIV(9) = "ProductUsenote"
    Dim intDIV
    Dim strDIV
    Dim arrOT1
    Dim intOT1
    Dim strOT1
    Dim strOT2
    Dim intPOS
    '*
    '* Declare Objects
    '*
    Dim objFSO
    Set objFSO = CreateObject("Scripting.FileSystemObject")
    Dim objOT1
    Set objOT1 = objFSO.OpenTextFile(cOT1,1)
    Dim objOT2
    Set objOT2 = objFSO.OpenTextFile(cOT2,2,True)
    '*
    '* Read File, Extract "div", Write Line
    '*
    strOT1 = objOT1.ReadAll()
    For intDIV = 1 To UBound(arrDIV)
    strOT2 = Mid(strOT1,intBEG)
    strDIV = Replace(arrDIV(0),"?",arrDIV(intDIV))
    intPOS = InStr(strOT2,strDIV)
    If intPOS > 0 Then
    strOT2 = Mid(strOT2,intPOS)
    intPOS = InStr(strOT2,cDIV)
    strOT2 = Left(strOT2,intPOS+Len(cDIV))
    objOT2.WriteLine(strOT2 & vbCrLf)
    intBEG = intPOS + Len(cDIV) + 1
    End If
    Next
    '*
    '* Destroy Objects
    '*
    Set objOT1 = Nothing
    Set objOT2 = Nothing
    Set objFSO = Nothing
    '*
    '* Done!
    '*
    MsgBox "Done!",vbInformation,cVBS

    You could modify it to loop through a list or folder of files.

    Note that each "class=" is in the stylesheet:
    <link rel="stylesheet" href="../resource/mlcatstyle.css"
    type="text/css">
    which you should refer to when using their div's.
    McKirahan, May 19, 2006
    #7
  8. Rob Meade

    Rob Meade Guest

    "McKirahan" wrote ...

    Hi McKirahan, thank you again for your reply and example.

    I should add that I wont be writing these out to another file, instead it'll
    need to do it on the fly, ie, take the original source page by the code
    passed in the URL, read in the appropriate parts, and then spit out my own
    layout and extra parts.

    With the example you posted (below) - does it extract whats between the DIV
    tags, ie the <tr>'s and <td's> as well, or just the actually "text"?

    Thanks again

    Rob
    PS: The copyright one can be excluded..
    PPS: When I say its going to happen on the fly, this would obviously depend
    on how quick and efficient it is - if it turns out that because of the
    number of hits they get on the site in question its a bit too slow, then I
    might have to have some kind of "import" process which obviously would make
    more sense anyway, this could then create new pages, or perhaps store the
    information in the database.

    > It would have been nice if each div calss were unquie.
    > This one is repeated:
    > <div class="ProductRequirements">
    > It's not wrong just (potentially) inconvenient.
    >
    > <td class="Details">
    > <div class="ProductRequirements">200MHz Pentium ...
    >
    > <td class="Copyright">
    > <div class="ProductRequirements">Product names ...
    >
    > Which div's are you interested in?
    >
    >
    > Here's a script that will extract all the div's into a new file:
    >
    > Option Explicit
    > '*
    > Const cVBS = "Novell.vbs"
    > Const cOT1 = "Novell.htm" '= Input filename
    > Const cOT2 = "Novell.txt" '= Output filename
    > Const cDIV = "</div>"
    > '*
    > '* Declare Variables
    > '*
    > Dim intBEG
    > intBEG = 1
    > Dim arrDIV(9)
    > arrDIV(0) = "<div class=" & Chr(34) & "?" & Chr(34) & ">"
    > arrDIV(1) = "ProductTitle"
    > arrDIV(2) = "ProductDetails"
    > arrDIV(3) = "ProductSummary"
    > arrDIV(4) = "FreeText"
    > arrDIV(5) = "ObjectiveList"
    > arrDIV(6) = "OutlineList"
    > arrDIV(7) = "ProductRequirements"
    > arrDIV(8) = "ProductRequirements"
    > arrDIV(9) = "ProductUsenote"
    > Dim intDIV
    > Dim strDIV
    > Dim arrOT1
    > Dim intOT1
    > Dim strOT1
    > Dim strOT2
    > Dim intPOS
    > '*
    > '* Declare Objects
    > '*
    > Dim objFSO
    > Set objFSO = CreateObject("Scripting.FileSystemObject")
    > Dim objOT1
    > Set objOT1 = objFSO.OpenTextFile(cOT1,1)
    > Dim objOT2
    > Set objOT2 = objFSO.OpenTextFile(cOT2,2,True)
    > '*
    > '* Read File, Extract "div", Write Line
    > '*
    > strOT1 = objOT1.ReadAll()
    > For intDIV = 1 To UBound(arrDIV)
    > strOT2 = Mid(strOT1,intBEG)
    > strDIV = Replace(arrDIV(0),"?",arrDIV(intDIV))
    > intPOS = InStr(strOT2,strDIV)
    > If intPOS > 0 Then
    > strOT2 = Mid(strOT2,intPOS)
    > intPOS = InStr(strOT2,cDIV)
    > strOT2 = Left(strOT2,intPOS+Len(cDIV))
    > objOT2.WriteLine(strOT2 & vbCrLf)
    > intBEG = intPOS + Len(cDIV) + 1
    > End If
    > Next
    > '*
    > '* Destroy Objects
    > '*
    > Set objOT1 = Nothing
    > Set objOT2 = Nothing
    > Set objFSO = Nothing
    > '*
    > '* Done!
    > '*
    > MsgBox "Done!",vbInformation,cVBS
    >
    > You could modify it to loop through a list or folder of files.
    >
    > Note that each "class=" is in the stylesheet:
    > <link rel="stylesheet" href="../resource/mlcatstyle.css"
    > type="text/css">
    > which you should refer to when using their div's.
    Rob Meade, May 19, 2006
    #8
  9. Rob Meade

    McKirahan Guest

    "Rob Meade" <> wrote in message
    news:e3WJh$...
    > "McKirahan" wrote ...
    >
    > Hi McKirahan, thank you again for your reply and example.
    >
    > I should add that I wont be writing these out to another file, instead

    it'll
    > need to do it on the fly, ie, take the original source page by the code
    > passed in the URL, read in the appropriate parts, and then spit out my own
    > layout and extra parts.
    >
    > With the example you posted (below) - does it extract whats between the

    DIV
    > tags, ie the <tr>'s and <td's> as well, or just the actually "text"?
    >
    > Thanks again
    >
    > Rob
    > PS: The copyright one can be excluded..
    > PPS: When I say its going to happen on the fly, this would obviously

    depend
    > on how quick and efficient it is - if it turns out that because of the
    > number of hits they get on the site in question its a bit too slow, then I
    > might have to have some kind of "import" process which obviously would

    make
    > more sense anyway, this could then create new pages, or perhaps store the
    > information in the database.
    >


    Did you try it as-is to see what you get?

    I would probably put all 1300 files (pages) in a single folder.
    Then run a process against each to generate 1300 new files in
    a different folder. These would be posted for quick access.

    Prior to posting the could be reviewed for accuracy.

    Also, instead of extracting out the div's you could just identify
    where you want your stuff inserted.
    McKirahan, May 19, 2006
    #9
  10. Rob Meade

    Rob Meade Guest

    "McKirahan" wrote ...

    > Did you try it as-is to see what you get?


    Hi McKirahan, thanks for your reply.

    Not as of yet no - but I'm home this weekend so will be giving it ago :eek:)

    > I would probably put all 1300 files (pages) in a single folder.


    They come in a /courses directory

    > Then run a process against each to generate 1300 new files in
    > a different folder. These would be posted for quick access.


    I think I might have to change the process a bit but the idea is the same -
    the content provider has other bits that link to these files, so they'd
    still need to be in a /courses directory, but I could put them somewhere
    else first, "mangle" them and then spit them out to the /courses directory
    :eek:)

    > Prior to posting the could be reviewed for accuracy.


    I might check a couple - but not all 1300 - I dont wanna go mental... :eek:D

    > Also, instead of extracting out the div's you could just identify
    > where you want your stuff inserted.


    Yeah, but there were bits I needed to lose, ie the copyright section etc..

    I seem to remember a long time back a discussion about transforming pages, I
    think it might have been done in an ISAPI filter or something though - not
    sure - from what I remember the requested page would get grabbed, actions
    happen and then it can be spat out as a different page - I wonder if this is
    what the previous company that did this adopted, because I find it hard to
    believe they would have created 1300 asp files, but yet all of the links on
    the original site were <course-code>.asp as opposed to the real file
    <course-code.html - if you see what I mean...

    Regards

    Rob
    Rob Meade, May 20, 2006
    #10
  11. Rob Meade

    McKirahan Guest

    "Rob Meade" <> wrote in message
    news:cTBbg.72279$...

    [snip]

    > I seem to remember a long time back a discussion about transforming pages,

    I
    > think it might have been done in an ISAPI filter or something though - not
    > sure - from what I remember the requested page would get grabbed, actions
    > happen and then it can be spat out as a different page - I wonder if this

    is
    > what the previous company that did this adopted, because I find it hard to
    > believe they would have created 1300 asp files, but yet all of the links

    on
    > the original site were <course-code>.asp as opposed to the real file
    > <course-code.html - if you see what I mean...


    An approach they could have taken was to store the "sections" in a database
    table -- one memo field per section -- then generate static pages from it.

    Thus, the header, navigation, and footer could be modified independently.
    McKirahan, May 20, 2006
    #11
  12. Rob Meade

    Rob Meade Guest

    "McKirahan" wrote ...

    > An approach they could have taken was to store the "sections" in a
    > database
    > table -- one memo field per section -- then generate static pages from it.
    >
    > Thus, the header, navigation, and footer could be modified independently.


    I suspect the company does have this, but they most likely use it for the
    generation of these files which they then sell on etc...

    The one thing I do have missing at the moment is a nice file that ties the
    <course_code.html> file names (or just the codes) - to the titles of the
    courses!

    They give you a "contents.html" file which has all of the courses listed and
    the codes / files as hyperlinks - but again it would mean parsing the entire
    file to get at the goodies, I'm going to ask them if they have the same
    thing in XML/Database or something to hopefully make that a bit easier..

    Thanks again for your help - alas due to my 9 month old son I have yet to
    get around to trying your example! But I will :eek:)

    Rob
    Rob Meade, May 21, 2006
    #12
  13. Rob Meade

    Mike Brind Guest

    When you do get to try Rob's code, you will see that it opens a number
    of possibilities - one of which is to insert the contents of the divs
    into an database instead of writing them to 1300 text files. I really
    can't understand why this is not at the top of your list of options -
    manage 1300 files...? or manage 1? Hmmmm.... But then you obviously
    know a lot more about your project then I do :)

    If you were using Rob's code, you can insert this into it:

    If intDiv = 2 Then
    Dim re, m, myMatches, pcode
    Set re = New RegExp
    With re
    .Pattern = "Product Code: </span>[\s]+[\n]+[\s]+([a-z0-9]{6})"
    .IgnoreCase = True
    .Global = True
    End With
    Set myMatches = re.Execute(strOT2)
    For Each m In myMatches
    If m.Value <>"" Then
    pcode = Replace(m.Value,"Product Code: </span>","")
    pcode = Replace(pcode," ","")
    pcode = Replace(pcode,chr(10),"")
    pcode = Replace(pcode,chr(13),"")
    Response.Write pcode 'or write to db
    End If
    Next
    Set re = Nothing
    End If

    And that will return the Product Code on it's own. Change the pattern
    to "<title>[\.]*</title>" and you get the title stripped out too.

    --
    Mike Brind


    Rob Meade wrote:
    > "McKirahan" wrote ...
    >
    > > An approach they could have taken was to store the "sections" in a
    > > database
    > > table -- one memo field per section -- then generate static pages from it.
    > >
    > > Thus, the header, navigation, and footer could be modified independently.

    >
    > I suspect the company does have this, but they most likely use it for the
    > generation of these files which they then sell on etc...
    >
    > The one thing I do have missing at the moment is a nice file that ties the
    > <course_code.html> file names (or just the codes) - to the titles of the
    > courses!
    >
    > They give you a "contents.html" file which has all of the courses listed and
    > the codes / files as hyperlinks - but again it would mean parsing the entire
    > file to get at the goodies, I'm going to ask them if they have the same
    > thing in XML/Database or something to hopefully make that a bit easier..
    >
    > Thanks again for your help - alas due to my 9 month old son I have yet to
    > get around to trying your example! But I will :eek:)
    >
    > Rob
    Mike Brind, May 21, 2006
    #13
  14. Rob Meade

    Rob Meade Guest

    "Mike Brind" wrote ...

    > When you do get to try Rob's code, you will see that it opens a number
    > of possibilities - one of which is to insert the contents of the divs
    > into an database instead of writing them to 1300 text files. I really
    > can't understand why this is not at the top of your list of options -
    > manage 1300 files...? or manage 1? Hmmmm.... But then you obviously
    > know a lot more about your project then I do :)
    >
    > If you were using Rob's code, you can insert this into it:
    >
    > If intDiv = 2 Then
    > Dim re, m, myMatches, pcode
    > Set re = New RegExp
    > With re
    > .Pattern = "Product Code: </span>[\s]+[\n]+[\s]+([a-z0-9]{6})"
    > .IgnoreCase = True
    > .Global = True
    > End With
    > Set myMatches = re.Execute(strOT2)
    > For Each m In myMatches
    > If m.Value <>"" Then
    > pcode = Replace(m.Value,"Product Code: </span>","")
    > pcode = Replace(pcode," ","")
    > pcode = Replace(pcode,chr(10),"")
    > pcode = Replace(pcode,chr(13),"")
    > Response.Write pcode 'or write to db
    > End If
    > Next
    > Set re = Nothing
    > End If
    >
    > And that will return the Product Code on it's own. Change the pattern
    > to "<title>[\.]*</title>" and you get the title stripped out too.


    Hi Mike,

    Thanks for your reply - something else to try with it - very much
    appreciated, thank you.

    Regards

    Rob
    PS: It's McKirahan's code ;o)
    Rob Meade, May 22, 2006
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?U3RlcGhhbmU=?=

    The best way to parse an html file?

    =?Utf-8?B?U3RlcGhhbmU=?=, Oct 9, 2004, in forum: ASP .Net
    Replies:
    1
    Views:
    6,979
    Martin Honnen
    Oct 9, 2004
  2. DH
    Replies:
    13
    Views:
    607
    Anthra Norell
    Aug 26, 2006
  3. Replies:
    19
    Views:
    1,121
    Daniel Vallstrom
    Mar 15, 2005
  4. Stan SR

    Parse a html file as a XML file

    Stan SR, Jan 19, 2008, in forum: ASP .Net
    Replies:
    2
    Views:
    474
    Peter Bromberg [C# MVP]
    Jan 19, 2008
  5. 7stud --

    optparse: parse v. parse! ??

    7stud --, Feb 20, 2008, in forum: Ruby
    Replies:
    3
    Views:
    185
    7stud --
    Feb 20, 2008
Loading...

Share This Page