Parsing out text from in between HTML tags

Discussion in 'Perl Misc' started by tgwaltz@googlemail.com, Jan 18, 2009.

  1. Guest

    Hello -

    I'm new to perl and am having a tough time trying to complete a
    theoretically simple statement. What I'm trying to do is write a very
    simple search engine that searches an html file for a given
    searchQuery. The way it's set up now is that if the searchQuery is
    something like "java," every single page is a hit because the word
    "javascript" is in the code in the form of the "<script
    language="javascript">" etc. I want to specify that $searchQuery
    should be surrounded like so:

    ">(anything)searchQuery(anything)<"

    In other words, the searchQuery has to be in between two HTML tags.
    Here's what I have at this point (the wrong way):

    return unless ($fileName =~ /\Q$searchQuery\E/i);

    Any help would be greatly appreciated!

    Thanks,
    TW
    , Jan 18, 2009
    #1
    1. Advertising

  2. <> wrote:


    > I'm new to perl and am having a tough time trying to complete a
    > theoretically simple statement.



    What you want to do (parse a context-free language) is not
    as simple as it seems. It is, in fact, pretty darn complex.


    > What I'm trying to do is write a very
    > simple search engine that searches an html file for a given
    > searchQuery. The way it's set up now is that if the searchQuery is
    > something like "java," every single page is a hit because the word
    > "javascript" is in the code in the form of the "<script
    > language="javascript">" etc.



    Should it match the below, or should it not match the below?

    <p>You can use <strong>javascript</strong> for client-side programming</p>

    If it should not match, then you probably want word-boundaries (\b) in
    your pattern.


    > I want to specify that $searchQuery
    > should be surrounded like so:
    >
    > ">(anything)searchQuery(anything)<"



    If $searchQuery = 'HTML tags' then should it match or not match the below?

    <p><acronym title="HyperText Markup Language">HTML</acronym>
    tags have angle-brackets</p>

    If it should match, then "anything" above does not really mean anything...

    "HTML tags", "HTML&nbsp;tags" and "HTML\ntags" should probably all match...


    > In other words, the searchQuery has to be in between two HTML tags.

    ^^^^^^^^^^^^^^^^^^^^^

    That too is over-simplified.

    <p>It is spelled ja<strong>v</strong>a, not "jabba"</p>


    > Here's what I have at this point (the wrong way):
    >
    > return unless ($fileName =~ /\Q$searchQuery\E/i);

    ^^^^

    Do you want to search the name or search the content?

    If you want to search the content, then you have chosen an extremely
    poor name for your variable...

    Once you have culled the data to only its content (ie. removed all markup),
    and normalized it (eg. folded whitespace) then you probably want something like:

    ... $file_content =~ /\b\Q$searchQuery\E\b/i ...


    > Any help would be greatly appreciated!



    Use a module that understands HTML for processing HTML data.

    perldoc -q "remove HTML"

    suggests a couple of modules that can help you (and there are many others as well).


    --
    Tad McClellan
    email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"
    Tad J McClellan, Jan 18, 2009
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Naren
    Replies:
    0
    Views:
    579
    Naren
    May 11, 2004
  2. Thierry Lam
    Replies:
    7
    Views:
    405
    Neredbojias
    May 2, 2009
  3. Maqo
    Replies:
    4
    Views:
    138
    A. Sinan Unur
    Feb 23, 2005
  4. Replies:
    4
    Views:
    584
    Jürgen Exner
    Apr 12, 2005
  5. replacing tags between tags

    , Sep 18, 2005, in forum: Perl Misc
    Replies:
    9
    Views:
    125
    Jürgen Exner
    Sep 19, 2005
Loading...

Share This Page