Help with Regexps wanted

Discussion in 'HTML' started by Spartanicus, Oct 21, 2004.

  1. Spartanicus

    Spartanicus Guest

    I could use some examples of how to use regexps to filter html, I
    haven't been able to grasp it using the tutorials on the net.

    Functions I'm after:

    1) Remove all tags except img, object, a and embed.
    2) Remove all blank lines (they may contain spaces and/or tabs).
    3) Remove html comments

    My regexp parser (Homesite) is a bit limited, it doesn't support
    functions and shortcuts.

    --
    Spartanicus
    Spartanicus, Oct 21, 2004
    #1
    1. Advertising

  2. *Spartanicus* wrote:
    > I could use some examples of how to use regexps to filter html, I
    > haven't been able to grasp it using the tutorials on the net.
    >
    > Functions I'm after:
    >
    > 1) Remove all tags except img, object, a and embed.


    I might be getting close with:

    <(?!\/|img|object|a|embed)[^>]*?>|<\/(?!img|object|a|embed)[^>]*?>

    For example (javascript/ecmascript/jscript):

    str = str.replace(/\s*<(?!\/|img|object|a|embed)[^>]*?>\s*|\s*<\/(?!img|object|a|embed)[^>]*?>\s*/igm, " ");

    It's not terribly nice and I gave up trying to remove the inefficient OR
    in the middle :/

    > 2) Remove all blank lines (they may contain spaces and/or tabs).


    To match a blank line:

    /^\s*$/

    E.g.:

    str = str.replace(/^\s*$/gm, "");


    > 3) Remove html comments


    http://groups.google.co.uk/groups?th=4b9c59a6279b9620
    --
    Andrew Urquhart
    - FAQ: http://www.html-faq.com/
    - Archive: http://groups.google.com/groups?group=alt.html
    - Reply: http://andrewu.co.uk/contact/
    Andrew Urquhart, Oct 21, 2004
    #2
    1. Advertising

  3. Spartanicus

    William Park Guest

    Spartanicus <> wrote:
    > I could use some examples of how to use regexps to filter html, I
    > haven't been able to grasp it using the tutorials on the net.
    >
    > Functions I'm after:
    >
    > 1) Remove all tags except img, object, a and embed.


    If you don't care about the relative order of those tags, then run the
    - extract all text between '<a ' and '</a>',
    - extract all text between '<img ' and '>',
    - extract all text between '<object>' and '</object>',
    - extract all text between '<embed>' and '</embed>',
    separately using Python, Perl, or (patched) Bash shell. Essentially,
    read the whole file into string, and then cut/slice.

    If you like a shell solution, you can use
    http://freshmeat.net/projects/bashdiff/
    which has "string" cut/splicing.

    > 2) Remove all blank lines (they may contain spaces and/or tabs).
    > 3) Remove html comments
    >
    > My regexp parser (Homesite) is a bit limited, it doesn't support
    > functions and shortcuts.



    --
    William Park <>
    Open Geometry Consulting, Toronto, Canada
    William Park, Oct 25, 2004
    #3
  4. William Park <> writes:

    > Spartanicus <> wrote:


    >> 1) Remove all tags except img, object, a and embed.


    I'll just pick a simple one:

    > - extract all text between '<img ' and '>',


    <img src="tagc.png" alt=">">


    Doing such things is usually as trivial as writing your own SGML parser
    from scratch (the upshot being: there's a difference between something
    like parsing a private set -- of yourself, or simply currently available
    applications -- of *applied* syntax or conforming to a generic set of
    defined syntactical rules; the former is only fairly easy as long as you
    don't forget about your policies, the involved applications don't
    unexpectadly change and you are the only user to start with).


    --
    | ) Più Cabernet,
    -( meno Internet.
    | ) http://bednarz.nl/
    Eric B. Bednarz, Oct 25, 2004
    #4
  5. Spartanicus

    William Park Guest

    Eric B. Bednarz <> wrote:
    > William Park <> writes:
    >
    > > Spartanicus <> wrote:

    >
    > >> 1) Remove all tags except img, object, a and embed.

    >
    > I'll just pick a simple one:
    >
    > > - extract all text between '<img ' and '>',

    >
    > <img src="tagc.png" alt=">">


    Good one. I guess OP can turn HTML into XML syntax, and use XML parser.

    --
    William Park <>
    Open Geometry Consulting, Toronto, Canada
    William Park, Oct 25, 2004
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Fredrik Ramsberg

    Optimisation of regexps in Perl?

    Fredrik Ramsberg, Oct 14, 2003, in forum: Perl
    Replies:
    2
    Views:
    466
    Fredrik Ramsberg
    Oct 15, 2003
  2. Harvey
    Replies:
    0
    Views:
    677
    Harvey
    Jul 16, 2004
  3. Harvey
    Replies:
    1
    Views:
    823
    Daniel
    Jul 16, 2004
  4. Replies:
    4
    Views:
    557
  5. Robert Dodier
    Replies:
    2
    Views:
    148
    Tad McClellan
    Jul 9, 2006
Loading...

Share This Page