Help with Regexps wanted

S

Spartanicus

I could use some examples of how to use regexps to filter html, I
haven't been able to grasp it using the tutorials on the net.

Functions I'm after:

1) Remove all tags except img, object, a and embed.
2) Remove all blank lines (they may contain spaces and/or tabs).
3) Remove html comments

My regexp parser (Homesite) is a bit limited, it doesn't support
functions and shortcuts.
 
A

Andrew Urquhart

*Spartanicus* said:
I could use some examples of how to use regexps to filter html, I
haven't been able to grasp it using the tutorials on the net.

Functions I'm after:

1) Remove all tags except img, object, a and embed.

I might be getting close with:

<(?!\/|img|object|a|embed)[^>]*?>|<\/(?!img|object|a|embed)[^>]*?>

For example (javascript/ecmascript/jscript):

str = str.replace(/\s*<(?!\/|img|object|a|embed)[^>]*?>\s*|\s*<\/(?!img|object|a|embed)[^>]*?>\s*/igm, " ");

It's not terribly nice and I gave up trying to remove the inefficient OR
in the middle :/
2) Remove all blank lines (they may contain spaces and/or tabs).

To match a blank line:

/^\s*$/

E.g.:

str = str.replace(/^\s*$/gm, "");

3) Remove html comments

http://groups.google.co.uk/groups?th=4b9c59a6279b9620
 
W

William Park

Spartanicus said:
I could use some examples of how to use regexps to filter html, I
haven't been able to grasp it using the tutorials on the net.

Functions I'm after:

1) Remove all tags except img, object, a and embed.

If you don't care about the relative order of those tags, then run the
- extract all text between '<a ' and '</a>',
- extract all text between '<img ' and '>',
- extract all text between '<object>' and '</object>',
- extract all text between '<embed>' and '</embed>',
separately using Python, Perl, or (patched) Bash shell. Essentially,
read the whole file into string, and then cut/slice.

If you like a shell solution, you can use
http://freshmeat.net/projects/bashdiff/
which has "string" cut/splicing.
 
E

Eric B. Bednarz

I'll just pick a simple one:
- extract all text between '<img ' and '>',

<img src="tagc.png" alt=">">


Doing such things is usually as trivial as writing your own SGML parser
from scratch (the upshot being: there's a difference between something
like parsing a private set -- of yourself, or simply currently available
applications -- of *applied* syntax or conforming to a generic set of
defined syntactical rules; the former is only fairly easy as long as you
don't forget about your policies, the involved applications don't
unexpectadly change and you are the only user to start with).
 
W

William Park

Eric B. Bednarz said:
I'll just pick a simple one:


<img src="tagc.png" alt=">">

Good one. I guess OP can turn HTML into XML syntax, and use XML parser.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,572
Members
45,045
Latest member
DRCM

Latest Threads

Top