Regular expression to exclude lines?

M

Mark Szlazak

Shannon Jacobs said:
My goal now is to do something similar, but excluding the lines that do not
contain some string. I'm most interested in an elegant solution, though the
discussion so far seems to suggest that there may be no better approach than
parsing the input one line at a time...

An additional wrinkle is that I'd like to generalize a bit by treating the
decision string as a parameter returned in another field of the form.

I've tried twice to post a simple solution through Developers Dex but
they haven't appeared in about two days. I'm assuming they're lost.
Anyway, my previous post that did appear starts to point you to a
solution. It doesn't require a seperate process to break up lines and
it works. To remove lines without the substring "something" in them,
here's that solution again.

rx = /^(?:(?!\bsomething\b).)*$/gm;
outText = inText.replace(rx,'');

To make this regular expression dynamic, use the RegExp object
constuctor.

skip = 'something';
pattern = '^(?:(?!\\b' + skip + '\\b).)*$';
rx = new RegExp(pattern, 'gm');
outText = inText.replace(rx,'');

Also, one of your posts talks about linefeeds and the \r\n pattern.
This is OS dependent and linefeeds could also be just \r or \n.
 
S

Shannon Jacobs

Mark Szlazak wrote:
<snip of lengthy text describing goal of deleting lines that do not include
a key string>
rx = /^(?:(?!\bsomething\b).)*$/gm;
outText = inText.replace(rx,'');

To make this regular expression dynamic, use the RegExp object
constuctor.

skip = 'something';
pattern = '^(?:(?!\\b' + skip + '\\b).)*$';
rx = new RegExp(pattern, 'gm');
outText = inText.replace(rx,'');

Also, one of your posts talks about linefeeds and the \r\n pattern.
This is OS dependent and linefeeds could also be just \r or \n.

Below is the working code. I'm extremely obliged I hope the embedded
acknowledgment is sufficient, even though I don't expect to actively
broadcast the code. You're certainly a guru in my JavaScript book. The only
real change I had to make was the thing at the end to include the ends of
the lines. Your original version left a blank line, while I wanted to remove
those lines completely. By the way, I tested an earlier non-dynamic version
with Opera and it worked fine. I'll test the dynamic version tomorrow.

My main regret is that I still don't fully understand how it works... Rather
embarrassing, but looks like I'll have to break out the Perl manual
tomorrow.

function keepSelectedLines(keepString, blockOfText) {
// based on tips from Mark Szlazak
pattern = '^(?:(?!\\b' + keepString + '\\b).)*$[\r\n]*';
rx = new RegExp(pattern, 'gm');
blockOfText = blockOfText.replace(rx,'');
return blockOfText;
}
 
T

Thomas 'PointedEars' Lahn

Shannon said:
Below is the working code. [...]
My main regret is that I still don't fully understand how it works... Rather
embarrassing, but looks like I'll have to break out the Perl manual
tomorrow.

Why, see the Reference:

http://devedge.netscape.com/library/manuals/2000/javascript/1.5/guide/regexp.html#1010689
function keepSelectedLines(keepString, blockOfText) {
// based on tips from Mark Szlazak
pattern = '^(?:(?!\\b' + keepString + '\\b).)*$[\r\n]*';

This string literal contains a notation later to be used to create
a Regular Expression (RegExp object) that matchesthe beginning of
text (^) followed by none or more than one occurrences (*) of the following:

Match the following but don't remember the match (/?:/):
Match the previous only if the following does _not_ match (/?!/,
negative lookahead): Word boundary ("\\b" becoming /\b/) followed
by the value of `keepString' followed by a word boundary, followed
by any single character except the newline character (/./).

The above should match only if it is followed by the end of the
text followed by none or more than one occurrences (*) of any of
the characters ([...]) \r (carriage return) and \n (linefeed).
rx = new RegExp(pattern, 'gm');

This creates a RegExp object from the above string literal, matching
it on every single line instead of on the whole text ('m'; consider
multiline input), having /^/ and /$/ match the beginning and the end
of line instead of the beginning and the end of text, and matches all
occurrences, not only the first one ('g'; global).

However, it should be noted that it fails if the above string literal,
especially the value of the `keepString' argument, contains
single-escaped or certain double-escaped sequences, e.g. "C:\blurb"
which would then result in /C:blurb/mg in the RegExp, meaning "\b" as
the literal character `b', or "C:\\blurb" which would result in
/C:\blurb/mg, meaning /\b/ as word boundary. For this function, an input
of "C:\\\\blurb" would have to be used to get /C:\\blurb/ in the RegExp,
having /\\/ to match the literal backslash character (`\'), as it was
intended.

(AFAIS there is no general method with JavaScript to convert a string so
that it can be used as argument for the RegExp constructor function with
the resulting RegExp to match the string; simply inserting backslashes
will obviously not work as supposed in all cases.)
blockOfText = blockOfText.replace(rx,'');

Replaces matches of `rx' with the empty string (i.e. deletes the
matching substrings).
return blockOfText;

Returns the changed text.

HTH

PointedEars
 
S

Shannon Jacobs

Mark Szlazak wrote:
<snip of lengthy text describing goal of deleting lines that do not
include
a key string>
rx = /^(??!\bsomething\b).)*$/gm;
outText = inText.replace(rx,'');

To make this regular expression dynamic, use the RegExp object
constuctor.

skip = 'something';
pattern = '^(??!\\b' + skip + '\\b).)*$';
rx = new RegExp(pattern, 'gm');
outText = inText.replace(rx,'');

Also, one of your posts talks about linefeeds and the \r\n pattern.
This is OS dependent and linefeeds could also be just \r or \n.

Below is the working code. I'm extremely obliged and I hope the
embedded acknowledgment is sufficient, even though I don't expect to
actively broadcast the code. You're certainly a guru in my JavaScript
book. The only real change I had to make was the thing at the end to
include the ends of the lines. Your original version left a blank
line, while I wanted to remove those lines completely. By the way, I
tested an earlier non-dynamic version with Opera and it worked fine.
I'll test the dynamic version tomorrow.

My main regret is that I still don't fully understand how it works...
Rather embarrassing, but looks like I'll have to break out the Perl
manual tomorrow. [Actually, I did look at the manual, and still don7t
understand all of it, though I feel like the pair of \b is not really
required?]

function keepSelectedLines(keepString, blockOfText) {
// based on tips from Mark Szlazak
pattern = '^(??!\\b' + keepString + '\\b).)*$[\r\n]*';
rx = new RegExp(pattern, 'gm');
blockOfText = blockOfText.replace(rx,'');
return blockOfText;
}

(Apologies if this post appears twice, but something strange is going
on here... My newsreader definitely thinks I posted this reply
yesterday, but it seems to have disappeared, just as Mr. Szlazak
reported some of his posts had disppeared. I rather suspect that the
spammers efforts are resulting in so much newsgroup pollution that
non-spam posts are getting caught in the crossfire. Hopefully the
Google routing will work better.)
 
M

Mark Szlazak

Shannon Jacobs said:
Mark Szlazak wrote:
<snip of lengthy text describing goal of deleting lines that do not include
a key string>
rx = /^(?:(?!\bsomething\b).)*$/gm;
outText = inText.replace(rx,'');

To make this regular expression dynamic, use the RegExp object
constuctor.

skip = 'something';
pattern = '^(?:(?!\\b' + skip + '\\b).)*$';
rx = new RegExp(pattern, 'gm');
outText = inText.replace(rx,'');

Also, one of your posts talks about linefeeds and the \r\n pattern.
This is OS dependent and linefeeds could also be just \r or \n.

Below is the working code. I'm extremely obliged I hope the embedded
acknowledgment is sufficient, even though I don't expect to actively
broadcast the code. You're certainly a guru in my JavaScript book. The only
real change I had to make was the thing at the end to include the ends of
the lines. Your original version left a blank line, while I wanted to remove
those lines completely. By the way, I tested an earlier non-dynamic version
with Opera and it worked fine. I'll test the dynamic version tomorrow.

My main regret is that I still don't fully understand how it works... Rather
embarrassing, but looks like I'll have to break out the Perl manual
tomorrow.

function keepSelectedLines(keepString, blockOfText) {
// based on tips from Mark Szlazak
pattern = '^(?:(?!\\b' + keepString + '\\b).)*$[\r\n]*';
rx = new RegExp(pattern, 'gm');
blockOfText = blockOfText.replace(rx,'');
return blockOfText;
}

NOTE: I've tried posting this in two previous replies which again seem
to be lost.

Thanks Shannon! This regex isn't original and it's probably more
commonly known among Perl programmers.

I have a suggestion. If you what to consume the linefeeds then you
don't need $ in the regex. Change $[\r\n]* to [\r\n]+

Here's how I think about this regex. Starting at a position before the
first character of the string, the negative lookahead checks if its
substring isn't present, if not then the "dot" matches any character
except linefeeds and moves us to a new position just after that
character. This is repeated until the end of the line unless the
negative lookaheads subpattern is found and thus no match. Now, the
caret ^ at the beginning of the regex eliminates "bump-alongs" when
the negative lookaheads subpattern is found. What happens is the regex
engine will do our scanning all over again except from the next
position in the line. Again, if regex match isn't found (e.g.,
lookaheads subpattern is found) then it bumps-along to start at the
next position, re-does the scan, and this bumping-along could continue
to the end of the line.

You want to suppress this because it's not needed, it will not match
the entire line, and it will cause false matches when the engine moves
past say "s" in "something" to start scanning from "omething..." in a
negative lookahead that has "something" as it's subpattern.

At least I think that's how this works ;-)
 
S

Shannon Jacobs

Not sure what to make of it, but my original post showed up again after a
couple of days. Maybe server problems at my end?

Mark Szlazak wrote:
function keepSelectedLines(keepString, blockOfText) {
// based on tips from Mark Szlazak
pattern = '^(?:(?!\\b' + keepString + '\\b).)*$[\r\n]*';
rx = new RegExp(pattern, 'gm');
blockOfText = blockOfText.replace(rx,'');
return blockOfText;
}
I have a suggestion. If you what to consume the linefeeds then you
don't need $ in the regex. Change $[\r\n]* to [\r\n]+

I'll probably try that suggestion, but I already went in and removed the \b
pair. I'm not sure why you recommended those. Actually, the first person I
showed it to also wanted to be able to do two keys at a time. That turned
out to be easy by entering the keepString as:

(key1|key2)

However, I did run into one problem already... The operation is inconsistent
with Japanese, which uses a DBCS (part of the time). I suspected it might be
one of those byte-alignment problems, but that doesn't seem to make sense if
the regexp is trying to match from every byte position...

And thanks for the explanation of how it works. Already seen a couple, but
that seems to be another aspect of regexp newsgroups?
 
M

Mark Szlazak

Shannon Jacobs said:
Not sure what to make of it, but my original post showed up again after a
couple of days. Maybe server problems at my end?

Mark Szlazak wrote:
Also, one of your posts talks about linefeeds and the \r\n pattern.
This is OS dependent and linefeeds could also be just \r or \n.
function keepSelectedLines(keepString, blockOfText) {
// based on tips from Mark Szlazak
pattern = '^(?:(?!\\b' + keepString + '\\b).)*$[\r\n]*';
rx = new RegExp(pattern, 'gm');
blockOfText = blockOfText.replace(rx,'');
return blockOfText;
}
I have a suggestion. If you what to consume the linefeeds then you
don't need $ in the regex. Change $[\r\n]* to [\r\n]+

I'll probably try that suggestion, but I already went in and removed the \b
pair. I'm not sure why you recommended those. Actually, the first person I
showed it to also wanted to be able to do two keys at a time. That turned
out to be easy by entering the keepString as:

(key1|key2)

However, I did run into one problem already... The operation is inconsistent
with Japanese, which uses a DBCS (part of the time). I suspected it might be
one of those byte-alignment problems, but that doesn't seem to make sense if
the regexp is trying to match from every byte position...

And thanks for the explanation of how it works. Already seen a couple, but
that seems to be another aspect of regexp newsgroups?

The \b's are for word boundaries. See what happens when one line
has "Java" but not "JavaScript" and another line has "JavaScript"
but not "Java" with this negative lookahead (?!Java)

JavaScript 1.5 regular expressions are undefined for many unicode
characters and Japanese characters. However, you can specify unicode
character ranges by hex. The following regex would filter Katakana
letters when using the Japanese encoding of this table,
http://www.microsoft.com/globaldev/reference/dbcs/932.htm

katakana = /[\uff65-\uff9f]/;
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top