Splitting string into word array - regular expression

A

Anat

Hi,
What regex do I need to split a string, using javascript's split
method, into words-array?
Splitting accroding to whitespaces only is not enough, I need to split
according to whitespace, comma, hyphen, etc...
Is there a regex that does the trick?
Thanks, Anat.
 
R

RobG

Anat said:
Hi,
What regex do I need to split a string, using javascript's split
method, into words-array?

Of course, that depends on how you define a word.

Splitting accroding to whitespaces only is not enough, I need to split
according to whitespace, comma, hyphen, etc...
Is there a regex that does the trick?

To split at one or more non-word characters (basically any character
other than a letter or number):

var words = string.split(/\W+/);
 
Z

Zifud

RobG said:
Of course, that depends on how you define a word.



To split at one or more non-word characters (basically any character
other than a letter or number):

var words = string.split(/\W+/);

Not all browsers will tolerate regular expressions in split(), it may be
safer to replace all non-word characters with a space then split on that:

var newString = string.replace(/\W+/g,' ');
var words = newString.split(' ');


For the OP to consider...
 
A

Anat

Thanks guys,
But actually, when I come to think of it, it's not a good solution for
what I'm trying to do.
I want to take a given string, and make certain words hyperlinks.
For example:
"Hello world, this is a wonderful day!"
I'd like the words world, wonderful and day to be hyperlinks, therefore
after my manipulation it should be:
"Hello <a href=...>world</a>, this is a <a href=...>wonderful</a> <a
href=...>day</a>!"
Using split method is not good, because the whitespaces, commas and
other punctuation marks are gone.
Instead of displaying
"Hello <a href=...>world</a>, this is a <a href=...>wonderful</a> <a
href=...>day</a>!"
I will display
"Hello <a href=...>world</a> this is a <a href=...>wonderful</a> <a
href=...>day</a>"
(note that the comma and exclamation mark are gone).
Any ideas on how I can locate words, replace them but not loose
punctuation marks on the way?
Thanks again!!!
 
L

Lasse Reichstein Nielsen

Zifud said:
Not all browsers will tolerate regular expressions in split(),

Can you mention one that doesn't that is more recent that Netscape 3?
I can see that both IE 4 and Netscape 4.80 does support it.

/L
 
T

Thomas 'PointedEars' Lahn

RobG said:
Of course, that depends on how you define a word.
Exactly.


To split at one or more non-word characters (basically any character
other than a letter or number):

var words = string.split(/\W+/);

Therefore, one seldom wants that (considering Unicode word characters that
match \W), and probably the OP does not. They are looking for character
classes instead:

var s = [
"Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do",
"eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim",
"ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut",
"aliquip ex ea commodo consequat. Duis aute irure dolor in",
"reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla",
"pariatur. Excepteur sint occaecat cupidatat non proident, sunt in",
"culpa qui officia deserunt mollit anim id est laborum."
].join(" ");

window.alert(s);

// "etc." not included
var words = s.split(/[\s,-]+/);

window.alert(words.join(" | "));


PointedEars
 
T

Thomas 'PointedEars' Lahn

Zifud said:
Not all browsers will tolerate regular expressions in split(),

The RegExp object and Regular Expression literals were introduced with
JavaScript 1.2 (NN 4.0, June 1997), and JScript 3.0 (IE 4.0, October 1997).

Since then, the ECMA WG has produced two more editions of ECMAScript, where
Edition 3 (December 1999, March 2000) (finally) formally specified that
feature. No scriptable user agent can survive in the mid-term without
supporting it nowadays.

I'd say your information is /slightly/ outdated.
it may be safer to replace all non-word characters with a space then split
on that:
Unlikely.

var newString = string.replace(/\W+/g,' ');

That does not recognize "Überlandstraße" as one word ...
var words = newString.split(' ');

.... and makes ["", "berlandstra", "e"] out of it.
For the OP to consider...

.... and to reject.


PointedEars
 
T

Thomas 'PointedEars' Lahn

Anat said:
I want to take a given string, and make certain words hyperlinks.
For example:
"Hello world, this is a wonderful day!"
I'd like the words world, wonderful and day to be hyperlinks, therefore
after my manipulation it should be:
"Hello <a href=...>world</a>, this is a <a href=...>wonderful</a> <a
href=...>day</a>!"
Using split method is not good, because the whitespaces, commas and
other punctuation marks are gone.
[...]
Any ideas on how I can locate words, replace them but not loose
punctuation marks on the way?

From your use of the `a' element, I assume this is for `innerHTML'.
Please note that this property is proprietary, and its behavior is
both implementation-dependent and context-dependent.

You could use \b of course, but that will get you in trouble with
words containing non-ASCII characters. Therefore:

var s = ...innerHTML;
s = s.replace(
/(^|[\s-])(world|wonderful|day)([\s,;.?!-]|$)/g,
"$1<a href="http://en.wikipedia.org/wiki/$2">$2<\/a>$3");
...innerHTML = s;

Or with positive lookahead (requires JavaScript 1.5, JScript 5.5,
ECMAScript Ed. 3 [1]):

...
s = s.replace(
/([\s-]|^)(world|wonderful|day)(?=([\s,;.?!-]|$))/g,
'$1<a href="http://en.wikipedia.org/wiki/$2">$2<\/a>');
...

(Use those character classes, unless you want to code all UCS
[non-]word characters as compactly defined in the XML grammar.)

I can remember to have suggested a probably more sophisticated replacing
approach a few months ago already, that also points out the difficulties
with general replacing. Search the (Google Groups) archives for "IBM
replace author:pointedEars" or so.

When implementing this, you should additionally take into account that too
many hyperlinks in continuous text can make that text hardly legible.
Thanks again!!!

You are welcome. But please get your Exclamation Mark key repaired.


PointedEars
___________
[1] <URL:http://pointedears.de/es-matrix>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,020
Latest member
GenesisGai

Latest Threads

Top