Capitalize regex

Marek Mand · May 2, 2004

Janwillem Borleffs · May 2, 2004

Marek said:
How would I get "intended output" out from it taht is:
"Marek Mänd-Österreich A"
Mozilla works as intended and desired, the MSIE fails totally.
So what is the "word boundary" regex for other languages than English?
I thing Javascript regular expressions are weak and unusable.

var name = 'marek mänd-österreich a';

name = name.replace(/(^|\s|\-)(.)/g, function (c) { return
c.toUpperCase(); });
alert(name);

JW

Marek Mand · May 2, 2004

Marek said:
// http://www.faqts.com/knowledge_base/view.phtml/aid/15940
correctedname = name.replace(/\b\w+b/g, function(word) {

Just a correction for original post afterwards:

ofcourse what I had in my file was (missing \)
correctedname = name.replace(/\b\w+\b/g, function(word) {
just writing it here caused typo, but correcting that
still shows how MSIE differs from Mozilla.

Marek Mand · May 2, 2004

var name = 'marek mänd-österreich a';
name = name.replace(/(^|\s|\-)(.)/g, function (c) { return
c.toUpperCase(); });
alert(name);

So I understand it is important by my own explicitly define
what makes up a separator between words.
I understand I have to add comma and semicolon and colon and lots of
things more to that in order to work more reliably.

However I have no idea what all those symbols should be taking in
account there are lots of dots and 'fullstops' adn weird symbols in
Unicode, is there somewhere a pregiven list to be read what says
how word boundaries should be treated in the sense of what splits words?

Thanks for the answer! =D

Lasse Reichstein Nielsen · May 2, 2004

Marek Mand said:
So I understand it is important by my own explicitly define
what makes up a separator between words.

Yes. Javascript, or more precisely: ECMAScript v3, defines the \b
regexp as a boundary between a word character and a non-word
character. It also defines word character as [0-9a-zA-Z_]
(section 15.10.2.6).

Apparently Mozilla isn't following the ECMAScript standard.
I like their approach better, but it's not something you can
rely on.

/L

Marek Mand · May 2, 2004

Yes. Javascript, or more precisely: ECMAScript v3, defines the \b
regexp as a boundary between a word character and a non-word
character. It also defines word character as [0-9a-zA-Z_]
(section 15.10.2.6).

it would be a blast if in later jasvcript version I could define
character classes on my own at runtime for each regex object individually.

[[:estoninacharacters:]]
[[:germancharacters:]]
[[:danishcharacters:]]
[[:finishcharacters:]]

This would for the programmer definately save time and make
the code more readable in the context of that every occurence of
\w with complement of 'foreign language' (other than English)
specific characters shouldnt be written out every time one needs them,
but can be replaced with 'virtual own defined character class'.

I dont know whether it would be death to the regexengine performance
but the english-language-centerness and less options for easy
customisation is what makes the javascript regexes weak.

Apparently Mozilla isn't following the ECMAScript standard.

Just for others FYI. Opera does it like MSIE.
On the other hand not related very much but a bit fun watching is
argument on css text-transform:capitalize, what is a word
http://forums.mozillazine.org/viewtopic.php?t=46482&highlight

I like their approach better, but it's not something you can
rely on.

Me too ;D

Fox · May 3, 2004

Marek said:
<script>
var newval = '';
var name = 'marek mänd-österreich a';

// http://www.faqts.com/knowledge_base/view.phtml/aid/15940
correctedname = name.replace(/\b\w+b/g, function(word) {
return word.substring(0,1).toUpperCase()
+ word.substring(1);
});
alert(correctedname);
</script>

How would I get "intended output" out from it taht is:
"Marek Mänd-Österreich A"
Mozilla works as intended and desired, the MSIE fails totally.
So what is the "word boundary" regex for other languages than English?
I thing Javascript regular expressions are weak and unusable.

\S is better at capturing "foreign" characters [charcodes > 128 (they're
not foreign to you)] than \w -- however, it *includes* the hyphen
character in its set.
\S is the same as [^ \f\n\r\t\v] (or [^\s])

if you add the hyphen to the list:

String.prototype.initialCaps = function()
{
return this.replace(/[^\s-]+/g,
function(str)
{
return str.charAt(0).toUpperCase() + str.substring(1);
});
}

it should work as expected (at least for your example):

var name = 'marek mänd-österreich a';

alert( name.initialCaps() );

Dr John Stockton · May 5, 2004

JRS: In article <[email protected]>, seen in

news:comp.lang.javascript said:
Marek Mand said:

So I understand it is important by my own explicitly define
what makes up a separator between words.

Click to expand...

Yes. Javascript, or more precisely: ECMAScript v3, defines the \b
regexp as a boundary between a word character and a non-word
character. It also defines word character as [0-9a-zA-Z_]
(section 15.10.2.6).

That ought to be changed in future ECMA.

The underline character, and the digits, cannot normally be part of a
word; but, at least since the days of the fabulous Æsop, certain other
characters (joined and accented letters, for example) can.

The simple fix is to change the word "word" to "identifier" or other
suitable computer-jargon term (not "name").

After that, the term "word separator" becomes available to mean what it
means to the ordinary literate Briton, Dane, Estonian, etc.

Capitalize First Letter Exlcuding ( ) and /	3	May 27, 2005
manipulating text	0	Aug 13, 2004
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004

Capitalize regex

Marek Mand

Janwillem Borleffs

Marek Mand

Marek Mand

Lasse Reichstein Nielsen

Marek Mand

Fox

Dr John Stockton

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads