Capitalize regex

M

Marek Mand

<script>
var newval = '';
var name = 'marek mänd-österreich a';

// http://www.faqts.com/knowledge_base/view.phtml/aid/15940
correctedname = name.replace(/\b\w+b/g, function(word) {
return word.substring(0,1).toUpperCase()
+ word.substring(1);
});
alert(correctedname);
</script>


How would I get "intended output" out from it taht is:
"Marek Mänd-Österreich A"
Mozilla works as intended and desired, the MSIE fails totally.
So what is the "word boundary" regex for other languages than English?
I thing Javascript regular expressions are weak and unusable.
 
J

Janwillem Borleffs

Marek said:
How would I get "intended output" out from it taht is:
"Marek Mänd-Österreich A"
Mozilla works as intended and desired, the MSIE fails totally.
So what is the "word boundary" regex for other languages than English?
I thing Javascript regular expressions are weak and unusable.

var name = 'marek mänd-österreich a';

name = name.replace(/(^|\s|\-)(.)/g, function (c) { return
c.toUpperCase(); });
alert(name);


JW
 
M

Marek Mand

var name = 'marek mänd-österreich a';
name = name.replace(/(^|\s|\-)(.)/g, function (c) { return
c.toUpperCase(); });
alert(name);

So I understand it is important by my own explicitly define
what makes up a separator between words.
I understand I have to add comma and semicolon and colon and lots of
things more to that in order to work more reliably.

However I have no idea what all those symbols should be taking in
account there are lots of dots and 'fullstops' adn weird symbols in
Unicode, is there somewhere a pregiven list to be read what says
how word boundaries should be treated in the sense of what splits words?


Thanks for the answer! =D
 
L

Lasse Reichstein Nielsen

Marek Mand said:
So I understand it is important by my own explicitly define
what makes up a separator between words.

Yes. Javascript, or more precisely: ECMAScript v3, defines the \b
regexp as a boundary between a word character and a non-word
character. It also defines word character as [0-9a-zA-Z_]
(section 15.10.2.6).

Apparently Mozilla isn't following the ECMAScript standard.
I like their approach better, but it's not something you can
rely on.

/L
 
M

Marek Mand

Yes. Javascript, or more precisely: ECMAScript v3, defines the \b
regexp as a boundary between a word character and a non-word
character. It also defines word character as [0-9a-zA-Z_]
(section 15.10.2.6).

it would be a blast if in later jasvcript version I could define
character classes on my own at runtime for each regex object individually.

[[:estoninacharacters:]]
[[:germancharacters:]]
[[:danishcharacters:]]
[[:finishcharacters:]]

This would for the programmer definately save time and make
the code more readable in the context of that every occurence of
\w with complement of 'foreign language' (other than English)
specific characters shouldnt be written out every time one needs them,
but can be replaced with 'virtual own defined character class'.

I dont know whether it would be death to the regexengine performance
but the english-language-centerness and less options for easy
customisation is what makes the javascript regexes weak.

Apparently Mozilla isn't following the ECMAScript standard.

Just for others FYI. Opera does it like MSIE.
On the other hand not related very much but a bit fun watching is
argument on css text-transform:capitalize, what is a word
http://forums.mozillazine.org/viewtopic.php?t=46482&highlight
I like their approach better, but it's not something you can
rely on.

Me too ;D
 
F

Fox

Marek said:
<script>
var newval = '';
var name = 'marek mänd-österreich a';

// http://www.faqts.com/knowledge_base/view.phtml/aid/15940
correctedname = name.replace(/\b\w+b/g, function(word) {
return word.substring(0,1).toUpperCase()
+ word.substring(1);
});
alert(correctedname);
</script>

How would I get "intended output" out from it taht is:
"Marek Mänd-Österreich A"
Mozilla works as intended and desired, the MSIE fails totally.
So what is the "word boundary" regex for other languages than English?
I thing Javascript regular expressions are weak and unusable.

\S is better at capturing "foreign" characters [charcodes > 128 (they're
not foreign to you)] than \w -- however, it *includes* the hyphen
character in its set.
\S is the same as [^ \f\n\r\t\v] (or [^\s])

if you add the hyphen to the list:

String.prototype.initialCaps = function()
{
return this.replace(/[^\s-]+/g,
function(str)
{
return str.charAt(0).toUpperCase() + str.substring(1);
});
}

it should work as expected (at least for your example):

var name = 'marek mänd-österreich a';

alert( name.initialCaps() );
 
D

Dr John Stockton

JRS: In article <[email protected]>, seen in
news:comp.lang.javascript said:
Marek Mand said:
So I understand it is important by my own explicitly define
what makes up a separator between words.

Yes. Javascript, or more precisely: ECMAScript v3, defines the \b
regexp as a boundary between a word character and a non-word
character. It also defines word character as [0-9a-zA-Z_]
(section 15.10.2.6).

That ought to be changed in future ECMA.

The underline character, and the digits, cannot normally be part of a
word; but, at least since the days of the fabulous Æsop, certain other
characters (joined and accented letters, for example) can.

The simple fix is to change the word "word" to "identifier" or other
suitable computer-jargon term (not "name").

After that, the term "word separator" becomes available to mean what it
means to the ordinary literate Briton, Dane, Estonian, etc.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,774
Messages
2,569,599
Members
45,175
Latest member
Vinay Kumar_ Nevatia
Top