[newbie] upper to lower first letter of a word

Yvon Thoraval · Sep 23, 2003

Recently, i get a vintage list (more than 500 items) with poor typo, for
example, i've :

Côte de beaune-villages

instead of :

Côte de Beaune-Villages

Crémant d'alsace

instead of :

Crémant d'Alsace

i wonder of the way to change lower to upper case and also of

a regex able to do the trick.

something like :

every letter following a " ", "-" or "'" should be upper if not
belonging to a black list of words :

black_list = %w{d de du la le sec sur entre etc...}

Yvon Thoraval · Sep 23, 2003

Mark J. Reed said:
string.gsub!(/\b[a-z]+/) { |w| black_list.include?(w) ? w : w.capitalize }

a lot of tanxs °

Yvon Thoraval · Sep 23, 2003

Yvon Thoraval said:
string.gsub!(/\b[a-z]+/) { |w| black_list.include?(w) ? w : w.capitalize }

Click to expand...

a lot of tanxs °

it seems, it's a little bit trickier because accentuated characters are
taken as \b for example :

Vosne-romanée
becomes :
Vosne-RomanéE

then instead of \b i would have to exclude a list of chars :
[à|ä|â|é|è|ê|î|ö|ô|ü|ù]

Yvon Thoraval · Sep 23, 2003

Mark J. Reed said:
Really? That's arguably a bug. What character encoding are you using?

I'm (more-or-less) sure about that because even if i put :
l.gsub!(/\b[a-záàâçéèêíìîóòöôúùüû]+/) { |w| black_list.include?(w) ? w
: w.capitalize }

i get :
MâCon SupéRieur
when input was :
Mâcon supérieur

Accented letters should be in \w, not \W, and therefore the
space between one and an adjacent letter should not match \b.
But Ruby regexes may be ASCII-only, and even if not, they're probably
Latin-1-only. So, for instance, they wouldn't work on UTF-8 strings.

precisely i'm using utf-8 °

however, i'm able to do a try using iso-8859-1, my word editor (Pepper
on MacOS X) is able to transcode within 2 clicks + one cut'n paste rom
utf to iso...
sounds strange to me because Ruby is coming from Japan where "special"
chars are every-day chars ???

[snip]

The block has to compensate for that. Something like this:

string.gsub!(/(^|[- '])([a-z]+)/) { $1 + $2.capitalize }

Except that [a-z] won't match accented characters, so it's more like this:

string.gsub!(/(^|[- '])([a-záàâçéèêíìîóòôúùû]+)/) { $1 + $2.capitalize }

And if the names aren't limited to French, then even more special characters
creep in . . .

Yes, right, i know, for the time being, only about french and german
accentuated chars...

However because vintage are classified by area i might have to change
regex upon region...

Robert Klemme · Sep 24, 2003

Newsbeitrag

Mark J. Reed said:
Mark J. Reed said:

Really? That's arguably a bug. What character encoding are you

Click to expand...

using?

I'm (more-or-less) sure about that because even if i put :
l.gsub!(/\b[a-záàâçéèêíìîóòöôúùüû]+/) { |w| black_list.include?(w) ? w
: w.capitalize }

I'd omit the "\b" at the beginning since "é" then still matches a word
boundry:

l.gsub!(/[a-záàâçéèêíìîóòöôúùüû]+/) { |w| black_list.include?(w) ? w
: w.capitalize }

Alternatively:

l.gsub!(/[^\s!?.;:-]+/) {|w| black_list.include?(w) ? w : w.capitalize }

Regards

robert

Yvon Thoraval · Sep 24, 2003

Robert Klemme said:
I'd omit the "\b" at the beginning since "é" then still matches a word
boundry:

l.gsub!(/[a-záàâçéèêíìîóòöôúùüû]+/) { |w| black_list.include?(w) ? w
: w.capitalize }

yes, fine, i discovered also that capitalization don't work on
accentuated chars (as é)

then i've done another step for those "special" chars being as the first
letter of a xord

Alternatively:

l.gsub!(/[^\s!?.;:-]+/) {|w| black_list.include?(w) ? w : w.capitalize }

ok, however in my list no punctuation as ?!;:... only " " and "-"

Carlos · Sep 26, 2003

yes, fine, i discovered also that capitalization don't work on

accentuated chars (as Ã©)

You can use an old library named unicode:

irb(main):001:0> $KCODE="u"
=> "u"
irb(main):002:0> require "unicode"
=> true
irb(main):003:0> Unicode.capitalize("Ã Ã«ÃÃ´Å¯")
=> "Ã€Ã«ÃÃ´Å¯"

http://raa.ruby-lang.org/list.rhtml?name=unicode

Yvon Thoraval · Sep 26, 2003

Carlos said:
You can use an old library named unicode:

irb(main):001:0> $KCODE="u"
=> "u"
irb(main):002:0> require "unicode"
=> true
irb(main):003:0> Unicode.capitalize("àëíô?")
=> "Àëíô?"

http://raa.ruby-lang.org/list.rhtml?name=unicode

tanxs for all !

Thomas A. Reilly · Sep 27, 2003

I would appreciate it if someone could give me the regexp that it would
split the following:
for example -
"clonidine300 mg" into "clonidine 300 mg"

I have a bunch of drug data where the dose had been typed together.

Thanks

Jim Freeze · Sep 27, 2003

I would appreciate it if someone could give me the regexp that it would
split the following:
for example -
"clonidine300 mg" into "clonidine 300 mg"

I have a bunch of drug data where the dose had been typed together.

There's probably more than one way to do this. Here's one way:

irb(main):001:0> s="clonidine300 mg"
=> "clonidine300 mg"
irb(main):005:0> s.scan(/[a-zA-Z]+|\d+/) { |i| p i }
"clonidine"
"300"
"mg"

Jim Freeze · Sep 27, 2003

There's probably more than one way to do this. Here's one way:

And yet another way:

rb(main):018:0> m = /(\w+?)(\d+)\s+(\w+)/.match(s)
=> #<MatchData:0x81c2200>
irb(main):019:0> m[1]
=> "clonidine"
irb(main):020:0> m[2]
=> "300"
irb(main):021:0> m[3]
=> "mg"

Jonathan Lim · Sep 27, 2003

I would appreciate it if someone could give me the regexp that it would
split the following:
for example -
"clonidine300 mg" into "clonidine 300 mg"

/([^\d]+)(\d+)\s*mg/

Martin DeMello · Sep 27, 2003

Thomas A. Reilly said:
I would appreciate it if someone could give me the regexp that it would
split the following:
for example -
"clonidine300 mg" into "clonidine 300 mg"

I have a bunch of drug data where the dose had been typed together.

Are you likely to have numbers in the drug's name? Don't forget to
include that as a test case if so. The following puts a space before the
string of digits immediately preceding " mg" if it doesn't already have
one:

sub(/(\w)(\d+ mg)/, "#{$1} #{$2}")

martin

Joel VanderWerf · Sep 27, 2003

Martin said:
Are you likely to have numbers in the drug's name? Don't forget to
include that as a test case if so. The following puts a space before the
string of digits immediately preceding " mg" if it doesn't already have
one:

sub(/(\w)(\d+ mg)/, "#{$1} #{$2}")

martin

Another way to do it:

irb(main):001:0> s = "clonidine300 mg"
=> "clonidine300 mg"
irb(main):002:0> s[/(?=\d+ mg)/] = " "
=> " "
irb(main):003:0> s
=> "clonidine 300 mg"
irb(main):004:0>

Rodrigo B. de Oliveira · Sep 27, 2003

irb(main):001:0> s = "clonidine300 mg"
=> "clonidine300 mg"
irb(main):002:0> s[/(?=\d+ mg)/] = " "
=> " "

What's exactly happening here?

thanks,
Rodrigo

gabriele renzi · Sep 27, 2003

il Sun, 28 Sep 2003 03:01:27 +0900, "Rodrigo B. de Oliveira"

irb(main):001:0> s = "clonidine300 mg"
=> "clonidine300 mg"
irb(main):002:0> s[/(?=\d+ mg)/] = " "
=> " "

Click to expand...

What's exactly happening here?

I think:
s[/pattern/] returns the matching part of the string.
s[/pattern/]= value assigns 'value' to that piece of the string.
(?=regex) is known as ``zero-width positive lookahead'' and means that
the parte is string matched should not be consumed.

Joel VanderWerf · Sep 27, 2003

Rodrigo said:
irb(main):001:0> s = "clonidine300 mg"
=> "clonidine300 mg"
irb(main):002:0> s[/(?=\d+ mg)/] = " "
=> " "

Click to expand...

What's exactly happening here?

String#[]=, as in "s1[pat] = s2", is a destructive slice operator. It
replaces the first match of pat with the r.h.s. (raising IndexError if
no match).

In this case, pat is /(?=\d+ mg)/, which has a lookahead pattern
(?=...). This lookahead expression matches the point in the string *just
before* the match of /\d+ mg/, but it doesn't consume the "300 mg". So
slicing out the match (which is empty) and substituting " " has the
effect of inserting a space before the match of /\d+ mg/.

Rodrigo B. de Oliveira · Sep 27, 2003

Thanks! Really beautiful.

Rodrigo

----- Original Message -----

Thomas A. Reilly · Sep 29, 2003

Thanks a lot everyone.
The suggestions wwrked fine.

Tom

In the Matter of Herb Schildt: a Detailed Analysis of "C: TheComplete Nonsense"	109	Apr 3, 2010
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
How to del item of a list in loop?	25	Jan 15, 2005
comp.lang.c Changes to Answers to Frequently Asked Questions (FAQ)	1	Jul 4, 2004
Perl - Form handler	3	Aug 6, 2004
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004
Request for Feedback; a module making it easier to use regular expressions.	1	Jan 31, 2005
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Mar 1, 2004

[newbie] upper to lower first letter of a word

Yvon Thoraval

Yvon Thoraval

Yvon Thoraval

Yvon Thoraval

Robert Klemme

Yvon Thoraval

Carlos

Yvon Thoraval

Thomas A. Reilly

Jim Freeze

Jim Freeze

Jonathan Lim

Martin DeMello

Joel VanderWerf

Rodrigo B. de Oliveira

gabriele renzi

Joel VanderWerf

Rodrigo B. de Oliveira

Thomas A. Reilly

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads