[newbie] upper to lower first letter of a word

Y

Yvon Thoraval

Recently, i get a vintage list (more than 500 items) with poor typo, for
example, i've :

Côte de beaune-villages

instead of :

Côte de Beaune-Villages

Crémant d'alsace

instead of :

Crémant d'Alsace

i wonder of the way to change lower to upper case and also of

a regex able to do the trick.

something like :

every letter following a " ", "-" or "'" should be upper if not
belonging to a black list of words :

black_list = %w{d de du la le sec sur entre etc...}
 
Y

Yvon Thoraval

Yvon Thoraval said:
string.gsub!(/\b[a-z]+/) { |w| black_list.include?(w) ? w : w.capitalize }

a lot of tanxs °;)

it seems, it's a little bit trickier because accentuated characters are
taken as \b for example :

Vosne-romanée
becomes :
Vosne-RomanéE

then instead of \b i would have to exclude a list of chars :
[à|ä|â|é|è|ê|î|ö|ô|ü|ù]
 
Y

Yvon Thoraval

Mark J. Reed said:
Really? That's arguably a bug. What character encoding are you using?

I'm (more-or-less) sure about that because even if i put :
l.gsub!(/\b[a-záàâçéèêíìîóòöôúùüû]+/) { |w| black_list.include?(w) ? w
: w.capitalize }

i get :
MâCon SupéRieur
when input was :
Mâcon supérieur
Accented letters should be in \w, not \W, and therefore the
space between one and an adjacent letter should not match \b.
But Ruby regexes may be ASCII-only, and even if not, they're probably
Latin-1-only. So, for instance, they wouldn't work on UTF-8 strings.

precisely i'm using utf-8 °;)
however, i'm able to do a try using iso-8859-1, my word editor (Pepper
on MacOS X) is able to transcode within 2 clicks + one cut'n paste rom
utf to iso...
sounds strange to me because Ruby is coming from Japan where "special"
chars are every-day chars ???

[snip]
The block has to compensate for that. Something like this:

string.gsub!(/(^|[- '])([a-z]+)/) { $1 + $2.capitalize }

Except that [a-z] won't match accented characters, so it's more like this:

string.gsub!(/(^|[- '])([a-záàâçéèêíìîóòôúùû]+)/) { $1 + $2.capitalize }

And if the names aren't limited to French, then even more special characters
creep in . . .

Yes, right, i know, for the time being, only about french and german
accentuated chars...

However because vintage are classified by area i might have to change
regex upon region...
 
R

Robert Klemme

Newsbeitrag
Mark J. Reed said:
Really? That's arguably a bug. What character encoding are you
using?

I'm (more-or-less) sure about that because even if i put :
l.gsub!(/\b[a-záàâçéèêíìîóòöôúùüû]+/) { |w| black_list.include?(w) ? w
: w.capitalize }

I'd omit the "\b" at the beginning since "é" then still matches a word
boundry:

l.gsub!(/[a-záàâçéèêíìîóòöôúùüû]+/) { |w| black_list.include?(w) ? w
: w.capitalize }

Alternatively:

l.gsub!(/[^\s!?.;:-]+/) {|w| black_list.include?(w) ? w : w.capitalize }

Regards

robert
 
Y

Yvon Thoraval

Robert Klemme said:
I'd omit the "\b" at the beginning since "é" then still matches a word
boundry:

l.gsub!(/[a-záàâçéèêíìîóòöôúùüû]+/) { |w| black_list.include?(w) ? w
: w.capitalize }

yes, fine, i discovered also that capitalization don't work on
accentuated chars (as é)

then i've done another step for those "special" chars being as the first
letter of a xord
Alternatively:

l.gsub!(/[^\s!?.;:-]+/) {|w| black_list.include?(w) ? w : w.capitalize }

ok, however in my list no punctuation as ?!;:... only " " and "-"
 
T

Thomas A. Reilly

I would appreciate it if someone could give me the regexp that it would
split the following:
for example -
"clonidine300 mg" into "clonidine 300 mg"

I have a bunch of drug data where the dose had been typed together.

Thanks
 
J

Jim Freeze

I would appreciate it if someone could give me the regexp that it would
split the following:
for example -
"clonidine300 mg" into "clonidine 300 mg"

I have a bunch of drug data where the dose had been typed together.

There's probably more than one way to do this. Here's one way:

irb(main):001:0> s="clonidine300 mg"
=> "clonidine300 mg"
irb(main):005:0> s.scan(/[a-zA-Z]+|\d+/) { |i| p i }
"clonidine"
"300"
"mg"
 
J

Jim Freeze

There's probably more than one way to do this. Here's one way:

And yet another way:


rb(main):018:0> m = /(\w+?)(\d+)\s+(\w+)/.match(s)
=> #<MatchData:0x81c2200>
irb(main):019:0> m[1]
=> "clonidine"
irb(main):020:0> m[2]
=> "300"
irb(main):021:0> m[3]
=> "mg"
 
J

Jonathan Lim

I would appreciate it if someone could give me the regexp that it would
split the following:
for example -
"clonidine300 mg" into "clonidine 300 mg"

/([^\d]+)(\d+)\s*mg/
 
M

Martin DeMello

Thomas A. Reilly said:
I would appreciate it if someone could give me the regexp that it would
split the following:
for example -
"clonidine300 mg" into "clonidine 300 mg"

I have a bunch of drug data where the dose had been typed together.

Are you likely to have numbers in the drug's name? Don't forget to
include that as a test case if so. The following puts a space before the
string of digits immediately preceding " mg" if it doesn't already have
one:

sub(/(\w)(\d+ mg)/, "#{$1} #{$2}")

martin
 
J

Joel VanderWerf

Martin said:
Are you likely to have numbers in the drug's name? Don't forget to
include that as a test case if so. The following puts a space before the
string of digits immediately preceding " mg" if it doesn't already have
one:

sub(/(\w)(\d+ mg)/, "#{$1} #{$2}")

martin

Another way to do it:

irb(main):001:0> s = "clonidine300 mg"
=> "clonidine300 mg"
irb(main):002:0> s[/(?=\d+ mg)/] = " "
=> " "
irb(main):003:0> s
=> "clonidine 300 mg"
irb(main):004:0>
 
R

Rodrigo B. de Oliveira

irb(main):001:0> s = "clonidine300 mg"
=> "clonidine300 mg"
irb(main):002:0> s[/(?=\d+ mg)/] = " "
=> " "

What's exactly happening here?

thanks,
Rodrigo
 
G

gabriele renzi

il Sun, 28 Sep 2003 03:01:27 +0900, "Rodrigo B. de Oliveira"
irb(main):001:0> s = "clonidine300 mg"
=> "clonidine300 mg"
irb(main):002:0> s[/(?=\d+ mg)/] = " "
=> " "

What's exactly happening here?

I think:
s[/pattern/] returns the matching part of the string.
s[/pattern/]= value assigns 'value' to that piece of the string.
(?=regex) is known as ``zero-width positive lookahead'' and means that
the parte is string matched should not be consumed.
 
J

Joel VanderWerf

Rodrigo said:
irb(main):001:0> s = "clonidine300 mg"
=> "clonidine300 mg"
irb(main):002:0> s[/(?=\d+ mg)/] = " "
=> " "


What's exactly happening here?

String#[]=, as in "s1[pat] = s2", is a destructive slice operator. It
replaces the first match of pat with the r.h.s. (raising IndexError if
no match).

In this case, pat is /(?=\d+ mg)/, which has a lookahead pattern
(?=...). This lookahead expression matches the point in the string *just
before* the match of /\d+ mg/, but it doesn't consume the "300 mg". So
slicing out the match (which is empty) and substituting " " has the
effect of inserting a space before the match of /\d+ mg/.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,020
Latest member
GenesisGai

Latest Threads

Top