capitalizing words

Discussion in 'Ruby' started by Peter Bailey, Apr 8, 2008.

  1. Peter Bailey

    Peter Bailey Guest

    Hi,
    I need to capitalize the words in a string I find in XML files.

    The string that's in (.*) below is what I need to change. I just want to
    capitalize the first letter of each word in the string.

    I'm trying this, in a test:

    Dir.chdir("C:/users/pb4072/documents")
    file = File.read("test1.txt")
    file.gsub(/^<row><entry><text><emph face="b">(.*)<\/emph>/) do |match|
    array = $1.split
    array.each do |word|
    word.capitalize!
    end
    newfile = File.open("c:/users/pb4072/documents/test1.txt", "w") { |f|
    f.print array }
    end

    And, I'm getting this:

    #(.*)<\/emph>theQuickBrownFoxJumpedOverTheLazyDog.

    I want this:

    <row><entry><text><emph face="b">The Quick Brown Fox Jumped Over The
    Lazy Dog.<\/emph>/


    Thanks,
    Peter
    --
    Posted via http://www.ruby-forum.com/.
    Peter Bailey, Apr 8, 2008
    #1
    1. Advertising

  2. Peter Bailey

    Todd Benson Guest

    On Tue, Apr 8, 2008 at 1:53 PM, Peter Bailey <> wrote:
    > Hi,
    > I need to capitalize the words in a string I find in XML files.
    >
    > The string that's in (.*) below is what I need to change. I just want to
    > capitalize the first letter of each word in the string.
    >
    > I'm trying this, in a test:
    >
    > Dir.chdir("C:/users/pb4072/documents")
    > file = File.read("test1.txt")
    > file.gsub(/^<row><entry><text><emph face="b">(.*)<\/emph>/) do |match|
    > array = $1.split
    > array.each do |word|
    > word.capitalize!
    > end
    > newfile = File.open("c:/users/pb4072/documents/test1.txt", "w") { |f|
    > f.print array }
    > end
    >
    > And, I'm getting this:
    >
    > #(.*)<\/emph>theQuickBrownFoxJumpedOverTheLazyDog.
    >
    > I want this:
    >
    > <row><entry><text><emph face="b">The Quick Brown Fox Jumped Over The
    > Lazy Dog.<\/emph>/
    >
    >
    > Thanks,
    > Peter


    I don't know what the original text looks like in test1.txt, but this
    might point you in the right direction...

    irb(main):001:0> s = "the quick brown fox"
    => "the quick brown fox"
    irb(main):002:0> s.split.map {|w| w.capitalize}.join ' '
    => "The Quick Brown Fox"

    Todd
    Todd Benson, Apr 8, 2008
    #2
    1. Advertising

  3. On Apr 8, 2008, at 2:53 PM, Peter Bailey wrote:
    > Hi,
    > I need to capitalize the words in a string I find in XML files.
    >
    > The string that's in (.*) below is what I need to change. I just
    > want to
    > capitalize the first letter of each word in the string.
    >
    > I'm trying this, in a test:
    >
    > Dir.chdir("C:/users/pb4072/documents")
    > file = File.read("test1.txt")
    > file.gsub(/^<row><entry><text><emph face="b">(.*)<\/emph>/) do |
    > match|
    > array = $1.split
    > array.each do |word|
    > word.capitalize!
    > end
    > newfile = File.open("c:/users/pb4072/documents/test1.txt", "w") { |f|
    > f.print array }
    > end
    >
    > And, I'm getting this:
    >
    > #(.*)<\/emph>theQuickBrownFoxJumpedOverTheLazyDog.
    >
    > I want this:
    >
    > <row><entry><text><emph face="b">The Quick Brown Fox Jumped Over The
    > Lazy Dog.<\/emph>/
    >
    >
    > Thanks,
    > Peter


    Dir.chdir("C:/users/pb4072/documents") do |d|
    file = File.read("test1.txt")
    output = file.gsub(%r{^(<row><entry><text><emph face="b">)(.*)(</
    emph>)}m) do |match|
    "#{$1}#{$2.gsub(%r{\b\w+\b}){|w|w.capitalize}}#{$3}"
    end
    File.open("test1.txt", "w") { |f| f.write output }
    end

    Note the use of three capture groups to get the unchanged initial and
    final parts as well as the middle part that is altered. The %r{\b\w+
    \b} is a Regexp that matches words, \b is a word-boundary and \w is a
    word-character (short for [a-zA-Z0-9_]). Your use of
    String#capitalize! returns nil if no change is made.

    -Rob

    Rob Biedenharn http://agileconsultingllc.com
    Rob Biedenharn, Apr 9, 2008
    #3
  4. Peter Bailey

    Peter Bailey Guest

    Todd Benson wrote:
    > On Tue, Apr 8, 2008 at 1:53 PM, Peter Bailey <> wrote:
    >> file.gsub(/^<row><entry><text><emph face="b">(.*)<\/emph>/) do |match|
    >> #(.*)<\/emph>theQuickBrownFoxJumpedOverTheLazyDog.
    >>
    >> I want this:
    >>
    >> <row><entry><text><emph face="b">The Quick Brown Fox Jumped Over The
    >> Lazy Dog.<\/emph>/
    >>
    >>
    >> Thanks,
    >> Peter

    >
    > I don't know what the original text looks like in test1.txt, but this
    > might point you in the right direction...
    >
    > irb(main):001:0> s = "the quick brown fox"
    > => "the quick brown fox"
    > irb(main):002:0> s.split.map {|w| w.capitalize}.join ' '
    > => "The Quick Brown Fox"
    >
    > Todd


    Thanks, Todd.
    The original text is just:
    <row><entry><text><emph face="b">THE QUICK BROWN FOX JUMPED OVER THE
    LAZY DOG.<\/emph>/

    Should I just make your "s" equal to $1 from my original gsub?

    -Peter
    --
    Posted via http://www.ruby-forum.com/.
    Peter Bailey, Apr 9, 2008
    #4
  5. Peter Bailey

    Peter Bailey Guest

    Rob Biedenharn wrote:
    > On Apr 8, 2008, at 2:53 PM, Peter Bailey wrote:
    >> file = File.read("test1.txt")
    >> And, I'm getting this:
    >> Peter

    > Dir.chdir("C:/users/pb4072/documents") do |d|
    > file = File.read("test1.txt")
    > output = file.gsub(%r{^(<row><entry><text><emph face="b">)(.*)(</
    > emph>)}m) do |match|
    > "#{$1}#{$2.gsub(%r{\b\w+\b}){|w|w.capitalize}}#{$3}"
    > end
    > File.open("test1.txt", "w") { |f| f.write output }
    > end
    >
    > Note the use of three capture groups to get the unchanged initial and
    > final parts as well as the middle part that is altered. The %r{\b\w+
    > \b} is a Regexp that matches words, \b is a word-boundary and \w is a
    > word-character (short for [a-zA-Z0-9_]). Your use of
    > String#capitalize! returns nil if no change is made.
    >
    > -Rob
    >
    > Rob Biedenharn http://agileconsultingllc.com
    >


    Thanks, Rob. This works beautifully, except that I need that last
    </emph> in my output. It's being stripped with your code. I don't see
    why, because it's just your $3, isn't it?
    --
    Posted via http://www.ruby-forum.com/.
    Peter Bailey, Apr 9, 2008
    #5
  6. On Apr 9, 2008, at 8:24 AM, Peter Bailey wrote:
    > Rob Biedenharn wrote:
    >> On Apr 8, 2008, at 2:53 PM, Peter Bailey wrote:
    >>> file = File.read("test1.txt")
    >>> And, I'm getting this:
    >>> Peter

    >> Dir.chdir("C:/users/pb4072/documents") do |d|
    >> file = File.read("test1.txt")
    >> output = file.gsub(%r{^(<row><entry><text><emph face="b">)(.*)(</
    >> emph>)}m) do |match|
    >> "#{$1}#{$2.gsub(%r{\b\w+\b}){|w|w.capitalize}}#{$3}"
    >> end
    >> File.open("test1.txt", "w") { |f| f.write output }
    >> end
    >>
    >> Note the use of three capture groups to get the unchanged initial and
    >> final parts as well as the middle part that is altered. The %r{\b\w+
    >> \b} is a Regexp that matches words, \b is a word-boundary and \w is a
    >> word-character (short for [a-zA-Z0-9_]). Your use of
    >> String#capitalize! returns nil if no change is made.
    >>
    >> -Rob
    >>
    >> Rob Biedenharn http://agileconsultingllc.com
    >>

    >
    > Thanks, Rob. This works beautifully, except that I need that last
    > </emph> in my output. It's being stripped with your code. I don't see
    > why, because it's just your $3, isn't it?
    > --
    > Posted via http://www.ruby-forum.com/.


    You said to Todd:
    The original text is just:
    <row><entry><text><emph face="b">THE QUICK BROWN FOX JUMPED OVER THE
    LAZY DOG.<\/emph>/

    I assumed that the "<\/emph>/" part was a cut-n-paste of a regexp for
    the email (which is one reason that I change from // to %r{}
    construction of the Regexp so the / wouldn't have to be escaped. You
    may have to change the second group to (.*?) [reluctant match rather
    than greedy match] or adjust the third group to exactly match your
    input.

    -Rob

    Rob Biedenharn http://agileconsultingllc.com
    Rob Biedenharn, Apr 9, 2008
    #6
  7. Peter Bailey

    Peter Bailey Guest

    Rob Biedenharn wrote:
    > On Apr 9, 2008, at 8:24 AM, Peter Bailey wrote:
    >>> end
    >>>
    >>> Rob Biedenharn http://agileconsultingllc.com
    >>>

    >>
    >> Thanks, Rob. This works beautifully, except that I need that last
    >> </emph> in my output. It's being stripped with your code. I don't see
    >> why, because it's just your $3, isn't it?
    >> --
    >> Posted via http://www.ruby-forum.com/.

    >
    > You said to Todd:
    > The original text is just:
    > <row><entry><text><emph face="b">THE QUICK BROWN FOX JUMPED OVER THE
    > LAZY DOG.<\/emph>/
    >
    > I assumed that the "<\/emph>/" part was a cut-n-paste of a regexp for
    > the email (which is one reason that I change from // to %r{}
    > construction of the Regexp so the / wouldn't have to be escaped. You
    > may have to change the second group to (.*?) [reluctant match rather
    > than greedy match] or adjust the third group to exactly match your
    > input.
    >
    > -Rob
    >
    > Rob Biedenharn http://agileconsultingllc.com
    >


    Rob,
    So, here's my original file:
    <row><entry><text><emph face="b">THE QUICK BROWN FOX JUMPED OVER THE
    LAZY DOG.</emph>

    Here's my code, from you:
    Dir.chdir("C:/users/pb4072/documents") do |d|
    file = File.read("test1.txt")
    output = file.gsub(%r{^(<row><entry><text><emph
    face="b">)(.*)(<\/emph>)}m) do |match|
    "#{$1}#{$2.gsub(%r{\b\w+\b}){|w|w.capitalize}}#{$3}"
    end
    File.open("test1.txt", "w") { |f| f.write output }
    end

    Here's what I get. It works great, but, I don't understand why the $3
    text is simply blown away.
    <row><entry><text><emph face="b">The Quick Brown Fox Jumped Over The
    Lazy Dog.

    Thanks,
    Peter
    --
    Posted via http://www.ruby-forum.com/.
    Peter Bailey, Apr 9, 2008
    #7
  8. Peter Bailey

    Jens Wille Guest

    hi peter!

    Peter Bailey [2008-04-09 20:04]:
    > Dir.chdir("C:/users/pb4072/documents") do |d| file =
    > File.read("test1.txt") output =
    > file.gsub(%r{^(<row><entry><text><emph
    > face="b">)(.*)(<\/emph>)}m) do |match|
    > "#{$1}#{$2.gsub(%r{\b\w+\b}){|w|w.capitalize}}#{$3}" end
    > File.open("test1.txt", "w") { |f| f.write output } end
    >
    > Here's what I get. It works great, but, I don't understand why
    > the $3 text is simply blown away.

    because it's reset when you're doing that gsub on $2. the capture
    variables only refer to the *last* match. so you have to capture
    them into local variables first (can't think of a better way right now).

    cheers
    jens

    --
    Jens Wille, Dipl.-Bibl. (FH)
    prometheus - Das verteilte digitale Bildarchiv für Forschung & Lehre
    Kunsthistorisches Institut der Universität zu Köln
    Albertus-Magnus-Platz, D-50923 Köln
    Tel.: +49 (0)221 470-6668, E-Mail:
    http://www.prometheus-bildarchiv.de/
    Jens Wille, Apr 9, 2008
    #8
  9. On Apr 9, 2008, at 2:04 PM, Peter Bailey wrote:
    > Rob Biedenharn wrote:
    >> On Apr 9, 2008, at 8:24 AM, Peter Bailey wrote:
    >>>> end
    >>>>
    >>>> Rob Biedenharn http://agileconsultingllc.com
    >>>>
    >>>
    >>> Thanks, Rob. This works beautifully, except that I need that last
    >>> </emph> in my output. It's being stripped with your code. I don't
    >>> see
    >>> why, because it's just your $3, isn't it?
    >>> --
    >>> Posted via http://www.ruby-forum.com/.

    >>
    >> You said to Todd:
    >> The original text is just:
    >> <row><entry><text><emph face="b">THE QUICK BROWN FOX JUMPED OVER THE
    >> LAZY DOG.<\/emph>/
    >>
    >> I assumed that the "<\/emph>/" part was a cut-n-paste of a regexp for
    >> the email (which is one reason that I change from // to %r{}
    >> construction of the Regexp so the / wouldn't have to be escaped. You
    >> may have to change the second group to (.*?) [reluctant match rather
    >> than greedy match] or adjust the third group to exactly match your
    >> input.
    >>
    >> -Rob
    >>
    >> Rob Biedenharn http://agileconsultingllc.com
    >>

    >
    > Rob,
    > So, here's my original file:
    > <row><entry><text><emph face="b">THE QUICK BROWN FOX JUMPED OVER THE
    > LAZY DOG.</emph>

    OK, change this to a regexp:
    1. surround with the regexp literal bits
    %r{<row><entry><text><emph face="b">THE QUICK BROWN FOX JUMPED OVER THE
    LAZY DOG.</emph>}m

    2. add the grouping ()'s
    %r{(<row><entry><text><emph face="b">)(THE QUICK BROWN FOX JUMPED OVER
    THE
    LAZY DOG.)(</emph>)}m

    3. replace text with wildcards .* or .*?
    %r{(<row><entry><text><emph face="b">)(.*?)(</emph>)}m

    4. (optional?) add anchor ^
    %r{^(<row><entry><text><emph face="b">)(.*?)(</emph>)}m

    I'm assuming that is not the WHOLE file since the <row><entry><text>
    tags are not closed. It it quite likely that .* is slurping a lot
    more that you think so that's why I've change this to .*? which
    matches as little as possible while continuing to succeed.

    -Rob

    Rob Biedenharn http://agileconsultingllc.com


    > Here's my code, from you:
    > Dir.chdir("C:/users/pb4072/documents") do |d|
    > file = File.read("test1.txt")
    > output = file.gsub(%r{^(<row><entry><text><emph
    > face="b">)(.*)(<\/emph>)}m) do |match|
    > "#{$1}#{$2.gsub(%r{\b\w+\b}){|w|w.capitalize}}#{$3}"
    > end
    > File.open("test1.txt", "w") { |f| f.write output }
    > end
    >
    > Here's what I get. It works great, but, I don't understand why the $3
    > text is simply blown away.
    > <row><entry><text><emph face="b">The Quick Brown Fox Jumped Over The
    > Lazy Dog.
    >
    > Thanks,
    > Peter
    Rob Biedenharn, Apr 9, 2008
    #9
  10. On Apr 9, 2008, at 2:13 PM, Jens Wille wrote:
    > hi peter!
    >
    > Peter Bailey [2008-04-09 20:04]:
    >> Dir.chdir("C:/users/pb4072/documents") do |d| file =3D
    >> File.read("test1.txt") output =3D
    >> file.gsub(%r{^(<row><entry><text><emph
    >> face=3D"b">)(.*)(<\/emph>)}m) do |match|
    >> "#{$1}#{$2.gsub(%r{\b\w+\b}){|w|w.capitalize}}#{$3}" end
    >> File.open("test1.txt", "w") { |f| f.write output } end
    >>
    >> Here's what I get. It works great, but, I don't understand why
    >> the $3 text is simply blown away.

    > because it's reset when you're doing that gsub on $2. the capture
    > variables only refer to the *last* match. so you have to capture
    > them into local variables first (can't think of a better way right =20
    > now).
    >
    > cheers
    > jens
    >
    > --=20
    > Jens Wille, Dipl.-Bibl. (FH)
    > prometheus - Das verteilte digitale Bildarchiv f=FCr Forschung & Lehre
    > Kunsthistorisches Institut der Universit=E4t zu K=F6ln
    > Albertus-Magnus-Platz, D-50923 K=F6ln
    > Tel.: +49 (0)221 470-6668, E-Mail:
    > http://www.prometheus-bildarchiv.de/



    Ah yes! Good catch, Jens.

    Peter, you only *need* to capture $3, but it would make sense to get =20
    them all:

    head, content, tail =3D $1, $2, $3
    "#{head}#{content.gsub(%r{\b\w+\b}){|w|w.capitalize}}#{tail}"

    -Rob

    Rob Biedenharn http://agileconsultingllc.com
    =
    Rob Biedenharn, Apr 9, 2008
    #10
  11. Peter Bailey

    Jens Wille Guest

    Rob Biedenharn [2008-04-09 20:46]:
    > head, content, tail = $1, $2, $3
    > "#{head}#{content.gsub(%r{\b\w+\b}){|w|w.capitalize}}#{tail}"

    now here's a quick implementation that passes the MatchData object
    into the block:

    <http://prometheus.khi.uni-koeln.de/svn/scratch/ruby-nuggets/lib/nuggets/string/sub_with_md.rb>

    so that code effectively becomes:

    str.gsub_with_md(re) { |md|
    "#{md[1]}#{md[2].gsub(%r{\b\w+\b}){|w|w.capitalize}}#{md[3]}"
    }

    ;-)

    cheers
    jens
    Jens Wille, Apr 9, 2008
    #11
  12. Peter Bailey

    Jens Wille Guest

    Peter Bailey [2008-04-09 20:04]:
    > output = file.gsub(%r{^(<row><entry><text><emph
    > face="b">)(.*)(<\/emph>)}m) do |match|
    > "#{$1}#{$2.gsub(%r{\b\w+\b}){|w|w.capitalize}}#{$3}"
    > end

    oh, and for the fun of it, here's what you can do with oniguruma:

    Oniguruma::ORegexp.new(
    '(?<=^<row><entry><text><emph face="b">).+(?=</emph>)', 'm'
    ).gsub(file) { |md|
    md[0].gsub(%r{\b\w+\b}) { |w| w.capitalize }
    }

    (note that i needed to change '.*' to '.+')

    cheers
    jens
    Jens Wille, Apr 9, 2008
    #12
  13. Peter Bailey

    Peter Bailey Guest

    Jens Wille wrote:
    > Peter Bailey [2008-04-09 20:04]:
    >> output = file.gsub(%r{^(<row><entry><text><emph
    >> face="b">)(.*)(<\/emph>)}m) do |match|
    >> "#{$1}#{$2.gsub(%r{\b\w+\b}){|w|w.capitalize}}#{$3}"
    >> end

    > oh, and for the fun of it, here's what you can do with oniguruma:
    >
    > Oniguruma::ORegexp.new(
    > '(?<=^<row><entry><text><emph face="b">).+(?=</emph>)', 'm'
    > ).gsub(file) { |md|
    > md[0].gsub(%r{\b\w+\b}) { |w| w.capitalize }
    > }
    >
    > (note that i needed to change '.*' to '.+')
    >
    > cheers
    > jens


    Sorry, Jens, but, I have no idea what you're referring to here. I
    googled oniguruma. I see what it is. I installed it, but, it didn't seem
    to install successfully. Do I do a "require oniguruma" at the top of my
    script?
    --
    Posted via http://www.ruby-forum.com/.
    Peter Bailey, Apr 10, 2008
    #13
  14. Peter Bailey

    Jens Wille Guest

    Peter Bailey [2008-04-10 14:26]:
    > Do I do a "require oniguruma" at the top of my script?

    sure. but you really don't need it to solve your task at hand.

    it's just the new regexp engine for ruby 1.9 and sometimes i like to
    do some stuff with it that the default engine of 1.8 can't do
    (zero-width look-behind in this case).

    you can still simplify your substitution by using the look-ahead
    (which 1.8 *does* understand), so you get rid of the third capture:

    file.gsub(%r{^(<row><entry><text><emph>
    face="b">)(.*)(?=<\/emph>)}m) {
    "#{$1}#{$2.gsub(%r{\b\w+\b}) { |w| w.capitalize }}"
    }

    cheers
    jens
    Jens Wille, Apr 10, 2008
    #14
  15. Peter Bailey

    Peter Bailey Guest

    Jens Wille wrote:
    > Peter Bailey [2008-04-10 14:26]:
    >> Do I do a "require oniguruma" at the top of my script?

    > sure. but you really don't need it to solve your task at hand.
    >
    > it's just the new regexp engine for ruby 1.9 and sometimes i like to
    > do some stuff with it that the default engine of 1.8 can't do
    > (zero-width look-behind in this case).
    >
    > you can still simplify your substitution by using the look-ahead
    > (which 1.8 *does* understand), so you get rid of the third capture:
    >
    > file.gsub(%r{^(<row><entry><text><emph>
    > face="b">)(.*)(?=<\/emph>)}m) {
    > "#{$1}#{$2.gsub(%r{\b\w+\b}) { |w| w.capitalize }}"
    > }
    >
    > cheers
    > jens


    Thanks. But, again, do I need to do a "require" for oniguruma at the
    top?
    Cheers,
    Peter
    --
    Posted via http://www.ruby-forum.com/.
    Peter Bailey, Apr 10, 2008
    #15
  16. Peter Bailey

    Jens Wille Guest

    Peter Bailey [2008-04-10 16:20]:
    > Jens Wille wrote:
    >> Peter Bailey [2008-04-10 14:26]:
    >>> Do I do a "require oniguruma" at the top of my script?

    >> sure. but you really don't need it to solve your task at hand.

    > Thanks. But, again, do I need to do a "require" for oniguruma at
    > the top?

    if you want to use oniguruma, then yes, you have to require it first.
    Jens Wille, Apr 10, 2008
    #16
  17. Peter Bailey

    Peter Bailey Guest

    Jens Wille wrote:
    > Peter Bailey [2008-04-10 16:20]:
    >> Jens Wille wrote:
    >>> Peter Bailey [2008-04-10 14:26]:
    >>>> Do I do a "require oniguruma" at the top of my script?
    >>> sure. but you really don't need it to solve your task at hand.

    >> Thanks. But, again, do I need to do a "require" for oniguruma at
    >> the top?

    > if you want to use oniguruma, then yes, you have to require it first.


    OK. Thanks!
    --
    Posted via http://www.ruby-forum.com/.
    Peter Bailey, Apr 10, 2008
    #17
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Peter Strøiman
    Replies:
    1
    Views:
    2,072
    Peter Strøiman
    Aug 23, 2005
  2. Richard Heathfield
    Replies:
    7
    Views:
    350
    Barry Schwarz
    Oct 5, 2003
  3. utab

    Words Words

    utab, Feb 16, 2006, in forum: C++
    Replies:
    6
    Views:
    415
    Daniel T.
    Feb 16, 2006
  4. BerlinBrown
    Replies:
    6
    Views:
    4,438
  5. Richard Powell
    Replies:
    2
    Views:
    243
    Richard Powell
    Feb 10, 2011
Loading...

Share This Page