trying to use regex

merrittr · Jun 20, 2007

hi i am trying to strip out text between body tags but when run it i
get:

rob@rob-laptop:~/ruby$ ./html2.rb
../html2.rb:14: unknown regexp options - bdy
../html2.rb:14: unterminated string meets end of file
../html2.rb:14: parse error, unexpected tSTRING_END, expecting
tSTRING_CONTENT or tREGEXP_END or tSTRING_DBEG or tSTRING_DVAR

#! /usr/bin/ruby

@h = File.open "test.html"
@response = @h.gets

text = @response.scan(/<body[^>]*>(.+?)</body>/)[0]
puts text

Alex Gutteridge · Jun 20, 2007

hi i am trying to strip out text between body tags but when run it i
get:

rob@rob-laptop:~/ruby$ ./html2.rb
./html2.rb:14: unknown regexp options - bdy
./html2.rb:14: unterminated string meets end of file
./html2.rb:14: parse error, unexpected tSTRING_END, expecting
tSTRING_CONTENT or tREGEXP_END or tSTRING_DBEG or tSTRING_DVAR

#! /usr/bin/ruby

@h = File.open "test.html"
@response = @h.gets

text = @response.scan(/<body[^>]*>(.+?)</body>/)[0]
puts text

You need to escape the '/' in your regexp, and unless your html file
is one line you may need to also add the multiline option:

text = @response.scan(/<body[^>]*>(.+?)<\/body>/m)[0]

Alex Gutteridge

Bioinformatics Center
Kyoto University

Rob Biedenharn · Jun 20, 2007

hi i am trying to strip out text between body tags but when run it i
get:

rob@rob-laptop:~/ruby$ ./html2.rb
./html2.rb:14: unknown regexp options - bdy
./html2.rb:14: unterminated string meets end of file
./html2.rb:14: parse error, unexpected tSTRING_END, expecting
tSTRING_CONTENT or tREGEXP_END or tSTRING_DBEG or tSTRING_DVAR

#! /usr/bin/ruby

@h = File.open "test.html"
@response = @h.gets

text = @response.scan(/<body[^>]*>(.+?)</body>/)[0]
puts text

Click to expand...

You need to escape the '/' in your regexp, and unless your html
file is one line you may need to also add the multiline option:

text = @response.scan(/<body[^>]*>(.+?)<\/body>/m)[0]

Alex Gutteridge

Bioinformatics Center
Kyoto University

Or you can use the %r{} form of a Regexp literal:

text = @response.scan(%r{<body\b.*?>(.*?)</body>}mi)[0]

\b matches a "word boundary"
m is the multi-line option that causes . to match newlines, too
i is the case insensitive option (so BODY would also be matched)

-Rob

Rob Biedenharn http://agileconsultingllc.com
(e-mail address removed)

Drew Olson · Jun 20, 2007

merrittr said:
hi i am trying to strip out text between body tags but when run it i
get:

HTML parsing can get quite complicated, why not use a library? I've
heard great things about http://code.whytheluckystiff.net/hpricot/

Using variables defined in configuration files	1	Apr 20, 2010
Syntax error messages in 1.8.3	5	Feb 14, 2006
Here Document syntax is stringent - trailing blank	11	Jan 11, 2008
How to require Ruby 1.9 from a required file?	3	Sep 6, 2010
Extract lines with regular expressions	2	Apr 21, 2008
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Jan 1, 2008
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Mar 1, 2008
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Dec 15, 2007

trying to use regex

merrittr

Alex Gutteridge

Rob Biedenharn

Drew Olson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads