removing Whitespace using regexp

Discussion in 'Ruby' started by Arun Kumar, May 6, 2009.

  1. Arun Kumar

    Arun Kumar Guest

    Hi,
    Previously I posted a topic on how to strip all html tags and getting
    the remaining text using regexp. Luckily I got one. This is the regexp:

    /([^>]*)(?=<[^>]*?>)/im

    In this case I'm able to get all the data between the html tags. But one
    small problem. I'm getting output like this :


    Example Web Page


    You have reached this web page by typing &quot;example.com&quot;,
    &quot;example.net&quot;,
    or &quot;example.org&quot; into your web browser.
    These domain names are reserved for use in documentation and are not
    available
    for registration. See RFC
    2606, Section 3.

    This is the output which I get when I parse the html content of
    example.com using the above regexp. Here you can see some white space
    between the data(ie. between 'Example web page' and 'You have
    reached...'. These whitespaces are generated in place of the html tags
    which I avoided using the above regexp. I want to remove those
    whitespaces. I think that modifying the above regexp will give me the
    right output without white spaces. Can somebody please help me.

    Thanks
    Arun
    --
    Posted via http://www.ruby-forum.com/.
     
    Arun Kumar, May 6, 2009
    #1
    1. Advertising

  2. * Arun Kumar <> (07:54) schrieb:

    > Hi,
    > Previously I posted a topic on how to strip all html tags and getting
    > the remaining text using regexp. Luckily I got one. This is the regexp:
    >
    > /([^>]*)(?=<[^>]*?>)/im


    And what do you do with this regexp?

    > In this case I'm able to get all the data between the html tags. But one
    > small problem.


    Hasn't everybody told you, there are problems with parsing HTML with regexps?

    > This is the output which I get when I parse the html content of
    > example.com using the above regexp. Here you can see some white space
    > between the data(ie. between 'Example web page' and 'You have
    > reached...'. These whitespaces are generated in place of the html tags
    > which I avoided using the above regexp.


    Really? Aren't they just from all the meaningless whitespace that's in
    a typical HTML document?

    > I want to remove those
    > whitespaces. I think that modifying the above regexp will give me the
    > right output without white spaces. Can somebody please help me.


    There are easy ways to strip all the whitespace, which is certainly not
    what you want, and there is a simple way to reduce all runs of whitespace
    by just one space (gsub(/\s+/, ' '), which probably also not what you
    want.

    Selectively removing some of the whitespace isn't easy at all, but it is
    probably a lot easier with a real HTML parser.

    mfg, simon .... l
     
    Simon Krahnke, May 6, 2009
    #2
    1. Advertising

  3. Hey Arun,


    How about doing a gsub on the output to remove white spaces.

    For example:

    "Example Web Page".gsub(" ","")

    This would remove the white spaces.


    Hope this helps.

    Regards
    Sriram.


    --
    Posted via http://www.ruby-forum.com/.
     
    Sriram Varahan, May 6, 2009
    #3
  4. [Note: parts of this message were removed to make it a legal post.]

    I know your boss and whoever it is who is dangling your carrots won't let
    you use Hpricot, but tell him you will use Hpricot to get properly formatted
    html and then write a parser to parse the properly formatted html. Even he
    can't be opposed to that(seeing as how he wants you to reinvent wheels).
    That way you can get rid of your whitespace problem and deal with the cosmos
    at large.

    Jayanth

    On Wed, May 6, 2009 at 12:35 PM, Simon Krahnke <> wrote:

    > * Arun Kumar <> (07:54) schrieb:
    >
    > > Hi,
    > > Previously I posted a topic on how to strip all html tags and getting
    > > the remaining text using regexp. Luckily I got one. This is the regexp:
    > >
    > > /([^>]*)(?=<[^>]*?>)/im

    >
    > And what do you do with this regexp?
    >
    > > In this case I'm able to get all the data between the html tags. But one
    > > small problem.

    >
    > Hasn't everybody told you, there are problems with parsing HTML with
    > regexps?
    >
    > > This is the output which I get when I parse the html content of
    > > example.com using the above regexp. Here you can see some white space
    > > between the data(ie. between 'Example web page' and 'You have
    > > reached...'. These whitespaces are generated in place of the html tags
    > > which I avoided using the above regexp.

    >
    > Really? Aren't they just from all the meaningless whitespace that's in
    > a typical HTML document?
    >
    > > I want to remove those
    > > whitespaces. I think that modifying the above regexp will give me the
    > > right output without white spaces. Can somebody please help me.

    >
    > There are easy ways to strip all the whitespace, which is certainly not
    > what you want, and there is a simple way to reduce all runs of whitespace
    > by just one space (gsub(/\s+/, ' '), which probably also not what you
    > want.
    >
    > Selectively removing some of the whitespace isn't easy at all, but it is
    > probably a lot easier with a real HTML parser.
    >
    > mfg, simon .... l
    >
    >
     
    Srijayanth Sridhar, May 6, 2009
    #4
  5. 2009/5/6 Sriram Varahan <>:

    > How about doing a gsub on the output to remove white spaces.
    >
    > For example:
    >
    > "Example Web Page".gsub(" ","")
    >
    > This would remove the white spaces.


    I would rather do

    s.gsub /\s+/, ' '

    Because your statement removes *all* whitespace:

    irb(main):002:0> "Example Web Page".gsub(" ","")
    => "ExampleWebPage"

    This is usually not what you want.

    > Hope this helps.


    Dito.

    Cheers

    robert

    --
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
     
    Robert Klemme, May 6, 2009
    #5
  6. Arun Kumar

    Mark Thomas Guest


    > I know your boss and whoever it is who is dangling your carrots won't let
    > you use Hpricot, but tell him you will use Hpricot to get properly formatted
    > html and then write a parser to parse the properly formatted html. Even he
    > can't be opposed to that(seeing as how he wants you to reinvent wheels).
    > That way you can get rid of your whitespace problem and deal with the cosmos
    > at large.


    Here's what we know about Arun from his previous posts...
    * he is a "trainee" doing assignments.
    * he is learning ruby
    * "nobody else" around him knows ruby
    * his "boss"/teacher is giving him specific assignments that seem to
    be purely academic exercises, because the constraints (e.g. don't use
    gsub, parse "example.com") would otherwise be completely ridiculous.
    * He doesn't have the authority to re-scope the assignment or offer
    alternate solutions.

    I believe he is asking us to do his homework.
     
    Mark Thomas, May 7, 2009
    #6
  7. [Note: parts of this message were removed to make it a legal post.]

    How many places do you know that have extensive Ruby training programs that
    expect you to write HTML parsers armed with nothing but regular expressions?
    ;)

    I live in Bangalore, and I don't know one. Either he truly has a sadistic
    boss, or his truth is stranger than fiction. I don't doubt that it is
    homework of some sort.

    Jayanth

    On Thu, May 7, 2009 at 5:10 PM, Mark Thomas <> wrote:

    >
    > > I know your boss and whoever it is who is dangling your carrots won't let
    > > you use Hpricot, but tell him you will use Hpricot to get properly

    > formatted
    > > html and then write a parser to parse the properly formatted html. Even

    > he
    > > can't be opposed to that(seeing as how he wants you to reinvent wheels).
    > > That way you can get rid of your whitespace problem and deal with the

    > cosmos
    > > at large.

    >
    > Here's what we know about Arun from his previous posts...
    > * he is a "trainee" doing assignments.
    > * he is learning ruby
    > * "nobody else" around him knows ruby
    > * his "boss"/teacher is giving him specific assignments that seem to
    > be purely academic exercises, because the constraints (e.g. don't use
    > gsub, parse "example.com") would otherwise be completely ridiculous.
    > * He doesn't have the authority to re-scope the assignment or offer
    > alternate solutions.
    >
    > I believe he is asking us to do his homework.
    >
    >
     
    Srijayanth Sridhar, May 7, 2009
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Oli Filth
    Replies:
    9
    Views:
    3,348
    Uncle Pirate
    Jan 17, 2005
  2. haughki
    Replies:
    0
    Views:
    422
    haughki
    Oct 8, 2003
  3. Replies:
    10
    Views:
    769
    Eric Brunel
    Dec 16, 2008
  4. MRAB
    Replies:
    3
    Views:
    394
  5. Joao Silva
    Replies:
    16
    Views:
    377
    7stud --
    Aug 21, 2009
Loading...

Share This Page