Adventures in html decoding.

Discussion in 'Ruby' started by Morgan, Sep 12, 2005.

  1. Morgan

    Morgan Guest

    From the "If you want it done right, do it yourself... maybe"
    department.

    Today I was looking at a webpage that used html encoding
    (ie, "a" in place of "a") to obfuscate much of it's contents.
    This displeased me for several reasons. (Not the least of
    which was a standing order from General Principles.)

    So I looked around online for a web-based tool that I could
    paste the text into and get back a more useful form. But
    everything I found either didn't work, or just didn't convert
    ordinary letters.

    So I said to heck with it, I can write something to do this
    myself.

    I didn't use CGI for two reasons. 1) I remember the last time I
    tried experimenting with CGI, and had to severely hack the
    library to get it to let me use html generation methods in a
    non-server environment. 2) The description of unescapeHTML
    sounded as though it would only unescape the special characters
    that have to be escaped.

    So, I ended up with this:


    ===
    outfile = File.new(ARGV[1], File::CREAT|File::WRONLY|File::EXCL)
    IO.readlines(ARGV[0]).each{ |line|
    begin
    outfile.puts line.gsub(/&#(\d+);/) { |x|
    if $1.to_i < 256
    $1.to_i.chr
    else
    x
    end
    }
    rescue
    outfile.puts line
    puts line
    end

    }
    outfile.close
    ===

    And it worked.

    Then I thought of looking at the source of unescapeHTML, and
    found that the description or my interpretation of it was wrong.
    Not only would it handle all the escaped ascii characters, it was
    a class method, so I didn't need to deal with the enviroment
    issues.

    Which lead to...

    ===
    require 'cgi'
    outfile = File.new(ARGV[1], File::CREAT|File::WRONLY|File::EXCL)
    IO.readlines(ARGV[0]).each{ |line|
    outfile.puts CGI::unescapeHTML(line)
    }
    outfile.close
    ===

    Which is much simpler; just some file handling stuff around
    the unescapeHTML function. Maybe later I'll try something with
    rubywebdialogs that'll let me paste into a web browser window
    and get back results the way I'd like to be able to do...

    The moral of this story is, html obfuscation sucks.

    (What? That's *not* the moral? Oh well...)

    -Morgan


    --
    No virus found in this outgoing message.
    Checked by AVG Anti-Virus.
    Version: 7.0.344 / Virus Database: 267.10.21/96 - Release Date: 09/10/2005
    Morgan, Sep 12, 2005
    #1
    1. Advertising

  2. Morgan

    Jim Freeze Guest

    On 9/12/05, Morgan <> wrote:
    > =3D=3D=3D
    > require 'cgi'
    > outfile =3D File.new(ARGV[1], File::CREAT|File::WRONLY|File::EXCL)
    > IO.readlines(ARGV[0]).each{ |line|
    > outfile.puts CGI::unescapeHTML(line)
    > }
    > outfile.close
    > =3D=3D=3D


    This may be off topic, but I always wonder why all the flags to File.
    Could what you are doing be written as:

    File.open(ARGV[1], "w") { |outfile|
    File.foreach(ARGV[0]) { |line|
    outfile.puts CGI::unescapeHTML(line)
    }
    }

    or am I missing something big here?
    --=20
    Jim Freeze
    Jim Freeze, Sep 12, 2005
    #2
    1. Advertising


  3. > [...]
    > Which lead to...
    >
    > ===
    > require 'cgi'
    > outfile = File.new(ARGV[1], File::CREAT|File::WRONLY|File::EXCL)
    > IO.readlines(ARGV[0]).each{ |line|
    > outfile.puts CGI::unescapeHTML(line)
    > }
    > outfile.close
    > ===
    >
    > Which is much simpler; just some file handling stuff around
    > the unescapeHTML function. Maybe later I'll try something with
    > rubywebdialogs that'll let me paste into a web browser window
    > and get back results the way I'd like to be able to do...
    >
    > The moral of this story is, html obfuscation sucks.
    >
    > (What? That's *not* the moral? Oh well...)
    >
    > -Morgan


    the moral is, there is always a simpler way :)

    require 'cgi'
    open(ARGV[1], 'w') do |f|
    f.write(CGI::unescapeHTML(IO.read(ARGV[0])))
    end

    cheers

    Simon
    Simon Kröger, Sep 12, 2005
    #3
  4. Morgan

    Morgan Guest

    Jim Freeze wrote:
    >This may be off topic, but I always wonder why all the flags to File.
    >Could what you are doing be written as:
    >
    > File.open(ARGV[1], "w") { |outfile|
    > File.foreach(ARGV[0]) { |line|
    > outfile.puts CGI::unescapeHTML(line)
    > }
    > }
    >
    >or am I missing something big here?


    Well, in this case, I don't believe it's possible to
    get the effect of File::EXCL (which basically amounts to
    "don't overwrite an existing file") with a string as the open
    mode. There are some other combinations of parameters
    that are also difficult (impossible) to achieve that way.
    (I don't remember exactly what it was, but I think it had to
    do with a file that was being opened for reading and writing.
    All the strings I tried either wouldn't let me access parts of
    an existing file, or otherwise failed to perfrom as I required.)

    -Morgan


    --
    No virus found in this outgoing message.
    Checked by AVG Anti-Virus.
    Version: 7.0.344 / Virus Database: 267.10.21/96 - Release Date: 09/10/2005
    Morgan, Sep 13, 2005
    #4
  5. Morgan

    Ara.T.Howard Guest

    On Tue, 13 Sep 2005, Morgan wrote:

    > Jim Freeze wrote:
    >> This may be off topic, but I always wonder why all the flags to File.
    >> Could what you are doing be written as:
    >>
    >> File.open(ARGV[1], "w") { |outfile|
    >> File.foreach(ARGV[0]) { |line|
    >> outfile.puts CGI::unescapeHTML(line)
    >> }
    >> }
    >>
    >> or am I missing something big here?

    >
    > Well, in this case, I don't believe it's possible to
    > get the effect of File::EXCL (which basically amounts to
    > "don't overwrite an existing file") with a string as the open
    > mode. There are some other combinations of parameters
    > that are also difficult (impossible) to achieve that way.
    > (I don't remember exactly what it was, but I think it had to
    > do with a file that was being opened for reading and writing.
    > All the strings I tried either wouldn't let me access parts of
    > an existing file, or otherwise failed to perfrom as I required.)


    O_EXCL is broken on nfs:

    O_EXCL When used with O_CREAT, if the file already exists it is an error
    and the open will fail. In this context, a symbolic link exists, regardless of
    where its points to. O_EXCL is broken on NFS file systems, programs which
    rely on it for performing lock- ing tasks will contain a race condition. The
    solution for per- forming atomic file locking using a lockfile is to
    create a unique file on the same fs (e.g., incorporating hostname and pid),
    use link(2) to make a link to the lockfile. If link() returns 0, the lock
    is successful. Otherwise, use stat(2) on the unique file to check if its
    link count has increased to 2, in which case the lock is also successful.


    fyi.

    -a
    --
    ===============================================================================
    | email :: ara [dot] t [dot] howard [at] noaa [dot] gov
    | phone :: 303.497.6469
    | Your life dwells amoung the causes of death
    | Like a lamp standing in a strong breeze. --Nagarjuna
    ===============================================================================
    Ara.T.Howard, Sep 13, 2005
    #5
  6. Morgan

    Morgan Guest

    "Ara.T.Howard" wrote:
    >O_EXCL is broken on nfs:
    >
    > O_EXCL When used with O_CREAT, if the file already exists it is an
    > error
    > and the open will fail. In this context, a symbolic link exists,
    > regardless of
    > where its points to. O_EXCL is broken on NFS file systems, programs which
    > rely on it for performing lock- ing tasks will contain a race
    > condition. The
    > solution for per- forming atomic file locking using a lockfile is to
    > create a unique file on the same
    > fs (e.g., incorporating hostname and pid),
    > use link(2) to make a link to the lockfile. If link() returns 0,
    > the lock
    > is successful. Otherwise, use stat(2) on the unique file to check
    > if its
    > link count has increased to 2, in which case the lock is also successful.


    ... And I barely understood a word of that. `.`

    Does that mean it won't properly perform the "don't clobber an existing file"
    purpose I'm using it for?

    -Morgan


    --
    No virus found in this outgoing message.
    Checked by AVG Anti-Virus.
    Version: 7.0.344 / Virus Database: 267.10.21/96 - Release Date: 09/10/2005
    Morgan, Sep 13, 2005
    #6
  7. Morgan

    Ara.T.Howard Guest

    On Tue, 13 Sep 2005, Morgan wrote:

    > "Ara.T.Howard" wrote:
    >> O_EXCL is broken on nfs:
    >>
    >> O_EXCL When used with O_CREAT, if the file already exists it is an
    >> error and the open will fail. In this context, a symbolic link exists,
    >> regardless of where its points to. O_EXCL is broken on NFS file
    >> systems, programs which rely on it for performing lock- ing tasks will
    >> contain a race condition. The solution for per- forming atomic file
    >> locking using a lockfile is to create a unique file on the same fs
    >> (e.g., incorporating hostname and pid), use link(2) to make a link
    >> to the lockfile. If link() returns 0, the lock is successful. Otherwise,
    >> use stat(2) on the unique file to check if its link count has
    >> increased to 2, in which case the lock is also successful.

    >
    > ... And I barely understood a word of that. `.`
    >
    > Does that mean it won't properly perform the "don't clobber an existing file"
    > purpose I'm using it for?


    it means that O_EXCL fails silently on some kinds of filesystems, including
    nfs. this is not likely to affect you and is beyond the control of ruby (it's
    the c library/fs fault) but, if it does affect you, it means that two
    instances of the code, when run at the same time, would __both__ be writing to
    the file at the same time - neither would have an exclusive lock on the file
    as it would not be created atomically. basically you can ignore this if you
    are working on local disk - but if you are some sort of shared setup like nfs
    or windows equiv be wary.

    cheers.

    -a
    --
    ===============================================================================
    | email :: ara [dot] t [dot] howard [at] noaa [dot] gov
    | phone :: 303.497.6469
    | Your life dwells amoung the causes of death
    | Like a lamp standing in a strong breeze. --Nagarjuna
    ===============================================================================
    Ara.T.Howard, Sep 13, 2005
    #7
  8. Simon Kröger wrote:
    > > [...]
    > > Which lead to...
    > >
    > > ===
    > > require 'cgi'
    > > outfile = File.new(ARGV[1], File::CREAT|File::WRONLY|File::EXCL)
    > > IO.readlines(ARGV[0]).each{ |line|
    > > outfile.puts CGI::unescapeHTML(line)
    > > }
    > > outfile.close
    > > ===
    > >
    > > Which is much simpler; just some file handling stuff around
    > > the unescapeHTML function. Maybe later I'll try something with
    > > rubywebdialogs that'll let me paste into a web browser window
    > > and get back results the way I'd like to be able to do...
    > >
    > > The moral of this story is, html obfuscation sucks.
    > >
    > > (What? That's *not* the moral? Oh well...)
    > >
    > > -Morgan

    >
    > the moral is, there is always a simpler way :)
    >
    > require 'cgi'
    > open(ARGV[1], 'w') do |f|
    > f.write(CGI::unescapeHTML(IO.read(ARGV[0])))
    > end
    >
    > cheers
    >
    > Simon


    Simpler still:

    require 'cgi'
    open(ARGV.pop, 'w') { |f|
    f.write(CGI.unescapeHTML(ARGF.read))
    }
    William James, Sep 13, 2005
    #8
  9. Morgan

    Morgan Guest

    "William James" wrote:
    >Simpler still:
    >
    >require 'cgi'
    >open(ARGV.pop, 'w') { |f|
    > f.write(CGI.unescapeHTML(ARGF.read))
    >}


    I think you might have reached the point where
    simpler is more complex... I'm not sure I'd know
    what that code was supposed to do if it wasn't
    something I wrote being reduced.

    *never even -seen- ARGF before*

    -Morgan


    --
    No virus found in this outgoing message.
    Checked by AVG Anti-Virus.
    Version: 7.0.344 / Virus Database: 267.10.21/96 - Release Date: 09/10/2005
    Morgan, Sep 13, 2005
    #9
  10. Morgan

    Jim Freeze Guest

    On 9/12/05, Morgan <> wrote:
    > *never even -seen- ARGF before*


    ARGF is a reference to $stdin.

    --=20
    Jim Freeze
    Jim Freeze, Sep 13, 2005
    #10
  11. Jim Freeze wrote:
    > On 9/12/05, Morgan <> wrote:
    > > *never even -seen- ARGF before*

    >
    > ARGF is a reference to $stdin.
    >
    > --
    > Jim Freeze


    An object providing access to virtual concatenation of files
    passed as command-line arguments or standard input if there
    are no command-line arguments. -- Ruby in a Nutshell

    ARGF is no more esoteric than ARGV, and it's quite handy.
    Let's say you want to process every line of every file
    on the command-line:

    ruby -e 'ARGF.each_line{|x| p x}' file1 file2 file3
    William James, Sep 13, 2005
    #11
  12. --Apple-Mail-4-135326081
    Content-Transfer-Encoding: 7bit
    Content-Type: text/plain;
    charset=US-ASCII;
    format=flowed

    On Sep 13, 2005, at 1:16 AM, William James wrote:
    > ARGF is no more esoteric than ARGV, and it's quite handy.
    > Let's say you want to process every line of every file
    > on the command-line:
    >
    > ruby -e 'ARGF.each_line{|x| p x}' file1 file2 file3


    Damn, that *is* handy! I love this list.
    --Apple-Mail-4-135326081--
    Gavin Kistner, Sep 13, 2005
    #12
  13. mathew wrote:
    > William James wrote:
    > > ARGF is no more esoteric than ARGV, [...]

    >
    > I disagree. ARGV is familiar to anyone who's ever written C, C++,
    > Objective-C, Java, Perl, AWK, Python, Scheme, ...
    > ARGF is not.


    These are not familiar to everyone who's ever written in C or Awk:

    class, map, join, __END__, DATA, <<HERE, grep, flatten

    But that doesn't prove they are esoteric to those who use Ruby.

    > I'd never heard of it until this thread.


    Major premise:
    I know everything about Ruby except that which is esoteric.

    Minor premise:
    I don't know about ARGF.

    Conclusion:
    ARGF is esoteric.

    >
    > Compare the number of references to ARGV and ARGF in the pickaxe book
    > too: ARGF is only mentioned three times in the entire book according to
    > the index.



    Pickaxe (1st edition), page 16:

    The "Ruby way" to write this would be to use an iterator:

    ARGF.each { |line| print line if line =~ /Ruby/ }

    on page 219 under the heading "Standard Objects" these are listed:
    ARGF, ARGV, ENV, false, nil, self, true

    page 217 explains ARGF's synonym, $<.


    "Teach Yourself Ruby in 21 Days" explains ARGF in Day 8 on
    page 173 and uses it in the final two solutions to a problem.
    The penultimate one is

    has_a_long_word = /\w{5,}/
    ARGF.each{|line| print line unless has_a_long_word =~ line}

    Matz himself in "Ruby in a Nutshell" explains it on page 38
    and lists it as one of 14 predefined global constants.


    > One of those is in a grey "you can skip this" section talking
    > about Perlisms,


    For that the authors should be afflicted with the Spell of
    Forlorn Encystment.


    ------------
    ------------


    Usage tip: the name of the file currently being read is available as
    $FILENAME or as shown in this example:

    ruby -e 'ARGF.each{|x| print ARGF.filename + ", " + x }' file1 file2
    William James, Sep 14, 2005
    #13
  14. Hi --

    On Wed, 14 Sep 2005, mathew wrote:

    > William James wrote:
    >> ARGF is no more esoteric than ARGV, [...]

    >
    > I disagree. ARGV is familiar to anyone who's ever written C, C++,
    > Objective-C, Java, Perl, AWK, Python, Scheme, ...
    >
    > ARGF is not. I'd never heard of it until this thread.


    You make it sound like learning something from a ruby-talk thread is
    bad :)

    > Compare the number of references to ARGV and ARGF in the pickaxe book too:
    > ARGF is only mentioned three times in the entire book according to the index.
    > One of those is in a grey "you can skip this" section talking about Perlisms,
    > the second is under a big "ARGC" heading where it's mentioned in passing, and
    > the real discussion isn't until page 336.


    That doesn't mean it's esoteric. It just means it's discussed on page
    336. Something has to be :)


    David

    --
    David A. Black
    David A. Black, Sep 14, 2005
    #14
  15. David A. Black <> wrote:
    >
    > That doesn't mean it's esoteric. It just means it's discussed on page
    > 336. Something has to be :)


    Nicely put :)

    martin
    Martin DeMello, Sep 17, 2005
    #15
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Doug Holland

    Adventures in Whidbey land...

    Doug Holland, Jan 6, 2004, in forum: ASP .Net
    Replies:
    3
    Views:
    354
    Yan-Hong Huang[MSFT]
    Jan 6, 2004
  2. Chris Botha

    Oracle Adventures

    Chris Botha, Nov 9, 2005, in forum: ASP .Net
    Replies:
    13
    Views:
    7,585
    Paul Clement
    Nov 17, 2005
  3. Steven W. Orr

    Further adventures in array slicing.

    Steven W. Orr, May 4, 2007, in forum: Python
    Replies:
    3
    Views:
    268
    Alex Martelli
    May 5, 2007
  4. Travers Naran

    Adventures in java.util.concurrent

    Travers Naran, Aug 29, 2011, in forum: Java
    Replies:
    0
    Views:
    327
    Travers Naran
    Aug 29, 2011
  5. John Carter
    Replies:
    4
    Views:
    118
    F. Senault
    Dec 2, 2008
Loading...

Share This Page