Adventures in html decoding.

M

Morgan

From the "If you want it done right, do it yourself... maybe"
department.

Today I was looking at a webpage that used html encoding
(ie, "a" in place of "a") to obfuscate much of it's contents.
This displeased me for several reasons. (Not the least of
which was a standing order from General Principles.)

So I looked around online for a web-based tool that I could
paste the text into and get back a more useful form. But
everything I found either didn't work, or just didn't convert
ordinary letters.

So I said to heck with it, I can write something to do this
myself.

I didn't use CGI for two reasons. 1) I remember the last time I
tried experimenting with CGI, and had to severely hack the
library to get it to let me use html generation methods in a
non-server environment. 2) The description of unescapeHTML
sounded as though it would only unescape the special characters
that have to be escaped.

So, I ended up with this:


===
outfile = File.new(ARGV[1], File::CREAT|File::WRONLY|File::EXCL)
IO.readlines(ARGV[0]).each{ |line|
begin
outfile.puts line.gsub(/&#(\d+);/) { |x|
if $1.to_i < 256
$1.to_i.chr
else
x
end
}
rescue
outfile.puts line
puts line
end

}
outfile.close
===

And it worked.

Then I thought of looking at the source of unescapeHTML, and
found that the description or my interpretation of it was wrong.
Not only would it handle all the escaped ascii characters, it was
a class method, so I didn't need to deal with the enviroment
issues.

Which lead to...

===
require 'cgi'
outfile = File.new(ARGV[1], File::CREAT|File::WRONLY|File::EXCL)
IO.readlines(ARGV[0]).each{ |line|
outfile.puts CGI::unescapeHTML(line)
}
outfile.close
===

Which is much simpler; just some file handling stuff around
the unescapeHTML function. Maybe later I'll try something with
rubywebdialogs that'll let me paste into a web browser window
and get back results the way I'd like to be able to do...

The moral of this story is, html obfuscation sucks.

(What? That's *not* the moral? Oh well...)

-Morgan
 
J

Jim Freeze

=3D=3D=3D
require 'cgi'
outfile =3D File.new(ARGV[1], File::CREAT|File::WRONLY|File::EXCL)
IO.readlines(ARGV[0]).each{ |line|
outfile.puts CGI::unescapeHTML(line)
}
outfile.close
=3D=3D=3D

This may be off topic, but I always wonder why all the flags to File.
Could what you are doing be written as:

File.open(ARGV[1], "w") { |outfile|
File.foreach(ARGV[0]) { |line|
outfile.puts CGI::unescapeHTML(line)
}
}

or am I missing something big here?
--=20
Jim Freeze
 
S

Simon Kröger

[...]
Which lead to...

===
require 'cgi'
outfile = File.new(ARGV[1], File::CREAT|File::WRONLY|File::EXCL)
IO.readlines(ARGV[0]).each{ |line|
outfile.puts CGI::unescapeHTML(line)
}
outfile.close
===

Which is much simpler; just some file handling stuff around
the unescapeHTML function. Maybe later I'll try something with
rubywebdialogs that'll let me paste into a web browser window
and get back results the way I'd like to be able to do...

The moral of this story is, html obfuscation sucks.

(What? That's *not* the moral? Oh well...)

-Morgan

the moral is, there is always a simpler way :)

require 'cgi'
open(ARGV[1], 'w') do |f|
f.write(CGI::unescapeHTML(IO.read(ARGV[0])))
end

cheers

Simon
 
M

Morgan

Jim said:
This may be off topic, but I always wonder why all the flags to File.
Could what you are doing be written as:

File.open(ARGV[1], "w") { |outfile|
File.foreach(ARGV[0]) { |line|
outfile.puts CGI::unescapeHTML(line)
}
}

or am I missing something big here?

Well, in this case, I don't believe it's possible to
get the effect of File::EXCL (which basically amounts to
"don't overwrite an existing file") with a string as the open
mode. There are some other combinations of parameters
that are also difficult (impossible) to achieve that way.
(I don't remember exactly what it was, but I think it had to
do with a file that was being opened for reading and writing.
All the strings I tried either wouldn't let me access parts of
an existing file, or otherwise failed to perfrom as I required.)

-Morgan
 
A

Ara.T.Howard

Jim said:
This may be off topic, but I always wonder why all the flags to File.
Could what you are doing be written as:

File.open(ARGV[1], "w") { |outfile|
File.foreach(ARGV[0]) { |line|
outfile.puts CGI::unescapeHTML(line)
}
}

or am I missing something big here?

Well, in this case, I don't believe it's possible to
get the effect of File::EXCL (which basically amounts to
"don't overwrite an existing file") with a string as the open
mode. There are some other combinations of parameters
that are also difficult (impossible) to achieve that way.
(I don't remember exactly what it was, but I think it had to
do with a file that was being opened for reading and writing.
All the strings I tried either wouldn't let me access parts of
an existing file, or otherwise failed to perfrom as I required.)

O_EXCL is broken on nfs:

O_EXCL When used with O_CREAT, if the file already exists it is an error
and the open will fail. In this context, a symbolic link exists, regardless of
where its points to. O_EXCL is broken on NFS file systems, programs which
rely on it for performing lock- ing tasks will contain a race condition. The
solution for per- forming atomic file locking using a lockfile is to
create a unique file on the same fs (e.g., incorporating hostname and pid),
use link(2) to make a link to the lockfile. If link() returns 0, the lock
is successful. Otherwise, use stat(2) on the unique file to check if its
link count has increased to 2, in which case the lock is also successful.


fyi.

-a
--
===============================================================================
| email :: ara [dot] t [dot] howard [at] noaa [dot] gov
| phone :: 303.497.6469
| Your life dwells amoung the causes of death
| Like a lamp standing in a strong breeze. --Nagarjuna
===============================================================================
 
M

Morgan

Ara.T.Howard said:
O_EXCL is broken on nfs:

O_EXCL When used with O_CREAT, if the file already exists it is an
error
and the open will fail. In this context, a symbolic link exists,
regardless of
where its points to. O_EXCL is broken on NFS file systems, programs which
rely on it for performing lock- ing tasks will contain a race
condition. The
solution for per- forming atomic file locking using a lockfile is to
create a unique file on the same
fs (e.g., incorporating hostname and pid),
use link(2) to make a link to the lockfile. If link() returns 0,
the lock
is successful. Otherwise, use stat(2) on the unique file to check
if its
link count has increased to 2, in which case the lock is also successful.

... And I barely understood a word of that. `.`

Does that mean it won't properly perform the "don't clobber an existing file"
purpose I'm using it for?

-Morgan
 
A

Ara.T.Howard

... And I barely understood a word of that. `.`

Does that mean it won't properly perform the "don't clobber an existing file"
purpose I'm using it for?

it means that O_EXCL fails silently on some kinds of filesystems, including
nfs. this is not likely to affect you and is beyond the control of ruby (it's
the c library/fs fault) but, if it does affect you, it means that two
instances of the code, when run at the same time, would __both__ be writing to
the file at the same time - neither would have an exclusive lock on the file
as it would not be created atomically. basically you can ignore this if you
are working on local disk - but if you are some sort of shared setup like nfs
or windows equiv be wary.

cheers.

-a
--
===============================================================================
| email :: ara [dot] t [dot] howard [at] noaa [dot] gov
| phone :: 303.497.6469
| Your life dwells amoung the causes of death
| Like a lamp standing in a strong breeze. --Nagarjuna
===============================================================================
 
W

William James

Simon said:
[...]
Which lead to...

===
require 'cgi'
outfile = File.new(ARGV[1], File::CREAT|File::WRONLY|File::EXCL)
IO.readlines(ARGV[0]).each{ |line|
outfile.puts CGI::unescapeHTML(line)
}
outfile.close
===

Which is much simpler; just some file handling stuff around
the unescapeHTML function. Maybe later I'll try something with
rubywebdialogs that'll let me paste into a web browser window
and get back results the way I'd like to be able to do...

The moral of this story is, html obfuscation sucks.

(What? That's *not* the moral? Oh well...)

-Morgan

the moral is, there is always a simpler way :)

require 'cgi'
open(ARGV[1], 'w') do |f|
f.write(CGI::unescapeHTML(IO.read(ARGV[0])))
end

cheers

Simon

Simpler still:

require 'cgi'
open(ARGV.pop, 'w') { |f|
f.write(CGI.unescapeHTML(ARGF.read))
}
 
M

Morgan

William James said:
Simpler still:

require 'cgi'
open(ARGV.pop, 'w') { |f|
f.write(CGI.unescapeHTML(ARGF.read))
}

I think you might have reached the point where
simpler is more complex... I'm not sure I'd know
what that code was supposed to do if it wasn't
something I wrote being reduced.

*never even -seen- ARGF before*

-Morgan
 
W

William James

Jim said:
ARGF is a reference to $stdin.

An object providing access to virtual concatenation of files
passed as command-line arguments or standard input if there
are no command-line arguments. -- Ruby in a Nutshell

ARGF is no more esoteric than ARGV, and it's quite handy.
Let's say you want to process every line of every file
on the command-line:

ruby -e 'ARGF.each_line{|x| p x}' file1 file2 file3
 
G

Gavin Kistner

--Apple-Mail-4-135326081
Content-Transfer-Encoding: 7bit
Content-Type: text/plain;
charset=US-ASCII;
format=flowed

ARGF is no more esoteric than ARGV, and it's quite handy.
Let's say you want to process every line of every file
on the command-line:

ruby -e 'ARGF.each_line{|x| p x}' file1 file2 file3

Damn, that *is* handy! I love this list.
--Apple-Mail-4-135326081--
 
W

William James

mathew said:
William said:
ARGF is no more esoteric than ARGV, [...]

I disagree. ARGV is familiar to anyone who's ever written C, C++,
Objective-C, Java, Perl, AWK, Python, Scheme, ...
ARGF is not.

These are not familiar to everyone who's ever written in C or Awk:

class, map, join, __END__, DATA, <<HERE, grep, flatten

But that doesn't prove they are esoteric to those who use Ruby.
I'd never heard of it until this thread.

Major premise:
I know everything about Ruby except that which is esoteric.

Minor premise:
I don't know about ARGF.

Conclusion:
ARGF is esoteric.
Compare the number of references to ARGV and ARGF in the pickaxe book
too: ARGF is only mentioned three times in the entire book according to
the index.


Pickaxe (1st edition), page 16:

The "Ruby way" to write this would be to use an iterator:

ARGF.each { |line| print line if line =~ /Ruby/ }

on page 219 under the heading "Standard Objects" these are listed:
ARGF, ARGV, ENV, false, nil, self, true

page 217 explains ARGF's synonym, $<.


"Teach Yourself Ruby in 21 Days" explains ARGF in Day 8 on
page 173 and uses it in the final two solutions to a problem.
The penultimate one is

has_a_long_word = /\w{5,}/
ARGF.each{|line| print line unless has_a_long_word =~ line}

Matz himself in "Ruby in a Nutshell" explains it on page 38
and lists it as one of 14 predefined global constants.

One of those is in a grey "you can skip this" section talking
about Perlisms,

For that the authors should be afflicted with the Spell of
Forlorn Encystment.


------------
------------


Usage tip: the name of the file currently being read is available as
$FILENAME or as shown in this example:

ruby -e 'ARGF.each{|x| print ARGF.filename + ", " + x }' file1 file2
 
D

David A. Black

Hi --

William said:
ARGF is no more esoteric than ARGV, [...]

I disagree. ARGV is familiar to anyone who's ever written C, C++,
Objective-C, Java, Perl, AWK, Python, Scheme, ...

ARGF is not. I'd never heard of it until this thread.

You make it sound like learning something from a ruby-talk thread is
bad :)
Compare the number of references to ARGV and ARGF in the pickaxe book too:
ARGF is only mentioned three times in the entire book according to the index.
One of those is in a grey "you can skip this" section talking about Perlisms,
the second is under a big "ARGC" heading where it's mentioned in passing, and
the real discussion isn't until page 336.

That doesn't mean it's esoteric. It just means it's discussed on page
336. Something has to be :)


David
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,578
Members
45,052
Latest member
LucyCarper

Latest Threads

Top