Decompressing LZW compression from PDF file

A

Ahmad Azizan

Hello,

I'm trying to find a ruby module/code that can decompress
LZW-compression-scheme from a PDF file. However, there is no such code
or module (as far as I've known) that exist publicly.

PDF usually compress its stream data by using FlateDecode,
ASCIIHexDecode, ASCII85Decode, and LZWDecode. In ruby, FlateDecode and
ASCII85Decode can be decompressed with existing ruby module which are
zlib and Ascii85. For ASCIIHexDecode, I just need to convert Hex
characters to char. My problem arise from the LZWDecode since there is
no module or code to decompress it.

Since there is no code example of implementing the LZW decompression in
ruby, I've found the implementation code from python. However,
translating python into ruby seems to be a pain-in-a-butt process.

Example of working LZW decompression in python is here:
http://pastebin.ca/1849009
My translated code in ruby is here: http://pastebin.ca/1849012

With a small input, I can decompress the it to get the equivalent output
like the python code.
e.g:
Python
data = "\x80\x0b\x60\x50\x22\x0c\x0c\x85\x01"
tmp = LZWDecode(data)
print tmp

data = "\x80\x0b\x60\x50\x22\x0c\x0c\x85\x01"
lzw = LZWDecoder.new(data)
puts lzw.run()

However, with a real stream from PDF file, I cannot get the decompressed
output. I guess it might be some error in the code or improper handling
of special character in ruby.
I've spent large amount of hours/days in digesting how to decompress LZW
stream and try to translate from python to ruby. It seems that my
current effort didnt give me a bright end. I really hope someone can
help me pointing some of the hint or solution towards this problem.

Thank you
 
R

Ryan Davis

With a small input, I can decompress the it to get the equivalent = output
like the python code.
e.g:
Python
data =3D "\x80\x0b\x60\x50\x22\x0c\x0c\x85\x01"
tmp =3D LZWDecode(data)
print tmp
=20
data =3D "\x80\x0b\x60\x50\x22\x0c\x0c\x85\x01"
lzw =3D LZWDecoder.new(data)
puts lzw.run()
=20
However, with a real stream from PDF file, I cannot get the = decompressed
output. I guess it might be some error in the code or improper = handling
of special character in ruby.

Can you get the python code to decode the real stream? That'd be one way =
to determine if the original data is corrupt or not.
 
B

Brian Candler

Example of working LZW decompression in python is here:

Which version of ruby are you using? If it's 1.9 then your @fp[@inc] may
fall foul of the character encoding rules. Try this in your initialize:

puts @fp.encoding
@fp.force_encoding("ASCII-8BIT")

However if you pass in a StringIO rather than a String then you can just
copy what python is doing:

x = @fp.read(1)
@buff = x[0].unpack("C").first

and read(1) always reads single bytes. This has the advantage of being
able to decompress directly from files, without reading them into RAM
first.

Minor suggestion: it might be more rubyish to return nil rather than
raise EOFError, which would simplify your run loop to

result = ""
while code = readbits(@nbits)
result << feed(code)
end
return result

Regards,

Brian.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,731
Messages
2,569,432
Members
44,832
Latest member
GlennSmall

Latest Threads

Top