Decompressing LZW compression from PDF file

Ahmad Azizan · Mar 22, 2010

Hello,

I'm trying to find a ruby module/code that can decompress
LZW-compression-scheme from a PDF file. However, there is no such code
or module (as far as I've known) that exist publicly.

PDF usually compress its stream data by using FlateDecode,
ASCIIHexDecode, ASCII85Decode, and LZWDecode. In ruby, FlateDecode and
ASCII85Decode can be decompressed with existing ruby module which are
zlib and Ascii85. For ASCIIHexDecode, I just need to convert Hex
characters to char. My problem arise from the LZWDecode since there is
no module or code to decompress it.

Since there is no code example of implementing the LZW decompression in
ruby, I've found the implementation code from python. However,
translating python into ruby seems to be a pain-in-a-butt process.

Example of working LZW decompression in python is here:
http://pastebin.ca/1849009
My translated code in ruby is here: http://pastebin.ca/1849012

With a small input, I can decompress the it to get the equivalent output
like the python code.
e.g:
Python
data = "\x80\x0b\x60\x50\x22\x0c\x0c\x85\x01"
tmp = LZWDecode(data)
print tmp

data = "\x80\x0b\x60\x50\x22\x0c\x0c\x85\x01"
lzw = LZWDecoder.new(data)
puts lzw.run()

However, with a real stream from PDF file, I cannot get the decompressed
output. I guess it might be some error in the code or improper handling
of special character in ruby.
I've spent large amount of hours/days in digesting how to decompress LZW
stream and try to translate from python to ruby. It seems that my
current effort didnt give me a bright end. I really hope someone can
help me pointing some of the hint or solution towards this problem.

Thank you

Ryan Davis · Mar 22, 2010

With a small input, I can decompress the it to get the equivalent = output
like the python code.
e.g:
Python
data =3D "\x80\x0b\x60\x50\x22\x0c\x0c\x85\x01"
tmp =3D LZWDecode(data)
print tmp
=20
data =3D "\x80\x0b\x60\x50\x22\x0c\x0c\x85\x01"
lzw =3D LZWDecoder.new(data)
puts lzw.run()
=20
However, with a real stream from PDF file, I cannot get the = decompressed
output. I guess it might be some error in the code or improper = handling
of special character in ruby.

Can you get the python code to decode the real stream? That'd be one way =
to determine if the original data is corrupt or not.

Brian Candler · Mar 22, 2010

Example of working LZW decompression in python is here:

http://pastebin.ca/1849009
My translated code in ruby is here: http://pastebin.ca/1849012

Which version of ruby are you using? If it's 1.9 then your @fp[@inc] may
fall foul of the character encoding rules. Try this in your initialize:

puts @fp.encoding
@fp.force_encoding("ASCII-8BIT")

However if you pass in a StringIO rather than a String then you can just
copy what python is doing:

x = @fp.read(1)
@buff = x[0].unpack("C").first

and read(1) always reads single bytes. This has the advantage of being
able to decompress directly from files, without reading them into RAM
first.

Minor suggestion: it might be more rubyish to return nil rather than
raise EOFError, which would simplify your run loop to

result = ""
while code = readbits(@nbits)
result << feed(code)
end
return result

Regards,

Brian.

retrieve source code from code object as returned by compile()	4	Apr 24, 2014
Archives and magic bytes	5	Mar 24, 2005
Anyone can give some instructions on the function of this asm?	7	Mar 2, 2006
Problem in compressing data stream using socket.	0	Nov 28, 2003
Config file for Lighttpd for win 2003 server	0	Jan 23, 2006
Vectorized laziness inside	0	Sep 10, 2009
Module/package hierarchy and its separation from file structure	23	Jan 23, 2008
Ruby Weekly News 5th - 11th September 2005	1	Sep 12, 2005

Decompressing LZW compression from PDF file

Ahmad Azizan

Ryan Davis

Brian Candler

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads