reformatting a text file that has some binary in it

Adam Akhtar · Apr 15, 2009

I have never worked with binary before and after trying to solve this
problem for 3 hours im turning to the community for help

i have a text file which has entries comprised of a key written in
binary and its values written in strings (you can see an exerpt below).

I need to parse the binary and transform it into human readable hex and
parse its associated info. My reg exps dont seem to be behaving and im
wondering if its me or if its this binary text that is causing mischief
somehow. Heres a sample item

20: ç®ç¥ºï½±ãƒ»ãƒ»ï½·ãƒ»GèŠã¾d8:completei7e10:downloadedi2046e10:incompletei1ee

binary parts are always enclosed between "20:" and "d8:complete" where
the 8 can be any integer(s) e.g. 5 or 23.

str = File.open('textfile.txt' , 'r').readlines.join
str.gsub!(/(20

(.*?)(d\d+

/m) do |x|
$1 + $2.unpack('H*').join + $3
end

The above works for some but not all of the text. It seeems to go beyond
the "d8:complete" marker

Heres a bigger sample set if needs be. Any tips or pointers would be
greatly appreciated.

d8:completei2e10:downloadedi770e10:incompletei1ee20:
ç®ç¥ºï½±ãƒ»ãƒ»ï½·ãƒ»GèŠã¾d8:completei7e10:downloadedi2046e10:incompletei1ee20:
6çš€ï¾ˆãƒ»ï£³ æ¶‡jï½ï£²ãƒ»wãƒ»d8:completei0e10:downloadedi72602e10:incompletei1ee20:
}tï½ªï½½h>ï½¥ï¾“é€æ¦Žï¾—î‚¹d8:completei3e10:downloadedi7718e10:incompletei2ee20: æž¶C
ï¾‹ï½³J<ï½¨Fï¾—0ï¾‰çŠ’Wd8:completei2e10:downloadedi617e10:incompletei0ee20:
incï¾‰ãƒ»U]~é¼¡éåƒ˜< d8:completei3e10:downloadedi533e10:incompletei0ee20:
ï½©<Zè¿š<0ï½¹_!Y/î»ï¾ž3d8:completei1e10:downloadedi281e10:incompletei0ee20:
6ï¾•î•’iï£²å‰ï£²ãƒ»ç…¤ï½¯ãƒ»ï½µGd8:completei0e10:downloadedi216e10:incompletei1ee20:
Iï¾ˆlè«‡ï½«ï½¦7Z&ï¾šãƒ»Kï¾žå»ºï¾‰Ed8:completei4e10:downloadedi262e10:incompletei3ee20:
Smï£²ï¾…ï½¿ï¾Žæ€¯ï½¿ï¾‹ï½¨h7rãƒ»é†‹d8:completei3e10:downloadedi787e10:incompletei0ee20:
Yjï½½ï½©ï¾Žd ï¾‹ï£° îŽï¾–]è±†ud8:completei0e10:downloadedi154e10:incompletei1ee20:
bjãƒ»]VFãƒ»wä»é±§åï¾‰d8:completei10e10:downloadedi505e10:incompletei16ee20:
hãƒ»ï¾æ£\MDaî‹ž3ï¾ï½ºd8:completei2e10:downloadedi1050e10:incompletei2ee20:
hï¾ãƒ»ï¾”î–=ï½©îŠ¢ï½¿æ•é™ï½¢å·Œd8:completei1e10:downloadedi57e10:incompletei2ee20:
mbâ€˜<GSï½·bï½µï½©qTãƒ»?d8:completei1e10:downloadedi3860e10:incompletei1ee20:
uâ‰¡ï£²axBï¾†z<ç¸Š3d8:completei3e10:downloadedi700e10:incompletei7ee20:
u[ç´„ãƒ»ï¾‘æ³±@2 ï½³4ï½§ï¾„d8:completei0e10:downloadedi658e10:incompletei3ee20:
ãƒ»ãƒ»aãƒ»$ï£±ï¾#ãƒ»3!COd8:completei3e10:downloadedi304e10:incompletei0ee20:
æ§ãƒ»Gï¾“_å»°å¶Zï½¤7}d8:completei3e10:downloadedi2285e10:incompletei2ee20:
å…¥ï½»ãƒ»î¼ï½¢ï£²ï¾ŽGaï½¦`qîš¬vBd8:completei6e10:downloadedi1061e10:incompletei5ee20:
ç‰Ÿï½´?eî‚†ï½©ï½¹ï½¿æ¯ãƒ»ï½©ï¾„jãƒ»d8:completei3e10:downloadedi2902e10:incompletei1ee20:
ï£°ï½¾ï½µï½¯ãƒ»Mï½´_Lå®‹ï¾…ãƒ»ï½°ï¾ld8:completei1e10:downloadedi147e10:incompletei1ee20:
ï½¨ï¾šY#å„„Uï¾Ž "ï½°Fçµ–bï¾‡awd8:completei6e10:downloadedi39010e10:incompletei2ee20:
ï½©ãƒ»ï½«ï¾‹2ï½¤îŠ²1ãƒ»tï½½ç¨˜8:completei7e10:downloadedi1835e10:incompletei0ee20:
ï½®3ãƒ»ãƒ»Lï½» ï½´)ï½»ï¾“.ï£±Gå–®8:completei2e10:downloadedi474e10:incompletei0ee20:
ï½´)POéœ™
ï½µï½¸ï½¦&ãƒ»%Lç‹Žè‹§d8:completei4e10:downloadedi3674e10:incompletei0ee20:
ï¾ï¾žï½ï¾E\GTçž¹ï¾Ÿãƒ»ç¿¡ï½ªï½²ï½®d8:completei2e10:downloadedi328e10:incompletei0ee20:
ï¾„è£”ãƒ»(ãƒ»éš‹Kç ¡ï½¾ï¾ˆï¾Šï½¨ï¾‚d8:completei43e10:downloadedi9665e10:incompletei31ee20:
ï¾…ãƒ»ç¯ ï¾šï£°æ®±î’µ 0hãƒ»qgd8:completei3e10:downloadedi17686e10:incompletei0ee20:
ï¾‡Bï½ºå¶‚gãƒ»ï½ªãƒ»ãƒ»-/Tãƒ»d8:completei5e10:downloadedi801e10:incompletei2ee20:
ï¾ï¾ãƒ»4ï¾”Iãƒ»{-u)ï¾ï½±ãƒ»bd8:completei3e10:downloadedi4878e10:incompletei2ee20:
ï¾‘î„ï¾…ï¾ˆï¾…&ï½¤Obï½¼ï¾‹$ãƒ»d8:completei4e10:downloadedi1499e10:incompletei0ee20:
ï¾’@iæŽƒï½©tãƒ»aKï¾˜fç®š.ï¾†ï¾œd8:completei7e10:downloadedi1745e10:incompletei3ee20:
ï¾—ï¾ˆ<ï½¼ï½´ï¾ãƒ»Bãƒ»iï½£\,ï½ªEd8:completei0e10:downloadedi745e10:incompletei1ee20:
ãƒ»æ©¡æ…‹Pæš¦Zï½·ï¾”îŽï½´å¹2ï¾ˆd8:completei1e10:downloadedi11865e10:incompletei7ee20:
ãƒ»ï½¡B<ï¾ƒï½¹ï¾ƒï¾œï¾ˆï¾ŸD
=ï½²ãƒ»ï½µéœ™7èˆŠî€Œd8:completei2e10:downloadedi9246e10:incompletei2ee20:G[fJãƒ»Y*d8:completei15e10:downloadedi3649e10:incompletei11ee20:ãƒ»/Xrï¾šï¾™J
ãƒ»XA
å®€ãƒ»d8:completei3e10:downloadedi323e10:incompletei0ee20:åˆ•ï½®ãƒ»ï¾ˆï½¯ï½±|
jetei4e10:downloadedi12601e10:incompletei0ee20:aï½¿ãƒ»ï¾˜åˆ¶ï¾•\ï¾Ÿ?@î€»ï½¡ãŒ§8:completei3e10:downloadedi1005e10:incompletei0ee20:ctï½µ@è¨£+@ãƒ»
ï¾‚ï½§e5lï¾d8:completei0e10:downloadedi166e10:incompletei2ee20:cï½¶Glusï½´Bnï½¤ãƒ»]ç³¾ï½¨é¡•d8:completei1e10:downloadedi110e10:incompletei0ee20:j+sï½¢î€¨xï½£ãƒ»iï½¼4!î”½mG~d8:completei5e10:downloadedi6427e10:incompletei0ee20:|1Sãƒ»Mï¾…iè´ï¾‰ãƒ»
æ–/~d8:completei5e10:downloadedi865e10:incompletei2ee20:}lï½¶îˆ›ï½§/ãƒ»2k+ï½·Bãƒ»å†8:completei4e10:downloadedi1032e10:incompletei0ee20:æµï½¼ï½¾uç¢£PîŸï¾•ãƒ»knå†8:completei1e10:downloadedi95e10:incompletei1ee20:å¤ï¾‡gHï¾ãƒ»ï½°Ï€ï½¢Kï¾6ç¶œdd8:completei6e10:downloadedi14810e10:incompletei3ee20:è¢‹ï¾Ÿkï¾žï½²Lpãƒ»ï¾‘î”«Tï½«%î†œ8:completei3e10:downloadedi430e10:incompletei1ee20:æ„ˆW
ï½ï½³Lç½‡çžï½¯|ï¾”ãƒ»8$d8:completei0e10:downloadedi69e10:incompletei1ee20:ãƒ»ï¾˜b"mï½¨ï½³é‡¤ï£²ãƒ»ï½³î…ï½¹îƒ 8:completei8e10:downloadedi9526e10:incompletei0ee20:ï½ï¾šï½·4|ï½ºï½ªï£±èŠãƒ»å±®
lï¾‹Ead8:completei15e10:downloadedi1775e10:incompletei9ee20:ï½²ï¾”Aï¾Œqæ•¢kFæ¯…{Dãƒ»ï¾
d8:completei5e10:downloadedi4154e10:incompletei1ee20:ï½³03ãƒ»ï£²ï½©8e10:incompletei0ee20:ï½©
æ¼¸î˜„ç©†ï¾”+î…½ï¿¤ï¾‰hãƒ»!d8:completei3e10:downloadedi2874e10:incompletei1ee20:ï½©ï¾Ÿl,8Hçšºï¾ˆæºæ¤¿kï½¬{é¹½8:completei55e10:downloadedi10735e10:incompletei82ee20:ï½«ï¾Ž~pï¾ãƒ»(QãŽž8?uL^4d8:completei2e10:downloadedi140e10:incompletei0ee20:ï½¯î“ï½¨ç—‡ï½¢æ…™ï½¡ï£²Uî–—ï½¤fãƒ»8:completei0e10:downloadedi368e10:incompletei3ee20:ï½µï¾Žï£°Eî‰—ãƒ»d1ï½¶ï½½ãƒ»qQBWd8:completei6e10:downloadedi7221e10:incompletei9ee20:ï½»ï½ºï¾–,ï¾’tQï£±_ï¾œï¾`ï¾ƒ(ï¾Šd8:completei3e10:downloadedi1536e10:incompletei21ee20:ï¾‹Sãƒ»ï¾el%~ï½¼ï½,yï¾›î•—bd8:completei1e10:downloadedi111e10:incompletei1ee20:ï¾ŽBd
ãƒ»ï½¹î²|e]ï½¥"ï½¤vTd8:completei2e10:downloadedi1096e10:incompletei1ee20:ï¾•ãƒ»ï½¾ãƒ»$îŽ®î‹•d8:completei1e10:downloadedi701e10:incompletei0ee20:ï¾œzï¾ï½¨/ï£³ï¾“@g.3å—‡=ãƒ»ï¾Ÿd8:completei0e10:downloadedi512e10:incompletei1ee20:ï¾žéå‘·i15e10:downloadedi2161e10:incompletei8ee20:~è™œBï½¦|ï¾žîŽµï¾Œ0ç¯·î€™`hd8:completei0e10:downloadedi86e10:incompletei1ee20:è©±ãƒ»!ï£²ï¾›\éš†?ãƒ»Â€ï¾ï½°rd8:completei1e10:downloadedi36732e10:incompletei0ee20:ä¿‘>(ãƒ»lå·ž=å¡˜Â€ãƒ»é¶šç§Ÿ8:completei5e10:downloadedi8917e10:incompletei1ee20:ãƒ»ï½»Jï¾Žç£¨gî˜»è®F2é “@kd8:completei7e10:downloadedi1644e10:incompletei22ee20:ï½¤"&cB:TRï¾š}taç¦°ï½¼0+å½8:completei3e10:downloadedi418e10:incompletei0ee20:ï½¹ãƒ»æ‘Â€ï¾‡$Nï½«cï½¯Pçœ·{ï½½d8:completei8e10:downloadedi10297e10:incompletei10ee20:ï¾…ï£°æ–ˆ>Tãƒ»ï½©
ï¾ƒï¾ï¾‰>ï¾‘îƒµæ’8:completei0e10:downloadedi323e10:incompletei1ee20:ï¾†èª¨vï½½.sé¬â– ï½ªe1+|<ï½¼Gd8:completei0e10:downloadedi1412e10:incompletei1ee20:ï¾‘ï½¬ç…®è‚„nï£±ï½¤c
ãƒ»ï£³ï½«_è »d8:completei2e10:downloadedi477e10:incompletei3ee20:ç¦³""ï½»Rï½µç´†îŽ¼ãƒ»ãƒ»@îŒŸd8:completei0e10:downloadedi175e10:incompletei4ee20:ãƒ»uå€’æ“Zï¾8wæ“¡>{ï½¦åš‡8:completei1e10:downloadedi5929e10:incompletei0ee20:î„‚éš•î™°ï½¯i7MYãƒ»vÂ€Yd8:completei1e10:downloadedi212e10:incompletei1ee20:î“Œï¾™ï½·fï£²è‹ºãƒ»QfCæ¸‹+ï¾›ï½®d8:completei1e10:downloadedi159e10:incompletei2ee20:ãƒ»ï½¨7ï½©ãƒ»èƒ™ï£°ï¾žï¾„ï£²ï¾Žï½§ç««

can anyone help

James Gray · Apr 15, 2009

i have a text file which has entries comprised of a key written in
binary and its values written in strings (you can see an exerpt =20
below).

I need to parse the binary and transform it into human readable hex =20=

and
parse its associated info. My reg exps dont seem to be behaving and im
wondering if its me or if its this binary text that is causing =20
mischief
somehow. Heres a sample item

20: =0C=E7=90=AE=E7=A5=BA=EF=BD=B1=E3=83=BB=E3=83=BB=EF=BD=B7=E3=83=BBG=1B= =E8=81=8A=E3=81=BE=20
d8:completei7e10:downloadedi2046e10:incompletei1ee

binary parts are always enclosed between "20:" and "d8:complete" where
the 8 can be any integer(s) e.g. 5 or 23.

str =3D File.open('textfile.txt' , 'r').readlines.join
str.gsub!(/(20(.*?)(d\d+/m) do |x|
$1 + $2.unpack('H*').join + $3
end

The above works for some but not all of the text. It seeems to go =20
beyond
the "d8:complete" marker

I suspect this is an encoding issue. If your data is UTF-8, this code =20=

may work for you:

data =3D File.read('textfile.txt')
data.scan(/(20

(.*?)(d\d+

/um) do |start, bin, finish|
p start + bin.unpack('H*').join + finish
end

I'm guessing though.

If you want to read more about what I believe is causing you problems, =20=

you may find my m17n series of blog posts helpful:

http://blog.grayproductions.net/articles/understanding_m17n

James Edward Gray II

Adam Akhtar · Apr 15, 2009

Ahh i didnt know you could use scan like that with blocks and
variables...thats going to come in very handy indeed.

Ill give that a go - many thanks James!

Adam Akhtar · Apr 15, 2009

Oh and your blog post looks good too, just started reading it.

James Gray · Apr 15, 2009

Oh and your blog post looks good too, just started reading it.

Great. I hope it helps.

James Edward Gray II

Adam Akhtar · Apr 21, 2009

Im back again and pretty confused as to why my regexp still is
overshooting the mark.

I want my regexp /(20

(.*?)(d\d+:complete.+?incomplete.+?ee)/ium

to get everything between and including 20: and ee i.e. from the first
line of the sample at the bottom of this message id want want this

20: Â€0ï¾ƒ ãƒ»aï½ï¾˜:$ ï½©DÂ€ãƒ»d8:completei0e10:downloadedi772e10:incompletei1ee

but sometimes it overshoots and does something like this
20: Â€0ï¾ƒ ãƒ»aï½ï¾˜:$
ï½©DÂ€ãƒ»d8:completei0e10:downloadedi772e10:incompletei1ee20:
ç®ç¥ºï½±ãƒ»ãƒ»ï½·ãƒ»GèŠã¾d8:completei9e10:downloadedi2064e10:incompletei2ee

and I cant figure out why? In my notepad plus editor i have it set to
display line feeds and carriage returns. Soemtimes in the binary parts
it displays an lf symbol. In binary does lf serve as a representation
for a new line or it just used to represent data (bytes etc) - could it
be that thats tripping up rubys regexp engine?

I load the data text file like so
data = File.open("text.txt", "rb").readlines

Is there something im doing wrong?

sample from the data text file

20: Â€0ï¾ƒ ãƒ»aï½ï¾˜:$
ï½©DÂ€ãƒ»d8:completei0e10:downloadedi772e10:incompletei1ee20:
ç®ç¥ºï½±ãƒ»ãƒ»ï½·ãƒ»GèŠã¾d8:completei9e10:downloadedi2064e10:incompletei2ee20:
}tï½ªï½½h>ï½¥ï¾“é€æ¦Žï¾—î‚¹d8:completei4e10:downloadedi7724e10:incompletei5ee20: æž¶C
ï¾‹ï½³J<ï½¨Fï¾—0ï¾‰çŠ’Wd8:completei4e10:downloadedi632e10:incompletei2ee20:
incï¾‰ãƒ»U]~é¼¡ãƒ»`åƒ˜< d8:completei5e10:downloadedi536e10:incompletei0ee20:
ï½¼ï£³ï¾™qï£³ï¾…!pî†ï½«ï½½-ï¾˜ï¾€58Td8:completei1e10:downloadedi520e10:incompletei0ee20:
G*ï¨¨î‰’ï½ª
ãƒ»4Tæ¾€ï½µï½¿kæ¾†d8:completei0e10:downloadedi1061e10:incompletei2ee20:
Iï¾ˆlè«‡ï½«ï½¦7Z&ï¾šãƒ»Kï¾žå»ºï¾‰Ed8:completei5e10:downloadedi268e10:incompletei0ee20:
Smï£²ï¾…ï½¿ï¾Žæ€¯ï½¿ï¾‹ï½¨h7rãƒ»é†‹d8:completei5e10:downloadedi798e10:incompletei0ee20:
bjãƒ»]VFãƒ»wä»é±§åï¾‰d8:completei8e10:downloadedi523e10:incompletei11ee20:
hï¾ãƒ»ï¾”î–=ï½©îŠ¢ï½¿æ•é™ï½¢å·Œd8:completei0e10:downloadedi57e10:incompletei3ee20:
mbâ€˜<GSï½·bï½µï½©qTãƒ»?d8:completei2e10:downloadedi3864e10:incompletei0ee20:
uâ‰¡ï£²axBï¾†z<ç¸Š3d8:completei4e10:downloadedi713e10:incompletei7ee20:
u[ç´„ãƒ»ï¾‘æ³±@2 ï½³4ï½§ï¾„d8:completei2e10:downloadedi659e10:incompletei5ee20:
å…„ï£³ãƒ»-|ï¾‚ï¾…-ï¾•ï¾Ÿ6â…¶ï¾™ï¾™ï½ªãƒ»8:completei0e10:downloadedi108e10:incompletei2ee20:
ãƒ»ãƒ»aãƒ»$ï£±ï¾#ãƒ»3!COd8:completei3e10:downloadedi306e10:incompletei0ee20:
æ§ãƒ»Gï¾“_å»°å¶Zï½¤7}d8:completei1e10:downloadedi2293e10:incompletei1ee

Adam Akhtar · Apr 22, 2009

Im thoroughly confused and have spent a good 10 hours getting nowhere
fast. Im gong to throw my monitor against the wall!

I have a file with text like the stuff in posts above. I dont create the
file, its given to me as a standard text file. I dont know how it is
encoded. Im assuming utf-8. There is your standard readable english
lower 128 ascii and then there are bits of garbled crap that are
supposed to be binary.

I do the following

$KCODE = "UTF8"

then i do

data_a = File.read('mn-scrape.txt')
data_b = File.open("mn-scrape.txt", "rb").readlines.join("")
data_a.scan(/./m).length ( ==> 170799 )
data_b.scan(/./m).length ( ==> 767702 )

why are they different?
When I look in notepad++ viewing the file under the utf-8 encoding it
says the num of characters is 767702 which is nearly 4 times bigger that
the .read version

Why is this happening?

What is the correct way to open this type of file? Any help whatsoever
will be a great great great help!

Adam Akhtar · Apr 23, 2009

anyone, im begging ;-)

if im not being clear please say and ill answer any questions you have

t3ch.dude · Apr 23, 2009

anyone, im begging ;-)

if im not being clear please say and ill answer any questions you have

Adam,

Forum and e-mail cut & paste is iffy... is there somewhere you could
post all or part of one of these source files? Is it possible that
these inline binary blobs are actually all the same number of bytes?

-t3ch.dude

Adam Akhtar · Apr 23, 2009

ahh should have thought about that. here is a souce file

Attachments:
http://www.ruby-forum.com/attachment/3615/mini-scrape.txt

Martin DeMello · Apr 24, 2009

Ahh i didnt know you could use scan like that with blocks and
variables...thats going to come in very handy indeed.

You probably realise this, but for the benefit of newbies, there are
three different things going on there. Firstly, if the regexp passed
to scan has groups, the returned values are arrays with one element
per group (corresponding to $1, $2, ...). Secondly, if you pass a
block to scan, it yields its return values one by one, rather than
just accumulating them into an array. Thirdly, if you yield multiple
values to a block, the block can capture them either as an array, or
in multiple parameters. The beauty of ruby is how well all these
different features fit together to give the elegant scan syntax.

martin

Eric Hodel · Apr 24, 2009

ahh should have thought about that. here is a souce file

Attachments:
http://www.ruby-forum.com/attachment/3615/mini-scrape.txt

I think regexp is the wrong way to do this. Since this is a binary
file format a regexp is unlikely to give you real data. Scanning
seems to work out better. Where did you get this data?

It seems to have the following format in pseudo EBNF:

record: digit+ ":" <N bytes of data> stuff
stuff: "d" | "i" N+ "e" "e"?

Instead of using Regexp, use StringScanner or just read by hand like I
do below.

Here's what I tried:

irb(main):001:0> io = open 'mini-scrape.txt'
=> #<File:mini-scrape.txt>
irb(main):002:0> io.read 1
=> "2"
irb(main):003:0> io.read 1
=> "0"
irb(main):004:0> io.read 1
=> ":"

# I'm guessing "20:" says read 20 bytes, let's see where that puts us:

irb(main):005:0> io.read 20
=> " \f\373j\342Q\261\201E\201E\267\201EG\e\343\326\202\334"

# ok...

irb(main):006:0> io.read 1
=> "d"

# I don't know what "d" means, but carrying on:

irb(main):007:0> io.read 1
=> "8"
irb(main):008:0> io.read 1
=> ":"

# "8:", let's read 8 bytes:

irb(main):009:0> io.read 8
=> "complete"

# ok, looking good

irb(main):010:0> io.read 1
=> "i"
irb(main):011:0> io.read 1
=> "9"
irb(main):012:0> io.read 1
=> "e"

# dunno what "i9e" could be

irb(main):013:0> io.read 1
=> "1"
irb(main):014:0> io.read 1
=> "0"
irb(main):015:0> io.read 1
=> ":"

# "10:", read 10 bytes:

irb(main):016:0> io.read 10
=> "downloaded"

# ok...

irb(main):017:0> io.read 1
=> "i"
irb(main):018:0> io.read 1
=> "2"
irb(main):019:0> io.read 1
=> "0"
irb(main):020:0> io.read 1
=> "6"
irb(main):021:0> io.read 1
=> "4"
irb(main):022:0> io.read 1
=> "e"

# dunno what "i2064e", but maybe it downloaded 2064 bytes and the
previous one was complete in 9 somethings

irb(main):023:0> io.read 1
=> "1"
irb(main):024:0> io.read 1
=> "0"
irb(main):025:0> io.read 1
=> ":"

# read 10 bytes, another string:

irb(main):026:0> io.read 10
=> "incomplete"

Heesob Park · Apr 24, 2009

2009/4/24 Adam Akhtar said:
ahh should have thought about that. here is a souce file

Attachments:
http://www.ruby-forum.com/attachment/3615/mini-scrape.txt

I guess the following code will work for you.

str = File.open('mini-scrape.txt' , 'rb').read
str = str.split(/(20

/).map{|x|x.gsub(/(.+?)(d\d+

/){$1.unpack('H*').join+$2}}.join

Regards,

Park Heesob

Adam Akhtar · Apr 24, 2009

Thanks for all your responses.

I think regexp is the wrong way to do this. Since this is a binary
file format a regexp is unlikely to give you real data. Scanning
seems to work out better. Where did you get this data?

Im confused about binary file format. Is UTF-8 and binary file format
two seperate things? I thought binary was just represented by unicode?

Why would the regexp trip up at the binary part if i tell it the
encoding is UTF-8?

Also with read() isnt that dangerous with Unicode text? Can I assume
that all characters are only 1 byte wide?

The file is bencoded (i think its like yaml in some respects).

Adam Akhtar · Apr 24, 2009

well ive found some stuff out re: binary format.

I was getting confused re: the "b" switch in File.open("file", "rb") (as
in "r**b**")

I thought this was needed to tell ruby we were dealing with some funky
"binary" file but its a lot simpler than that. There is no special
binary file format (that im aware of). Binary is just written to a file
as text is but in unicode (im assuming).

So why then do we have to set the "b" for binary mode flag in the
File.open ?
Sometimes binary can have the ^Z character in it. As binary its doing
nothing more than any other character- representing some information but
in windows that character represents end of file.

File.open expects text files so if it comes accross ^Z it will stop
reading even if the text is actually representing binary. To stop ruby
doing that you use "b" in your call to .open.

This is a windows only issue apparently.

This will explain why i was getting different lengths with

data_a = File.read('mn-scrape.txt')
data_b = File.open("mn-scrape.txt", "rb").readlines.join("")
data_a.scan(/./m).length ( ==> 170799 )
data_b.scan(/./m).length ( ==> 767702 )

Adam Akhtar · Apr 24, 2009

I think regexp is the wrong way to do this. Since this is a binary

file format a regexp is unlikely to give you real data. Scanning
seems to work out better. Where did you get this data?

Can you tell me why regular expressions are bad for this? Although the
text represents binary, its just text at the end of the day. And if i
know in advance that the binary starts after a :20 and ends before a
d\d+ is there any reason why
/:20.+?d\d+/ wouldnt work?

I looked at StringScanner but that seems to use regular experssion to
scan though.

What confuses me re: reg expressions is if I do something like

File.open("some-file", "rb") do |data|
text = data.read
end

text =~ /(.{20})/um
$1
=> "d5:filesd20:\000\006îŒ¨å‘ªãƒ»

Notice that the result doesnt show 20 characters and it doesnt end with
the expected " that irb uses to enclose results...whys that?

Eric Hodel · Apr 24, 2009

Thanks for all your responses.

Im confused about binary file format. Is UTF-8 and binary file format
two seperate things? I thought binary was just represented by unicode?

They are separate things. A UTF-8 character that spans multiple bytes
has a special bit pattern across its multiple bytes. A binary file
can have any format.

Why would the regexp trip up at the binary part if i tell it the
encoding is UTF-8?

It doesn't matter what the encoding is, in a binary file you don't
have any guarantees that one of your markers won't show up in the
middle of a binary chunk. There's no reason "20:" or "8:" or anything
couldn't show up inside the chunk of random data.

Also with read() isnt that dangerous with Unicode text? Can I assume
that all characters are only 1 byte wide?

Correct, but I don't think this file is in any Unicode encoding. The
individual chunks of binary data may be, but overall the file appears
not to be.

The file is bencoded (i think its like yaml in some respects).

Yes, a binary file format is like yaml, in this case you have the
"20:", "8:", etc that tell you how far to read (I'm guessing).

Eric Hodel · Apr 24, 2009

well ive found some stuff out re: binary format.

I was getting confused re: the "b" switch in File.open("file", "rb")
(as
in "r**b**")

I thought this was needed to tell ruby we were dealing with some funky
"binary" file but its a lot simpler than that. There is no special
binary file format (that im aware of). Binary is just written to a
file
as text is but in unicode (im assuming).

In windows and on ruby 1.9 the 'b' flag says not to perform any
conversions of bytes to characters on the text, that's all. Just
leave it as a stream of bytes.

So why then do we have to set the "b" for binary mode flag in the
File.open ?
Sometimes binary can have the ^Z character in it. As binary its doing
nothing more than any other character- representing some information
but
in windows that character represents end of file.

Yes, ^Z is the NULL byte "\0" on windows.

File.open expects text files so if it comes accross ^Z it will stop
reading even if the text is actually representing binary. To stop ruby
doing that you use "b" in your call to .open.

It'll also convert line endings, losing data that should be in a
binary file.

This is a windows only issue apparently.

It is also an issue on ruby 1.9 for any platform, but for different
reasons. Ruby will perform other character conversions.

This will explain why i was getting different lengths with

data_a = File.read('mn-scrape.txt')
data_b = File.open("mn-scrape.txt", "rb").readlines.join("")
data_a.scan(/./m).length ( ==> 170799 )
data_b.scan(/./m).length ( ==> 767702 )

Yup.

Eric Hodel · Apr 24, 2009

Can you tell me why regular expressions are bad for this? Although the
text represents binary, its just text at the end of the day. And if i
know in advance that the binary starts after a :20 and ends before a
d\d+ is there any reason why
/:20.+?d\d+/ wouldnt work?

(I think you mean "20:")

It will incorrectly match this stream of text, losing data:

"20:d20:d20:d20:d20:d20:d20:"

A /d\d/ could happen in the middle of that binary chunk. You're just =20=

lucky that it hasn't shown up.

I looked at StringScanner but that seems to use regular experssion to
scan though.

Yes, but they are all anchored at the front so you can choose what to =20=

do:

require 'strscan'

open 'mini-scrape.txt', 'rb' do |io|
s =3D StringScanner.new io.read

# look for any number of digits followed by a ":" at the scan pointer
len =3D s.scan(/\d+:/).to_i # #to_i ignores the ":"

# now the scan pointer has moved to the start of the binary data
# so we can read the length of bytes out
data =3D s.scan(/.{#{len.to_i}}/m) # m flag makes . match newlines, =20=

don't use the u flag

p :data =3D> data

p :next =3D> s.string[s.pos, 20]

# what's next in the stream is a "d" followed by another length =20
specifier,
# so let's read in the "d" even though I don't know what to do with =20=

it
case s.peek 1
when 'd' then
s.get_byte

# add your own cases here for other thingys that show up.
else
raise "unknown thingy #{s.peek 1}"
end

# you'll probably want to put a loop around this, which will start =20=

over reading
# another length specifier and a chunk of data
end

If you wrap this in a loop you can easily continue extending it until =20=

it handles your entire file.

What confuses me re: reg expressions is if I do something like

File.open("some-file", "rb") do |data|
text =3D data.read
end

text =3D~ /(.{20})/um
$1
=3D> "d5:filesd20:\000\006=EE=8C=A8=13=E5=91=AA=E3=83=BB

Notice that the result doesnt show 20 characters and it doesnt end =20
with
the expected " that irb uses to enclose results...whys that?

This probably is the fault of your terminal. Remember you're working =20=

in bytes (8 bits wide) not UTF-8 characters (which may be up to 6 =20
bytes long). One of the characters is probably overwriting the the =20
closing ".=

Adam Akhtar · Apr 28, 2009

Many Thanks everybody for your help on the matter, especially Eric who
has replied so many times.

I took a break from the pc over the weekend and i came back to the
problem with a fresh head and managed to achieve what i wanted using the
information posted by yourselves.

Thank you all so much again.

Custom alphabetical sort	8	Dec 24, 2012
software engineer in beijing	0	Sep 19, 2005
Ajax calls web service in wcf(windows service)	7	Jul 30, 2008
looking for a regular expression	2	Aug 1, 2006
which is better for you ?kakg	0	Apr 28, 2005
I develop a Java program to format Java codes	14	Mar 2, 2012
Is top define.h	0	Jul 11, 2012
how can I list all the processes in the system	3	Apr 17, 2007

reformatting a text file that has some binary in it

Adam Akhtar

James Gray

Adam Akhtar

Adam Akhtar

James Gray

Adam Akhtar

Adam Akhtar

Adam Akhtar

t3ch.dude

Adam Akhtar

Martin DeMello

Eric Hodel

Heesob Park

Adam Akhtar

Adam Akhtar

Adam Akhtar

Eric Hodel

Eric Hodel

Eric Hodel

Adam Akhtar

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads