reformatting a text file that has some binary in it

Discussion in 'Ruby' started by Adam Akhtar, Apr 15, 2009.

  1. Adam Akhtar

    Adam Akhtar Guest

    I have never worked with binary before and after trying to solve this
    problem for 3 hours im turning to the community for help

    i have a text file which has entries comprised of a key written in
    binary and its values written in strings (you can see an exerpt below).

    I need to parse the binary and transform it into human readable hex and
    parse its associated info. My reg exps dont seem to be behaving and im
    wondering if its me or if its this binary text that is causing mischief
    somehow. Heres a sample item

    20: ç®ç¥ºï½±ãƒ»ãƒ»ï½·ãƒ»GèŠã¾d8:completei7e10:downloadedi2046e10:incompletei1ee


    binary parts are always enclosed between "20:" and "d8:complete" where
    the 8 can be any integer(s) e.g. 5 or 23.

    str = File.open('textfile.txt' , 'r').readlines.join
    str.gsub!(/(20:)(.*?)(d\d+:)/m) do |x|
    $1 + $2.unpack('H*').join + $3
    end

    The above works for some but not all of the text. It seeems to go beyond
    the "d8:complete" marker

    Heres a bigger sample set if needs be. Any tips or pointers would be
    greatly appreciated.


    d8:completei2e10:downloadedi770e10:incompletei1ee20:
    ç®ç¥ºï½±ãƒ»ãƒ»ï½·ãƒ»GèŠã¾d8:completei7e10:downloadedi2046e10:incompletei1ee20:
    6皀ネ・ 涇jュ・w・d8:completei0e10:downloadedi72602e10:incompletei1ee20:
    }tェスh>ï½¥モé€æ¦Žï¾—d8:completei3e10:downloadedi7718e10:incompletei2ee20: 架C
    ヒウJ<ィFラ0ノ犒Wd8:completei2e10:downloadedi617e10:incompletei0ee20:
    incノ・U]~鼡é僘< d8:completei3e10:downloadedi533e10:incompletei0ee20:
    ゥ<Zè¿š<0ï½¹_!Y/î»ï¾ž3d8:completei1e10:downloadedi281e10:incompletei0ee20:
    6ユiå‰ï£²ãƒ»ç…¤ï½¯ãƒ»ï½µGd8:completei0e10:downloadedi216e10:incompletei1ee20:
    Iネl談ォヲ7Z&レ・K゙建ノEd8:completei4e10:downloadedi262e10:incompletei3ee20:
    Smナソホ怯ソヒィh7r・醋d8:completei3e10:downloadedi787e10:incompletei0ee20:
    Yjスゥホd ヒ îŽï¾–]豆ud8:completei0e10:downloadedi154e10:incompletei1ee20:
    bj・]VF・w仭鱧åノd8:completei10e10:downloadedi505e10:incompletei16ee20:
    h・ï¾æ£\MDaî‹ž3ï¾ï½ºd8:completei2e10:downloadedi1050e10:incompletei2ee20:
    hï¾ãƒ»ï¾”î–­=ゥソæ•é™ï½¢å·Œd8:completei1e10:downloadedi57e10:incompletei2ee20:
    mb‘<GSキbオゥqT・?d8:completei1e10:downloadedi3860e10:incompletei1ee20:
    u ≡axBニz<縊3d8:completei3e10:downloadedi700e10:incompletei7ee20:
    u[約・ム泱@2 ウ4ァトd8:completei0e10:downloadedi658e10:incompletei3ee20:
    ・・a・$ï¾#・3!COd8:completei3e10:downloadedi304e10:incompletei0ee20:
    æ§ãƒ»Gモ_å»°å¶Z、7}d8:completei3e10:downloadedi2285e10:incompletei2ee20:
    入サ・î¼ï½¢ï£²ï¾ŽGaヲ`qvBd8:completei6e10:downloadedi1061e10:incompletei5ee20:
    牟エ?eゥケソæ¯ãƒ»ï½©ï¾„j・d8:completei3e10:downloadedi2902e10:incompletei1ee20:
    セオッ・Mï½´_L宋ナ・ーï¾ld8:completei1e10:downloadedi147e10:incompletei1ee20:
    ィレY#億Uホ "ーF絖bヌawd8:completei6e10:downloadedi39010e10:incompletei2ee20:
    ゥ・ォヒ2、1・tス稘8:completei7e10:downloadedi1835e10:incompletei0ee20:
    ョ3・・Lサ エ)サモ.G單8:completei2e10:downloadedi474e10:incompletei0ee20:
    エ)PO霙
    オクヲ&・%L狎苧d8:completei4e10:downloadedi3674e10:incompletei0ee20:
    ï¾ï¾žï½­ï¾E\GT瞹゚・翡ェイョd8:completei2e10:downloadedi328e10:incompletei0ee20:
    ト裔・(・隋K砡セネハィツd8:completei43e10:downloadedi9665e10:incompletei31ee20:
    ナ・篠レ殱  0h・qgd8:completei3e10:downloadedi17686e10:incompletei0ee20:
    ヌBコ嶂g・ェ・・-/T・d8:completei5e10:downloadedi801e10:incompletei2ee20:
    ï¾ï¾ãƒ» 4ï¾”I・{-u)ï¾ï½±ãƒ»bd8:completei3e10:downloadedi4878e10:incompletei2ee20:
    ムî„ナネナ&、Obシヒ$・d8:completei4e10:downloadedi1499e10:incompletei0ee20:
    メ@i掃ゥt・aKリf箚.ニワd8:completei7e10:downloadedi1745e10:incompletei3ee20:
    ラネ<シエï¾ãƒ»B・iï½£\,ェEd8:completei0e10:downloadedi745e10:incompletei1ee20:
    ・橡態P暦Zï½·ï¾”îŽï½´å¹2ネd8:completei1e10:downloadedi11865e10:incompletei7ee20:
    ・。B<テケテワネ゚D
    =イ・オ霙7舊d8:completei2e10:downloadedi9246e10:incompletei2ee20:G[fJ・Y*d8:completei15e10:downloadedi3649e10:incompletei11ee20:・/XrレルJ
    ・XA
    宀・d8:completei3e10:downloadedi323e10:incompletei0ee20:刕ョ・ネッア|
    jetei4e10:downloadedi12601e10:incompletei0ee20:aソ・リ制ユ \゚?@。㌧8:completei3e10:downloadedi1005e10:incompletei0ee20:ctオ@訣+@・
    ツァe5lï¾d8:completei0e10:downloadedi166e10:incompletei2ee20:cカGlusï½´Bn、・]糾ィ顕d8:completei1e10:downloadedi110e10:incompletei0ee20:j+s「x」・iï½¼4!mG~d8:completei5e10:downloadedi6427e10:incompletei0ee20:|1S・Mï¾…iè´ï¾‰ãƒ»
    æ–­/~d8:completei5e10:downloadedi865e10:incompletei2ee20:}lカァ/・2k+ï½·B・å†8:completei4e10:downloadedi1032e10:incompletei0ee20:æµï½¼ï½¾u碣PîŸï¾•・knå†8:completei1e10:downloadedi95e10:incompletei1ee20:孤ヌgHï¾・ーπ「Kï¾6綜dd8:completei6e10:downloadedi14810e10:incompletei3ee20:袋゚k゙イLp・ムTォ%8:completei3e10:downloadedi430e10:incompletei1ee20:愈W
    ï½­ï½³L罇瞭ッ|ヤ・8$d8:completei0e10:downloadedi69e10:incompletei1ee20:・リb"mィウ釤・ウî…ï½¹8:completei8e10:downloadedi9526e10:incompletei0ee20:ュレキ4|コェèŠãƒ»å±®
    lヒEad8:completei15e10:downloadedi1775e10:incompletei9ee20:イヤAフqæ•¢kF毅{D・ï¾
    d8:completei5e10:downloadedi4154e10:incompletei1ee20:ウ03・ゥ8e10:incompletei0ee20:ゥ
    漸穆ヤ+¦ノh・!d8:completei3e10:downloadedi2874e10:incompletei1ee20:ゥ゚l,8H皺ネæºæ¤¿kャ{é¹½8:completei55e10:downloadedi10735e10:incompletei82ee20:ォホ~ pï¾ãƒ»(Q㎞8?uL^4d8:completei2e10:downloadedi140e10:incompletei0ee20:ッî“ィ症「慙。U、f・8:completei0e10:downloadedi368e10:incompletei3ee20:オホE・d1カス・qQBWd8:completei6e10:downloadedi7221e10:incompletei9ee20:ï½»コï¾–,ï¾’tQ_ワï¾`テ(ハd8:completei3e10:downloadedi1536e10:incompletei21ee20:ヒS・ï¾el%~シュ,yロbd8:completei1e10:downloadedi111e10:incompletei1ee20:ホBd
    ・ケî²|e]ï½¥"、vTd8:completei2e10:downloadedi1096e10:incompletei1ee20:ユ・セ・$d8:completei1e10:downloadedi701e10:incompletei0ee20:ワzï¾ï½¨/モ@g.3å—‡=・゚d8:completei0e10:downloadedi512e10:incompletei1ee20:゙éå‘·i15e10:downloadedi2161e10:incompletei8ee20:~虜Bヲ|゙フ0篷`hd8:completei0e10:downloadedi86e10:incompletei1ee20:話・!ロ\隆?・€ï¾ï½°rd8:completei1e10:downloadedi36732e10:incompletei0ee20:ä¿‘>(・lå·ž=塘€・鶚租8:completei5e10:downloadedi8917e10:incompletei1ee20:・サJホ磨gè®F2é “@kd8:completei7e10:downloadedi1644e10:incompletei22ee20:、"&cB:TRレ}ta禰シ0+å½­8:completei3e10:downloadedi418e10:incompletei0ee20:ï½¹・æ‘€ヌ$NォcッP眷{ï½½d8:completei8e10:downloadedi10297e10:incompletei10ee20:ナ斈>T・ゥ
    テï¾ï¾‰>ム播8:completei0e10:downloadedi323e10:incompletei1ee20:ニ誨vï½½.sé¬â– ï½ªe1+|<ï½¼Gd8:completei0e10:downloadedi1412e10:incompletei1ee20:ムャ煮肄n、c
    ・ォ_è »d8:completei2e10:downloadedi477e10:incompletei3ee20:禳""ï½»Rオ紆・・@d8:completei0e10:downloadedi175e10:incompletei4ee20:・u倒æ“Zï¾8wæ“¡>{ヲ嚇8:completei1e10:downloadedi5929e10:incompletei0ee20:î„‚隕ッi7MY・v€Yd8:completei1e10:downloadedi212e10:incompletei1ee20:ルキf苺・QfC渋+ロョd8:completei1e10:downloadedi159e10:incompletei2ee20:・ィ7ゥ・胙゙トホァ竫

    can anyone help
    --
    Posted via http://www.ruby-forum.com/.
     
    Adam Akhtar, Apr 15, 2009
    #1
    1. Advertising

  2. Adam Akhtar

    James Gray Guest

    On Apr 15, 2009, at 8:19 AM, Adam Akhtar wrote:

    > i have a text file which has entries comprised of a key written in
    > binary and its values written in strings (you can see an exerpt =20
    > below).
    >
    > I need to parse the binary and transform it into human readable hex =20=


    > and
    > parse its associated info. My reg exps dont seem to be behaving and im
    > wondering if its me or if its this binary text that is causing =20
    > mischief
    > somehow. Heres a sample item
    >
    > 20: =0C=E7=90=AE=E7=A5=BA=EF=BD=B1=E3=83=BB=E3=83=BB=EF=BD=B7=E3=83=BBG=1B=

    =E8=81=8A=E3=81=BE=20
    > d8:completei7e10:downloadedi2046e10:incompletei1ee
    >
    >
    > binary parts are always enclosed between "20:" and "d8:complete" where
    > the 8 can be any integer(s) e.g. 5 or 23.
    >
    > str =3D File.open('textfile.txt' , 'r').readlines.join
    > str.gsub!(/(20:)(.*?)(d\d+:)/m) do |x|
    > $1 + $2.unpack('H*').join + $3
    > end
    >
    > The above works for some but not all of the text. It seeems to go =20
    > beyond
    > the "d8:complete" marker


    I suspect this is an encoding issue. If your data is UTF-8, this code =20=

    may work for you:

    data =3D File.read('textfile.txt')
    data.scan(/(20:)(.*?)(d\d+:)/um) do |start, bin, finish|
    p start + bin.unpack('H*').join + finish
    end

    I'm guessing though.

    If you want to read more about what I believe is causing you problems, =20=

    you may find my m17n series of blog posts helpful:

    http://blog.grayproductions.net/articles/understanding_m17n

    James Edward Gray II
     
    James Gray, Apr 15, 2009
    #2
    1. Advertising

  3. Adam Akhtar

    Adam Akhtar Guest

    Ahh i didnt know you could use scan like that with blocks and
    variables...thats going to come in very handy indeed.

    Ill give that a go - many thanks James!
    --
    Posted via http://www.ruby-forum.com/.
     
    Adam Akhtar, Apr 15, 2009
    #3
  4. Adam Akhtar

    Adam Akhtar Guest

    Adam Akhtar, Apr 15, 2009
    #4
  5. Adam Akhtar

    James Gray Guest

    On Apr 15, 2009, at 5:30 PM, Adam Akhtar wrote:

    > Oh and your blog post looks good too, just started reading it.


    Great. I hope it helps.

    James Edward Gray II
     
    James Gray, Apr 15, 2009
    #5
  6. Adam Akhtar

    Adam Akhtar Guest

    Im back again and pretty confused as to why my regexp still is
    overshooting the mark.

    I want my regexp /(20:)(.*?)(d\d+:complete.+?incomplete.+?ee)/ium

    to get everything between and including 20: and ee i.e. from the first
    line of the sample at the bottom of this message id want want this

    20: €0テ ・aュリ:$ ゥD€・d8:completei0e10:downloadedi772e10:incompletei1ee

    but sometimes it overshoots and does something like this
    20: €0テ ・aュリ:$
    ゥD€・d8:completei0e10:downloadedi772e10:incompletei1ee20:
    ç®ç¥ºï½±ãƒ»ãƒ»ï½·ãƒ»GèŠã¾d8:completei9e10:downloadedi2064e10:incompletei2ee

    and I cant figure out why? In my notepad plus editor i have it set to
    display line feeds and carriage returns. Soemtimes in the binary parts
    it displays an lf symbol. In binary does lf serve as a representation
    for a new line or it just used to represent data (bytes etc) - could it
    be that thats tripping up rubys regexp engine?

    I load the data text file like so
    data = File.open("text.txt", "rb").readlines

    Is there something im doing wrong?


    sample from the data text file

    20: €0テ ・aュリ:$
    ゥD€・d8:completei0e10:downloadedi772e10:incompletei1ee20:
    ç®ç¥ºï½±ãƒ»ãƒ»ï½·ãƒ»GèŠã¾d8:completei9e10:downloadedi2064e10:incompletei2ee20:
    }tェスh>ï½¥モé€æ¦Žï¾—d8:completei4e10:downloadedi7724e10:incompletei5ee20: 架C
    ヒウJ<ィFラ0ノ犒Wd8:completei4e10:downloadedi632e10:incompletei2ee20:
    incノ・U]~鼡・`僘< d8:completei5e10:downloadedi536e10:incompletei0ee20:
    シルqナ!pî†ï½«ï½½-リタ58Td8:completei1e10:downloadedi520e10:incompletei0ee20:
    G*﨨ェ
    ・4T澀オソk澆d8:completei0e10:downloadedi1061e10:incompletei2ee20:
    Iネl談ォヲ7Z&レ・K゙建ノEd8:completei5e10:downloadedi268e10:incompletei0ee20:
    Smナソホ怯ソヒィh7r・醋d8:completei5e10:downloadedi798e10:incompletei0ee20:
    bj・]VF・w仭鱧åノd8:completei8e10:downloadedi523e10:incompletei11ee20:
    hï¾ãƒ»ï¾”î–­=ゥソæ•é™ï½¢å·Œd8:completei0e10:downloadedi57e10:incompletei3ee20:
    mb‘<GSキbオゥqT・?d8:completei2e10:downloadedi3864e10:incompletei0ee20:
    u ≡axBニz<縊3d8:completei4e10:downloadedi713e10:incompletei7ee20:
    u[約・ム泱@2 ウ4ァトd8:completei2e10:downloadedi659e10:incompletei5ee20:
    兄・-|ツナ-ユ゚6ⅶルルェ・8:completei0e10:downloadedi108e10:incompletei2ee20:
    ・・a・$ï¾#・3!COd8:completei3e10:downloadedi306e10:incompletei0ee20:
    æ§ãƒ»Gモ_å»°å¶Z、7}d8:completei1e10:downloadedi2293e10:incompletei1ee



    --
    Posted via http://www.ruby-forum.com/.
     
    Adam Akhtar, Apr 21, 2009
    #6
  7. Adam Akhtar

    Adam Akhtar Guest

    Im thoroughly confused and have spent a good 10 hours getting nowhere
    fast. Im gong to throw my monitor against the wall!

    I have a file with text like the stuff in posts above. I dont create the
    file, its given to me as a standard text file. I dont know how it is
    encoded. Im assuming utf-8. There is your standard readable english
    lower 128 ascii and then there are bits of garbled crap that are
    supposed to be binary.

    I do the following

    $KCODE = "UTF8"

    then i do

    data_a = File.read('mn-scrape.txt')
    data_b = File.open("mn-scrape.txt", "rb").readlines.join("")
    data_a.scan(/./m).length ( ==> 170799 )
    data_b.scan(/./m).length ( ==> 767702 )

    why are they different?
    When I look in notepad++ viewing the file under the utf-8 encoding it
    says the num of characters is 767702 which is nearly 4 times bigger that
    the .read version

    Why is this happening?

    What is the correct way to open this type of file? Any help whatsoever
    will be a great great great help!

    --
    Posted via http://www.ruby-forum.com/.
     
    Adam Akhtar, Apr 22, 2009
    #7
  8. Adam Akhtar

    Adam Akhtar Guest

    anyone, im begging ;-)

    if im not being clear please say and ill answer any questions you have
    --
    Posted via http://www.ruby-forum.com/.
     
    Adam Akhtar, Apr 23, 2009
    #8
  9. Adam Akhtar

    t3ch.dude Guest

    On Apr 23, 5:05 am, Adam Akhtar <> wrote:
    > anyone, im begging ;-)
    >
    > if im not being clear please say and ill answer any questions you have
    > --
    > Posted viahttp://www.ruby-forum.com/.


    Adam,

    Forum and e-mail cut & paste is iffy... is there somewhere you could
    post all or part of one of these source files? Is it possible that
    these inline binary blobs are actually all the same number of bytes?

    -t3ch.dude
     
    t3ch.dude, Apr 23, 2009
    #9
  10. Adam Akhtar

    Adam Akhtar Guest

    Adam Akhtar, Apr 23, 2009
    #10
  11. On Thu, Apr 16, 2009 at 3:59 AM, Adam Akhtar <> wrote:
    > Ahh i didnt know you could use scan like that with blocks and
    > variables...thats going to come in very handy indeed.


    You probably realise this, but for the benefit of newbies, there are
    three different things going on there. Firstly, if the regexp passed
    to scan has groups, the returned values are arrays with one element
    per group (corresponding to $1, $2, ...). Secondly, if you pass a
    block to scan, it yields its return values one by one, rather than
    just accumulating them into an array. Thirdly, if you yield multiple
    values to a block, the block can capture them either as an array, or
    in multiple parameters. The beauty of ruby is how well all these
    different features fit together to give the elegant scan syntax.

    martin
     
    Martin DeMello, Apr 24, 2009
    #11
  12. Adam Akhtar

    Eric Hodel Guest

    On Apr 23, 2009, at 15:51, Adam Akhtar wrote:

    > ahh should have thought about that. here is a souce file
    >
    > Attachments:
    > http://www.ruby-forum.com/attachment/3615/mini-scrape.txt


    I think regexp is the wrong way to do this. Since this is a binary
    file format a regexp is unlikely to give you real data. Scanning
    seems to work out better. Where did you get this data?

    It seems to have the following format in pseudo EBNF:

    record: digit+ ":" <N bytes of data> stuff
    stuff: "d" | "i" N+ "e" "e"?

    Instead of using Regexp, use StringScanner or just read by hand like I
    do below.

    Here's what I tried:

    irb(main):001:0> io = open 'mini-scrape.txt'
    => #<File:mini-scrape.txt>
    irb(main):002:0> io.read 1
    => "2"
    irb(main):003:0> io.read 1
    => "0"
    irb(main):004:0> io.read 1
    => ":"

    # I'm guessing "20:" says read 20 bytes, let's see where that puts us:

    irb(main):005:0> io.read 20
    => " \f\373j\342Q\261\201E\201E\267\201EG\e\343\326\202\334"

    # ok...

    irb(main):006:0> io.read 1
    => "d"

    # I don't know what "d" means, but carrying on:

    irb(main):007:0> io.read 1
    => "8"
    irb(main):008:0> io.read 1
    => ":"

    # "8:", let's read 8 bytes:

    irb(main):009:0> io.read 8
    => "complete"

    # ok, looking good

    irb(main):010:0> io.read 1
    => "i"
    irb(main):011:0> io.read 1
    => "9"
    irb(main):012:0> io.read 1
    => "e"

    # dunno what "i9e" could be

    irb(main):013:0> io.read 1
    => "1"
    irb(main):014:0> io.read 1
    => "0"
    irb(main):015:0> io.read 1
    => ":"

    # "10:", read 10 bytes:

    irb(main):016:0> io.read 10
    => "downloaded"

    # ok...

    irb(main):017:0> io.read 1
    => "i"
    irb(main):018:0> io.read 1
    => "2"
    irb(main):019:0> io.read 1
    => "0"
    irb(main):020:0> io.read 1
    => "6"
    irb(main):021:0> io.read 1
    => "4"
    irb(main):022:0> io.read 1
    => "e"

    # dunno what "i2064e", but maybe it downloaded 2064 bytes and the
    previous one was complete in 9 somethings

    irb(main):023:0> io.read 1
    => "1"
    irb(main):024:0> io.read 1
    => "0"
    irb(main):025:0> io.read 1
    => ":"

    # read 10 bytes, another string:

    irb(main):026:0> io.read 10
    => "incomplete"
     
    Eric Hodel, Apr 24, 2009
    #12
  13. Adam Akhtar

    Heesob Park Guest

    2009/4/24 Adam Akhtar <>:
    > ahh should have thought about that. here is a souce file
    >
    > Attachments:
    > http://www.ruby-forum.com/attachment/3615/mini-scrape.txt
    >

    I guess the following code will work for you.

    str = File.open('mini-scrape.txt' , 'rb').read
    str = str.split(/(20:)/).map{|x|x.gsub(/(.+?)(d\d+:)/){$1.unpack('H*').join+$2}}.join

    Regards,

    Park Heesob
     
    Heesob Park, Apr 24, 2009
    #13
  14. Adam Akhtar

    Adam Akhtar Guest

    Thanks for all your responses.

    >
    > I think regexp is the wrong way to do this. Since this is a binary
    > file format a regexp is unlikely to give you real data. Scanning
    > seems to work out better. Where did you get this data?
    >


    Im confused about binary file format. Is UTF-8 and binary file format
    two seperate things? I thought binary was just represented by unicode?

    Why would the regexp trip up at the binary part if i tell it the
    encoding is UTF-8?

    Also with read() isnt that dangerous with Unicode text? Can I assume
    that all characters are only 1 byte wide?

    The file is bencoded (i think its like yaml in some respects).

    --
    Posted via http://www.ruby-forum.com/.
     
    Adam Akhtar, Apr 24, 2009
    #14
  15. Adam Akhtar

    Adam Akhtar Guest

    well ive found some stuff out re: binary format.

    I was getting confused re: the "b" switch in File.open("file", "rb") (as
    in "r**b**")

    I thought this was needed to tell ruby we were dealing with some funky
    "binary" file but its a lot simpler than that. There is no special
    binary file format (that im aware of). Binary is just written to a file
    as text is but in unicode (im assuming).

    So why then do we have to set the "b" for binary mode flag in the
    File.open ?
    Sometimes binary can have the ^Z character in it. As binary its doing
    nothing more than any other character- representing some information but
    in windows that character represents end of file.

    File.open expects text files so if it comes accross ^Z it will stop
    reading even if the text is actually representing binary. To stop ruby
    doing that you use "b" in your call to .open.

    This is a windows only issue apparently.

    This will explain why i was getting different lengths with

    data_a = File.read('mn-scrape.txt')
    data_b = File.open("mn-scrape.txt", "rb").readlines.join("")
    data_a.scan(/./m).length ( ==> 170799 )
    data_b.scan(/./m).length ( ==> 767702 )
    --
    Posted via http://www.ruby-forum.com/.
     
    Adam Akhtar, Apr 24, 2009
    #15
  16. Adam Akhtar

    Adam Akhtar Guest

    > I think regexp is the wrong way to do this. Since this is a binary
    > file format a regexp is unlikely to give you real data. Scanning
    > seems to work out better. Where did you get this data?


    Can you tell me why regular expressions are bad for this? Although the
    text represents binary, its just text at the end of the day. And if i
    know in advance that the binary starts after a :20 and ends before a
    d\d+ is there any reason why
    /:20.+?d\d+/ wouldnt work?

    I looked at StringScanner but that seems to use regular experssion to
    scan though.

    What confuses me re: reg expressions is if I do something like

    File.open("some-file", "rb") do |data|
    text = data.read
    end

    text =~ /(.{20})/um
    $1
    => "d5:filesd20:\000\006呪・

    Notice that the result doesnt show 20 characters and it doesnt end with
    the expected " that irb uses to enclose results...whys that?
    --
    Posted via http://www.ruby-forum.com/.
     
    Adam Akhtar, Apr 24, 2009
    #16
  17. Adam Akhtar

    Eric Hodel Guest

    On Apr 24, 2009, at 00:54, Adam Akhtar wrote:
    > Thanks for all your responses.
    >> I think regexp is the wrong way to do this. Since this is a binary
    >> file format a regexp is unlikely to give you real data. Scanning
    >> seems to work out better. Where did you get this data?
    >>

    >
    > Im confused about binary file format. Is UTF-8 and binary file format
    > two seperate things? I thought binary was just represented by unicode?


    They are separate things. A UTF-8 character that spans multiple bytes
    has a special bit pattern across its multiple bytes. A binary file
    can have any format.

    > Why would the regexp trip up at the binary part if i tell it the
    > encoding is UTF-8?


    It doesn't matter what the encoding is, in a binary file you don't
    have any guarantees that one of your markers won't show up in the
    middle of a binary chunk. There's no reason "20:" or "8:" or anything
    couldn't show up inside the chunk of random data.

    > Also with read() isnt that dangerous with Unicode text? Can I assume
    > that all characters are only 1 byte wide?


    Correct, but I don't think this file is in any Unicode encoding. The
    individual chunks of binary data may be, but overall the file appears
    not to be.

    > The file is bencoded (i think its like yaml in some respects).


    Yes, a binary file format is like yaml, in this case you have the
    "20:", "8:", etc that tell you how far to read (I'm guessing).
     
    Eric Hodel, Apr 24, 2009
    #17
  18. Adam Akhtar

    Eric Hodel Guest

    On Apr 24, 2009, at 03:49, Adam Akhtar wrote:
    > well ive found some stuff out re: binary format.
    >
    > I was getting confused re: the "b" switch in File.open("file", "rb")
    > (as
    > in "r**b**")
    >
    > I thought this was needed to tell ruby we were dealing with some funky
    > "binary" file but its a lot simpler than that. There is no special
    > binary file format (that im aware of). Binary is just written to a
    > file
    > as text is but in unicode (im assuming).


    In windows and on ruby 1.9 the 'b' flag says not to perform any
    conversions of bytes to characters on the text, that's all. Just
    leave it as a stream of bytes.

    > So why then do we have to set the "b" for binary mode flag in the
    > File.open ?
    > Sometimes binary can have the ^Z character in it. As binary its doing
    > nothing more than any other character- representing some information
    > but
    > in windows that character represents end of file.


    Yes, ^Z is the NULL byte "\0" on windows.

    > File.open expects text files so if it comes accross ^Z it will stop
    > reading even if the text is actually representing binary. To stop ruby
    > doing that you use "b" in your call to .open.


    It'll also convert line endings, losing data that should be in a
    binary file.

    > This is a windows only issue apparently.


    It is also an issue on ruby 1.9 for any platform, but for different
    reasons. Ruby will perform other character conversions.

    > This will explain why i was getting different lengths with
    >
    > data_a = File.read('mn-scrape.txt')
    > data_b = File.open("mn-scrape.txt", "rb").readlines.join("")
    > data_a.scan(/./m).length ( ==> 170799 )
    > data_b.scan(/./m).length ( ==> 767702 )


    Yup.
     
    Eric Hodel, Apr 24, 2009
    #18
  19. Adam Akhtar

    Eric Hodel Guest

    On Apr 24, 2009, at 04:39, Adam Akhtar wrote:
    >> I think regexp is the wrong way to do this. Since this is a binary
    >> file format a regexp is unlikely to give you real data. Scanning
    >> seems to work out better. Where did you get this data?

    >
    > Can you tell me why regular expressions are bad for this? Although the
    > text represents binary, its just text at the end of the day. And if i
    > know in advance that the binary starts after a :20 and ends before a
    > d\d+ is there any reason why
    > /:20.+?d\d+/ wouldnt work?


    (I think you mean "20:")

    It will incorrectly match this stream of text, losing data:

    "20:d20:d20:d20:d20:d20:d20:"

    A /d\d/ could happen in the middle of that binary chunk. You're just =20=

    lucky that it hasn't shown up.

    > I looked at StringScanner but that seems to use regular experssion to
    > scan though.


    Yes, but they are all anchored at the front so you can choose what to =20=

    do:

    require 'strscan'

    open 'mini-scrape.txt', 'rb' do |io|
    s =3D StringScanner.new io.read

    # look for any number of digits followed by a ":" at the scan pointer
    len =3D s.scan(/\d+:/).to_i # #to_i ignores the ":"

    # now the scan pointer has moved to the start of the binary data
    # so we can read the length of bytes out
    data =3D s.scan(/.{#{len.to_i}}/m) # m flag makes . match newlines, =20=

    don't use the u flag

    p :data =3D> data

    p :next =3D> s.string[s.pos, 20]

    # what's next in the stream is a "d" followed by another length =20
    specifier,
    # so let's read in the "d" even though I don't know what to do with =20=

    it
    case s.peek 1
    when 'd' then
    s.get_byte

    # add your own cases here for other thingys that show up.
    else
    raise "unknown thingy #{s.peek 1}"
    end

    # you'll probably want to put a loop around this, which will start =20=

    over reading
    # another length specifier and a chunk of data
    end

    If you wrap this in a loop you can easily continue extending it until =20=

    it handles your entire file.

    > What confuses me re: reg expressions is if I do something like
    >
    > File.open("some-file", "rb") do |data|
    > text =3D data.read
    > end
    >
    > text =3D~ /(.{20})/um
    > $1
    > =3D> "d5:filesd20:\000\006=EE=8C=A8=13=E5=91=AA=E3=83=BB
    >
    > Notice that the result doesnt show 20 characters and it doesnt end =20
    > with
    > the expected " that irb uses to enclose results...whys that?


    This probably is the fault of your terminal. Remember you're working =20=

    in bytes (8 bits wide) not UTF-8 characters (which may be up to 6 =20
    bytes long). One of the characters is probably overwriting the the =20
    closing ".=
     
    Eric Hodel, Apr 24, 2009
    #19
  20. Adam Akhtar

    Adam Akhtar Guest

    Many Thanks everybody for your help on the matter, especially Eric who
    has replied so many times.

    I took a break from the pc over the weekend and i came back to the
    problem with a fresh head and managed to achieve what i wanted using the
    information posted by yourselves.

    Thank you all so much again.

    --
    Posted via http://www.ruby-forum.com/.
     
    Adam Akhtar, Apr 28, 2009
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Curt
    Replies:
    3
    Views:
    1,890
    Sahil Malik
    Jun 18, 2004
  2. Draz

    XML text reformatting

    Draz, Jul 25, 2005, in forum: XML
    Replies:
    0
    Views:
    398
  3. iwawi

    text file reformatting

    iwawi, Oct 31, 2010, in forum: Python
    Replies:
    8
    Views:
    241
    iwawi
    Nov 3, 2010
  4. Marc Hoeppner

    Text parser / reformatting

    Marc Hoeppner, Jul 9, 2007, in forum: Ruby
    Replies:
    3
    Views:
    148
    Marc Hoeppner
    Jul 9, 2007
  5. per
    Replies:
    0
    Views:
    90
Loading...

Share This Page