File data extraction

Discussion in 'Ruby' started by Rolf Pedersen, Jul 5, 2010.

  1. [Note: parts of this message were removed to make it a legal post.]

    Hi

    I have a file with the following format (example):

    Save Format v3.0(19990112)
    @begin Libraries
    "felles.pbl" "";
    @end;
    @begin Objects
    "n_cst_xml_utils.sru" "felles.pbl";
    "n_melding.sru" "felles.pbl";
    @end;

    The data in the two begin/end blocks are lists, which may be longer than
    shown.

    I'd like to extract an array of the filenames (first quote) in the @begin
    Objects ... @end; block.
    For the example above this should return ["n_cst_xml_utils.sru",
    "n_melding.sru"]

    My initial idea was to treat the whole thing as one long string, and extract
    the part within the being-end-block by using regexp, converting the result
    back to individual lines (split '\n') and doing array.map and regexp to
    single out the name in the first quote on each line.
    But I keep hitting the wall, especially with the first step in this
    approach... :eek:(

    I know this should be easily done in a couple of lines of code, but I can't
    get it right.

    Appreciate any help!

    Best regards,
    Rolf
     
    Rolf Pedersen, Jul 5, 2010
    #1
    1. Advertising

  2. Rolf Pedersen wrote:
    > My initial idea was to treat the whole thing as one long string, and
    > extract
    > the part within the being-end-block by using regexp, converting the
    > result
    > back to individual lines (split '\n') and doing array.map and regexp to
    > single out the name in the first quote on each line.
    > But I keep hitting the wall, especially with the first step in this
    > approach... :eek:(


    How about this for starters:

    p src.scan(/^@begin(.*?)^@end;/m)
    --
    Posted via http://www.ruby-forum.com/.
     
    Brian Candler, Jul 5, 2010
    #2
    1. Advertising

  3. [Note: parts of this message were removed to make it a legal post.]

    Thanks Brian, that helped me a lot ! :eek:)

    The code now looks like this:

    filenames = File.open(filename).readlines.join.scan(/^@begin
    Objects\n(.*?)^@end;/m)[0][0].split("\n").map{|l| l.scan(/"(.*?)"/)[0][0]}
    Probably far from optimal, but it seems to do the trick.

    Best regards,
    Rolf

    On Mon, Jul 5, 2010 at 12:12 PM, Brian Candler <> wrote:

    > Rolf Pedersen wrote:
    > > My initial idea was to treat the whole thing as one long string, and
    > > extract
    > > the part within the being-end-block by using regexp, converting the
    > > result
    > > back to individual lines (split '\n') and doing array.map and regexp to
    > > single out the name in the first quote on each line.
    > > But I keep hitting the wall, especially with the first step in this
    > > approach... :eek:(

    >
    > How about this for starters:
    >
    > p src.scan(/^@begin(.*?)^@end;/m)
    > --
    > Posted via http://www.ruby-forum.com/.
    >
    >
     
    Rolf Pedersen, Jul 5, 2010
    #3
  4. Rolf Pedersen wrote:
    > The code now looks like this:
    >
    > filenames = File.open(filename).readlines.join.scan(/^@begin
    > Objects\n(.*?)^@end;/m)[0][0].split("\n").map{|l|
    > l.scan(/"(.*?)"/)[0][0]}
    > Probably far from optimal, but it seems to do the trick.


    That's the most important thing :)

    I actually misread your example. If there's only one @begin Objects
    section, then 'scan' is overkill; a simple regexp match will do.

    res = if File.read(filename) =~ /^@begin Objects$(.*?)^@end;$/m
    $1.scan(/^\s*"(.*?)"/).map { |r| r.first }
    end
    --
    Posted via http://www.ruby-forum.com/.
     
    Brian Candler, Jul 5, 2010
    #4
  5. 2010/7/5 Brian Candler <>:
    > Rolf Pedersen wrote:
    >> The code now looks like this:
    >>
    >> filenames =3D File.open(filename).readlines.join.scan(/^@begin
    >> Objects\n(.*?)^@end;/m)[0][0].split("\n").map{|l|
    >> l.scan(/"(.*?)"/)[0][0]}
    >> Probably far from optimal, but it seems to do the trick.

    >
    > That's the most important thing :)
    >
    > I actually misread your example. If there's only one @begin Objects
    > section, then 'scan' is overkill; a simple regexp match will do.
    >
    > res =3D if File.read(filename) =3D~ /^@begin Objects$(.*?)^@end;$/m
    > =A0$1.scan(/^\s*"(.*?)"/).map { |r| r.first }
    > end


    If files are large than the line based approach is usually more
    feasible. In this case you can use the flip flop operator in an if
    condition to select the lines we want:

    17:31:49 Temp$ ./lextr.rb
    ["n_cst_xml_utils.sru", "n_melding.sru"]
    17:48:32 Temp$ cat lextr.rb
    #!/bin/env ruby19

    ar =3D []

    DATA.each_line do |line|
    if /^@begin Objects/ =3D~ line .. /^end;/ =3D~ line
    name =3D line[/^\s*"([^"]*)"/, 1] and ar << name
    end
    end

    p ar

    __END__
    Save Format v3.0(19990112)
    @begin Libraries
    "felles.pbl" "";
    @end;
    @begin Objects
    "n_cst_xml_utils.sru" "felles.pbl";
    "n_melding.sru" "felles.pbl";
    @end;
    17:49:30 Temp$

    Kind regards

    robert


    --=20
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
     
    Robert Klemme, Jul 5, 2010
    #5
  6. Rolf Pedersen

    w_a_x_man Guest

    On Jul 5, 3:48 am, Rolf Pedersen <> wrote:
    > [Note:  parts of this message were removed to make it a legal post.]
    >
    > Hi
    >
    > I have a file with the following format (example):
    >
    > Save Format v3.0(19990112)
    > @begin Libraries
    >  "felles.pbl" "";
    > @end;
    > @begin Objects
    >  "n_cst_xml_utils.sru" "felles.pbl";
    >  "n_melding.sru" "felles.pbl";
    > @end;
    >
    > The data in the two begin/end blocks are lists, which may be longer than
    > shown.
    >
    > I'd like to extract an array of the filenames (first quote) in the @begin
    > Objects ... @end; block.
    > For the example above this should return ["n_cst_xml_utils.sru",
    > "n_melding.sru"]
    >
    > My initial idea was to treat the whole thing as one long string, and extract
    > the part within the being-end-block by using regexp, converting the result
    > back to individual lines (split '\n') and doing array.map and regexp to
    > single out the name in the first quote on each line.
    > But I keep hitting the wall, especially with the first step in this
    > approach... :eek:(
    >
    > I know this should be easily done in a couple of lines of code, but I can't
    > get it right.
    >
    > Appreciate any help!
    >
    > Best regards,
    > Rolf


    puts DATA.read.scan(/^@begin Objects(.*?)^@end;/m).flatten.
    map{|s| s.strip.to_a}.flatten.map{|s| s.split(/"/)[1]}

    __END__
    Save Format v3.0(19990112)
    @begin Libraries
    "felles.pbl" "";
    @end;
    @begin Objects
    "n_cst_xml_utils.sru" "felles.pbl";
    "n_melding.sru" "felles.pbl";
    @end;
    @begin Libraries
    "felles.pbl" "";
    @end;
    @begin Objects
    "n_cst_xml_utils.sru" "felles.pbl";
    "n_melding.sru" "felles.pbl";
    @end;
     
    w_a_x_man, Jul 5, 2010
    #6
  7. [Note: parts of this message were removed to make it a legal post.]

    Robert:
    The use of flip flop operator opened a new door for me. Didn't know of this
    before...
    And new knowledge is the best knowledge! :eek:)

    w_a_x_man:
    I can't believe I didn't think of the possibility to use a simple split
    instead of a scan to extract the filenames between the first two quotation
    marks!

    Thanks to all for the great input I've gotten on this issue!
    :eek:)

    Best regards,
    Rolf

    On Mon, Jul 5, 2010 at 8:25 PM, w_a_x_man <> wrote:

    > On Jul 5, 3:48 am, Rolf Pedersen <> wrote:
    > > [Note: parts of this message were removed to make it a legal post.]
    > >
    > > Hi
    > >
    > > I have a file with the following format (example):
    > >
    > > Save Format v3.0(19990112)
    > > @begin Libraries
    > > "felles.pbl" "";
    > > @end;
    > > @begin Objects
    > > "n_cst_xml_utils.sru" "felles.pbl";
    > > "n_melding.sru" "felles.pbl";
    > > @end;
    > >
    > > The data in the two begin/end blocks are lists, which may be longer than
    > > shown.
    > >
    > > I'd like to extract an array of the filenames (first quote) in the @begin
    > > Objects ... @end; block.
    > > For the example above this should return ["n_cst_xml_utils.sru",
    > > "n_melding.sru"]
    > >
    > > My initial idea was to treat the whole thing as one long string, and

    > extract
    > > the part within the being-end-block by using regexp, converting the

    > result
    > > back to individual lines (split '\n') and doing array.map and regexp to
    > > single out the name in the first quote on each line.
    > > But I keep hitting the wall, especially with the first step in this
    > > approach... :eek:(
    > >
    > > I know this should be easily done in a couple of lines of code, but I

    > can't
    > > get it right.
    > >
    > > Appreciate any help!
    > >
    > > Best regards,
    > > Rolf

    >
    > puts DATA.read.scan(/^@begin Objects(.*?)^@end;/m).flatten.
    > map{|s| s.strip.to_a}.flatten.map{|s| s.split(/"/)[1]}
    >
    > __END__
    > Save Format v3.0(19990112)
    > @begin Libraries
    > "felles.pbl" "";
    > @end;
    > @begin Objects
    > "n_cst_xml_utils.sru" "felles.pbl";
    > "n_melding.sru" "felles.pbl";
    > @end;
    > @begin Libraries
    > "felles.pbl" "";
    > @end;
    > @begin Objects
    > "n_cst_xml_utils.sru" "felles.pbl";
    > "n_melding.sru" "felles.pbl";
    > @end;
    >
    >
    >
     
    Rolf Pedersen, Jul 7, 2010
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Di'ego

    Clock and data extraction

    Di'ego, Dec 21, 2003, in forum: VHDL
    Replies:
    0
    Views:
    473
    Di'ego
    Dec 21, 2003
  2. Manasa
    Replies:
    2
    Views:
    785
    Mothra
    May 19, 2004
  3. madhukp

    Data extraction from MS Word file

    madhukp, Feb 9, 2004, in forum: ASP .Net
    Replies:
    1
    Views:
    496
    Steve C. Orr [MVP, MCSD]
    Feb 9, 2004
  4. Replies:
    6
    Views:
    534
    Carlos Eduardo Lima Borges
    Jul 7, 2006
  5. Dave Kuhlman

    HTML data extraction?

    Dave Kuhlman, Dec 22, 2003, in forum: Python
    Replies:
    2
    Views:
    390
    John J. Lee
    Dec 22, 2003
Loading...

Share This Page