parse bib file in C

Discussion in 'C Programming' started by Rudra Banerjee, Jun 30, 2012.

  1. I will be grateful if somebody shows my the way to parse bib file using C program.
    I am novice in C so, inspite of a large hits while I search file parser in C, failed to create a bib parser.
    a bib file structure is


    ###############################
    ## SAMPLE BIB FILE FORMAT ##
    @article{key1(alpha-numeric),
    Title="Some Title(char)",
    Author="Author List(char)",
    Year="2012(int)",
    volume="123(int)",
    Pages="321(int)"
    journal="Publishers(char)"
    }


    @book{key2(alpha-numeric),
    Title="Some OTHER Title(char)",
    Author="OTHER Author List(char)",
    Year="2010(int)",
    volume="1234(int)",
    Pages="4321(int)"
    Publishers="Publishers(char)",
    Address="Publishers Address(alpha-numeric)",
    Edition="Books Edition"
    }
    #################################
    I want to parse this type of file and put it in a 2d array.
    Please help.
     
    Rudra Banerjee, Jun 30, 2012
    #1
    1. Advertising

  2. Thanks a lot for trying to help me.

    On Saturday, 30 June 2012 10:00:18 UTC+5:30, Barry Schwarz wrote:
    > A 2d array of what? What parts of the above data do you want to save?


    I want 2d array of the entries, say,
    array[1][0]=article;array[1][1]=key1; array[1][2]=Some Title, array[1][3]=Author List
    array[2][0]=book,array[2][1]=key2; array[1][2]=Some OYHER Title, array[1][3]=OTHER Author List
    etc.

    > How many @article entries are there in the input? How many @book
    > entries? Do you intend for the entire array to be in memory at once?


    I would really love to make it general, so it should parse @article/@book also. And also it is not the case that all @article entry is at followed by all @book entry. I would love to write the output on the memory on the fly.

    > What will you do with the data once you parse it? Are the int values
    > actually imbedded in quotes?

    Yes
    >Is the order of the data fixed?


    No

    > Is every entry guaranteed to have all the data you show?


    No

    > Is the file well behaved (every left brace has a matching right brace, the reverse
    > also, every left parenthesis has a corresponding right parenthesis and
    > conversely, all quotes occur in pairs, etc)?


    Yes
    > Are the datum ID and the datum always on the same line?

    That is the general practice, but its not a RULE.
    > Do any lines have multiple data?


    The may have, if there is a number of Author. But for my purpose, parsing 1st two of the author will be sufficient

    > Are the volume and journal IDs really not capitalized?

    The initial letters are in Capital, like Phys. Rev. B.

    Following is 3 entry from a real bib file. Hope this will help.

    @article{Armgnac1930,
    author = "Armgnac, Alden C.",
    journal = "Popular Science",
    month = "December",
    pages = "31",
    title = "{New Steel Alloy Is Rust Proof}",
    year = "1930"
    }

    @book{ashcroftsolid,
    author = "Ashcroft, NW and Mermin, ND",
    booktitle = "{Solid State Physics}",
    publisher = "Brooks Cole",
    title = "{Solid State Physics}",
    x-fetchedfrom = "Google Scholar",
    year = "1976"
    }

    @article{Banerjee2010a,
    author = "Banerjee, Mitali and {\textbf{Rudra Banerjee}} and Majumdar, A.K. and Mookerjee, Abhijit and Sanyal, Biplab and Nigam, A.K.",
    doi = "10.1016/j.physb.2010.07.028",
    file = ":home/rudra/Documents/papers/sdarticle(3).pdf:pdf",
    issn = "09214526",
    journal = "Physica B: Condensed Matter",
    keywords = "Magnetic phases; Spin glasses",
    number = "20",
    pages = "4287--4293",
    publisher = "Elsevier",
    title = "{Magnetism in NiFeMo disordered alloys: Experiment and theory}",
    volume = "405",
    year = "2010"
    }
     
    Rudra Banerjee, Jun 30, 2012
    #2
    1. Advertising

  3. If anyone kindly provide me a small sample code, I can try to build the code over that and return back with the problem.
     
    Rudra Banerjee, Jun 30, 2012
    #3
  4. I will be grateful if someone show me a sample code so that I can build on that and come back if I face any problem.
     
    Rudra Banerjee, Jun 30, 2012
    #4
  5. Rudra Banerjee

    none Guest

    In article <>,
    Rudra Banerjee <> wrote:
    >I will be grateful if somebody shows my the way to parse bib file using
    >C program. I am novice in C so, inspite of a large hits while I search
    >file parser in C, failed to create a bib parser.


    You may want to have a look at the bibtex parser called btparse.
    You will find it at:

    http://www.cpan.org/authors/Greg_Ward/btparse-0.34.tar.gz

    But since you say you are a novice, then probably you will
    find it too advanced for you. You will need to know something
    about lexical parsers and analyzers to begin making sense
    of the program. Consider setting aside this project for a
    later time. Pick up something more suitable for a novice.

    --
    Rouben Rostamian
     
    none, Jun 30, 2012
    #5
  6. Thanks for the link.
    Previously I have done a naive xml to bibtex converter using libxml.
    So, I hope, if someone show me the first step, I can manage.

    Though, being novice, I was not aware of terms like lexical parser etc. Thanks for leting me know the way.
     
    Rudra Banerjee, Jun 30, 2012
    #6
  7. Rudra Banerjee <> writes:

    > I will be grateful if somebody shows my the way to parse bib file
    > using C program. I am novice in C so, inspite of a large hits while I
    > search file parser in C, failed to create a bib parser.


    The default answer to any such question is "use lex and yacc" (the GNU
    versions being flex and bison). Today there are many other similar
    programs, but lex and yacc have lots of tutorial material written about
    them so I think they might still be the beginner's choice. For Usenet
    help on using them, comp.unix.programmer might be the bets place, though
    there may be more specific groups.

    But another question comes to mind. I you are a C novice, why are you
    doing this in C? You say elsewhere that you want the result in a "2d
    array" but that does not seem like the right structure for any bibtex
    processing that I can think of. What is the top-level task you are
    trying to achieve, and why do think C the right way to do it?

    <snip>
    --
    Ben.
     
    Ben Bacarisse, Jun 30, 2012
    #7
  8. Rudra Banerjee

    Jorgen Grahn Guest

    On Sat, 2012-06-30, Rudra Banerjee wrote:
    > I will be grateful if somebody shows my the way to parse bib file
    > using C program.

    ....
    > ## SAMPLE BIB FILE FORMAT ##
    > @article{key1(alpha-numeric),
    > Title="Some Title(char)",
    > Author="Author List(char)",
    > Year="2012(int)",
    > volume="123(int)",
    > Pages="321(int)"
    > journal="Publishers(char)"
    > }


    To be pedantic, you probably mean BibTeX. Bib is another, much older
    tool which used (more or less) the refer(1) file format:

    %T Some Title
    %A Author1
    %A Author2
    %A Author3
    %D 2012
    %V 123
    %J Publishers

    /Jorgen

    --
    // Jorgen Grahn <grahn@ Oo o. . .
    \X/ snipabacken.se> O o .
     
    Jorgen Grahn, Jun 30, 2012
    #8
  9. On Saturday, 30 June 2012 17:22:02 UTC+5:30, Ben Bacarisse wrote:
    > But another question comes to mind. I you are a C novice, why are you
    > doing this in C? You say elsewhere that you want the result in a "2d
    > array" but that does not seem like the right structure for any bibtex
    > processing that I can think of. What is the top-level task you are
    > trying to achieve, and why do think C the right way to do it?
    > Ben.


    What I want to achive is a JabRef like viewer from GTK. By primary programming knowledge is in Fortran, and this is my time-passing. So, I dont think python will be very good option for me. In 2d array,say,array[j], as shown previously
    "I want 2d array of the entries, say,
    array[1][0]=article;array[1][1]=key1; array[1][2]=Some Title, array[1][3]=Author List
    array[2][0]=book,array[2][1]=key2; array[1][2]=Some OYHER Title, array[1][3]=OTHER Author List
    etc. "
     
    Rudra Banerjee, Jun 30, 2012
    #9
  10. Rudra Banerjee

    Stefan Ram Guest

    Rudra Banerjee <> writes:
    >I will be grateful if someone show me a sample code so that I
    >can build on that and come back if I face any problem.


    To write a parser for a language, one does not want to use
    examples of that language, but a grammar for that language.
    Give a C programmer a grammar and some money and he'll
    happily write a parser for you. As for an example:

    (If answering to the following post, one should please not
    quote all of it, but only a few lines one directly refers to.)

    In order to interpret or translate an expression (term), it is
    decomposed into lexical units (tokens, words), which then are
    used by a parser to build symbols and a structured
    representation of the input. This representation then might be
    evaluated or translated into some other representation.

    The syntactial structuring resembles the rules for the
    construction of an expression, which often is given by so-
    called "productions" of the EBNF (extended Backus-Nauer-Form)
    and which sometimes are left-recursive.

    When writing a parser, the left-recursive productions sometimes
    are a worry to the author, because it is not obvious how to
    avoid an infinite recursion. The solution is to rewrite them as
    right-recursive productions.

    The addition with a binary infix Operator, for example, is
    left associative. However, it is simpler to analyze in a
    right-associative manner. Therefore, one analyzes the source
    using right-associative rules and then creates a result
    using a left-associative interpretation.

    A left-associative grammar might be, for example, as follows.

    <numeral> ::= '2' | '4' | '5'.
    <expression> ::= <numeral> | <expression> '+' <numeral>.
    start symbol: <expression>.

    To analyze this using a recursive descent parser, one
    prefers to use the following grammar.

    <numeral> ::= '2' | '4' | '5'.
    <expression> ::= <numeral>[ '+' <expression> ].
    start symbol: <expression>.

    This can be written using iteration as follows.

    <numeral> ::= '2' | '4' | '5'.
    <expression> ::= <numeral>{ '+' <numeral> }.
    start symbol: <expression>.

    However, the product is created in the sense of the
    first grammar. Example code follows.

    #include <stdio.h> /* printf */

    /* scanner */

    static inline char get()
    { static char const * const source = "2+4+5)";
    static int pos = 0;
    return source[ pos++ ]; }

    /* parser */

    static inline int numeral(){ return get() - '0'; }

    static int sum(){ int result = numeral();
    while( '+' == get() )result += numeral();
    return result; }

    /* main */

    int main( void ){ printf( "sum = %d\n", sum() ); }

    To be able to parse expressions with higher
    priority, the grammar can be extended.

    <numeral> ::= '2' | '4' | '5'.
    <product> ::= <numeral> | <product> '*' <numeral>.
    <sum> ::= <product> | <sum> '+' <product>.
    start symbol: <sum>.

    In iterative notation:

    <numeral> ::= '2' | '4' | '5'.
    <product> ::= <numeral>{ '*' <numeral> }.
    <sum> ::= <product>{ '+' <product> }.
    start symbol: <sum>.

    In C:

    #include <stdio.h> /* printf */

    /* scanner */

    static inline char get( int const move )
    { static char const * const source = "2+4*5)";
    static int pos = 0;
    return source[ pos += move ]; }

    /* parser */

    static inline int numeral(){ return get( 1 )- '0'; }

    static int product(){ int result = numeral();
    while( '*' == get( 0 )){ get( 1 ); result *= numeral(); }
    return result; }

    static int sum(){ int result = product();
    while( '+' == get( 1 ))result += product();
    return result; }

    /* main */

    int main( void ){ printf( "sum = %d\n", sum() ); }

    Exercises

    - What is the output of the above programs?

    - Extend the last grammar and the last program so as
    to handle subtraction.

    - Extend the result of the last exercise in order
    to handle division.

    - Extend the result of the last exercise so that also
    numbers with multiple digits are accepted.

    - Extend the result of the last exercise so that also
    terms in parentheses are accepted. The input "(2+4)*5)"
    should give the result "30".

    - Extend the result of the last exercise so that
    also a unary minus "-" is recognized.

    - Extend the result of the last exercise so that
    more operators and functions are recognized.

    - Extend the result of the last exercise so that
    meaningful error messages are created for all
    inputs that do not fulfill the rules of the input
    language.

    - Extend the result of the last exercise so that the
    error messages also show the location where the error
    was detected. It should be possible to enter an expression
    that spans multiple lines, and an error message should
    contain the number of the line where the error was
    detected.

    See also:

    http://compilers.iecc.com/crenshaw/
     
    Stefan Ram, Jun 30, 2012
    #10
  11. On Fri, 29 Jun 2012 22:00:46 -0700 (PDT), Rudra Banerjee
    <> wrote:

    >Thanks a lot for trying to help me.
    >
    >On Saturday, 30 June 2012 10:00:18 UTC+5:30, Barry Schwarz wrote:
    >> A 2d array of what? What parts of the above data do you want to save?

    >
    >I want 2d array of the entries, say,
    >array[1][0]=article;array[1][1]=key1; array[1][2]=Some Title, array[1][3]=Author List
    >array[2][0]=book,array[2][1]=key2; array[1][2]=Some OYHER Title, array[1][3]=OTHER Author List
    >etc.


    You have to make some design decisions. Can the array size be fixed
    (the dimensions known at the time you write the code) or do you need
    the ability to expand the array as you go through the input data? Do
    you want the array to contain the actual data or do you want the array
    to contain pointers to the data?

    If the array size is fixed and it contains the data, then you can
    define it as
    typedef char TYPE[MAX_DATA_ITEM_LENGTH];
    TYPE array[MAX_ENTRIES][MAX_DATA_ITEMS_PER_ENTRY];
    This will produce a three dimensional array of characters. array
    is the array of data for i-th article/book. array[j] is the text
    of the j-th data item for this article/book (for j = 0, the key; for j
    = 1, the title; etc). array[j][k] is the k-th character in that
    data item.

    (Since all the array elements must be the same size, this can waste a
    lot of memory. Title and Author probably need more space than year or
    volume. You might want to consider using a structure to hold an entry
    worth of data so you could size each member as appropriate. However,
    you would still need to size each member for the maximum amount of
    data in that data item. If one book had a 50 character title, then
    every book would have to have space for a 50 character title.)

    A fixed array that contains pointers can be defined with
    typedef char *TYPE
    TYPE array[MAX_ENTRIES][MAX_DATA_ITEMS_PER_ENTRY];
    This will produce a two dimensional array of pointers. array is
    the array for the i-th article/book. array[j] is the pointer to
    the text for the j-th data item. This pointer must be assigned the
    address where the text for this item is stored (probably a block of
    memory allocated with malloc). With this approach, you can allocate
    just enough space for each data item. However, calling malloc for
    each data item would probably impact performance and could lead to
    significant memory fragmentation.

    If the array will need to be adjusted, then you define a pointer to it
    and call malloc to allocate memory to hold an estimated number of
    entries. When all the entries are used up, you call realloc to
    allocate room for additional entries. You still get to decide if the
    array should hold data or pointers to data.

    As a beginner in C, are you sure you want to do this?

    >
    >> How many @article entries are there in the input? How many @book
    >> entries? Do you intend for the entire array to be in memory at once?

    >
    >I would really love to make it general, so it should parse @article/@book also. And also it is not the case that all @article entry is at followed by all @book entry. I would love to write the output on the memory on the fly.
    >
    >> What will you do with the data once you parse it? Are the int values
    >> actually imbedded in quotes?

    >Yes
    >>Is the order of the data fixed?

    >
    >No


    This means for every input record, you need to determine what type of
    data it contains. You start by looking for an '@'. Then it must be
    followed by "article{" or "book{". Then you extract (up to the ',' or
    possibly '}') and save the key in the first array element . Then you
    look for next '=' and determine what type of data that follows by the
    identifier that precedes it. You extract the data (up to next ',' or
    '}'), probably strip off the quotes, and save it in the appropriate
    array element.

    >
    >> Is every entry guaranteed to have all the data you show?

    >
    >No


    So some array elements will not have data. It would probably pay to
    initialize the array (char to '\0' and pointers to NULL) so you can
    tell if the corresponding data item is omitted.

    >
    >> Is the file well behaved (every left brace has a matching right brace, the reverse
    >> also, every left parenthesis has a corresponding right parenthesis and
    >> conversely, all quotes occur in pairs, etc)?

    >
    >Yes
    >> Are the datum ID and the datum always on the same line?

    >That is the general practice, but its not a RULE.


    Then when you go through the extraction process mentioned above, you
    may have to continue processing the same data item on the next line.

    >> Do any lines have multiple data?

    >
    >The may have, if there is a number of Author. But for my purpose, parsing 1st two of the author will be sufficient


    Not does a data item have "multiple values" but are multiple data
    items on the same line, as in
    Year="2012",volume="123",

    >
    >> Are the volume and journal IDs really not capitalized?

    >The initial letters are in Capital, like Phys. Rev. B.


    It doesn't matter if the data inside the quotes is capitalized. I was
    asking about the data identifiers (the stuff before the '='). "Title",
    "Author", "Year", "Pages", etc are capitalized. Is "journal"
    capitalized or not?

    >
    >Following is 3 entry from a real bib file. Hope this will help.
    >
    >@article{Armgnac1930,
    > author = "Armgnac, Alden C.",


    But now "author" is not capitalized! It makes a difference.

    > journal = "Popular Science",
    > month = "December",
    > pages = "31",
    > title = "{New Steel Alloy Is Rust Proof}",
    > year = "1930"
    >}
    >
    >@book{ashcroftsolid,
    > author = "Ashcroft, NW and Mermin, ND",
    > booktitle = "{Solid State Physics}",


    Is the identifier "booktitle" as you show here or "Title" as you
    showed in your original message? It makes a difference.

    > publisher = "Brooks Cole",
    > title = "{Solid State Physics}",


    Are there really two data items with the same data?

    > x-fetchedfrom = "Google Scholar",


    This is a new identifier (and there are more below). You are going to
    need a COMPLETE list of all identifiers.

    > year = "1976"
    >}
    >
    >@article{Banerjee2010a,
    > author = "Banerjee, Mitali and {\textbf{Rudra Banerjee}} and Majumdar, A.K. and Mookerjee, Abhijit and Sanyal, Biplab and Nigam, A.K.",
    > doi = "10.1016/j.physb.2010.07.028",
    > file = ":home/rudra/Documents/papers/sdarticle(3).pdf:pdf",
    > issn = "09214526",
    > journal = "Physica B: Condensed Matter",
    > keywords = "Magnetic phases; Spin glasses",
    > number = "20",
    > pages = "4287--4293",
    > publisher = "Elsevier",
    > title = "{Magnetism in NiFeMo disordered alloys: Experiment and theory}",
    > volume = "405",
    > year = "2010"
    >}
    >
    >
    >


    --
    Remove del for email
     
    Barry Schwarz, Jun 30, 2012
    #11
  12. Rudra Banerjee <> writes:

    > On Saturday, 30 June 2012 17:22:02 UTC+5:30, Ben Bacarisse wrote:
    >> But another question comes to mind. I you are a C novice, why are you
    >> doing this in C? You say elsewhere that you want the result in a "2d
    >> array" but that does not seem like the right structure for any bibtex
    >> processing that I can think of. What is the top-level task you are
    >> trying to achieve, and why do think C the right way to do it?
    >> Ben.

    >
    > What I want to achive is a JabRef like viewer from GTK.


    For what it's worth, I'd do this in Perl. It has good GTK bindings and
    is excellent for string processing but...

    > By primary programming knowledge is in Fortran, and this is my
    > time-passing. So, I dont think python will be very good option for me.


    .... if this is for fun and you want to learn C, the far be it from me to
    say otherwise. It will be a long slow process though: learning
    parsing + C + GTK.

    <snip>
    --
    Ben.
     
    Ben Bacarisse, Jun 30, 2012
    #12
  13. בת×ריך ×™×•× ×©×‘×ª,30 ביוני 2012 04:53:06 UTC+1, מ×ת Rudra Banerjee:
    > I will be grateful if somebody shows my the way to parse bib file using Cprogram.
    > I am novice in C so, inspite of a large hits while I search file parser in C, failed to create a bib parser.
    > a bib file structure is
    >

    Firstly you've got to decide what you want to achieve. Do you need to be able to read any bib file, or do you just have a subset which contain books and articles. Are you interested in all the information, or just some of it?

    So the first thing to do is to define the structure for your array. I thinkwhat you probably want is a 1d array of structures rather than a 2 d array.. It might look like this

    struct bibentry
    {
    bool bookorjournal;
    char publisher[256];
    int year;
    int Nauthors;
    char **author_list;

    };

    the details are up to you.


    Now you've got two approaches. Hack it or do it properly. Hacking is a lot easier. Just go through the file, pulling out the fileds you want, with ad hoc code to sort out the author list and other areas of difficulty.

    To do it properly, you need to parse the bib file, reading in all the information. So you need two passes. One to read int he file, and create a .bib file structure, and a second pass to go through the structure you have created, and extract the fields you're interested in into a flat array. The seocnd way is harder. If you go onto my website you'll find a Basic interpreter, written in C. This will show you how to handle the general text parsing problem. But it's not something for a novice.
    --
    Take a look at MiniBasic
    http://www.malcolmmclean.site11.com/www
     
    Malcolm McLean, Jul 7, 2012
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. John M. Lembo

    Parse Text File and Output to File

    John M. Lembo, Aug 1, 2003, in forum: Perl
    Replies:
    0
    Views:
    12,757
    John M. Lembo
    Aug 1, 2003
  2. Sacha Korell
    Replies:
    2
    Views:
    15,114
    Mattias Sjögren
    Sep 6, 2003
  3. Replies:
    19
    Views:
    1,184
    Daniel Vallstrom
    Mar 15, 2005
  4. Stan SR

    Parse a html file as a XML file

    Stan SR, Jan 19, 2008, in forum: ASP .Net
    Replies:
    2
    Views:
    494
    Peter Bromberg [C# MVP]
    Jan 19, 2008
  5. 7stud --

    optparse: parse v. parse! ??

    7stud --, Feb 20, 2008, in forum: Ruby
    Replies:
    3
    Views:
    212
    7stud --
    Feb 20, 2008
Loading...

Share This Page