parse bib file in C

R

Rudra Banerjee

I will be grateful if somebody shows my the way to parse bib file using C program.
I am novice in C so, inspite of a large hits while I search file parser in C, failed to create a bib parser.
a bib file structure is


###############################
## SAMPLE BIB FILE FORMAT ##
@article{key1(alpha-numeric),
Title="Some Title(char)",
Author="Author List(char)",
Year="2012(int)",
volume="123(int)",
Pages="321(int)"
journal="Publishers(char)"
}


@book{key2(alpha-numeric),
Title="Some OTHER Title(char)",
Author="OTHER Author List(char)",
Year="2010(int)",
volume="1234(int)",
Pages="4321(int)"
Publishers="Publishers(char)",
Address="Publishers Address(alpha-numeric)",
Edition="Books Edition"
}
#################################
I want to parse this type of file and put it in a 2d array.
Please help.
 
R

Rudra Banerjee

Thanks a lot for trying to help me.

A 2d array of what? What parts of the above data do you want to save?

I want 2d array of the entries, say,
array[1][0]=article;array[1][1]=key1; array[1][2]=Some Title, array[1][3]=Author List
array[2][0]=book,array[2][1]=key2; array[1][2]=Some OYHER Title, array[1][3]=OTHER Author List
etc.
How many @article entries are there in the input? How many @book
entries? Do you intend for the entire array to be in memory at once?

I would really love to make it general, so it should parse @article/@book also. And also it is not the case that all @article entry is at followed by all @book entry. I would love to write the output on the memory on the fly.
What will you do with the data once you parse it? Are the int values
actually imbedded in quotes? Yes
Is the order of the data fixed?
No

Is every entry guaranteed to have all the data you show?
No

Is the file well behaved (every left brace has a matching right brace, the reverse
also, every left parenthesis has a corresponding right parenthesis and
conversely, all quotes occur in pairs, etc)?
Yes
Are the datum ID and the datum always on the same line?
That is the general practice, but its not a RULE.
Do any lines have multiple data?

The may have, if there is a number of Author. But for my purpose, parsing 1st two of the author will be sufficient
Are the volume and journal IDs really not capitalized?
The initial letters are in Capital, like Phys. Rev. B.

Following is 3 entry from a real bib file. Hope this will help.

@article{Armgnac1930,
author = "Armgnac, Alden C.",
journal = "Popular Science",
month = "December",
pages = "31",
title = "{New Steel Alloy Is Rust Proof}",
year = "1930"
}

@book{ashcroftsolid,
author = "Ashcroft, NW and Mermin, ND",
booktitle = "{Solid State Physics}",
publisher = "Brooks Cole",
title = "{Solid State Physics}",
x-fetchedfrom = "Google Scholar",
year = "1976"
}

@article{Banerjee2010a,
author = "Banerjee, Mitali and {\textbf{Rudra Banerjee}} and Majumdar, A.K. and Mookerjee, Abhijit and Sanyal, Biplab and Nigam, A.K.",
doi = "10.1016/j.physb.2010.07.028",
file = ":home/rudra/Documents/papers/sdarticle(3).pdf:pdf",
issn = "09214526",
journal = "Physica B: Condensed Matter",
keywords = "Magnetic phases; Spin glasses",
number = "20",
pages = "4287--4293",
publisher = "Elsevier",
title = "{Magnetism in NiFeMo disordered alloys: Experiment and theory}",
volume = "405",
year = "2010"
}
 
R

Rudra Banerjee

If anyone kindly provide me a small sample code, I can try to build the code over that and return back with the problem.
 
R

Rudra Banerjee

I will be grateful if someone show me a sample code so that I can build on that and come back if I face any problem.
 
N

none

I will be grateful if somebody shows my the way to parse bib file using
C program. I am novice in C so, inspite of a large hits while I search
file parser in C, failed to create a bib parser.

You may want to have a look at the bibtex parser called btparse.
You will find it at:

http://www.cpan.org/authors/Greg_Ward/btparse-0.34.tar.gz

But since you say you are a novice, then probably you will
find it too advanced for you. You will need to know something
about lexical parsers and analyzers to begin making sense
of the program. Consider setting aside this project for a
later time. Pick up something more suitable for a novice.
 
R

Rudra Banerjee

Thanks for the link.
Previously I have done a naive xml to bibtex converter using libxml.
So, I hope, if someone show me the first step, I can manage.

Though, being novice, I was not aware of terms like lexical parser etc. Thanks for leting me know the way.
 
B

Ben Bacarisse

Rudra Banerjee said:
I will be grateful if somebody shows my the way to parse bib file
using C program. I am novice in C so, inspite of a large hits while I
search file parser in C, failed to create a bib parser.

The default answer to any such question is "use lex and yacc" (the GNU
versions being flex and bison). Today there are many other similar
programs, but lex and yacc have lots of tutorial material written about
them so I think they might still be the beginner's choice. For Usenet
help on using them, comp.unix.programmer might be the bets place, though
there may be more specific groups.

But another question comes to mind. I you are a C novice, why are you
doing this in C? You say elsewhere that you want the result in a "2d
array" but that does not seem like the right structure for any bibtex
processing that I can think of. What is the top-level task you are
trying to achieve, and why do think C the right way to do it?

<snip>
 
J

Jorgen Grahn

I will be grateful if somebody shows my the way to parse bib file
using C program. ....
## SAMPLE BIB FILE FORMAT ##
@article{key1(alpha-numeric),
Title="Some Title(char)",
Author="Author List(char)",
Year="2012(int)",
volume="123(int)",
Pages="321(int)"
journal="Publishers(char)"
}

To be pedantic, you probably mean BibTeX. Bib is another, much older
tool which used (more or less) the refer(1) file format:

%T Some Title
%A Author1
%A Author2
%A Author3
%D 2012
%V 123
%J Publishers

/Jorgen
 
R

Rudra Banerjee

But another question comes to mind. I you are a C novice, why are you
doing this in C? You say elsewhere that you want the result in a "2d
array" but that does not seem like the right structure for any bibtex
processing that I can think of. What is the top-level task you are
trying to achieve, and why do think C the right way to do it?
Ben.

What I want to achive is a JabRef like viewer from GTK. By primary programming knowledge is in Fortran, and this is my time-passing. So, I dont think python will be very good option for me. In 2d array,say,array[j], as shown previously
"I want 2d array of the entries, say,
array[1][0]=article;array[1][1]=key1; array[1][2]=Some Title, array[1][3]=Author List
array[2][0]=book,array[2][1]=key2; array[1][2]=Some OYHER Title, array[1][3]=OTHER Author List
etc. "
 
S

Stefan Ram

Rudra Banerjee said:
I will be grateful if someone show me a sample code so that I
can build on that and come back if I face any problem.

To write a parser for a language, one does not want to use
examples of that language, but a grammar for that language.
Give a C programmer a grammar and some money and he'll
happily write a parser for you. As for an example:

(If answering to the following post, one should please not
quote all of it, but only a few lines one directly refers to.)

In order to interpret or translate an expression (term), it is
decomposed into lexical units (tokens, words), which then are
used by a parser to build symbols and a structured
representation of the input. This representation then might be
evaluated or translated into some other representation.

The syntactial structuring resembles the rules for the
construction of an expression, which often is given by so-
called "productions" of the EBNF (extended Backus-Nauer-Form)
and which sometimes are left-recursive.

When writing a parser, the left-recursive productions sometimes
are a worry to the author, because it is not obvious how to
avoid an infinite recursion. The solution is to rewrite them as
right-recursive productions.

The addition with a binary infix Operator, for example, is
left associative. However, it is simpler to analyze in a
right-associative manner. Therefore, one analyzes the source
using right-associative rules and then creates a result
using a left-associative interpretation.

A left-associative grammar might be, for example, as follows.

<numeral> ::= '2' | '4' | '5'.
<expression> ::= <numeral> | <expression> '+' <numeral>.
start symbol: <expression>.

To analyze this using a recursive descent parser, one
prefers to use the following grammar.

<numeral> ::= '2' | '4' | '5'.
<expression> ::= <numeral>[ '+' <expression> ].
start symbol: <expression>.

This can be written using iteration as follows.

<numeral> ::= '2' | '4' | '5'.
<expression> ::= <numeral>{ '+' <numeral> }.
start symbol: <expression>.

However, the product is created in the sense of the
first grammar. Example code follows.

#include <stdio.h> /* printf */

/* scanner */

static inline char get()
{ static char const * const source = "2+4+5)";
static int pos = 0;
return source[ pos++ ]; }

/* parser */

static inline int numeral(){ return get() - '0'; }

static int sum(){ int result = numeral();
while( '+' == get() )result += numeral();
return result; }

/* main */

int main( void ){ printf( "sum = %d\n", sum() ); }

To be able to parse expressions with higher
priority, the grammar can be extended.

<numeral> ::= '2' | '4' | '5'.
<product> ::= <numeral> | <product> '*' <numeral>.
<sum> ::= <product> | <sum> '+' <product>.
start symbol: <sum>.

In iterative notation:

<numeral> ::= '2' | '4' | '5'.
<product> ::= <numeral>{ '*' <numeral> }.
<sum> ::= <product>{ '+' <product> }.
start symbol: <sum>.

In C:

#include <stdio.h> /* printf */

/* scanner */

static inline char get( int const move )
{ static char const * const source = "2+4*5)";
static int pos = 0;
return source[ pos += move ]; }

/* parser */

static inline int numeral(){ return get( 1 )- '0'; }

static int product(){ int result = numeral();
while( '*' == get( 0 )){ get( 1 ); result *= numeral(); }
return result; }

static int sum(){ int result = product();
while( '+' == get( 1 ))result += product();
return result; }

/* main */

int main( void ){ printf( "sum = %d\n", sum() ); }

Exercises

- What is the output of the above programs?

- Extend the last grammar and the last program so as
to handle subtraction.

- Extend the result of the last exercise in order
to handle division.

- Extend the result of the last exercise so that also
numbers with multiple digits are accepted.

- Extend the result of the last exercise so that also
terms in parentheses are accepted. The input "(2+4)*5)"
should give the result "30".

- Extend the result of the last exercise so that
also a unary minus "-" is recognized.

- Extend the result of the last exercise so that
more operators and functions are recognized.

- Extend the result of the last exercise so that
meaningful error messages are created for all
inputs that do not fulfill the rules of the input
language.

- Extend the result of the last exercise so that the
error messages also show the location where the error
was detected. It should be possible to enter an expression
that spans multiple lines, and an error message should
contain the number of the line where the error was
detected.

See also:

http://compilers.iecc.com/crenshaw/
 
B

Barry Schwarz

Thanks a lot for trying to help me.

A 2d array of what? What parts of the above data do you want to save?

I want 2d array of the entries, say,
array[1][0]=article;array[1][1]=key1; array[1][2]=Some Title, array[1][3]=Author List
array[2][0]=book,array[2][1]=key2; array[1][2]=Some OYHER Title, array[1][3]=OTHER Author List
etc.

You have to make some design decisions. Can the array size be fixed
(the dimensions known at the time you write the code) or do you need
the ability to expand the array as you go through the input data? Do
you want the array to contain the actual data or do you want the array
to contain pointers to the data?

If the array size is fixed and it contains the data, then you can
define it as
typedef char TYPE[MAX_DATA_ITEM_LENGTH];
TYPE array[MAX_ENTRIES][MAX_DATA_ITEMS_PER_ENTRY];
This will produce a three dimensional array of characters. array
is the array of data for i-th article/book. array[j] is the text
of the j-th data item for this article/book (for j = 0, the key; for j
= 1, the title; etc). array[j][k] is the k-th character in that
data item.

(Since all the array elements must be the same size, this can waste a
lot of memory. Title and Author probably need more space than year or
volume. You might want to consider using a structure to hold an entry
worth of data so you could size each member as appropriate. However,
you would still need to size each member for the maximum amount of
data in that data item. If one book had a 50 character title, then
every book would have to have space for a 50 character title.)

A fixed array that contains pointers can be defined with
typedef char *TYPE
TYPE array[MAX_ENTRIES][MAX_DATA_ITEMS_PER_ENTRY];
This will produce a two dimensional array of pointers. array is
the array for the i-th article/book. array[j] is the pointer to
the text for the j-th data item. This pointer must be assigned the
address where the text for this item is stored (probably a block of
memory allocated with malloc). With this approach, you can allocate
just enough space for each data item. However, calling malloc for
each data item would probably impact performance and could lead to
significant memory fragmentation.

If the array will need to be adjusted, then you define a pointer to it
and call malloc to allocate memory to hold an estimated number of
entries. When all the entries are used up, you call realloc to
allocate room for additional entries. You still get to decide if the
array should hold data or pointers to data.

As a beginner in C, are you sure you want to do this?
I would really love to make it general, so it should parse @article/@book also. And also it is not the case that all @article entry is at followed by all @book entry. I would love to write the output on the memory on the fly.


No

This means for every input record, you need to determine what type of
data it contains. You start by looking for an '@'. Then it must be
followed by "article{" or "book{". Then you extract (up to the ',' or
possibly '}') and save the key in the first array element . Then you
look for next '=' and determine what type of data that follows by the
identifier that precedes it. You extract the data (up to next ',' or
'}'), probably strip off the quotes, and save it in the appropriate
array element.

So some array elements will not have data. It would probably pay to
initialize the array (char to '\0' and pointers to NULL) so you can
tell if the corresponding data item is omitted.
That is the general practice, but its not a RULE.

Then when you go through the extraction process mentioned above, you
may have to continue processing the same data item on the next line.
The may have, if there is a number of Author. But for my purpose, parsing 1st two of the author will be sufficient

Not does a data item have "multiple values" but are multiple data
items on the same line, as in
Year="2012",volume="123",
The initial letters are in Capital, like Phys. Rev. B.

It doesn't matter if the data inside the quotes is capitalized. I was
asking about the data identifiers (the stuff before the '='). "Title",
"Author", "Year", "Pages", etc are capitalized. Is "journal"
capitalized or not?
Following is 3 entry from a real bib file. Hope this will help.

@article{Armgnac1930,
author = "Armgnac, Alden C.",

But now "author" is not capitalized! It makes a difference.
journal = "Popular Science",
month = "December",
pages = "31",
title = "{New Steel Alloy Is Rust Proof}",
year = "1930"
}

@book{ashcroftsolid,
author = "Ashcroft, NW and Mermin, ND",
booktitle = "{Solid State Physics}",

Is the identifier "booktitle" as you show here or "Title" as you
showed in your original message? It makes a difference.
publisher = "Brooks Cole",
title = "{Solid State Physics}",

Are there really two data items with the same data?
x-fetchedfrom = "Google Scholar",

This is a new identifier (and there are more below). You are going to
need a COMPLETE list of all identifiers.
 
B

Ben Bacarisse

Rudra Banerjee said:
What I want to achive is a JabRef like viewer from GTK.

For what it's worth, I'd do this in Perl. It has good GTK bindings and
is excellent for string processing but...
By primary programming knowledge is in Fortran, and this is my
time-passing. So, I dont think python will be very good option for me.

.... if this is for fun and you want to learn C, the far be it from me to
say otherwise. It will be a long slow process though: learning
parsing + C + GTK.

<snip>
 
M

Malcolm McLean

בת×ריך ×™×•× ×©×‘×ª,30 ביוני 2012 04:53:06 UTC+1, מ×ת Rudra Banerjee:
I will be grateful if somebody shows my the way to parse bib file using Cprogram.
I am novice in C so, inspite of a large hits while I search file parser in C, failed to create a bib parser.
a bib file structure is
Firstly you've got to decide what you want to achieve. Do you need to be able to read any bib file, or do you just have a subset which contain books and articles. Are you interested in all the information, or just some of it?

So the first thing to do is to define the structure for your array. I thinkwhat you probably want is a 1d array of structures rather than a 2 d array.. It might look like this

struct bibentry
{
bool bookorjournal;
char publisher[256];
int year;
int Nauthors;
char **author_list;

};

the details are up to you.


Now you've got two approaches. Hack it or do it properly. Hacking is a lot easier. Just go through the file, pulling out the fileds you want, with ad hoc code to sort out the author list and other areas of difficulty.

To do it properly, you need to parse the bib file, reading in all the information. So you need two passes. One to read int he file, and create a .bib file structure, and a second pass to go through the structure you have created, and extract the fields you're interested in into a flat array. The seocnd way is harder. If you go onto my website you'll find a Basic interpreter, written in C. This will show you how to handle the general text parsing problem. But it's not something for a novice.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top