Thank you all very much for your help!
I'm sorry I didn't post my actual perl scripts, because I thought that
the examples I gave were sufficient to point out my problem. I will
post the real scripts I made in this post.
Generally, I'm trying to modify an existing perl script (PREPv1-0.pl by
Christopher M. Frenz) which is designed to do a PubMed database search
using the command prompt. This script generates a html page with a
description of all results from a certain PubMed query.
I don't want a html page, I want a text-database with only relevant
(for me) information. (year, journal, title, authors).
My current problem is that my filter in the foreach loop is only
carried out once, even if there are more lines that match the query (In
my example below it only gives me one author, whereas it should give me
more authors). A corresponding while loop in another script does it
correctly. But then I have to create a text file and run a seperate
script on that text file, while I want to perform all necessary
actions in one script.
With kind regards,
Jaap
my files:
I run it in windows 2000 (Activeperl 5.8.8)
c:\perl\bin\perl grabPubmed.pl van ingen jansen
(This uses the query "van Ingen Jansen ", which results in one hit on
PubMed)
grabPubmed.pl
--
#c:\perl\bin\perl
use strict;
use warnings;
# PREP (Perl RegExps for Pubmed) is a script that allows the use of
# Perl regexs in the searching of Pubmed records, providing the ability
to search
# records for textual patterns as well as keywords
# Copyright 2005- Christopher M. Frenz
# This script is free sofware it may be used, copied, redistributed,
and/or modified
# under the terms laid forth in the Perl Artisic License
# Please cite this script in any publication in which literature cited
within the
# publication was located using the PREP.pl script.
# Usage: perl PREPv1-0.pl PubmedQueryTerms
# Usage of this script requires the LWP and XML::LibXML modules are
installed
use LWP;
use XML::LibXML; #Version 1.58 used for development and testing
my $request;
my $response;
my $query;
# Concatenates arguments passed to script to form Pubmed query
$query=join(" ", @ARGV);
# Creates the URL to search Pubmed
my $baseurl="
http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?";
my $url=$baseurl . "db=Pubmed&retmax=1&usehistory=y&term=" . $query;
# Searches Pubmed and Returns the number of results
# as well as the session information needed for results retrieval
$request=LWP::UserAgent->new();
$response=$request->get($url);
my $results= $response->content;
die unless $response->is_success;
print "PubMed Search Results \n";
$results=~/<Count>(\d+)<\/Count>/;
my $NumAbstracts=$1;
$results=~/<QueryKey>(\d+)<\/QueryKey>/;
my $QueryKey=$1;
$results=~/<WebEnv>(.*?)<\/WebEnv>/;
my $WebEnv=$1;
print "$NumAbstracts are Available \n";
my $parser=XML::LibXML->new;
my $retmax=500; #Number of records to be retrieved per request-Max 500
my $retstart=0; #Record number to start retreival from
# Creates the URL needed to retrieve results
$baseurl="
http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?";
my
$url2="
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=";
my $Count=0;
# Retreives results in XML format
for($retstart=0;$retstart<=$NumAbstracts;$retstart+=$retmax){
print "Processing record # $retstart \n";
$url=$baseurl .
"rettype=abstract&retmode=xml&retstart=$retstart&retmax=$retmax&db=Pubmed&query_key=$QueryKey&WebEnv=$WebEnv";
$response=$request->get($url);
$results=$response->content;
die unless $response->is_success;
}
open my $OFile, '>', 'output.txt' or die "Can't open output file: $!";
my $tracker = 0; # The tag "Year" occurs more times in the xml file,
therefore I only want to read the Year-line beneath the PubDate tag.
foreach ($results){
next if /^#/; # skip comments
next if /^\s*$/; # skip empty lines
chomp; # remove line terminator
if ( /<PMID>/ ) {
/<PMID>(.*)<\/PMID>/;
print $OFile "$1 \n";
}
if ( /<PubDate>/ ) {
$tracker = 1;
}
if ( /<Year>/ ) {
if ($tracker == 1) {
/<Year>(.*)<\/Year>/;
print $OFile "$1 \n";
$tracker = 0;
}
}
if ( /<Title>/ ) {
/<Title>(.*)<\/Title>/;
print $OFile "$1 \n";
}
if ( /<ArticleTitle>/ ) {
/<ArticleTitle>(.*)<\/ArticleTitle>/;
print $OFile "$1 \n";
}
if ( /<LastName>/ ) {
/<LastName>(.*)<\/LastName>/;
print $OFile "$1 \n";
}
}
close $OFile;
--
The output file is not complete, it doesn't list all the authors.
output.txt
--
14705930
2004
Biochemistry.
Extension of the binding motif of the Sin3 interacting domain of the
Mad family proteins.
van Ingen
--
When I then write the XML ($results) to a file:
grabPubmed_full.pl
--
#c:\perl\bin\perl
use strict;
use warnings;
# PREP (Perl RegExps for Pubmed) is a script that allows the use of
# Perl regexs in the searching of Pubmed records, providing the ability
to search
# records for textual patterns as well as keywords
# Copyright 2005- Christopher M. Frenz
# This script is free sofware it may be used, copied, redistributed,
and/or modified
# under the terms laid forth in the Perl Artisic License
# Please cite this script in any publication in which literature cited
within the
# publication was located using the PREP.pl script.
# Usage: perl PREPv1-0.pl PubmedQueryTerms
# Usage of this script requires the LWP and XML::LibXML modules are
installed
use LWP;
use XML::LibXML; #Version 1.58 used for development and testing
my $request;
my $response;
my $query;
# Concatenates arguments passed to script to form Pubmed query
$query=join(" ", @ARGV);
# Creates the URL to search Pubmed
my $baseurl="
http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?";
my $url=$baseurl . "db=Pubmed&retmax=1&usehistory=y&term=" . $query;
# Searches Pubmed and Returns the number of results
# as well as the session information needed for results retrieval
$request=LWP::UserAgent->new();
$response=$request->get($url);
my $results= $response->content;
die unless $response->is_success;
print "PubMed Search Results \n";
$results=~/<Count>(\d+)<\/Count>/;
my $NumAbstracts=$1;
$results=~/<QueryKey>(\d+)<\/QueryKey>/;
my $QueryKey=$1;
$results=~/<WebEnv>(.*?)<\/WebEnv>/;
my $WebEnv=$1;
print "$NumAbstracts are Available \n";
my $parser=XML::LibXML->new;
my $retmax=500; #Number of records to be retrieved per request-Max 500
my $retstart=0; #Record number to start retreival from
# Creates the URL needed to retrieve results
$baseurl="
http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?";
my
$url2="
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=";
my $Count=0;
# Retreives results in XML format
for($retstart=0;$retstart<=$NumAbstracts;$retstart+=$retmax){
print "Processing record # $retstart \n";
$url=$baseurl .
"rettype=abstract&retmode=xml&retstart=$retstart&retmax=$retmax&db=Pubmed&query_key=$QueryKey&WebEnv=$WebEnv";
$response=$request->get($url);
$results=$response->content;
die unless $response->is_success;
}
open my $OFile, '>', 'output_full.txt' or die "Can't open output file:
$!";
print $OFile $results;
close $OFile;
--
resulting in this file:
output_full.txt
--
<?xml version="1.0"?>
<!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st
January 2006//EN"
"
http://www.ncbi.nlm.nih.gov/entrez/query/DTD/pubmed_060101.dtd">
<PubmedArticleSet>
<PubmedArticle>
<MedlineCitation Owner="NLM" Status="MEDLINE">
<PMID>14705930</PMID>
<DateCreated>
<Year>2004</Year>
<Month>01</Month>
<Day>06</Day>
</DateCreated>
<DateCompleted>
<Year>2004</Year>
<Month>05</Month>
<Day>12</Day>
</DateCompleted>
<DateRevised>
<Year>2005</Year>
<Month>11</Month>
<Day>17</Day>
</DateRevised>
<Article PubModel="Print">
<Journal>
<ISSN IssnType="Print">0006-2960</ISSN>
<JournalIssue CitedMedium="Print">
<Volume>43</Volume>
<Issue>1</Issue>
<PubDate>
<Year>2004</Year>
<Month>Jan</Month>
<Day>13</Day>
</PubDate>
</JournalIssue>
<Title>Biochemistry. </Title>
<ISOAbbreviation>Biochemistry</ISOAbbreviation>
</Journal>
<ArticleTitle>Extension of the binding motif of the Sin3
interacting domain of the Mad family proteins.</ArticleTitle>
<Pagination>
<MedlinePgn>46-54</MedlinePgn>
</Pagination>
<Abstract>
<AbstractText>Sin3 forms the scaffold for a
multiprotein corepressor complex that silences transcription via the
action of histone deacetylases. Sin3 is recruited to the DNA by several
DNA binding repressors, such as the helix-loop-helix proteins of the
Mad family. Here, we elaborate on the Mad-Sin3 interaction based on a
binding study, solution structure, and dynamics of the PAH2 domain of
mSin3 in complex to an extended Sin3 interacting domain (SID) of 24
residues of Mad1. We show that SID residues Met7 and Glu23, outside the
previously defined minimal binding motif, mediate additional
hydrophobic and electrostatic interactions with PAH2. On the basis of
these results we propose an extended consensus sequence describing the
PAH2-SID interaction specifically for the Mad family, showing that
residues outside the hydrophobic core of the SID interact with PAH2 and
modulate binding affinity to appropriate levels.</AbstractText>
</Abstract>
<Affiliation>Departments of Biophysical Chemistry and
Molecular Biology, NSRIM Center, University of Nijmegen, Toernooiveld
1, 6525 ED Nijmegen, The Netherlands.</Affiliation>
<AuthorList CompleteYN="Y">
<Author ValidYN="Y">
<LastName>van Ingen</LastName>
<ForeName>Hugo</ForeName>
<Initials>H</Initials>
</Author>
<Author ValidYN="Y">
<LastName>Lasonder</LastName>
<ForeName>Edwin</ForeName>
<Initials>E</Initials>
</Author>
<Author ValidYN="Y">
<LastName>Jansen</LastName>
<ForeName>Jacobus F A</ForeName>
<Initials>JF</Initials>
</Author>
<Author ValidYN="Y">
<LastName>Kaan</LastName>
<ForeName>Anita M</ForeName>
<Initials>AM</Initials>
</Author>
<Author ValidYN="Y">
<LastName>Spronk</LastName>
<ForeName>Christian A E M</ForeName>
<Initials>CA</Initials>
</Author>
<Author ValidYN="Y">
<LastName>Stunnenberg</LastName>
<ForeName>Henk G</ForeName>
<Initials>HG</Initials>
</Author>
<Author ValidYN="Y">
<LastName>Vuister</LastName>
<ForeName>Geerten W</ForeName>
<Initials>GW</Initials>
</Author>
</AuthorList>
<Language>eng</Language>
<DataBankList CompleteYN="Y">
<DataBank>
<DataBankName>PDB</DataBankName>
<AccessionNumberList>
<AccessionNumber>1PD7</AccessionNumber>
</AccessionNumberList>
</DataBank>
</DataBankList>
<PublicationTypeList>
<PublicationType>Journal Article</PublicationType>
</PublicationTypeList>
</Article>
<MedlineJournalInfo>
<Country>United States</Country>
<MedlineTA>Biochemistry</MedlineTA>
<NlmUniqueID>0370623</NlmUniqueID>
</MedlineJournalInfo>
<ChemicalList>
<Chemical>
<RegistryNumber>0</RegistryNumber>
<NameOfSubstance>Basic Helix-Loop-Helix Leucine Zipper
Transcription Factors</NameOfSubstance>
</Chemical>
<Chemical>
<RegistryNumber>0</RegistryNumber>
<NameOfSubstance>Caenorhabditis elegans
Proteins</NameOfSubstance>
</Chemical>
<Chemical>
<RegistryNumber>0</RegistryNumber>
<NameOfSubstance>DNA-Binding Proteins</NameOfSubstance>
</Chemical>
<Chemical>
<RegistryNumber>0</RegistryNumber>
<NameOfSubstance>Fungal Proteins</NameOfSubstance>
</Chemical>
<Chemical>
<RegistryNumber>0</RegistryNumber>
<NameOfSubstance>MXD1 protein, human</NameOfSubstance>
</Chemical>
<Chemical>
<RegistryNumber>0</RegistryNumber>
<NameOfSubstance>Membrane Proteins</NameOfSubstance>
</Chemical>
<Chemical>
<RegistryNumber>0</RegistryNumber>
<NameOfSubstance>PAH2 protein, Pichia
angusta</NameOfSubstance>
</Chemical>
<Chemical>
<RegistryNumber>0</RegistryNumber>
<NameOfSubstance>Repressor Proteins</NameOfSubstance>
</Chemical>
<Chemical>
<RegistryNumber>0</RegistryNumber>
<NameOfSubstance>SID-1 protein, C
elegans</NameOfSubstance>
</Chemical>
<Chemical>
<RegistryNumber>0</RegistryNumber>
<NameOfSubstance>SIN3 protein, S
cerevisiae</NameOfSubstance>
</Chemical>
<Chemical>
<RegistryNumber>0</RegistryNumber>
<NameOfSubstance>Saccharomyces cerevisiae
Proteins</NameOfSubstance>
</Chemical>
<Chemical>
<RegistryNumber>0</RegistryNumber>
<NameOfSubstance>Solutions</NameOfSubstance>
</Chemical>
<Chemical>
<RegistryNumber>0</RegistryNumber>
<NameOfSubstance>Transcription
Factors</NameOfSubstance>
</Chemical>
</ChemicalList>
<CitationSubset>IM</CitationSubset>
<MeshHeadingList>
<MeshHeading>
<DescriptorName MajorTopicYN="N">Amino Acid
Motifs</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N">Amino Acid
Sequence</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName
MajorTopicYN="N">Animals</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N">Basic Helix-Loop-Helix
Leucine Zipper Transcription Factors</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N">Caenorhabditis elegans
Proteins</DescriptorName>
<QualifierName
MajorTopicYN="N">chemistry</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N">Comparative
Study</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N">Conserved
Sequence</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N">Crystallography,
X-Ray</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N">DNA-Binding
Proteins</DescriptorName>
<QualifierName
MajorTopicYN="Y">chemistry</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N">Fungal
Proteins</DescriptorName>
<QualifierName
MajorTopicYN="N">chemistry</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName
MajorTopicYN="N">Humans</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N">Membrane
Proteins</DescriptorName>
<QualifierName
MajorTopicYN="N">chemistry</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N">Molecular Sequence
Data</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N">Multigene
Family</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N">Nuclear Magnetic
Resonance, Biomolecular</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N">Protein
Binding</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N">Protein Structure,
Tertiary</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="Y">Repressor
Proteins</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N">Research Support,
Non-U.S. Gov't</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="Y">Saccharomyces
cerevisiae Proteins</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N">Sequence
Alignment</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="Y">Sequence Homology,
Amino Acid</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName
MajorTopicYN="N">Solutions</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N">Surface Plasmon
Resonance</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName
MajorTopicYN="N">Thermodynamics</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N">Transcription
Factors</DescriptorName>
<QualifierName
MajorTopicYN="Y">chemistry</QualifierName>
</MeshHeading>
</MeshHeadingList>
</MedlineCitation>
<PubmedData>
<History>
<PubMedPubDate PubStatus="pubmed">
<Year>2004</Year>
<Month>1</Month>
<Day>7</Day>
<Hour>5</Hour>
<Minute>0</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="medline">
<Year>2004</Year>
<Month>5</Month>
<Day>13</Day>
<Hour>5</Hour>
<Minute>0</Minute>
</PubMedPubDate>
</History>
<PublicationStatus>ppublish</PublicationStatus>
<ArticleIdList>
<ArticleId IdType="pubmed">14705930</ArticleId>
<ArticleId IdType="doi">10.1021/bi0355645</ArticleId>
</ArticleIdList>
</PubmedData>
</PubmedArticle>
</PubmedArticleSet>
---
And then run another script:
parseXML.pl
--
#c:\perl\bin\perl
use strict;
use warnings;
open my $INPUT, '<', 'output_full.txt' or die "Can't open data file:
$!";
open my $OFile, '>', 'parsed_output.txt' or die "Can't open output
file: $!";
my $tracker = 0;
#print OFile "$INPUT";
while (<$INPUT>) {
next if /^#/; # skip comments
next if /^\s*$/; # skip empty lines
chomp; # remove line terminator
if ( /<PMID>/ ) {
/<PMID>(.*)<\/PMID>/;
print $OFile "$1 \n";
}
if ( /<PubDate>/ ) {
$tracker = 1;
}
if ( /<Year>/ ) {
if ($tracker == 1) {
/<Year>(.*)<\/Year>/;
print $OFile "$1 \n";
$tracker = 0;
}
}
if ( /<Title>/ ) {
/<Title>(.*)<\/Title>/;
print $OFile "$1 \n";
}
if ( /<ArticleTitle>/ ) {
/<ArticleTitle>(.*)<\/ArticleTitle>/;
print $OFile "$1 \n";
}
if ( /<LastName>/ ) {
/<LastName>(.*)<\/LastName>/;
print $OFile "$1 \n";
}
}
close $OFile;
close $INPUT;
--
I get what I want:
parsed_output.txt
--
14705930
2004
Biochemistry.
Extension of the binding motif of the Sin3 interacting domain of the
Mad family proteins.
van Ingen
Lasonder
Jansen
Kaan
Spronk
Stunnenberg
Vuister
--