Parsing XML and storing attributes in MySQL

Discussion in 'XML' started by jkugler, Jun 12, 2006.

  1. jkugler

    jkugler Guest

    Hello,

    I am trying to store huge amounts of data from xml files and put them
    into a MySQL database. The xml files all are in this format:

    <?xml version="1.0" encoding="UTF-8"?>
    <snp_submission xsi:schemaLocation="http://www.hapmap.org
    http://hapmap.cshl.org/xml-schema/2003-1
    1-04/hapmap.xsd"
    lsid="urn:LSID:ncbi.nlm.nih.gov:dbSNP/DCC_batch:34_chrom11_DELETE:005"
    xmlns="http
    ://www.hapmap.org"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <batch_info>
    <class>DELETE</class>
    <group lsid="urn:lsid:dcc.hapmap.cshl.org:Labgroup:DBSNP:1"/>
    <date_created>2005-09-15</date_created>
    <contact>
    <name>Steve Sherry</name>
    <email></email>
    </contact>
    </batch_info>
    <snps>
    <snp
    lsid="urn:lsid:ncbi.nlm.nih.gov:dbSNP;refSNP:rs2461544:111">
    <snp_class>
    <bac-overlap/>
    </snp_class>
    <sequence>

    <flank_5>caaaggaatataaatcattctattataaagatacatgcacAGGgctgggtgcagtggctcacacctgtaatccc
    agcactttgggaggccaaggcgggtggatcacctgaggacaggagtttgagaccagcctagccaacatggggaaactccatctctactaaaaatacaaa
    aattagccaggtatagtggtgcacacctgtaataccagctactttggaggctgaggcaggagaatcgctggaacccaggaggcagaggtcaaagtgagc
    caagatcataccattgcactccagcctgggcaacaagagcaaaactccatcttaaaaaaatatatatatatacatatacatacatatatatacacatat
    atatacatatatacagatattatatatgtaaatgtatatatatgtgtatatatatacacatatatatacatattataactacatatatatacacacaca
    catacatatacatgcacacatatgtttattgcagcactatttacgatagaaaatacatggaatcctcccaaatgcccatcaatgatatattggataaag
    aaaatgtgatatatattcaccatggaatactatgcagccgttaaaataaatgagatcatgttctttgcagggacatggatgaagctggaagccatcacc
    ctcagcaaactaacacaggaaaagaaaaccaaacaccacatgttctcagtcgtaagagggagttgaacaatgagagcaaacacatggatacatggaggg
    gaacaacacacaccagggcctctcagcgggacaggggtaggagaCCATCAGGACAaacacgtggatacatggaggggaacaacacacaccagggcctct
    cagggggacagggggtaggagaccatcaagacaaacacgtggatacatggaggggaacaacacacaccagggcctctcagggggacagggggtaggaga
    ccatcaggacaaacacgtggatacatggaggggaa</flank_5>
    <variation>
    <allele base="C"/>
    <allele base="T"/>
    </variation>

    <flank_3>aacacacaccagggcctctcagggggacagggggtaggagaccatcaRgacaaacacgtggRtacatggagggg
    aacaacacacaccagggcctctcagggggacggggggtagRagaccatcaggacaaatagctaatgcatgcagggcctcatacctaggtgatgggttga
    tgggtgcagcaaaccaccatggcacacatttacctatgtatcaaacctaYactttctgcacgtgtatcccagaacataaaataaaatttaaaaaatata
    taCACTGATTCATGATCTCCTttctctccttctgaaacactctttaaaactttttagcatttccccctctgtcttccatgtctcctaactacatgtttc
    ttattttccatgtctttattcctgtgttcattttggatagccccttctgacctatattacagtttactagttcactcttcaactgcttctaacatacta
    atattctgttaaaaccattcatttgggtttaaatttcaattatgttattctctatggacattctatttgttttcttttaatcttcttggccattctcta
    gagtttcctgttccattatgatatttttaattttttgttttactttaaacatactaaatatagttattttattttattttctgtatctgatactttcaa
    taactgcagtctttgctagtcttttttctgtgctcttgctcatagtttttttcatttgttttCATGATTagaaaaacagagagagaagaaggagagtaa
    agggaggaggcggaggaggagaaaagaagaaagcagagaagaagggacagagaaaaaaaggaagTTGGTTCTAACGTTTCTCTAACAACTGGCTTCAGT
    GAAACACTCCCACCTTGTGGATTTTTAGGTTATTGAAATTAACCAGTCTTCtgggtgcagcacaccaacatggcacatgtatacatatgtaacaaacct
    gcactttgtgcacatgtaccctaaaacttaaagta</flank_3>
    <length_class value="full-length"/>
    </sequence>
    <genomic_locations>
    <genomic_location active="true" type="exact">
    <assembly_version>ncbi_build34</assembly_version>
    <coordinates chrom="11" start="60749" stop="60749"
    strand="forward"/>
    </genomic_location>
    </genomic_locations>
    <neighbour_snps>
    <neighbour_snp
    lsid="urn:lsid:ncbi.nlm.nih.gov:dbSNP;refSNP:rs2461545:101"/>
    <neighbour_snp
    lsid="urn:lsid:ncbi.nlm.nih.gov:dbSNP;refSNP:rs2461546:101"/>
    <neighbour_snp
    lsid="urn:lsid:ncbi.nlm.nih.gov:dbSNP;refSNP:rs2461547:101"/>
    <neighbour_snp
    lsid="urn:lsid:ncbi.nlm.nih.gov:dbSNP;refSNP:rs2686858:100"/>
    </neighbour_snps>
    </snp>


    I am trying to extract the following information: lsid, allele base,
    cooordinates chrom, start, and strand. I would ideally like to use a
    perl script to pull this data out and put it in MySQL, but I have had
    no luck with XML::parser since the information I am looking for is
    embedded in the element's attributes.

    If anyone has any ideas on how to get this data out of xml and into
    MySQL, it would be much appreciated.

    Thanks so much,

    --James
     
    jkugler, Jun 12, 2006
    #1
    1. Advertising

  2. jkugler wrote:

    > I am trying to store huge amounts of data from xml files and put them
    > into a MySQL database. The xml files all are in this format:


    This is genomic data, so you probably have
    GigaBytes of this stuff. Very few tools are
    capable of handling such amounts in acceptable
    time with acceptable resource usage.

    > I am trying to extract the following information: lsid, allele base,
    > cooordinates chrom, start, and strand. I would ideally like to use a
    > perl script to pull this data out and put it in MySQL, but I have had
    > no luck with XML::parser since the information I am looking for is
    > embedded in the element's attributes.


    Have a look at what Andrew Schorr did to store
    GigaByte of XML data into PostgreSQL:

    http://home.vrweb.de/~juergen.kahrs/gawk/XML/xmlgawk.html#Loading-XML-data-into-PostgreSQL

    There is currently no interface for connecting
    to MySQL. But Andrew explained how such an interface
    should be implemented, when he answered a similar
    question today in comp.lang.awk:

    > Please take a look at the PostgreSQL API that I implemented
    > for xgawk: http://sourceforge.net/projects/xmlgawk. I think
    > a Mysql extension next would be nice.
    >
    > I had thought of building a general xgawk database API
    > over the libdbi library (http://sourceforge.net/projects/libdbi),
    > but I found that the libdbi API did not offer all of the PostgreSQL
    > API features that I needed to access. But as libdbi matures,
    > that may be a good layer to build upon.
    >
    > But xgawk would certainly be a great platform for you to
    > use in building a new database access mechanism. Contributions
    > are welcome.
    >
     
    =?ISO-8859-1?Q?J=FCrgen_Kahrs?=, Jun 12, 2006
    #2
    1. Advertising

  3. If you're dealing with gigabytes of data, you may want to look at IBM's
    new XML capabilities in DB2... or at "exploding" the XML into a standard
    database representation and operating on that. (XML is a good
    interchange format, but often is not the best back-end representation.)

    --
    () ASCII Ribbon Campaign | Joe Kesselman
    /\ Stamp out HTML e-mail! | System architexture and kinetic poetry
     
    Joe Kesselman, Jun 12, 2006
    #3
  4. jkugler wrote:

    > I am trying to extract the following information: lsid, allele base,
    > cooordinates chrom, start, and strand. I would ideally like to use a


    I just had a look at the data. This is a structural
    outline of your XML snippet:

    snps
    snp lsid='urn:lsid:ncbi.nlm.nih.gov:dbSNP;refSNP:rs2461544:111'
    snp_class
    bac-overlap
    sequence
    flank_5
    variation
    allele base='C'
    allele base='T'
    flank_3
    length_class value='full-length'
    genomic_locations
    genomic_location active='true' type='exact'
    assembly_version
    coordinates chrom='11' start='60749' stop='60749' strand='forward'
    neighbour_snps
    neighbour_snp lsid='urn:lsid:ncbi.nlm.nih.gov:dbSNP;refSNP:rs2461545:101'
    neighbour_snp lsid='urn:lsid:ncbi.nlm.nih.gov:dbSNP;refSNP:rs2461546:101'
    neighbour_snp lsid='urn:lsid:ncbi.nlm.nih.gov:dbSNP;refSNP:rs2461547:101'
    neighbour_snp lsid='urn:lsid:ncbi.nlm.nih.gov:dbSNP;refSNP:rs2686858:100'

    Why don't you convert it to CSV format and import
    the CSV data into your data base ?

    > perl script to pull this data out and put it in MySQL, but I have had
    > no luck with XML::parser since the information I am looking for is
    > embedded in the element's attributes.


    You probably mean lines likes this one:

    coordinates chrom='11' start='60749' stop='60749' strand='forward'

    This is particularly easy to handle in XMLgawk.
    I guess it would take about 10 to 20 lines of
    XMLgawk to convert your data to CSV format.
     
    =?ISO-8859-1?Q?J=FCrgen_Kahrs?=, Jun 12, 2006
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. toton
    Replies:
    11
    Views:
    712
    toton
    Oct 13, 2006
  2. Jonathan Wood
    Replies:
    1
    Views:
    511
    Jonathan Wood
    Jun 2, 2008
  3. John Levine
    Replies:
    0
    Views:
    732
    John Levine
    Feb 2, 2012
  4. Erik Wasser
    Replies:
    5
    Views:
    463
    Peter J. Holzer
    Mar 5, 2006
  5. jkugler
    Replies:
    5
    Views:
    463
    John Bokma
    Jun 13, 2006
Loading...

Share This Page