Parsing XML and storing attributes in MySQL

Discussion in 'XML' started by jkugler, Jun 12, 2006.

  1. jkugler

    jkugler Guest

    Hello,

    I am trying to store huge amounts of data from xml files and put them
    into a MySQL database. The xml files all are in this format:

    <?xml version="1.0" encoding="UTF-8"?>
    <snp_submission xsi:schemaLocation="http://www.hapmap.org
    http://hapmap.cshl.org/xml-schema/2003-1
    1-04/hapmap.xsd"
    lsid="urn:LSID:ncbi.nlm.nih.gov:dbSNP/DCC_batch:34_chrom11_DELETE:005"
    xmlns="http
    ://www.hapmap.org"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <batch_info>
    <class>DELETE</class>
    <group lsid="urn:lsid:dcc.hapmap.cshl.org:Labgroup:DBSNP:1"/>
    <date_created>2005-09-15</date_created>
    <contact>
    <name>Steve Sherry</name>
    <email></email>
    </contact>
    </batch_info>
    <snps>
    <snp
    lsid="urn:lsid:ncbi.nlm.nih.gov:dbSNP;refSNP:rs2461544:111">
    <snp_class>
    <bac-overlap/>
    </snp_class>
    <sequence>

    <flank_5>caaaggaatataaatcattctattataaagatacatgcacAGGgctgggtgcagtggctcacacctgtaatccc
    agcactttgggaggccaaggcgggtggatcacctgaggacaggagtttgagaccagcctagccaacatggggaaactccatctctactaaaaatacaaa
    aattagccaggtatagtggtgcacacctgtaataccagctactttggaggctgaggcaggagaatcgctggaacccaggaggcagaggtcaaagtgagc
    caagatcataccattgcactccagcctgggcaacaagagcaaaactccatcttaaaaaaatatatatatatacatatacatacatatatatacacatat
    atatacatatatacagatattatatatgtaaatgtatatatatgtgtatatatatacacatatatatacatattataactacatatatatacacacaca
    catacatatacatgcacacatatgtttattgcagcactatttacgatagaaaatacatggaatcctcccaaatgcccatcaatgatatattggataaag
    aaaatgtgatatatattcaccatggaatactatgcagccgttaaaataaatgagatcatgttctttgcagggacatggatgaagctggaagccatcacc
    ctcagcaaactaacacaggaaaagaaaaccaaacaccacatgttctcagtcgtaagagggagttgaacaatgagagcaaacacatggatacatggaggg
    gaacaacacacaccagggcctctcagcgggacaggggtaggagaCCATCAGGACAaacacgtggatacatggaggggaacaacacacaccagggcctct
    cagggggacagggggtaggagaccatcaagacaaacacgtggatacatggaggggaacaacacacaccagggcctctcagggggacagggggtaggaga
    ccatcaggacaaacacgtggatacatggaggggaa</flank_5>
    <variation>
    <allele base="C"/>
    <allele base="T"/>
    </variation>

    <flank_3>aacacacaccagggcctctcagggggacagggggtaggagaccatcaRgacaaacacgtggRtacatggagggg
    aacaacacacaccagggcctctcagggggacggggggtagRagaccatcaggacaaatagctaatgcatgcagggcctcatacctaggtgatgggttga
    tgggtgcagcaaaccaccatggcacacatttacctatgtatcaaacctaYactttctgcacgtgtatcccagaacataaaataaaatttaaaaaatata
    taCACTGATTCATGATCTCCTttctctccttctgaaacactctttaaaactttttagcatttccccctctgtcttccatgtctcctaactacatgtttc
    ttattttccatgtctttattcctgtgttcattttggatagccccttctgacctatattacagtttactagttcactcttcaactgcttctaacatacta
    atattctgttaaaaccattcatttgggtttaaatttcaattatgttattctctatggacattctatttgttttcttttaatcttcttggccattctcta
    gagtttcctgttccattatgatatttttaattttttgttttactttaaacatactaaatatagttattttattttattttctgtatctgatactttcaa
    taactgcagtctttgctagtcttttttctgtgctcttgctcatagtttttttcatttgttttCATGATTagaaaaacagagagagaagaaggagagtaa
    agggaggaggcggaggaggagaaaagaagaaagcagagaagaagggacagagaaaaaaaggaagTTGGTTCTAACGTTTCTCTAACAACTGGCTTCAGT
    GAAACACTCCCACCTTGTGGATTTTTAGGTTATTGAAATTAACCAGTCTTCtgggtgcagcacaccaacatggcacatgtatacatatgtaacaaacct
    gcactttgtgcacatgtaccctaaaacttaaagta</flank_3>
    <length_class value="full-length"/>
    </sequence>
    <genomic_locations>
    <genomic_location active="true" type="exact">
    <assembly_version>ncbi_build34</assembly_version>
    <coordinates chrom="11" start="60749" stop="60749"
    strand="forward"/>
    </genomic_location>
    </genomic_locations>
    <neighbour_snps>
    <neighbour_snp
    lsid="urn:lsid:ncbi.nlm.nih.gov:dbSNP;refSNP:rs2461545:101"/>
    <neighbour_snp
    lsid="urn:lsid:ncbi.nlm.nih.gov:dbSNP;refSNP:rs2461546:101"/>
    <neighbour_snp
    lsid="urn:lsid:ncbi.nlm.nih.gov:dbSNP;refSNP:rs2461547:101"/>
    <neighbour_snp
    lsid="urn:lsid:ncbi.nlm.nih.gov:dbSNP;refSNP:rs2686858:100"/>
    </neighbour_snps>
    </snp>


    I am trying to extract the following information: lsid, allele base,
    cooordinates chrom, start, and strand. I would ideally like to use a
    perl script to pull this data out and put it in MySQL, but I have had
    no luck with XML::parser since the information I am looking for is
    embedded in the element's attributes.

    If anyone has any ideas on how to get this data out of xml and into
    MySQL, it would be much appreciated.

    Thanks so much,

    --James
     
    jkugler, Jun 12, 2006
    #1
    1. Advertisements

  2. This is genomic data, so you probably have
    GigaBytes of this stuff. Very few tools are
    capable of handling such amounts in acceptable
    time with acceptable resource usage.
    Have a look at what Andrew Schorr did to store
    GigaByte of XML data into PostgreSQL:

    http://home.vrweb.de/~juergen.kahrs/gawk/XML/xmlgawk.html#Loading-XML-data-into-PostgreSQL

    There is currently no interface for connecting
    to MySQL. But Andrew explained how such an interface
    should be implemented, when he answered a similar
    question today in comp.lang.awk:
     
    =?ISO-8859-1?Q?J=FCrgen_Kahrs?=, Jun 12, 2006
    #2
    1. Advertisements

  3. If you're dealing with gigabytes of data, you may want to look at IBM's
    new XML capabilities in DB2... or at "exploding" the XML into a standard
    database representation and operating on that. (XML is a good
    interchange format, but often is not the best back-end representation.)
     
    Joe Kesselman, Jun 12, 2006
    #3
  4. I just had a look at the data. This is a structural
    outline of your XML snippet:

    snps
    snp lsid='urn:lsid:ncbi.nlm.nih.gov:dbSNP;refSNP:rs2461544:111'
    snp_class
    bac-overlap
    sequence
    flank_5
    variation
    allele base='C'
    allele base='T'
    flank_3
    length_class value='full-length'
    genomic_locations
    genomic_location active='true' type='exact'
    assembly_version
    coordinates chrom='11' start='60749' stop='60749' strand='forward'
    neighbour_snps
    neighbour_snp lsid='urn:lsid:ncbi.nlm.nih.gov:dbSNP;refSNP:rs2461545:101'
    neighbour_snp lsid='urn:lsid:ncbi.nlm.nih.gov:dbSNP;refSNP:rs2461546:101'
    neighbour_snp lsid='urn:lsid:ncbi.nlm.nih.gov:dbSNP;refSNP:rs2461547:101'
    neighbour_snp lsid='urn:lsid:ncbi.nlm.nih.gov:dbSNP;refSNP:rs2686858:100'

    Why don't you convert it to CSV format and import
    the CSV data into your data base ?
    You probably mean lines likes this one:

    coordinates chrom='11' start='60749' stop='60749' strand='forward'

    This is particularly easy to handle in XMLgawk.
    I guess it would take about 10 to 20 lines of
    XMLgawk to convert your data to CSV format.
     
    =?ISO-8859-1?Q?J=FCrgen_Kahrs?=, Jun 12, 2006
    #4
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.