Processing large files with TextFieldParser

Discussion in 'ASP .Net' started by Jon Spivey, Nov 30, 2009.

  1. Jon Spivey

    Jon Spivey Guest

    ASP.net 3.5/SQL Server 2008

    I've got a large (1.1m rows) csv file which needs parsing and sticking into
    sql server - the job needs doing every day and if anything the csv will get
    larger over time. At present I'm using a TextFieldParser to parse the csv
    line by line and add to the database. This fails probably 2 times in 3, if
    it's going to fall over it's usually at around 200,000 lines. Looking for
    suggestions as to how to do this robustly, on a shared server which doesn't
    allow bulk insert. Fair to assume the server is a factor in failure but I
    can't upgrade just yet.

    Would I be better breaking the csv into say 5 seperate files then processing
    each individually or processing in chunks, eg

    if TextFieldParser.LineNumber < 200,000 then
    process first chunk
    end if

    if TextFieldParser.LineNumber > 200000 and TextFieldParser.LineNumber
    <400000 then
    process next chunk
    end if

    etc.

    Or something else entirely?

    Cheers,
    Jon
     
    Jon Spivey, Nov 30, 2009
    #1
    1. Advertising

  2. "Jon Spivey" <> wrote in
    news:u#Orc$:

    > I've got a large (1.1m rows) csv file which needs parsing and sticking
    > into sql server - the job needs doing every day and if anything the
    > csv will get larger over time. At present I'm using a TextFieldParser
    > to parse the csv line by line and add to the database. This fails
    > probably 2 times in 3, if it's going to fall over it's usually at
    > around 200,000 lines. Looking for suggestions as to how to do this
    > robustly, on a shared server which doesn't allow bulk insert. Fair to
    > assume the server is a factor in failure but I can't upgrade just yet.


    Having done this numerous times, I find the best way is to use a
    StreamReader and read in line by line, esp. with large files, as trying
    to store everything in memory (whether DataSet or objects) ends up
    unwieldy.

    With a good regex, you can divide out the elements, even if there is a
    text delimiter (usually some form of quote). I have written my own, but
    I would not be surprised if there are others.

    Another direction to conquer this, as you are storing in SQL Server, is
    to use SSIS (or DTS in older versions). SSIS has the ability to read a
    CSV file.

    If this is a file format you can set up a BCP file for, you can bulk
    load the items into SQL Server, as well. NOTE that this will not work if
    you have to manipulate the CSV flat file into multiple tables, however.

    In the past, I architected a system that had multiple GB files that had
    to be manipulated. The solution was to leave the data in flat files and
    manipulate out into files that mimicked SQL Server tables. I then
    incremented the IDENTITY values and seeded the flat files. This required
    many passes and some file sorts to get things into SQL Server, so it is
    overkill if the file is very predictable and/or does not require
    extensive manipulation.

    Peace and Grace,


    --
    Gregory A. Beamer (MVP)

    Twitter: @gbworld
    Blog: http://gregorybeamer.spaces.live.com

    *******************************************
    | Think outside the box! |
    *******************************************
     
    Gregory A. Beamer, Nov 30, 2009
    #2
    1. Advertising

  3. Jon Spivey

    Jon Spivey Guest

    Hi Greg,
    Thanks for your reply. I've gone with a streamreader as you suggested,
    tested with the 1st csv and it worked perfectly. Going to test it a few more
    times to be sure but it certainly seems to be the solution.

    Cheers,
    Jon

    "Gregory A. Beamer" <> wrote in message
    news:Xns9CD37DC4031Bgbworld@207.46.248.16...
    > "Jon Spivey" <> wrote in
    > news:u#Orc$:
    >
    >> I've got a large (1.1m rows) csv file which needs parsing and sticking
    >> into sql server - the job needs doing every day and if anything the
    >> csv will get larger over time. At present I'm using a TextFieldParser
    >> to parse the csv line by line and add to the database. This fails
    >> probably 2 times in 3, if it's going to fall over it's usually at
    >> around 200,000 lines. Looking for suggestions as to how to do this
    >> robustly, on a shared server which doesn't allow bulk insert. Fair to
    >> assume the server is a factor in failure but I can't upgrade just yet.

    >
    > Having done this numerous times, I find the best way is to use a
    > StreamReader and read in line by line, esp. with large files, as trying
    > to store everything in memory (whether DataSet or objects) ends up
    > unwieldy.
    >
    > With a good regex, you can divide out the elements, even if there is a
    > text delimiter (usually some form of quote). I have written my own, but
    > I would not be surprised if there are others.
    >
    > Another direction to conquer this, as you are storing in SQL Server, is
    > to use SSIS (or DTS in older versions). SSIS has the ability to read a
    > CSV file.
    >
    > If this is a file format you can set up a BCP file for, you can bulk
    > load the items into SQL Server, as well. NOTE that this will not work if
    > you have to manipulate the CSV flat file into multiple tables, however.
    >
    > In the past, I architected a system that had multiple GB files that had
    > to be manipulated. The solution was to leave the data in flat files and
    > manipulate out into files that mimicked SQL Server tables. I then
    > incremented the IDENTITY values and seeded the flat files. This required
    > many passes and some file sorts to get things into SQL Server, so it is
    > overkill if the file is very predictable and/or does not require
    > extensive manipulation.
    >
    > Peace and Grace,
    >
    >
    > --
    > Gregory A. Beamer (MVP)
    >
    > Twitter: @gbworld
    > Blog: http://gregorybeamer.spaces.live.com
    >
    > *******************************************
    > | Think outside the box! |
    > *******************************************
     
    Jon Spivey, Dec 1, 2009
    #3
  4. "Jon Spivey" <> wrote in
    news::

    > Thanks for your reply. I've gone with a streamreader as you suggested,
    > tested with the 1st csv and it worked perfectly. Going to test it a
    > few more times to be sure but it certainly seems to be the solution.


    The stream only has the overhead of the buffer, so it works very well when
    working with data that can be streamed. When you are working with files,
    you generally work one row at a time, so it is a perfect solution in the
    cases where you are simply grabbing records and putting them elsewhere.

    Peace and Grace,

    --
    Gregory A. Beamer (MVP)

    Twitter: @gbworld
    Blog: http://gregorybeamer.spaces.live.com

    *******************************************
    | Think outside the box! |
    *******************************************
     
    Gregory A. Beamer, Dec 1, 2009
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Maxim
    Replies:
    0
    Views:
    415
    Maxim
    Jul 7, 2003
  2. Clement Ow
    Replies:
    0
    Views:
    122
    Clement Ow
    Apr 4, 2008
  3. Scott Stark
    Replies:
    1
    Views:
    116
    Scott Stark
    Aug 3, 2003
  4. Andreja

    processing large files

    Andreja, Apr 6, 2006, in forum: Perl Misc
    Replies:
    12
    Views:
    339
    robic0
    Apr 10, 2006
  5. Replies:
    20
    Views:
    301
    News123
    Jan 2, 2009
Loading...

Share This Page