Efficiently concatenating contents of multiple files

Discussion in 'Java' started by sasuke, Jul 2, 2008.

  1. sasuke

    sasuke Guest

    Hello to all Java programmers out there. :)

    I was just wondering what would be the most time / space efficient way
    of concatenating contents of different files to a single file. Sample
    usage would be:
    java Concat targetFile.txt sourceFileOne.txt sourceFileTwo.txt ...

    Using threads to open a stream to the source files is out of question
    since the data needs to be written in a ordered manner in which it
    exists in the source files i.e. no ad hoc writing. Reading the entire
    contents of the file into memory (by using a StingBuffer /
    StringBuilder) also isn't a good choice considering that we can come
    across really large text files (~10 MB, typical for db dumps). Reading
    the source file line by line doesn't seem attractive given that it
    would increase I/O and again for really large files might turn out to
    be a I/O bottleneck. One solution which comes to mind is to read the
    file in chunks; i.e. read the data in char array of 8KB or a string
    array of size 100.

    My question here is -» Is there any ideal solution which comes to
    mind when solving this problem or does the solution really depend on
    the domain in consideration and the kind of sacrifices we are ready to
    make (e.g. lose the ordering of data, memory trade off when reading
    entire file in a buffer, I/O hit)?

    Pardon me for asking such trivial / silly question but just a
    thought. :)

    Regards,
    /~sasuke
     
    sasuke, Jul 2, 2008
    #1
    1. Advertising

  2. sasuke wrote:
    > Hello to all Java programmers out there. :)
    >
    > I was just wondering what would be the most time / space efficient way
    > of concatenating contents of different files to a single file. Sample
    > usage would be:
    > java Concat targetFile.txt sourceFileOne.txt sourceFileTwo.txt ...


    The most efficient usage of your time is not to reinvent wheels.

    >
    > Using threads to open a stream to the source files is out of question
    > since the data needs to be written in a ordered manner in which it
    > exists in the source files i.e. no ad hoc writing.


    Having multiple threads doing I/O to the same disk is likely to slow
    things down.


    > Reading the entire
    > contents of the file into memory (by using a StingBuffer /
    > StringBuilder) also isn't a good choice considering that we can come
    > across really large text files (~10 MB, typical for db dumps).


    I see no benefit in reading a whole file into memory.


    > Reading
    > the source file line by line doesn't seem attractive given that it
    > would increase I/O and again for really large files might turn out to
    > be a I/O bottleneck.


    You don't need the JVM to be doing conversion to UTC-16, or pointless
    line-oriented processing (e.g. scanning for line-endings).


    > One solution which comes to mind is to read the
    > file in chunks; i.e. read the data in char array of 8KB or a string
    > array of size 100.
    >
    > My question here is -» Is there any ideal solution which comes to
    > mind when solving this problem


    :)

    cat sourceFileOne.txt sourceFileTwo.txt ... targetFile.txt

    or

    copy sourceFileOne.txt+sourceFileTwo.txt ... targetFile.txt

    depending on operating system

    > or does the solution really depend on
    > the domain in consideration and the kind of sacrifices we are ready to
    > make (e.g. lose the ordering of data, memory trade off when reading
    > entire file in a buffer, I/O hit)?



    I wouldn't reinvent this wheel but if you are doing it I suggest you
    treat the files as binary not as text (especially not using anything
    that translates encodings). Reading in large fixed-size chunks would
    seem to be sensible. Given that the task is I/O bound I wouldn't try too
    hard to optimise anything else.

    --
    RGB
     
    RedGrittyBrick, Jul 2, 2008
    #2
    1. Advertising

  3. sasuke wrote:
    > Hello to all Java programmers out there. :)
    >
    > I was just wondering what would be the most time / space efficient way
    > of concatenating contents of different files to a single file. Sample
    > usage would be:
    > java Concat targetFile.txt sourceFileOne.txt sourceFileTwo.txt ...
    >
    > Using threads to open a stream to the source files is out of question
    > since the data needs to be written in a ordered manner in which it
    > exists in the source files i.e. no ad hoc writing. Reading the entire
    > contents of the file into memory (by using a StingBuffer /
    > StringBuilder) also isn't a good choice considering that we can come
    > across really large text files (~10 MB, typical for db dumps). Reading
    > the source file line by line doesn't seem attractive given that it
    > would increase I/O and again for really large files might turn out to
    > be a I/O bottleneck. One solution which comes to mind is to read the
    > file in chunks; i.e. read the data in char array of 8KB or a string
    > array of size 100.
    >
    > My question here is -» Is there any ideal solution which comes to
    > mind when solving this problem or does the solution really depend on
    > the domain in consideration and the kind of sacrifices we are ready to
    > make (e.g. lose the ordering of data, memory trade off when reading
    > entire file in a buffer, I/O hit)?
    >
    > Pardon me for asking such trivial / silly question but just a
    > thought. :)
    >
    > Regards,
    > /~sasuke

    Why not use concat task that comes with ant? Or if you can use shell on
    a nix box, use "cat". Or install cat binary from cygwin on the windows
    box (the list goes on). There are many solutions out there, the least
    recommended being writing something like this from scratch (unless you
    are doing this just for learning or for fun).
    Abhijat
     
    Abhijat Vatsyayan, Jul 3, 2008
    #3
  4. sasuke

    sasuke Guest

    Thanks to all for their replies. True, when programming we must seek
    real life solutions to real world problems and the only efficient way
    here seems to be making use of platform specific trickery.

    I also completely agree with the general consensus that reading /
    writing raw bytes in much more faster than reading in bytes,
    converting them into string for a given or default encoding, writing
    the string to the target file which will again be decoded into a byte
    array based on the encoding.

    A few queries though:

    On Jul 3, 1:32 am, Zig <> wrote:
    > What encoding are your text files in? If the source and target files are
    > in the same encoding, and do not have a BOM character at the beginning of
    > the file, then a binary transfer is the way to go. Take a look at
    > java.nio.channels.FileChannel.transferTo / transferFromhttp://java.sun.com/javase/6/docs/api/java/nio/channels/FileChannel.h...,
    > long, java.nio.channels.WritableByteChannel)


    Isn't this method an abstract method? So it implies that I need to
    subclass this class and create my own specialized class which deals
    with the content transfer? I wonder how that is any different from
    doing it the raw way...

    > If you need to deal with different encodings (from your example usage, you
    > might check to see if your source files were using different BOMs), then
    > reading a block of characters (decoding from source), and writing them
    > back to the target (encoding them with the target file's encoding) may be
    > more appropriate. If they all have the same encoding, but use BOMs, then
    > you can use a binary transfer, skipping the BOM character from all but the
    > first source file.


    BOM? Googling says that this is some sort of Byte order mark but I
    don't think I have ever worked with BOM files before. If this is some
    special byte which occurs at the start of every file (like some sort
    of header) I wonder how you can call them plain text files?

    Your inputs are much appreciated.

    Thanks and regards,
    /sasuke
     
    sasuke, Jul 5, 2008
    #4
  5. sasuke

    Roedy Green Guest

    On Wed, 2 Jul 2008 09:51:55 -0700 (PDT), sasuke
    <> wrote, quoted or indirectly quoted someone who
    said :

    >I was just wondering what would be the most time / space efficient way
    >of concatenating contents of different files to a single file. Sample
    >usage would be:
    >java Concat targetFile.txt sourceFileOne.txt sourceFileTwo.txt ...


    1. If you want a platform-specific solution, you could spawn a command
    processor shell.

    2. The simplest code would just be to read each file with a
    BufferedReader using a whacking huge buffersize and write in turn to a
    bufered output. see http://mindprod.com/applet/fileio.html for
    sameple code. That has needless overhead for converting from bytes to
    char and back, though it theory you could concatenate files of
    different encodings if you knew what they were.

    3. if you read the files as raw bytes rather than chars, you know
    their precise lengths, and the offset where they will fit in the final
    file. You could use random access to implement your thread idea.
    However, I doubt the game will be worth the candle unless the files to
    be gathered live on different _physical_ drives. All you will succeed
    in doing is jerking the heads all over.

    4. If you want a canned solution, use the FileTransfer class
    downloadable from
    http://mindprod.com/products.html#FILETRANSFER

    It does it rapidly in large raw-byte chunks.

    // test FileTransfer.append
    import com.mindprod.filetransfer.FileTransfer;
    import java.io.File;
    public class Concat
    {
    /**
    * test harness to concatenate c onto the end of b, leaving the
    result in a.
    *
    * @param args not used
    */
    public static void main ( String[] args )
    {
    File a = new File ("C:/temp/temp.txt"); // does not exist yet
    File b = new File ("E:/mindprod/feedback/peaceincorrect.html");
    File c = new File ("E:/mindprod/jgloss/j.html");

    FileTransfer ft = new FileTransfer ( 50000 /* buffsize */ );
    // source, target
    ft.append( b, a );
    ft.append( c, a );
    }
    }

    --

    Roedy Green Canadian Mind Products
    The Java Glossary
    http://mindprod.com
     
    Roedy Green, Jul 5, 2008
    #5
  6. sasuke

    Tom Anderson Guest

    On Sat, 5 Jul 2008, Roedy Green wrote:

    > On Wed, 2 Jul 2008 09:51:55 -0700 (PDT), sasuke
    > <> wrote, quoted or indirectly quoted someone who
    > said :
    >
    >> I was just wondering what would be the most time / space efficient way
    >> of concatenating contents of different files to a single file. Sample
    >> usage would be: java Concat targetFile.txt sourceFileOne.txt
    >> sourceFileTwo.txt ...

    >
    > 1. If you want a platform-specific solution, you could spawn a command
    > processor shell.
    >
    > 2. The simplest code would just be to read each file with a
    > BufferedReader using a whacking huge buffersize and write in turn to a
    > bufered output. see http://mindprod.com/applet/fileio.html for sameple
    > code. That has needless overhead for converting from bytes to char and
    > back, though it theory you could concatenate files of different
    > encodings if you knew what they were.


    I think what i'd do is memory-map the input file using NIO, and then write
    the entire thing to the output in one go. And then cross my fingers and
    hope that the OS was smart enough to do the right thing here, rather than
    attempting to load the whole input file into memory first. If it does do
    the right thing, this avoids a lot of copying of bytes to and from java,
    and might even avoid any copying across the kernel/userspace border.

    But yeah, running 'cat' is the right solution here.

    tom

    --
    Linux is like a FreeBSD fork maintained by 10 year old retards. --
    Encyclopedia Dramatica
     
    Tom Anderson, Jul 6, 2008
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Gary Morrison
    Replies:
    5
    Views:
    2,583
    Gary Morrison
    Mar 28, 2006
  2. =?Utf-8?B?UkY=?=

    Concatenating Multiple Values for DataValueField

    =?Utf-8?B?UkY=?=, Dec 2, 2004, in forum: ASP .Net
    Replies:
    6
    Views:
    7,451
    Marina
    Dec 3, 2004
  3. Replies:
    4
    Views:
    1,002
    M.E.Farmer
    Feb 13, 2005
  4. A B
    Replies:
    15
    Views:
    1,686
    Jorgen Grahn
    Nov 11, 2010
  5. Jorge
    Replies:
    10
    Views:
    204
    John Bokma
    May 22, 2010
Loading...

Share This Page