Class ExternalSort


  • public class ExternalSort
    extends Object
    Source copied from a publicly available library.
    See Also:
    https://code.google.com/p/externalsortinginjava
     Goal: offer a generic external-memory sorting program in Java.
     
     It must be : - hackable (easy to adapt) - scalable to large files - sensibly efficient.
     
     This software is in the public domain.
     
     Usage: java org/apache/oak/commons/sort//ExternalSort somefile.txt out.txt
     
     You can change the default maximal number of temporary files with the -t flag: java
     org/apache/oak/commons/sort/ExternalSort somefile.txt out.txt -t 3
     
     You can change the default maximum memory available with the -m flag: java
     org/apache/oak/commons/sort/ExternalSort somefile.txt out.txt -m 8192
     
     For very large files, you might want to use an appropriate flag to allocate more memory to
     the Java VM: java -Xms2G org/apache/oak/commons/sort/ExternalSort somefile.txt out.txt
     
     By (in alphabetical order) Philippe Beaudoin, Eleftherios Chetzakis, Jon Elsas, Christan
     Grant, Daniel Haran, Daniel Lemire, Sugumaran Harikrishnan, Jerry Yang, First published:
     April 2010 originally posted at
     http://lemire.me/blog/archives/2010/04/01/external-memory-sorting-in-java/
     
    • Constructor Detail

      • ExternalSort

        public ExternalSort()
    • Method Detail

      • estimateBestSizeOfBlocks

        public static long estimateBestSizeOfBlocks​(File filetobesorted,
                                                    int maxtmpfiles,
                                                    long maxMemory)
      • sortInBatch

        public static List<File> sortInBatch​(File file)
                                      throws IOException
        This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.
        Parameters:
        file - some flat file
        Returns:
        a list of temporary flat files
        Throws:
        IOException
      • sortInBatch

        public static List<File> sortInBatch​(File file,
                                             Comparator<String> cmp)
                                      throws IOException
        This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.
        Parameters:
        file - some flat file
        cmp - string comparator
        Returns:
        a list of temporary flat files
        Throws:
        IOException
      • sortInBatch

        public static List<File> sortInBatch​(File file,
                                             Comparator<String> cmp,
                                             boolean distinct)
                                      throws IOException
        This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.
        Parameters:
        file - some flat file
        cmp - string comparator
        distinct - Pass true if duplicate lines should be discarded.
        Returns:
        a list of temporary flat files
        Throws:
        IOException
      • sortInBatch

        public static List<File> sortInBatch​(File file,
                                             Comparator<String> cmp,
                                             int maxtmpfiles,
                                             long maxMemory,
                                             Charset cs,
                                             File tmpdirectory,
                                             boolean distinct,
                                             int numHeader,
                                             boolean usegzip)
                                      throws IOException
        This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. You can specify a bound on the number of temporary files that will be created.
        Parameters:
        file - some flat file
        cmp - string comparator
        maxtmpfiles - maximal number of temporary files
        cs - character set to use (can use Charset.defaultCharset())
        tmpdirectory - location of the temporary files (set to null for default location)
        distinct - Pass true if duplicate lines should be discarded.
        numHeader - number of lines to preclude before sorting starts
        usegzip - use gzip compression for the temporary files
        Returns:
        a list of temporary flat files
        Throws:
        IOException
      • sortInBatch

        public static <T> List<File> sortInBatch​(File file,
                                                 Comparator<T> cmp,
                                                 int maxtmpfiles,
                                                 long maxMemory,
                                                 Charset cs,
                                                 File tmpdirectory,
                                                 boolean distinct,
                                                 int numHeader,
                                                 boolean usegzip,
                                                 Function<T,​String> typeToString,
                                                 Function<String,​T> stringToType)
                                          throws IOException
        This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. You can specify a bound on the number of temporary files that will be created.
        Parameters:
        file - some flat file
        cmp - string comparator
        maxtmpfiles - maximal number of temporary files
        cs - character set to use (can use Charset.defaultCharset())
        tmpdirectory - location of the temporary files (set to null for default location)
        distinct - Pass true if duplicate lines should be discarded.
        numHeader - number of lines to preclude before sorting starts
        usegzip - use gzip compression for the temporary files
        typeToString - function to map string to custom type. User for coverting line to custom type for the purpose of sorting
        stringToType - function to map custom type to string. Used for storing sorted content back to file
        Returns:
        a list of temporary flat files
        Throws:
        IOException
      • sortInBatch

        public static List<File> sortInBatch​(File file,
                                             Comparator<String> cmp,
                                             int maxtmpfiles,
                                             long maxMemory,
                                             Charset cs,
                                             File tmpdirectory,
                                             boolean distinct)
                                      throws IOException
        This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. You can specify a bound on the number of temporary files that will be created.
        Parameters:
        file - some flat file
        cmp - string comparator
        maxtmpfiles - maximal number of temporary files
        cs - character set to use (can use Charset.defaultCharset())
        tmpdirectory - location of the temporary files (set to null for default location)
        distinct - Pass true if duplicate lines should be discarded.
        Returns:
        a list of temporary flat files
        Throws:
        IOException
      • sortAndSave

        public static File sortAndSave​(List<String> tmplist,
                                       Comparator<String> cmp,
                                       Charset cs,
                                       File tmpdirectory,
                                       boolean distinct,
                                       boolean usegzip)
                                throws IOException
        Sort a list and save it to a temporary file
        Parameters:
        tmplist - data to be sorted
        cmp - string comparator
        cs - charset to use for output (can use Charset.defaultCharset())
        tmpdirectory - location of the temporary files (set to null for default location)
        distinct - Pass true if duplicate lines should be discarded.
        Returns:
        the file containing the sorted data
        Throws:
        IOException
      • sortAndSave

        public static <T> File sortAndSave​(List<T> tmplist,
                                           Comparator<T> cmp,
                                           Charset cs,
                                           File tmpdirectory,
                                           boolean distinct,
                                           boolean usegzip,
                                           Function<T,​String> typeToString)
                                    throws IOException
        Sort a list and save it to a temporary file
        Parameters:
        tmplist - data to be sorted
        cmp - string comparator
        cs - charset to use for output (can use Charset.defaultCharset())
        tmpdirectory - location of the temporary files (set to null for default location)
        distinct -
        typeToString - function to map string to custom type. User for coverting line to custom type for the purpose of sorting
        Returns:
        the file containing the sorted data
        Throws:
        IOException
      • sortAndSave

        public static File sortAndSave​(List<String> tmplist,
                                       Comparator<String> cmp,
                                       Charset cs,
                                       File tmpdirectory)
                                throws IOException
        Sort a list and save it to a temporary file
        Parameters:
        tmplist - data to be sorted
        cmp - string comparator
        cs - charset to use for output (can use Charset.defaultCharset())
        tmpdirectory - location of the temporary files (set to null for default location)
        Returns:
        the file containing the sorted data
        Throws:
        IOException
      • mergeSortedFiles

        public static int mergeSortedFiles​(List<File> files,
                                           File outputfile)
                                    throws IOException
        This merges a bunch of temporary flat files
        Parameters:
        files -
        outputfile - file
        Returns:
        The number of lines sorted. (P. Beaudoin)
        Throws:
        IOException
      • mergeSortedFiles

        public static int mergeSortedFiles​(List<File> files,
                                           File outputfile,
                                           Comparator<String> cmp)
                                    throws IOException
        This merges a bunch of temporary flat files
        Parameters:
        files -
        outputfile - file
        Returns:
        The number of lines sorted. (P. Beaudoin)
        Throws:
        IOException
      • mergeSortedFiles

        public static int mergeSortedFiles​(List<File> files,
                                           File outputfile,
                                           Comparator<String> cmp,
                                           boolean distinct)
                                    throws IOException
        This merges a bunch of temporary flat files
        Parameters:
        files -
        outputfile - file
        Returns:
        The number of lines sorted. (P. Beaudoin)
        Throws:
        IOException
      • mergeSortedFiles

        public static <T> int mergeSortedFiles​(List<File> files,
                                               File outputfile,
                                               Comparator<String> cmp,
                                               Charset cs,
                                               boolean distinct,
                                               boolean append,
                                               boolean usegzip)
                                        throws IOException
        This merges a bunch of temporary flat files
        Parameters:
        files - The List of sorted Files to be merged.
        distinct - Pass true if duplicate lines should be discarded. ([email protected])
        outputfile - The output File to merge the results to.
        cmp - The Comparator to use to compare Strings.
        cs - The Charset to be used for the byte to character conversion.
        append - Pass true if result should append to File instead of overwrite. Default to be false for overloading methods.
        usegzip - assumes we used gzip compression for temporary files
        Returns:
        The number of lines sorted. (P. Beaudoin)
        Throws:
        IOException
        Since:
        v0.1.4
      • mergeSortedFiles

        public static <T> int mergeSortedFiles​(List<File> files,
                                               File outputfile,
                                               Comparator<T> cmp,
                                               Charset cs,
                                               boolean distinct,
                                               boolean append,
                                               boolean usegzip,
                                               Function<T,​String> typeToString,
                                               Function<String,​T> stringToType)
                                        throws IOException
        This merges a bunch of temporary flat files and deletes them on success or error.
        Parameters:
        files - The List of sorted Files to be merged.
        outputfile - The output File to merge the results to.
        cmp - The Comparator to use to compare Strings.
        cs - The Charset to be used for the byte to character conversion.
        distinct - Pass true if duplicate lines should be discarded. ([email protected])
        append - Pass true if result should append to File instead of overwrite. Default to be false for overloading methods.
        usegzip - assumes we used gzip compression for temporary files
        typeToString - function to map string to custom type. User for coverting line to custom type for the purpose of sorting
        stringToType - function to map custom type to string. Used for storing sorted content back to file
        Throws:
        IOException
        Since:
        v0.1.4
      • mergeSortedFiles

        public static <T> int mergeSortedFiles​(List<File> files,
                                               BufferedWriter fbw,
                                               Comparator<T> cmp,
                                               Charset cs,
                                               boolean distinct,
                                               boolean usegzip,
                                               Function<T,​String> typeToString,
                                               Function<String,​T> stringToType)
                                        throws IOException
        This merges a bunch of temporary flat files and deletes them on success or error.
        Parameters:
        files - The List of sorted Files to be merged.
        fbw - Buffered writer used to store the sorted content
        cmp - The Comparator to use to compare Strings.
        cs - The Charset to be used for the byte to character conversion.
        distinct - Pass true if duplicate lines should be discarded. ([email protected])
        usegzip - assumes we used gzip compression for temporary files
        typeToString - function to map string to custom type. User for coverting line to custom type for the purpose of sorting
        stringToType - function to map custom type to string. Used for storing sorted content back to file
        Throws:
        IOException
        Since:
        v0.1.4
      • merge

        public static <T> int merge​(BufferedWriter fbw,
                                    Comparator<T> cmp,
                                    boolean distinct,
                                    List<org.apache.jackrabbit.oak.commons.sort.BinaryFileBuffer<T>> buffers,
                                    Function<T,​String> typeToString)
                             throws IOException
        This merges several BinaryFileBuffer to an output writer.
        Parameters:
        fbw - A buffer where we write the data.
        cmp - A comparator object that tells us how to sort the lines.
        distinct - Pass true if duplicate lines should be discarded. ([email protected])
        buffers - Where the data should be read.
        typeToString - function to map string to custom type. User for coverting line to custom type for the purpose of sorting
        Throws:
        IOException
      • mergeSortedFiles

        public static int mergeSortedFiles​(List<File> files,
                                           File outputfile,
                                           Comparator<String> cmp,
                                           Charset cs,
                                           boolean distinct)
                                    throws IOException
        This merges a bunch of temporary flat files
        Parameters:
        files - The List of sorted Files to be merged.
        distinct - Pass true if duplicate lines should be discarded. ([email protected])
        outputfile - The output File to merge the results to.
        cmp - The Comparator to use to compare Strings.
        cs - The Charset to be used for the byte to character conversion.
        Returns:
        The number of lines sorted. (P. Beaudoin)
        Throws:
        IOException
        Since:
        v0.1.2
      • mergeSortedFiles

        public static int mergeSortedFiles​(List<File> files,
                                           File outputfile,
                                           Comparator<String> cmp,
                                           Charset cs)
                                    throws IOException
        This merges a bunch of temporary flat files
        Parameters:
        files -
        outputfile - file
        cs - character set to use to load the strings
        Returns:
        The number of lines sorted. (P. Beaudoin)
        Throws:
        IOException
      • displayUsage

        public static void displayUsage()