public class ExternalSort extends Object
Goal: offer a generic external-memory sorting program in Java. It must be : - hackable (easy to adapt) - scalable to large files - sensibly efficient. This software is in the public domain. Usage: java org/apache/oak/commons/sort//ExternalSort somefile.txt out.txt You can change the default maximal number of temporary files with the -t flag: java org/apache/oak/commons/sort/ExternalSort somefile.txt out.txt -t 3 You can change the default maximum memory available with the -m flag: java org/apache/oak/commons/sort/ExternalSort somefile.txt out.txt -m 8192 For very large files, you might want to use an appropriate flag to allocate more memory to the Java VM: java -Xms2G org/apache/oak/commons/sort/ExternalSort somefile.txt out.txt By (in alphabetical order) Philippe Beaudoin, Eleftherios Chetzakis, Jon Elsas, Christan Grant, Daniel Haran, Daniel Lemire, Sugumaran Harikrishnan, Jerry Yang, First published: April 2010 originally posted at http://lemire.me/blog/archives/2010/04/01/external-memory-sorting-in-java/
Modifier and Type | Field and Description |
---|---|
static Comparator<String> |
defaultcomparator |
Constructor and Description |
---|
ExternalSort() |
Modifier and Type | Method and Description |
---|---|
static void |
displayUsage() |
static long |
estimateBestSizeOfBlocks(File filetobesorted,
int maxtmpfiles,
long maxMemory) |
static void |
main(String[] args) |
static <T> int |
merge(BufferedWriter fbw,
Comparator<T> cmp,
boolean distinct,
List<org.apache.jackrabbit.oak.commons.sort.BinaryFileBuffer<T>> buffers,
Function<T,String> typeToString)
This merges several BinaryFileBuffer to an output writer.
|
static <T> int |
mergeSortedFiles(List<File> files,
BufferedWriter fbw,
Comparator<T> cmp,
Charset cs,
boolean distinct,
boolean usegzip,
Function<T,String> typeToString,
Function<String,T> stringToType)
This merges a bunch of temporary flat files and deletes them on success or error.
|
static int |
mergeSortedFiles(List<File> files,
File outputfile)
This merges a bunch of temporary flat files
|
static int |
mergeSortedFiles(List<File> files,
File outputfile,
Comparator<String> cmp)
This merges a bunch of temporary flat files
|
static int |
mergeSortedFiles(List<File> files,
File outputfile,
Comparator<String> cmp,
boolean distinct)
This merges a bunch of temporary flat files
|
static int |
mergeSortedFiles(List<File> files,
File outputfile,
Comparator<String> cmp,
Charset cs)
This merges a bunch of temporary flat files
|
static int |
mergeSortedFiles(List<File> files,
File outputfile,
Comparator<String> cmp,
Charset cs,
boolean distinct)
This merges a bunch of temporary flat files
|
static <T> int |
mergeSortedFiles(List<File> files,
File outputfile,
Comparator<String> cmp,
Charset cs,
boolean distinct,
boolean append,
boolean usegzip)
This merges a bunch of temporary flat files
|
static <T> int |
mergeSortedFiles(List<File> files,
File outputfile,
Comparator<T> cmp,
Charset cs,
boolean distinct,
boolean append,
boolean usegzip,
Function<T,String> typeToString,
Function<String,T> stringToType)
This merges a bunch of temporary flat files and deletes them on success or error.
|
static void |
sort(File input,
File output) |
static File |
sortAndSave(List<String> tmplist,
Comparator<String> cmp,
Charset cs,
File tmpdirectory)
Sort a list and save it to a temporary file
|
static File |
sortAndSave(List<String> tmplist,
Comparator<String> cmp,
Charset cs,
File tmpdirectory,
boolean distinct,
boolean usegzip)
Sort a list and save it to a temporary file
|
static <T> File |
sortAndSave(List<T> tmplist,
Comparator<T> cmp,
Charset cs,
File tmpdirectory,
boolean distinct,
boolean usegzip,
Function<T,String> typeToString)
Sort a list and save it to a temporary file
|
static <T> List<File> |
sortInBatch(BufferedReader fbr,
long actualFileSize,
Comparator<T> cmp,
int maxtmpfiles,
long maxMemory,
Charset cs,
File tmpdirectory,
boolean distinct,
int numHeader,
boolean usegzip,
Function<T,String> typeToString,
Function<String,T> stringToType) |
static List<File> |
sortInBatch(File file)
This will simply load the file by blocks of lines, then sort them in-memory, and write the
result to temporary files that have to be merged later.
|
static List<File> |
sortInBatch(File file,
Comparator<String> cmp)
This will simply load the file by blocks of lines, then sort them in-memory, and write the
result to temporary files that have to be merged later.
|
static List<File> |
sortInBatch(File file,
Comparator<String> cmp,
boolean distinct)
This will simply load the file by blocks of lines, then sort them in-memory, and write the
result to temporary files that have to be merged later.
|
static List<File> |
sortInBatch(File file,
Comparator<String> cmp,
int maxtmpfiles,
long maxMemory,
Charset cs,
File tmpdirectory,
boolean distinct)
This will simply load the file by blocks of lines, then sort them in-memory, and write the
result to temporary files that have to be merged later.
|
static List<File> |
sortInBatch(File file,
Comparator<String> cmp,
int maxtmpfiles,
long maxMemory,
Charset cs,
File tmpdirectory,
boolean distinct,
int numHeader,
boolean usegzip)
This will simply load the file by blocks of lines, then sort them in-memory, and write the
result to temporary files that have to be merged later.
|
static <T> List<File> |
sortInBatch(File file,
Comparator<T> cmp,
int maxtmpfiles,
long maxMemory,
Charset cs,
File tmpdirectory,
boolean distinct,
int numHeader,
boolean usegzip,
Function<T,String> typeToString,
Function<String,T> stringToType)
This will simply load the file by blocks of lines, then sort them in-memory, and write the
result to temporary files that have to be merged later.
|
public static Comparator<String> defaultcomparator
public static void sort(File input, File output) throws IOException
IOException
public static long estimateBestSizeOfBlocks(File filetobesorted, int maxtmpfiles, long maxMemory)
public static List<File> sortInBatch(File file) throws IOException
file
- some flat fileIOException
public static List<File> sortInBatch(File file, Comparator<String> cmp) throws IOException
file
- some flat filecmp
- string comparatorIOException
public static List<File> sortInBatch(File file, Comparator<String> cmp, boolean distinct) throws IOException
file
- some flat filecmp
- string comparatordistinct
- Pass true
if duplicate lines should be discarded.IOException
public static List<File> sortInBatch(File file, Comparator<String> cmp, int maxtmpfiles, long maxMemory, Charset cs, File tmpdirectory, boolean distinct, int numHeader, boolean usegzip) throws IOException
file
- some flat filecmp
- string comparatormaxtmpfiles
- maximal number of temporary filescs
- character set to use (can use Charset.defaultCharset())tmpdirectory
- location of the temporary files (set to null for default location)distinct
- Pass true
if duplicate lines should be discarded.numHeader
- number of lines to preclude before sorting startsusegzip
- use gzip compression for the temporary filesIOException
public static <T> List<File> sortInBatch(File file, Comparator<T> cmp, int maxtmpfiles, long maxMemory, Charset cs, File tmpdirectory, boolean distinct, int numHeader, boolean usegzip, Function<T,String> typeToString, Function<String,T> stringToType) throws IOException
file
- some flat filecmp
- string comparatormaxtmpfiles
- maximal number of temporary filescs
- character set to use (can use Charset.defaultCharset())tmpdirectory
- location of the temporary files (set to null for default location)distinct
- Pass true
if duplicate lines should be discarded.numHeader
- number of lines to preclude before sorting startsusegzip
- use gzip compression for the temporary filestypeToString
- function to map string to custom type. User for coverting line to custom type for the
purpose of sortingstringToType
- function to map custom type to string. Used for storing sorted content back to fileIOException
public static <T> List<File> sortInBatch(BufferedReader fbr, long actualFileSize, Comparator<T> cmp, int maxtmpfiles, long maxMemory, Charset cs, File tmpdirectory, boolean distinct, int numHeader, boolean usegzip, Function<T,String> typeToString, Function<String,T> stringToType) throws IOException
IOException
public static List<File> sortInBatch(File file, Comparator<String> cmp, int maxtmpfiles, long maxMemory, Charset cs, File tmpdirectory, boolean distinct) throws IOException
file
- some flat filecmp
- string comparatormaxtmpfiles
- maximal number of temporary filescs
- character set to use (can use Charset.defaultCharset())tmpdirectory
- location of the temporary files (set to null for default location)distinct
- Pass true
if duplicate lines should be discarded.IOException
public static File sortAndSave(List<String> tmplist, Comparator<String> cmp, Charset cs, File tmpdirectory, boolean distinct, boolean usegzip) throws IOException
tmplist
- data to be sortedcmp
- string comparatorcs
- charset to use for output (can use Charset.defaultCharset())tmpdirectory
- location of the temporary files (set to null for default location)distinct
- Pass true
if duplicate lines should be discarded.IOException
public static <T> File sortAndSave(List<T> tmplist, Comparator<T> cmp, Charset cs, File tmpdirectory, boolean distinct, boolean usegzip, Function<T,String> typeToString) throws IOException
tmplist
- data to be sortedcmp
- string comparatorcs
- charset to use for output (can use Charset.defaultCharset())tmpdirectory
- location of the temporary files (set to null for default location)distinct
- typeToString
- function to map string to custom type. User for coverting line to custom type for the
purpose of sortingIOException
public static File sortAndSave(List<String> tmplist, Comparator<String> cmp, Charset cs, File tmpdirectory) throws IOException
tmplist
- data to be sortedcmp
- string comparatorcs
- charset to use for output (can use Charset.defaultCharset())tmpdirectory
- location of the temporary files (set to null for default location)IOException
public static int mergeSortedFiles(List<File> files, File outputfile) throws IOException
files
- outputfile
- fileIOException
public static int mergeSortedFiles(List<File> files, File outputfile, Comparator<String> cmp) throws IOException
files
- outputfile
- fileIOException
public static int mergeSortedFiles(List<File> files, File outputfile, Comparator<String> cmp, boolean distinct) throws IOException
files
- outputfile
- fileIOException
public static <T> int mergeSortedFiles(List<File> files, File outputfile, Comparator<String> cmp, Charset cs, boolean distinct, boolean append, boolean usegzip) throws IOException
files
- The List
of sorted File
s to be merged.distinct
- Pass true
if duplicate lines should be discarded. ([email protected])outputfile
- The output File
to merge the results to.cmp
- The Comparator
to use to compare String
s.cs
- The Charset
to be used for the byte to character conversion.append
- Pass true
if result should append to File
instead of
overwrite. Default to be false for overloading methods.usegzip
- assumes we used gzip compression for temporary filesIOException
public static <T> int mergeSortedFiles(List<File> files, File outputfile, Comparator<T> cmp, Charset cs, boolean distinct, boolean append, boolean usegzip, Function<T,String> typeToString, Function<String,T> stringToType) throws IOException
files
- The List
of sorted File
s to be merged.outputfile
- The output File
to merge the results to.cmp
- The Comparator
to use to compare String
s.cs
- The Charset
to be used for the byte to character conversion.distinct
- Pass true
if duplicate lines should be discarded. ([email protected])append
- Pass true
if result should append to File
instead of
overwrite. Default to be false for overloading methods.usegzip
- assumes we used gzip compression for temporary filestypeToString
- function to map string to custom type. User for coverting line to custom type for the
purpose of sortingstringToType
- function to map custom type to string. Used for storing sorted content back to fileIOException
public static <T> int mergeSortedFiles(List<File> files, BufferedWriter fbw, Comparator<T> cmp, Charset cs, boolean distinct, boolean usegzip, Function<T,String> typeToString, Function<String,T> stringToType) throws IOException
files
- The List
of sorted File
s to be merged.fbw
- Buffered writer used to store the sorted contentcmp
- The Comparator
to use to compare String
s.cs
- The Charset
to be used for the byte to character conversion.distinct
- Pass true
if duplicate lines should be discarded. ([email protected])usegzip
- assumes we used gzip compression for temporary filestypeToString
- function to map string to custom type. User for coverting line to custom type for the
purpose of sortingstringToType
- function to map custom type to string. Used for storing sorted content back to fileIOException
public static <T> int merge(BufferedWriter fbw, Comparator<T> cmp, boolean distinct, List<org.apache.jackrabbit.oak.commons.sort.BinaryFileBuffer<T>> buffers, Function<T,String> typeToString) throws IOException
fbw
- A buffer where we write the data.cmp
- A comparator object that tells us how to sort the lines.distinct
- Pass true
if duplicate lines should be discarded. ([email protected])buffers
- Where the data should be read.typeToString
- function to map string to custom type. User for coverting line to custom type for the
purpose of sortingIOException
public static int mergeSortedFiles(List<File> files, File outputfile, Comparator<String> cmp, Charset cs, boolean distinct) throws IOException
files
- The List
of sorted File
s to be merged.distinct
- Pass true
if duplicate lines should be discarded. ([email protected])outputfile
- The output File
to merge the results to.cmp
- The Comparator
to use to compare String
s.cs
- The Charset
to be used for the byte to character conversion.IOException
public static int mergeSortedFiles(List<File> files, File outputfile, Comparator<String> cmp, Charset cs) throws IOException
files
- outputfile
- filecs
- character set to use to load the stringsIOException
public static void displayUsage()
public static void main(String[] args) throws IOException
IOException
Copyright © 2010 - 2020 Adobe. All Rights Reserved