public class AssemblyMAFFromAnchorWavePlugin
This class creates MAF files of assembly genomes aligned to reference genomes using the anchorWave program. Details on anchorwave may be found here: https://github.com/baoxingsong/AnchorWave Input:
directory to which output files will be written
a keyFile that contains information on the assembly fastas, including where the source files may be found. The required header for this file are: NOTE - the AssemblyServerDir and AssemblyGenomeFasta are not used here, but will be needed in a later step when storing the haplotypes. THey are included here to allow for a single keyfile to be used throughout the assembly processing steps
AssemblyServerDir: server where the assembly files are hosted
AssemblyGenomeFasta: full path to the assembly fasta stored on the AssemblyServerDir
AssemblyDir: local directory where the assembly file is stored
AssemblyFasta: name of the assembly fasta file
AssemblyDBName: name to be used for this assembly in the database
gffFile: A Gff file for the reference fasta, to be used when creating a CDS fasta
refFasta: Full path to the reference fasta
numRuns: number of simultaneous anchorwave runs to perform
threadsPerRun: the number of threads to give each anchorwave run
Output:
For aligned assembly, a UCSC formatted MAF file will be created and stored to the output folder
Note on threads: it can take up to 50G per thread when running anchorwave. Consider both the number of threads available on your machine as well as the memory that will be used for each. For example: If your machine has 512G, no more than 10 threads may be used when running anchorwave for a single run. If you want to run 2 alignments in parallel and your machine has 512G, no more then 4 threadsPerRun
memoryCost ~ * (80+()*50) == 920GB
E.g.: 2 assemblies with 4 threads each:
memoryCost ~ 2 * (80+(4-1)*50) == 460GB
In the calculation above, the first 80G is for the thread that was subracted in
When a user has 10 threads available, there are different options, e.g. 2 runs with 5 threads each, or 5 runs with 2 threads each. AnchorWave is using a thread for each collinear block, for some cases the last single collinear block might take sometime which using single thread. So, setting of 5 runs with 2 thread each is faster than 2 runs with 5 threads each, but cost more mems.
Also on threads: anchorwave does not subtract from what the user gave for number of threads. Per Baoxing: "I did not subtract from the user given number. Since main function is costing a very little resource while the other threads are running."
Note on anchorwave: Install from conda. When running "anchorwave" it actually exectues a script which checks the cpu capabilities, and then executes the correct anchorwave executable based on the system's instruction set. Installing anchorwave from conda installs all the different executables
public AssemblyMAFFromAnchorWavePlugin(@Nullable java.awt.Frame parentFrame, boolean isInteractive)
This class creates MAF files of assembly genomes aligned to reference genomes using the anchorWave program. Details on anchorwave may be found here: https://github.com/baoxingsong/AnchorWave Input:
directory to which output files will be written
a keyFile that contains information on the assembly fastas, including where the source files may be found. The required header for this file are: NOTE - the AssemblyServerDir and AssemblyGenomeFasta are not used here, but will be needed in a later step when storing the haplotypes. THey are included here to allow for a single keyfile to be used throughout the assembly processing steps
AssemblyServerDir: server where the assembly files are hosted
AssemblyGenomeFasta: full path to the assembly fasta stored on the AssemblyServerDir
AssemblyDir: local directory where the assembly file is stored
AssemblyFasta: name of the assembly fasta file
AssemblyDBName: name to be used for this assembly in the database
gffFile: A Gff file for the reference fasta, to be used when creating a CDS fasta
refFasta: Full path to the reference fasta
numRuns: number of simultaneous anchorwave runs to perform
threadsPerRun: the number of threads to give each anchorwave run
Output:
For aligned assembly, a UCSC formatted MAF file will be created and stored to the output folder
Note on threads: it can take up to 50G per thread when running anchorwave. Consider both the number of threads available on your machine as well as the memory that will be used for each. For example: If your machine has 512G, no more than 10 threads may be used when running anchorwave for a single run. If you want to run 2 alignments in parallel and your machine has 512G, no more then 4 threadsPerRun
memoryCost ~ * (80+()*50) == 920GB
E.g.: 2 assemblies with 4 threads each:
memoryCost ~ 2 * (80+(4-1)*50) == 460GB
In the calculation above, the first 80G is for the thread that was subracted in
When a user has 10 threads available, there are different options, e.g. 2 runs with 5 threads each, or 5 runs with 2 threads each. AnchorWave is using a thread for each collinear block, for some cases the last single collinear block might take sometime which using single thread. So, setting of 5 runs with 2 thread each is faster than 2 runs with 5 threads each, but cost more mems.
Also on threads: anchorwave does not subtract from what the user gave for number of threads. Per Baoxing: "I did not subtract from the user given number. Since main function is costing a very little resource while the other threads are running."
Note on anchorwave: Install from conda. When running "anchorwave" it actually exectues a script which checks the cpu capabilities, and then executes the correct anchorwave executable based on the system's instruction set. Installing anchorwave from conda installs all the different executables
public AssemblyMAFFromAnchorWavePlugin()
This class creates MAF files of assembly genomes aligned to reference genomes using the anchorWave program. Details on anchorwave may be found here: https://github.com/baoxingsong/AnchorWave Input:
directory to which output files will be written
a keyFile that contains information on the assembly fastas, including where the source files may be found. The required header for this file are: NOTE - the AssemblyServerDir and AssemblyGenomeFasta are not used here, but will be needed in a later step when storing the haplotypes. THey are included here to allow for a single keyfile to be used throughout the assembly processing steps
AssemblyServerDir: server where the assembly files are hosted
AssemblyGenomeFasta: full path to the assembly fasta stored on the AssemblyServerDir
AssemblyDir: local directory where the assembly file is stored
AssemblyFasta: name of the assembly fasta file
AssemblyDBName: name to be used for this assembly in the database
gffFile: A Gff file for the reference fasta, to be used when creating a CDS fasta
refFasta: Full path to the reference fasta
numRuns: number of simultaneous anchorwave runs to perform
threadsPerRun: the number of threads to give each anchorwave run
Output:
For aligned assembly, a UCSC formatted MAF file will be created and stored to the output folder
Note on threads: it can take up to 50G per thread when running anchorwave. Consider both the number of threads available on your machine as well as the memory that will be used for each. For example: If your machine has 512G, no more than 10 threads may be used when running anchorwave for a single run. If you want to run 2 alignments in parallel and your machine has 512G, no more then 4 threadsPerRun
memoryCost ~ * (80+()*50) == 920GB
E.g.: 2 assemblies with 4 threads each:
memoryCost ~ 2 * (80+(4-1)*50) == 460GB
In the calculation above, the first 80G is for the thread that was subracted in
When a user has 10 threads available, there are different options, e.g. 2 runs with 5 threads each, or 5 runs with 2 threads each. AnchorWave is using a thread for each collinear block, for some cases the last single collinear block might take sometime which using single thread. So, setting of 5 runs with 2 thread each is faster than 2 runs with 5 threads each, but cost more mems.
Also on threads: anchorwave does not subtract from what the user gave for number of threads. Per Baoxing: "I did not subtract from the user given number. Since main function is costing a very little resource while the other threads are running."
Note on anchorwave: Install from conda. When running "anchorwave" it actually exectues a script which checks the cpu capabilities, and then executes the correct anchorwave executable based on the system's instruction set. Installing anchorwave from conda installs all the different executables
@Nullable public net.maizegenetics.plugindef.DataSet processData(@Nullable net.maizegenetics.plugindef.DataSet input)
public void runAnchorWaveMultiThread(@NotNull java.lang.String refFasta, @NotNull kotlin.Pair<? extends java.util.Map<java.lang.String,java.lang.Integer>,? extends java.util.List<? extends java.util.List<java.lang.String>>> colsAndData, @NotNull java.lang.String cdsFasta, @NotNull java.lang.String gffFile, @NotNull java.lang.String refSamOutFile)
public boolean createCDSfromRefData(@NotNull java.lang.String refFasta, @NotNull java.lang.String gffFile, @NotNull java.lang.String cdsFasta, @NotNull java.lang.String outputDir)
public void runAnchorwaveProali(@NotNull java.lang.String gffFile, @NotNull java.lang.String refFasta, @NotNull java.lang.String asmFasta, @NotNull java.lang.String cdsFasta, @NotNull java.lang.String refSam, @NotNull java.lang.String asmSam)
@Nullable public javax.swing.ImageIcon getIcon()
@NotNull public java.lang.String getButtonName()
@NotNull public java.lang.String getToolTipText()
@NotNull public java.lang.String outputDir()
Output directory for writing files
@NotNull public AssemblyMAFFromAnchorWavePlugin outputDir(@NotNull java.lang.String value)
Set Output Directory. Output directory for writing files
value
- Output Directory@NotNull public java.lang.String keyFile()
Name of the Keyfile to process. Must have columns AssemblyServerDir, AssemblyGenomeFasta, RefDir, RefFasta, AssemblyDir, AssemblyFasta, and AssemblyDBName. The AssemblyFasta column should contain the name of the assembly fasta file for aligning. The AssemblyGenomeFasta column should contain the name of the full genome fasta from which the assembly fasta came (it may be the same name as the AssemblyGenomeFasta).
@NotNull public AssemblyMAFFromAnchorWavePlugin keyFile(@NotNull java.lang.String value)
Set keyFile. Name of the Keyfile to process. Must have columns AssemblyServerDir, AssemblyGenomeFasta, RefDir, RefFasta, AssemblyDir, AssemblyFasta, and AssemblyDBName. The AssemblyFasta column should contain the name of the assembly fasta file for aligning. The AssemblyGenomeFasta column should contain the name of the full genome fasta from which the assembly fasta came (it may be the same name as the AssemblyGenomeFasta).
value
- keyFile@NotNull public java.lang.String gffFile()
Reference GFF3 file used to create the CDS fasta for minimap2 alignment
@NotNull public AssemblyMAFFromAnchorWavePlugin gffFile(@NotNull java.lang.String value)
Set Ref GFF3 File. Reference GFF3 file used to create the CDS fasta for minimap2 alignment
value
- Ref GFF3 File@NotNull public java.lang.String refFasta()
Full path to reference fasta file, docker specific path if running in a docker
@NotNull public AssemblyMAFFromAnchorWavePlugin refFasta(@NotNull java.lang.String value)
Set Reference Fasta File. Full path to reference fasta file, docker specific path if running in a docker
value
- Reference Fasta Filepublic int threadsPerRun()
Number of threads to use for each assembly processed. This value plus the value for numRuns should be determined based on system available threads and memory.
@NotNull public AssemblyMAFFromAnchorWavePlugin threadsPerRun(int value)
Set Threads Per Run. Number of threads to use for each assembly processed. This value plus the value for numRuns should be determined based on system available threads and memory.
value
- Threads Per Runpublic int numRuns()
Number of simultaneous assemblies to process. The anchorwave application can take up to 50G per thread for each assembly processed, plus some overhead. Consider this memory factor when providing values for threadsPerRun and numRuns
@NotNull public AssemblyMAFFromAnchorWavePlugin numRuns(int value)
Set Num Runs. Number of simultaneous assemblies to process. The anchorwave application can take up to 50G per thread for each assembly processed, plus some overhead. Consider this memory factor when providing values for threadsPerRun and numRuns
value
- Num Runs@NotNull public java.lang.String minimap2Location()
Location of Minimap2 on file system. This defaults to use minimap2 if it is on the PATH environment variable.
@NotNull public AssemblyMAFFromAnchorWavePlugin minimap2Location(@NotNull java.lang.String value)
Set Location of Minimap2 Executable. Location of Minimap2 on file system. This defaults to use minimap2 if it is on the PATH environment variable.
value
- Location of Minimap2 Executable@NotNull public java.lang.String anchorwaveLocation()
Location of anchorwave on file system. This defaults to use anchorwave if it is on the PATH environment variable.
@NotNull public AssemblyMAFFromAnchorWavePlugin anchorwaveLocation(@NotNull java.lang.String value)
Set Location of anchorwave Executable. Location of anchorwave on file system. This defaults to use anchorwave if it is on the PATH environment variable.
value
- Location of anchorwave Executablepublic int refMaxAlignCov()
anchorwave proali parameter R, indicating reference genome maximum alignment coverage .
@NotNull public AssemblyMAFFromAnchorWavePlugin refMaxAlignCov(int value)
Set Ref Max Align Cov. anchorwave proali parameter R, indicating reference genome maximum alignment coverage .
value
- Ref Max Align Covpublic int queryMaxAlignCov()
anchorwave proali parameter Q, indicating query genome maximum alignment coverage .
@NotNull public AssemblyMAFFromAnchorWavePlugin queryMaxAlignCov(int value)
Set Query Max Align Cov. anchorwave proali parameter Q, indicating query genome maximum alignment coverage
value
- Query Max Align Cov