-
- All Implemented Interfaces:
public class AssemblyProcessingUtils
This class contains methods useful for processing assembly haplotypes.
-
-
Method Summary
Modifier and Type Method Description static RangeMap<Position, List<Position>>
getCoordsRangeMap(List<String> coordsList, String chrom)
Create range map from list of entries from a Mummer coordinates file static String
getEntryFromTabDelimitedLine(String mline, int entryColumn, int totalColumns)
Returns the value for a specific column in a tab-delimited string static Tuple<Integer, Integer>
getStartEndCoordinates(String entry, boolean ref)
Find the start/end coordinates from a tab-delmimited Mummer4 coords file entry static double
calculateCoordDistance(Tuple<Integer, Integer> prev, Tuple<Integer, Integer> current)
Calculates distance between 2 sets of mummer coords file entries This is called on coordinates that are both either ascending (start <end) or both descending (start >end) so "sign" of entries is not checked here. static boolean
checkSnpEntryInRange(String mline, RangeMap<Position, List<Position>> coordsRangeMap, String chrom)
Verifies if the positions from a Mummer4 snp file fall within the range map of reference and assembly positions created from the Mummer4 coordinates files. static List<VariantContext>
findVCListForAnchor(RangeMap<Position, VariantContext> positionRangeToVariantContextMap, Position refStart, Position refEnd)
static Tuple<Integer, Double>
getRegionCoverage(RangeSet<Integer> rangeSet, Range<Integer> targetRange)
This method takes a RangeSet of integers, and a single range. static RangeMap<Position, Tuple<String, String>>
parseMummerSNPFile(String fileName, String chromosome)
Method to parse the Mummer SNP file into a rangemap The first String in the tuple is for the reference call The second String is for the assembly call static Map<Range<Position>, List<Position>>
parseCoordinateRegions(String coordFile, String chromosome)
Method to parse out the reference coordinates into a map which along with the SNP data can then be used to create Variants. static RangeMap<Position, Position>
createAsmCoordinatesRangeMap(Map<Range<Position>, List<Position>> refCoords)
Creates a RangeMap of asm positions from the given reference range map, using lower reference position as the value. static int
calculateRegionCovered(RangeMap<Position, Position> asmCoveredMap, Range<Position> asmRange)
Given a map of ranges and a range, calculate the number of positions within the given range that are represented in the RangeMap. static Map<Range<Position>, List<Position>>
mergeCoords(String coordFile, String chromosome)
Test method to try to merge overlapping coordinates. static Map<Range<Position>, List<Position>>
resizeCoords(String coordFile, String chromosome)
Test method to resize the coordinate files so they are not overlapping static void
exportMergedRegions(Map<Range<Position>, List<Position>> mergedCoords, String fileName)
Utility method to export out the merged regions static RangeMap<Position, Tuple<String, String>>
setupIndelVariants(Map<Range<Position>, List<Position>> coordinates, GenomeSequence refSequence, GenomeSequence asmSequence)
Method to fill in the unmapped regions coming from nucmer. static RangeSet<Position>
getIndelRanges(RangeMap<Position, List<Position>> coordinates)
static List<VariantContext>
createVCasRefBlock(GenomeSequence refSequence, String assemblyName, RangeSet<Position> anchors, Map<Range<Position>, List<Position>> refMappings)
Method to build list of VariantContexts as RefRangeVCs - used when the reference and assembly have identical chromosome data static List<VariantContext>
extractAnchorVariantContextsFromAssemblyAlignments(GenomeSequence refSequence, String assemblyName, RangeSet<Position> anchors, Map<Range<Position>, List<Position>> refMappings, RangeMap<Position, Tuple<String, String>> snps)
Method to build the list of VariantContexts based on the mapped coordinates and the SNPs static List<VariantContext>
splitRefRange(List<VariantContext> variantContexts, Map<Integer, ReferenceRange> anchorMapping, GenomeSequence refSequence)
Method to split up the reference range by anchor mappings. static VariantContext
createRefRangeVC(GenomeSequence refSequence, String assemblyTaxon, Position refRangeStart, Position refRangeEnd, Position asmStart, Position asmEnd)
Helper method to create a Reference Range VariantContext for assemblies. static VariantContext
createSNPVC(String assemblyTaxon, Position startPosition, Position endPosition, Tuple<String, String> calls, Position asmStart, Position asmEnd)
Helper method to create a SNP Variant context for assemblies. static boolean
isRefBlock(VariantContext vc)
Simple method to determine if the current variant context is a reference block or not. static List<VariantContext>
resizeRefBlock(VariantContext vc, GenomeSequence refSequence, Position positionToSplit, boolean isStart)
Method which will take a variant Context which needs to be split and will output 2 new variants while updating ASM_* annotations. static Tuple<Integer, Integer>
findRefIndelStart(int refSnpPos, Collection<String> snpEntries)
Find the lowest reference start entry from a mummer snp file list of entries. static int
findAsmIndelStart(Collection<String> snpEntries)
static RangeSet<Position>
getAnchorRangeSet(Map<Integer, ReferenceRange> anchorEntries)
Create a RangeSet from a map of ranges static Map<Integer, ReferenceRange>
referenceRangeForChromMap(Connection database, String chrom)
Find all reference ranges for a particular chromosome Query pulls all reference ranges for that chrom from the reference_ranges table. static Tuple<Integer, Integer>
loadInitialAssemblyData(String assemblyName, String method, int clusterSize, Connection dbConn, Map<String, String> pluginParams, List<String> fastaInfo, boolean isTestMethod)
load initial genotype and method data to the database static void
loadAssemblyDataToDB(int gamete_grp_id, String method, Connection dbConn, Map<Integer, AnchorDataPHG> anchorSequences, String chromosome, int genomeFileId)
Load the assembly haplotype data to the database -
-
Method Detail
-
getCoordsRangeMap
static RangeMap<Position, List<Position>> getCoordsRangeMap(List<String> coordsList, String chrom)
Create range map from list of entries from a Mummer coordinates file
- Parameters:
coordsList
- List containing tab-delimited strings from a Mummer coordinates filechrom
- String with chromosome name
-
getEntryFromTabDelimitedLine
static String getEntryFromTabDelimitedLine(String mline, int entryColumn, int totalColumns)
Returns the value for a specific column in a tab-delimited string
- Parameters:
entryColumn
- - the column number (1-based) of the entry the caller wantstotalColumns
- - total number of tab-delimited columns in the line
-
getStartEndCoordinates
static Tuple<Integer, Integer> getStartEndCoordinates(String entry, boolean ref)
Find the start/end coordinates from a tab-delmimited Mummer4 coords file entry
- Parameters:
entry
- : Line from a mummer coords fileref
- Boolean: if true, get ref start end.
-
calculateCoordDistance
static double calculateCoordDistance(Tuple<Integer, Integer> prev, Tuple<Integer, Integer> current)
Calculates distance between 2 sets of mummer coords file entries This is called on coordinates that are both either ascending (start <end) or both descending (start >end) so "sign" of entries is not checked here.
-
checkSnpEntryInRange
static boolean checkSnpEntryInRange(String mline, RangeMap<Position, List<Position>> coordsRangeMap, String chrom)
Verifies if the positions from a Mummer4 snp file fall within the range map of reference and assembly positions created from the Mummer4 coordinates files.
-
findVCListForAnchor
static List<VariantContext> findVCListForAnchor(RangeMap<Position, VariantContext> positionRangeToVariantContextMap, Position refStart, Position refEnd)
-
getRegionCoverage
static Tuple<Integer, Double> getRegionCoverage(RangeSet<Integer> rangeSet, Range<Integer> targetRange)
This method takes a RangeSet of integers, and a single range. It finds all the ranges in the set that intersect the targetRange. Calculate both the number of bases from the targetRange that are represented in the rangeSet, and the percentage of the bases represented. Return a Tuple with this information.
-
parseMummerSNPFile
static RangeMap<Position, Tuple<String, String>> parseMummerSNPFile(String fileName, String chromosome)
Method to parse the Mummer SNP file into a rangemap The first String in the tuple is for the reference call The second String is for the assembly call
-
parseCoordinateRegions
static Map<Range<Position>, List<Position>> parseCoordinateRegions(String coordFile, String chromosome)
Method to parse out the reference coordinates into a map which along with the SNP data can then be used to create Variants.
-
createAsmCoordinatesRangeMap
static RangeMap<Position, Position> createAsmCoordinatesRangeMap(Map<Range<Position>, List<Position>> refCoords)
Creates a RangeMap of asm positions from the given reference range map, using lower reference position as the value.
-
calculateRegionCovered
static int calculateRegionCovered(RangeMap<Position, Position> asmCoveredMap, Range<Position> asmRange)
Given a map of ranges and a range, calculate the number of positions within the given range that are represented in the RangeMap.
-
mergeCoords
@Deprecated() static Map<Range<Position>, List<Position>> mergeCoords(String coordFile, String chromosome)
Test method to try to merge overlapping coordinates. We thought Show-SNPs would recall the SNPs in the delta file based on this information, but this was not the case.
-
resizeCoords
static Map<Range<Position>, List<Position>> resizeCoords(String coordFile, String chromosome)
Test method to resize the coordinate files so they are not overlapping
-
exportMergedRegions
@Deprecated() static void exportMergedRegions(Map<Range<Position>, List<Position>> mergedCoords, String fileName)
Utility method to export out the merged regions
-
setupIndelVariants
static RangeMap<Position, Tuple<String, String>> setupIndelVariants(Map<Range<Position>, List<Position>> coordinates, GenomeSequence refSequence, GenomeSequence asmSequence)
Method to fill in the unmapped regions coming from nucmer. This method will create multi-bp indels between the mapped regions which can then be added to the SNP list for processing into Variants.
-
getIndelRanges
static RangeSet<Position> getIndelRanges(RangeMap<Position, List<Position>> coordinates)
-
createVCasRefBlock
static List<VariantContext> createVCasRefBlock(GenomeSequence refSequence, String assemblyName, RangeSet<Position> anchors, Map<Range<Position>, List<Position>> refMappings)
Method to build list of VariantContexts as RefRangeVCs - used when the reference and assembly have identical chromosome data
-
extractAnchorVariantContextsFromAssemblyAlignments
static List<VariantContext> extractAnchorVariantContextsFromAssemblyAlignments(GenomeSequence refSequence, String assemblyName, RangeSet<Position> anchors, Map<Range<Position>, List<Position>> refMappings, RangeMap<Position, Tuple<String, String>> snps)
Method to build the list of VariantContexts based on the mapped coordinates and the SNPs
-
splitRefRange
static List<VariantContext> splitRefRange(List<VariantContext> variantContexts, Map<Integer, ReferenceRange> anchorMapping, GenomeSequence refSequence)
Method to split up the reference range by anchor mappings. Basically this method will take the variant contexts and the anchor coordinates. If a variant context is a reference block which is spanning the start or end of the anchor(should happen frequently if anchor ends are truly conserved), we need to break up the variant context into two adjacent reference blocks with the end point being the start or end of one of the variants. This will allow for easy querying of the list of Variants when attempting to load into the db.
-
createRefRangeVC
static VariantContext createRefRangeVC(GenomeSequence refSequence, String assemblyTaxon, Position refRangeStart, Position refRangeEnd, Position asmStart, Position asmEnd)
Helper method to create a Reference Range VariantContext for assemblies. The DP value is defaulted to 0 for assemblies. If this is not set, -1 is used as default in GenotypeBuilder. That causes assembly problems down the line when storing the value as a byte in a long.
-
createSNPVC
static VariantContext createSNPVC(String assemblyTaxon, Position startPosition, Position endPosition, Tuple<String, String> calls, Position asmStart, Position asmEnd)
Helper method to create a SNP Variant context for assemblies. The DP value is defaulted to 0 for assemblies. If this is not set, -1 is used as default in GenotypeBuilder. That causes assembly problems down the line when storing the value as a byte in a long.
-
isRefBlock
static boolean isRefBlock(VariantContext vc)
Simple method to determine if the current variant context is a reference block or not.
-
resizeRefBlock
static List<VariantContext> resizeRefBlock(VariantContext vc, GenomeSequence refSequence, Position positionToSplit, boolean isStart)
Method which will take a variant Context which needs to be split and will output 2 new variants while updating ASM_* annotations. Depending on if the splitting position is a start or end or if the assembly is increasing or decreasing, it will have to handle things differently.
-
findRefIndelStart
static Tuple<Integer, Integer> findRefIndelStart(int refSnpPos, Collection<String> snpEntries)
Find the lowest reference start entry from a mummer snp file list of entries. There could be more than 1 string of indels for this asm snp. Find the start and end of the string of indels whose positions overlap the reference position for the SNP in question. Return the start position and the length of this string of indels
-
findAsmIndelStart
static int findAsmIndelStart(Collection<String> snpEntries)
-
getAnchorRangeSet
static RangeSet<Position> getAnchorRangeSet(Map<Integer, ReferenceRange> anchorEntries)
Create a RangeSet from a map of ranges
-
referenceRangeForChromMap
static Map<Integer, ReferenceRange> referenceRangeForChromMap(Connection database, String chrom)
Find all reference ranges for a particular chromosome Query pulls all reference ranges for that chrom from the reference_ranges table. The assembly should be processed against all defined reference ranges.
-
loadInitialAssemblyData
static Tuple<Integer, Integer> loadInitialAssemblyData(String assemblyName, String method, int clusterSize, Connection dbConn, Map<String, String> pluginParams, List<String> fastaInfo, boolean isTestMethod)
load initial genotype and method data to the database
-
loadAssemblyDataToDB
static void loadAssemblyDataToDB(int gamete_grp_id, String method, Connection dbConn, Map<Integer, AnchorDataPHG> anchorSequences, String chromosome, int genomeFileId)
Load the assembly haplotype data to the database
-
-
-
-