Package 

Class AssemblyProcessingUtils

  • All Implemented Interfaces:

    
    public class AssemblyProcessingUtils
    
                        

    This class contains methods useful for processing assembly haplotypes.

    • Nested Class Summary

      Nested Classes 
      Modifier and Type Class Description
    • Field Summary

      Fields 
      Modifier and Type Field Description
    • Constructor Summary

      Constructors 
      Constructor Description
    • Enum Constant Summary

      Enum Constants 
      Enum Constant Description
    • Method Summary

      Modifier and Type Method Description
      static RangeMap<Position, List<Position>> getCoordsRangeMap(List<String> coordsList, String chrom) Create range map from list of entries from a Mummer coordinates file
      static String getEntryFromTabDelimitedLine(String mline, int entryColumn, int totalColumns) Returns the value for a specific column in a tab-delimited string
      static Tuple<Integer, Integer> getStartEndCoordinates(String entry, boolean ref) Find the start/end coordinates from a tab-delmimited Mummer4 coords file entry
      static double calculateCoordDistance(Tuple<Integer, Integer> prev, Tuple<Integer, Integer> current) Calculates distance between 2 sets of mummer coords file entries This is called on coordinates that are both either ascending (start <end) or both descending (start >end) so "sign" of entries is not checked here.
      static boolean checkSnpEntryInRange(String mline, RangeMap<Position, List<Position>> coordsRangeMap, String chrom) Verifies if the positions from a Mummer4 snp file fall within the range map of reference and assembly positions created from the Mummer4 coordinates files.
      static List<VariantContext> findVCListForAnchor(RangeMap<Position, VariantContext> positionRangeToVariantContextMap, Position refStart, Position refEnd)
      static Tuple<Integer, Double> getRegionCoverage(RangeSet<Integer> rangeSet, Range<Integer> targetRange) This method takes a RangeSet of integers, and a single range.
      static RangeMap<Position, Tuple<String, String>> parseMummerSNPFile(String fileName, String chromosome) Method to parse the Mummer SNP file into a rangemap The first String in the tuple is for the reference call The second String is for the assembly call
      static Map<Range<Position>, List<Position>> parseCoordinateRegions(String coordFile, String chromosome) Method to parse out the reference coordinates into a map which along with the SNP data can then be used to create Variants.
      static RangeMap<Position, Position> createAsmCoordinatesRangeMap(Map<Range<Position>, List<Position>> refCoords) Creates a RangeMap of asm positions from the given reference range map, using lower reference position as the value.
      static int calculateRegionCovered(RangeMap<Position, Position> asmCoveredMap, Range<Position> asmRange) Given a map of ranges and a range, calculate the number of positions within the given range that are represented in the RangeMap.
      static Map<Range<Position>, List<Position>> mergeCoords(String coordFile, String chromosome) Test method to try to merge overlapping coordinates.
      static Map<Range<Position>, List<Position>> resizeCoords(String coordFile, String chromosome) Test method to resize the coordinate files so they are not overlapping
      static void exportMergedRegions(Map<Range<Position>, List<Position>> mergedCoords, String fileName) Utility method to export out the merged regions
      static RangeMap<Position, Tuple<String, String>> setupIndelVariants(Map<Range<Position>, List<Position>> coordinates, GenomeSequence refSequence, GenomeSequence asmSequence) Method to fill in the unmapped regions coming from nucmer.
      static RangeSet<Position> getIndelRanges(RangeMap<Position, List<Position>> coordinates)
      static List<VariantContext> createVCasRefBlock(GenomeSequence refSequence, String assemblyName, RangeSet<Position> anchors, Map<Range<Position>, List<Position>> refMappings) Method to build list of VariantContexts as RefRangeVCs - used when the reference and assembly have identical chromosome data
      static List<VariantContext> extractAnchorVariantContextsFromAssemblyAlignments(GenomeSequence refSequence, String assemblyName, RangeSet<Position> anchors, Map<Range<Position>, List<Position>> refMappings, RangeMap<Position, Tuple<String, String>> snps) Method to build the list of VariantContexts based on the mapped coordinates and the SNPs
      static List<VariantContext> splitRefRange(List<VariantContext> variantContexts, Map<Integer, ReferenceRange> anchorMapping, GenomeSequence refSequence) Method to split up the reference range by anchor mappings.
      static VariantContext createRefRangeVC(GenomeSequence refSequence, String assemblyTaxon, Position refRangeStart, Position refRangeEnd, Position asmStart, Position asmEnd) Helper method to create a Reference Range VariantContext for assemblies.
      static VariantContext createSNPVC(String assemblyTaxon, Position startPosition, Position endPosition, Tuple<String, String> calls, Position asmStart, Position asmEnd) Helper method to create a SNP Variant context for assemblies.
      static boolean isRefBlock(VariantContext vc) Simple method to determine if the current variant context is a reference block or not.
      static List<VariantContext> resizeRefBlock(VariantContext vc, GenomeSequence refSequence, Position positionToSplit, boolean isStart) Method which will take a variant Context which needs to be split and will output 2 new variants while updating ASM_* annotations.
      static Tuple<Integer, Integer> findRefIndelStart(int refSnpPos, Collection<String> snpEntries) Find the lowest reference start entry from a mummer snp file list of entries.
      static int findAsmIndelStart(Collection<String> snpEntries)
      static RangeSet<Position> getAnchorRangeSet(Map<Integer, ReferenceRange> anchorEntries) Create a RangeSet from a map of ranges
      static Map<Integer, ReferenceRange> referenceRangeForChromMap(Connection database, String chrom) Find all reference ranges for a particular chromosome Query pulls all reference ranges for that chrom from the reference_ranges table.
      static Tuple<Integer, Integer> loadInitialAssemblyData(String assemblyName, String method, int clusterSize, Connection dbConn, Map<String, String> pluginParams, List<String> fastaInfo, boolean isTestMethod) load initial genotype and method data to the database
      static void loadAssemblyDataToDB(int gamete_grp_id, String method, Connection dbConn, Map<Integer, AnchorDataPHG> anchorSequences, String chromosome, int genomeFileId) Load the assembly haplotype data to the database
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

    • Method Detail

      • getCoordsRangeMap

         static RangeMap<Position, List<Position>> getCoordsRangeMap(List<String> coordsList, String chrom)

        Create range map from list of entries from a Mummer coordinates file

        Parameters:
        coordsList - List containing tab-delimited strings from a Mummer coordinates file
        chrom - String with chromosome name
      • getEntryFromTabDelimitedLine

         static String getEntryFromTabDelimitedLine(String mline, int entryColumn, int totalColumns)

        Returns the value for a specific column in a tab-delimited string

        Parameters:
        entryColumn - - the column number (1-based) of the entry the caller wants
        totalColumns - - total number of tab-delimited columns in the line
      • getStartEndCoordinates

         static Tuple<Integer, Integer> getStartEndCoordinates(String entry, boolean ref)

        Find the start/end coordinates from a tab-delmimited Mummer4 coords file entry

        Parameters:
        entry - : Line from a mummer coords file
        ref - Boolean: if true, get ref start end.
      • calculateCoordDistance

         static double calculateCoordDistance(Tuple<Integer, Integer> prev, Tuple<Integer, Integer> current)

        Calculates distance between 2 sets of mummer coords file entries This is called on coordinates that are both either ascending (start <end) or both descending (start >end) so "sign" of entries is not checked here.

      • checkSnpEntryInRange

         static boolean checkSnpEntryInRange(String mline, RangeMap<Position, List<Position>> coordsRangeMap, String chrom)

        Verifies if the positions from a Mummer4 snp file fall within the range map of reference and assembly positions created from the Mummer4 coordinates files.

      • findVCListForAnchor

         static List<VariantContext> findVCListForAnchor(RangeMap<Position, VariantContext> positionRangeToVariantContextMap, Position refStart, Position refEnd)
      • getRegionCoverage

         static Tuple<Integer, Double> getRegionCoverage(RangeSet<Integer> rangeSet, Range<Integer> targetRange)

        This method takes a RangeSet of integers, and a single range. It finds all the ranges in the set that intersect the targetRange. Calculate both the number of bases from the targetRange that are represented in the rangeSet, and the percentage of the bases represented. Return a Tuple with this information.

      • parseMummerSNPFile

         static RangeMap<Position, Tuple<String, String>> parseMummerSNPFile(String fileName, String chromosome)

        Method to parse the Mummer SNP file into a rangemap The first String in the tuple is for the reference call The second String is for the assembly call

      • parseCoordinateRegions

         static Map<Range<Position>, List<Position>> parseCoordinateRegions(String coordFile, String chromosome)

        Method to parse out the reference coordinates into a map which along with the SNP data can then be used to create Variants.

      • createAsmCoordinatesRangeMap

         static RangeMap<Position, Position> createAsmCoordinatesRangeMap(Map<Range<Position>, List<Position>> refCoords)

        Creates a RangeMap of asm positions from the given reference range map, using lower reference position as the value.

      • calculateRegionCovered

         static int calculateRegionCovered(RangeMap<Position, Position> asmCoveredMap, Range<Position> asmRange)

        Given a map of ranges and a range, calculate the number of positions within the given range that are represented in the RangeMap.

      • mergeCoords

        @Deprecated() static Map<Range<Position>, List<Position>> mergeCoords(String coordFile, String chromosome)

        Test method to try to merge overlapping coordinates. We thought Show-SNPs would recall the SNPs in the delta file based on this information, but this was not the case.

      • resizeCoords

         static Map<Range<Position>, List<Position>> resizeCoords(String coordFile, String chromosome)

        Test method to resize the coordinate files so they are not overlapping

      • setupIndelVariants

         static RangeMap<Position, Tuple<String, String>> setupIndelVariants(Map<Range<Position>, List<Position>> coordinates, GenomeSequence refSequence, GenomeSequence asmSequence)

        Method to fill in the unmapped regions coming from nucmer. This method will create multi-bp indels between the mapped regions which can then be added to the SNP list for processing into Variants.

      • getIndelRanges

         static RangeSet<Position> getIndelRanges(RangeMap<Position, List<Position>> coordinates)
      • createVCasRefBlock

         static List<VariantContext> createVCasRefBlock(GenomeSequence refSequence, String assemblyName, RangeSet<Position> anchors, Map<Range<Position>, List<Position>> refMappings)

        Method to build list of VariantContexts as RefRangeVCs - used when the reference and assembly have identical chromosome data

      • extractAnchorVariantContextsFromAssemblyAlignments

         static List<VariantContext> extractAnchorVariantContextsFromAssemblyAlignments(GenomeSequence refSequence, String assemblyName, RangeSet<Position> anchors, Map<Range<Position>, List<Position>> refMappings, RangeMap<Position, Tuple<String, String>> snps)

        Method to build the list of VariantContexts based on the mapped coordinates and the SNPs

      • splitRefRange

         static List<VariantContext> splitRefRange(List<VariantContext> variantContexts, Map<Integer, ReferenceRange> anchorMapping, GenomeSequence refSequence)

        Method to split up the reference range by anchor mappings. Basically this method will take the variant contexts and the anchor coordinates. If a variant context is a reference block which is spanning the start or end of the anchor(should happen frequently if anchor ends are truly conserved), we need to break up the variant context into two adjacent reference blocks with the end point being the start or end of one of the variants. This will allow for easy querying of the list of Variants when attempting to load into the db.

      • createRefRangeVC

         static VariantContext createRefRangeVC(GenomeSequence refSequence, String assemblyTaxon, Position refRangeStart, Position refRangeEnd, Position asmStart, Position asmEnd)

        Helper method to create a Reference Range VariantContext for assemblies. The DP value is defaulted to 0 for assemblies. If this is not set, -1 is used as default in GenotypeBuilder. That causes assembly problems down the line when storing the value as a byte in a long.

      • createSNPVC

         static VariantContext createSNPVC(String assemblyTaxon, Position startPosition, Position endPosition, Tuple<String, String> calls, Position asmStart, Position asmEnd)

        Helper method to create a SNP Variant context for assemblies. The DP value is defaulted to 0 for assemblies. If this is not set, -1 is used as default in GenotypeBuilder. That causes assembly problems down the line when storing the value as a byte in a long.

      • isRefBlock

         static boolean isRefBlock(VariantContext vc)

        Simple method to determine if the current variant context is a reference block or not.

      • resizeRefBlock

         static List<VariantContext> resizeRefBlock(VariantContext vc, GenomeSequence refSequence, Position positionToSplit, boolean isStart)

        Method which will take a variant Context which needs to be split and will output 2 new variants while updating ASM_* annotations. Depending on if the splitting position is a start or end or if the assembly is increasing or decreasing, it will have to handle things differently.

      • findRefIndelStart

         static Tuple<Integer, Integer> findRefIndelStart(int refSnpPos, Collection<String> snpEntries)

        Find the lowest reference start entry from a mummer snp file list of entries. There could be more than 1 string of indels for this asm snp. Find the start and end of the string of indels whose positions overlap the reference position for the SNP in question. Return the start position and the length of this string of indels

      • referenceRangeForChromMap

         static Map<Integer, ReferenceRange> referenceRangeForChromMap(Connection database, String chrom)

        Find all reference ranges for a particular chromosome Query pulls all reference ranges for that chrom from the reference_ranges table. The assembly should be processed against all defined reference ranges.