public class GFFUtilsKt
@NotNull public static java.util.List<htsjdk.tribble.gff.Gff3Feature> readGFFtoGff3Feature(@NotNull java.lang.String gffFile)
Function to use htsjdk to read gff into memory. Returns a Set .
NOTE: gff files must end with .gff3, .gff, .gff3.gz or .gff.gz Any other extension causes htsjdk feature reader to throw an exception
The returned feature set may be printed using htsdjk Gff3Writer
@NotNull public static java.util.Map<java.lang.String,java.util.TreeMap> loadGFFsToGff3Feature(@NotNull java.lang.String keyFile)
Function takes a key file with columns taxon, gffFile. For each taxon, read the gff file using htsjdk gff feature reader. Return a map of taxon to associated Gff3Feature entries
@NotNull public static java.util.Map<java.lang.String,java.lang.String> getTaxonToGffFileMap(@NotNull java.lang.String keyFile)
Reads a key file with columns for "taxon" and "Path/name of gff file" Returns a map of taxon->fileName
@NotNull public static java.util.TreeMap<net.maizegenetics.dna.map.Position,java.util.ArrayList> createTreeMapFromFeaturesCenter(@NotNull java.util.List<? extends htsjdk.tribble.gff.Gff3Feature> features)
This method creates a mapping of the feature center position (posSTart + posEnd)/2 to list of Gff3Features
@NotNull public static java.util.Set<htsjdk.tribble.gff.Gff3Feature> makeGffFromPath(@NotNull java.util.List<java.lang.Integer> path, @NotNull java.util.Map<java.lang.String,? extends java.util.TreeMap<net.maizegenetics.dna.map.Position,java.util.ArrayList<htsjdk.tribble.gff.Gff3Feature>>> centerGffs, @NotNull HaplotypeGraph graph, @Nullable java.lang.String outputFile)
This method takes a path ( a list of integer haplotype ids), a phg graph that is based on the haplotypeIds in the path, and an optional output file name.
From the graph it pulls the asm contig and coordinates for each item on the path, finds regions in the Gff3 entries that overlap with the graph haplotype asm coordinate entries, and creates a set of Gff3Feature entries.
If an outputFile is specified, the gff file is written to the specified path.
Return: A set of Gff3Features
public static void writeGffFile(@NotNull java.lang.String outputFile, @NotNull java.util.Set<? extends htsjdk.tribble.gff.Gff3Feature> features, @Nullable java.util.List<java.lang.String> comments, @Nullable java.util.Set<? extends htsjdk.tribble.gff.SequenceRegion> regions)
Using htsjdk classes, writeGffFile will separately write comments, sequenceRegions, and features. The list of comments (List) and regions (Set) are optional The list of features (Set) is required.
The data is written to a GFF3 formatted file.
@NotNull public static java.util.Set<htsjdk.tribble.gff.Gff3Feature> getOverlappingEntriesFromGff(@NotNull java.lang.String contig, @NotNull kotlin.ranges.IntRange haplotypeRange, @NotNull java.util.TreeMap<net.maizegenetics.dna.map.Position,java.util.ArrayList> asmCenterGffs)
Search the gff entry map for overlaps with the haplotypeNode asm coordinates. Requires as input a map keyed by TASSEL Position object as well as the contig and coordinates from the haplotypeNode.
@NotNull public static kotlin.ranges.IntRange getPseudoGenomeGFFCoordinates(@NotNull kotlin.ranges.IntRange asmGffRange, @NotNull kotlin.ranges.IntRange hapNodeRange, int offset)
Creates new start/end coordinates based on the parts of the haplotype node that intersect with the range for an assembly GFF3 entry. When calculating the new coordinates, the range overlaps and the offset from the start of the pseudo-genome are considered.
Return: IntRange holding the new start/end pseudo-genome coordinates.
@NotNull public static htsjdk.tribble.gff.Gff3Feature createGffChromosomeEntry(@NotNull java.lang.String prevChrom, int offset)
Create the "chromosome" type line in the gff file
public static int gffSingleIDcount(@NotNull java.lang.String idToMatch, @NotNull java.util.Set<? extends htsjdk.tribble.gff.Gff3Feature> gffSet)
Counts the number of times a single GFF ID appears. User may specify the full ID or just part of it. The code checks for GFF entries where the ID contains the user specified string.
FOr example, if the user wanted to check for and ID of "Zm00001e000002": this could show up in the GFF3 file attributes column as: ID=gene:Zm00001e000002 ID=transcript:Zm00001e000002_T001 ID=Zm00001e000002
Or, there may be no ID, but a Parent attribute, e.g.: Parent=transcript:Zm00001e000002_T001
Base on the above, the code checks each feature first for an ID field in the attributes column, and if not found, for a Parent fiels in the attributes column. Then verifies is the user "idToMatch" string is contained in the ID or Parent value.
THe original code assumed there would always be an ID, but the ID is only required for attributes that have children, it is optional otherwise.
@NotNull public static java.util.Map<java.lang.String,java.lang.Integer> gffAllIDcount(@NotNull java.util.Set<? extends htsjdk.tribble.gff.Gff3Feature> gffSet, int count)
Takes a set of Gff3Features, creates a mapping of GFF id (ID from the attributes column) to number of times it appears The "count" parameter tells by what amount to shorten the ID.
For example: If the user only wants the first 8 characters of the ID field considered, count would be 8. This would allow for counting something like all instances of RST00015*.
If "count" is > than the length of the ID, the full value is used as the key.
returns: a mapping of ID (subsetted to first "count" letters) to number of times it occurs.
@NotNull public static java.util.Map<java.lang.String,java.lang.Integer> getGFFEntriesPerChrom(@NotNull java.util.Set<? extends htsjdk.tribble.gff.Gff3Feature> gffSet)
Counts the gff entries per chromosome
returns: Map of chromsome -> number of entries
public static int countDistinctID(@NotNull java.util.Set<? extends htsjdk.tribble.gff.Gff3Feature> gffSet, @Nullable java.lang.String contig)
Counts the number of distinct ID values either across the full GFF3Feature set (if contig==null) or for a specific contig
@NotNull public static java.util.List<java.lang.String> getDistinctGffIds(@NotNull java.util.Set<? extends htsjdk.tribble.gff.Gff3Feature> gffSet)
Get list of distinct GFF IDs in the full Gff3Feature Set NOTE: what this returns is based on how the GFF3 has the IDs defined. If user wants a count of something like "Zm000a2" but the ID is: "ID=gene:Zm000a2", it won't be what they want. Because "ID="gene:Zm000a2" and "ID=transcript:Zm000a2" are different IDs.
returns: List where String is the ID value
@NotNull public static java.util.Map<java.lang.String,java.util.Set> getDistinctGffIdsByChrom(@NotNull java.util.Set<? extends htsjdk.tribble.gff.Gff3Feature> gffSet)
Creates map of contig(chrom) to set of distinct Gff3Feature ids associated with that contig
@NotNull public static java.util.Map<java.lang.String,java.lang.Integer> sumPerChromGFFBasePairs(@NotNull java.util.Set<? extends htsjdk.tribble.gff.Gff3Feature> gffSet)
Calculates the number of BPs for each contig/chromosome in the GFF file Overlapping positions are only counted once.
return: A map of chromosome to number of included positions
@NotNull public static java.util.Map<java.lang.String,java.lang.Double> percentPerChromGFFBasePairs(@NotNull java.util.Set<? extends htsjdk.tribble.gff.Gff3Feature> gffSet)
@NotNull public static java.util.Map<java.lang.String,java.lang.Integer> sumPerChromNonGFFBasePairs(@NotNull java.util.Set<? extends htsjdk.tribble.gff.Gff3Feature> gffSet)
Calculates the number of BPs for each contig/chromosome that were not present in the GFF file. For example: If the chromosome is of length 2000, and there are only 850 total base pairs for that chromosome in the GFF, then the non-represented number is 1150
return: a map of chromosome to number of bps NOT represented in the GFF file
@NotNull public static java.util.Map<java.lang.String,java.lang.Double> percentPerChromNonGFFBasePairs(@NotNull java.util.Set<? extends htsjdk.tribble.gff.Gff3Feature> gffSet)
@NotNull public static net.maizegenetics.taxa.TaxaList createTaxaListFromFileOrString(@NotNull java.lang.String taxa)