Package 

Class CreateIntervalsFileFromGffPlugin

  • All Implemented Interfaces:
    java.lang.Runnable , net.maizegenetics.plugindef.Plugin , net.maizegenetics.plugindef.PluginListener , net.maizegenetics.util.ProgressListener

    
    public class CreateIntervalsFileFromGffPlugin
    extends AbstractPlugin
                        

    This class creates the interval files needed for running GATK haplotype caller, and the csv files needed for loading reference sequence into the database. Two sets of files are created: one set has coordinates based just on the ref gene coordinates. The other is gene coordinates plus user-specified flanking regions Algorithm: 1. read gff file, grab gene coordinates 2. For each Chromosome: merge genes that overlap, toss genes that are embedded within another gene Store list as mergedGeneList. 3. Using the mergedGeneList in 3, create 2nd per-chrom coordinate lists that includes flanking regions 4. Write files: interval format (chrom:start-end): a. mergedGeneList; b. mergedGEneList with flanking csv format (chr,anchorstart,anchorend,geneStart,geneEnd,geneName) a. mergedGeneList; b. mergedGEneList with flanking debug files: List of merged, list of embedded files written for informational purposes NOTE: the csv files contain the name of all genes contained in an anchor. This data is not stored in the DB. IT is included because the biologists have at times asked for it and this is a good place for it to be stored and retrieved. INPUT: 1. refFile: String: path to reference genome. needed to find size of chromosomes for adding flanking regions to last chrom entry. 2. geneFile: String: path to single file containing all chrom gene data in GFF format; or path to directory containing per-chrom files with gene data in GFF format. These data files must consist of GFF gene data alone, not the full gff. 3. outputBase: String: directory, including trailing "/", where output files will be written. 4. numFlanking: int: number of flanking bps to add on each end of the anchors. OUTPUT: 1. intervals file based on gene coordinates. 2. intervals file based on gene coordinates + numflanking bps 3. csv file based on gene coordinates 4. csv file based on gene coordinates + numflanking bps