Package 

Class FindRampSeqContigsInAssemblies

  • All Implemented Interfaces:

    
    public class FindRampSeqContigsInAssemblies
    
                        

    This method takes a fasta of ramp seq short sequences, and looks for them in an assembly genome. This is for Dan. Looking for exact matches of the 9000 across all entries in the fasta file. Look for both orig sequence, and reverse complement of sequence. This one works well - it runs each assembly in sequence. When processing the assemblies, it parallelizes over every rampSeq contig in the rampSeq map (file read into map). This speeds things up considerably from parallel processing just over the assemblies. Using indexOf(seq,startPos) still seems quicker than knuth-morris-pratt method, perhaps because of overhead of the latter. INPUT: - fasta of rampSeq short contigs - directory path, including trailing / where assembly genome fasta files live - directory path, including trailing / to which output files will be written OUTPUT: - tab-delimited files without headers, but the columns are BED file positions (0-based, inclusive/exclusive). ContigName AssemblyIDLine startPos endPos Strand In the above, Strand is whether the forward (as presented in file) or reverse-compliment of the strand matched in the assembly file. THe start/end positions are 0-based, inclusive/exclusive as for bedfiles. There is 1 tab-delimited file generated for each assembly. The file name reflects the assembly name.