Package org.broadinstitute.hellbender.utils.tsv
File format description
Tab separated values may contain any number of comment lines (started with "#"), a column name containing line (aka. the header line) and any number of data lines one per record.
While comment lines can contain any sequence of characters, the header and data lines are divided in columns using exactly one character.
Blank lines are treated as having a single column with the empty string as the only value (or column name)
The header line is the first non-comment line, whereas any other non-comment line after that is
considered a data line. Comment lines can appear anywhere in the file and their
present is ignored by the reader (TableReader
implementations).
The header line values, the column names, must all be different (otherwise a formatting exception will be thrown), and all data lines have to have as many values as there are columns in the header line.
Values can be quoted using . This becomes necessary when the value contain any special formatting characters like a new-line, the quote character itself, the column separator character or the escape character .
Within quotes, especial characters must be escaped using the
Examples 1:
# comment 1 # comment 2 CONTIG START END NAME SAMPLE1 SAMPLE2 # comment 3 chr1 123100 123134 tgt_0 100.0 102.0 chr1 134012 134201 tgt_1 50 12 # comment 4 chr2 ...
Reading tsv files
You will need to extend classTableReader
, either using
a top- or inner class and overriding createRecord
method to map input data-lines, wrapped into a DataLine
, to
your row element class of choice.
Example, a SimpleInterval reader from a tsv file with three columns, CONTIG, START and END:
... public void doWork(final File inputFile) throws IOException { final TableReader<SimpleInterval> reader = new TableReader<SimpleInterval>(inputFile) { // Optional (but recommended) check that the columns in the file are the ones expected: @Override protected void processColumns(final TableColumns columns) { if (!columns.containsExactly("CONTIG","START","END")) throw formatException("Bad column names"); } @Override protected TableCounts createRecord(final DataLine dataLine) { return new SimpleInterval(dataLine.get("CONTIG"), dataLine.getInt("START"), dataLine.getInt("END")); } }; for (final SimpleInterval interval : reader) { // whatever you wanna do per interval. } reader.close(); ... }
Writing tsv files
You will need to extend classTableWriter
, either using
a top- or inner class and overriding composeLine
method to map your record object type to a output line, represented by a DataLine
.
Instances of DataLine
can be obtained by calling DataLine
can be obtained by calling the writers protected parameter-less method composeLine
.
The column names are passed in order to the constructor.
Example:
public void doWork(final File outputFile) throws IOException { final TableWriter<SimpleInterval> writer = new TableWriter<SimpleInterval>(outputFile, new TableColumns("CONTIG","START","END")) { @Override protected void composeLine(final SimpleInterval interval, final DataLine dataLine) { // we can use append with confidence because we know the column order. dataLine .append(interval.getContig()) .append(interval.getStart(),interval.getEnd()); } }; for (final SimpleInterval interval : intervalsToWrite) { writer.writeRecord(interval); } writer.close(); ... }
Readers and Writers using function composition
TableUtils
contains methods to create
readers and writers without the need to explicitly extending TableReader
or TableWriter
but by specifying their behaviour through
lambda functions.
Example of a reader:
final TableReader<SimpleInterval> reader = TableUtils.reader(inputFile, (columns,formatExceptionFactory) -> { // we check the columns is what we except them to be: if (!columns.matchesExactly("CONTIG","START","END")) throw formatExceptionFactory.apply("Bad header"); // we return the lambda to translate dataLines into intervals. return (dataLine) -> new SimpleIntervals(dataLine.get(0),dataLine.getInt(1),dataLine.getInt(2)); });
The lambda that you need to indicate seems a bit complicate but is not so... basically it receives the columns in the input and it must return another lambda that will translate data-lines into records considering those columns.
Before doing that, it check whether the columns are the excepted ones and int the correct order (always recommended).
The additional formatExceptionFactory parameter allows the reader implementation to correctly report formatting issues.
Example of a writer:
final TableWriter<SimpleInterval> reader = TableUtils.reader(outputFile, new TableColumnCollection("CONTIG","START","END"), (interval,dataLine) -> { dataLine.append(interval.getContig() .append(interval.getStart(),interval.getEnd()); });
The case of the writer is far more simple as there is no need to report formatting errors as we are the ones producing the file.
-
ClassesClassDescriptionTable data-line string array wrapper.A simple TSV/CSV/XSV writer with support for writing in the cloud with configurable delimiter.Represents a list of table columns.TableReader<R>Reads the contents of a tab separated value formatted text input into records of an arbitrary type
TableReader
.Common constants for table readers and writers.TableWriter<R>Class to write tab separated value files.