org.broadinstitute.hellbender.utils.tsv (gatk 4.4.0.0 API)

package org.broadinstitute.hellbender.utils.tsv

Utility classes to read and write tab separated value (tsv) files.

File format description

Tab separated values may contain any number of comment lines (started with "#"), a column name containing line (aka. the header line) and any number of data lines one per record.

While comment lines can contain any sequence of characters, the header and data lines are divided in columns using exactly one character.

Blank lines are treated as having a single column with the empty string as the only value (or column name)

The header line is the first non-comment line, whereas any other non-comment line after that is considered a data line. Comment lines can appear anywhere in the file and their present is ignored by the reader (TableReader implementations).

The header line values, the column names, must all be different (otherwise a formatting exception will be thrown), and all data lines have to have as many values as there are columns in the header line.

Values can be quoted using . This becomes necessary when the value contain any special formatting characters like a new-line, the quote character itself, the column separator character or the escape character .

Within quotes, especial characters must be escaped using the

Examples 1:

     # comment 1
     # comment 2
     CONTIG   START   END     NAME    SAMPLE1 SAMPLE2
     # comment 3
     chr1     123100  123134 tgt_0    100.0   102.0
     chr1     134012  134201 tgt_1    50      12
     # comment 4
     chr2     ...

Reading tsv files

You will need to extend class TableReader, either using a top- or inner class and overriding createRecord method to map input data-lines, wrapped into a DataLine, to your row element class of choice.

Example, a SimpleInterval reader from a tsv file with three columns, CONTIG, START and END:


     ...

     public void doWork(final File inputFile) throws IOException {

         final TableReader<SimpleInterval> reader = new TableReader<SimpleInterval&gt(inputFile) {

            // Optional (but recommended) check that the columns in the file are the ones expected:
            @Override
            protected void processColumns(final TableColumns columns) {
                  if (!columns.containsExactly("CONTIG","START","END"))
                      throw formatException("Bad column names");
            }

            @Override
            protected TableCounts createRecord(final DataLine dataLine) {
                return new SimpleInterval(dataLine.get("CONTIG"),
                                       dataLine.getInt("START"),
                                       dataLine.getInt("END"));
            }
         };

         for (final SimpleInterval interval : reader) {
             // whatever you wanna do per interval.
         }
         reader.close();
         ...

     }

Writing tsv files

You will need to extend class TableWriter, either using a top- or inner class and overriding composeLine method to map your record object type to a output line, represented by a DataLine.

Instances of DataLine can be obtained by calling DataLine can be obtained by calling the writers protected parameter-less method composeLine.

The column names are passed in order to the constructor.

Example:

     public void doWork(final File outputFile) throws IOException {

         final TableWriter<SimpleInterval> writer =
              new TableWriter<SimpleInterval&gt(outputFile, new TableColumns("CONTIG","START","END")) {
            @Override
            protected void composeLine(final SimpleInterval interval, final DataLine dataLine) {
                // we can use append with confidence because we know the column order.
                dataLine
                    .append(interval.getContig())
                    .append(interval.getStart(),interval.getEnd());
            }
         };

         for (final SimpleInterval interval : intervalsToWrite) {
             writer.writeRecord(interval);
         }
         writer.close();
         ...

     }

Readers and Writers using function composition

TableUtils contains methods to create readers and writers without the need to explicitly extending TableReader or TableWriter but by specifying their behaviour through lambda functions.

Example of a reader:

     final TableReader<SimpleInterval> reader = TableUtils.reader(inputFile,
                (columns,formatExceptionFactory) -> {
                   // we check the columns is what we except them to be:
                   if (!columns.matchesExactly("CONTIG","START","END"))
                      throw formatExceptionFactory.apply("Bad header");
                   // we return the lambda to translate dataLines into intervals.
                   return (dataLine) -> new SimpleIntervals(dataLine.get(0),dataLine.getInt(1),dataLine.getInt(2));
                });

The lambda that you need to indicate seems a bit complicate but is not so... basically it receives the columns in the input and it must return another lambda that will translate data-lines into records considering those columns.

Before doing that, it check whether the columns are the excepted ones and int the correct order (always recommended).

The additional formatExceptionFactory parameter allows the reader implementation to correctly report formatting issues.

Example of a writer:

     final TableWriter<SimpleInterval> reader = TableUtils.reader(outputFile,
                new TableColumnCollection("CONTIG","START","END"),
                (interval,dataLine) -> {
                  dataLine.append(interval.getContig()
                          .append(interval.getStart(),interval.getEnd());
                });

The case of the writer is far more simple as there is no need to report formatting errors as we are the ones producing the file.

Related Packages

Package

Description

org.broadinstitute.hellbender.utils
Classes

Class

Description

DataLine

Table data-line string array wrapper.

SimpleXSVWriter

A simple TSV/CSV/XSV writer with support for writing in the cloud with configurable delimiter.

TableColumnCollection

Represents a list of table columns.

TableReader<R>

Reads the contents of a tab separated value formatted text input into records of an arbitrary type TableReader.

TableUtils

Common constants for table readers and writers.

TableWriter<R>

Class to write tab separated value files.

Package org.broadinstitute.hellbender.utils.tsv

File format description

Reading tsv files

Writing tsv files

Readers and Writers using function composition