Class StreamingPythonScriptExecutor<T>

Type Parameters:
T - type of data that will be streamed to the Python process

public class StreamingPythonScriptExecutor<T> extends PythonExecutorBase
Python executor used to interact with a cooperative, keep-alive Python process. The executor issues commands to call Python functions in the tool module in gatktool Python package. These include functions for managing an acknowledgement FIFO that is used to signal completion of Python commands, and a data FIFO that can be used to stream data to Python. - construct the executor - start the remote process (start(java.util.List<java.lang.String>). - optionally call #getStreamWriter to initialize and create a data transfer fifo. - send one or more synchronous or asynchronous commands to be executed in Python - optionally send data one or more times of type through the async writer - execute python code to close the data fifo - terminate the executor terminate() Guidelines for writing GATK tools that use Python interactively: - Program correctness should not rely on consumption of anything written by Python to stdout/stderr. All data should be transferred through the stream writer or a file. - Python code should write errors to stderr. - Prefer single line commands that run a script, vs. multi-line Python code embedded in Java - Terminate commands with a newline. - Try not to be chatty (maximize use of the fifo buffer by writing to it in batches before reading from Python)
  • Constructor Details

    • StreamingPythonScriptExecutor

      public StreamingPythonScriptExecutor(boolean ensureExecutableExists)
      The start method must be called to actually start the remote executable.
      Parameters:
      ensureExecutableExists - throw if the python executable cannot be located
    • StreamingPythonScriptExecutor

      public StreamingPythonScriptExecutor(PythonExecutorBase.PythonExecutableName pythonExecutableName, boolean ensureExecutableExists)
      The start method must be called to actually start the remote executable.
      Parameters:
      pythonExecutableName - name of the python executable to start
      ensureExecutableExists - throw if the python executable cannot be found
  • Method Details

    • start

      public boolean start(List<String> pythonProcessArgs)
      Start the Python process.
      Parameters:
      pythonProcessArgs - args to be passed to the python process
      Returns:
      true if the process is successfully started
    • start

      public boolean start(List<String> pythonProcessArgs, boolean enableJournaling, File profileResults)
      Start the Python process.
      Parameters:
      pythonProcessArgs - args to be passed to the python process
      enableJournaling - true to enable Journaling, which records all interprocess IO to a file. This is expensive and should only be used for debugging purposes.
      Returns:
      true if the process is successfully started
    • sendSynchronousCommand

      public ProcessOutput sendSynchronousCommand(String line)
      Send a command to Python, and wait for an ack, returning all accumulated output since the last call to either <link #sendSynchronousCommand/> or <line #getAccumulatedOutput/> This is a blocking call - if no acknowledgment is received from the remote process, it will block indefinitely. If an exception is raised in the Python code, or a negative acknowledgment is received, an PythonScriptExecutorException will be thrown. The caller is required to terminate commands with the correct number of newline(s) as appropriate for the command being issued. Since white space is significant in Python, failure to do so properly can leave the Python parser blocked waiting for more newlines to terminate indented code blocks.
      Parameters:
      line - data to be sent to the remote process
      Returns:
      ProcessOutput
      Throws:
      UserException - if a timeout occurs
    • sendAsynchronousCommand

      public void sendAsynchronousCommand(String line)
      Send a command to the remote process without waiting for a response. This method should only be used for responses that will block the remote process. NOTE: Before executing further synchronous statements after calling this method, getAccumulatedOutput should be called to enforce a synchronization point. The caller is required to terminate commands with the correct number of newline(s) as appropriate for the command being issued. Since white space is significant in Python, failure to do so properly can leave the Python parser blocked waiting for more newlines to terminate indented code blocks.
      Parameters:
      line - data to send to the remote process
    • waitForAck

      public ProcessOutput waitForAck()
      Wait for an acknowledgement (which must have been previously requested).
      Returns:
      ProcessOutput when positive acknowledgement (ack) has been received, otherwise throws
      Throws:
      PythonScriptExecutorException - if nck was received
    • getApproximateCommandLine

      public String getApproximateCommandLine()
      /** Return a (not necessarily executable) string representing the current command line for this executor for error reporting purposes.
      Specified by:
      getApproximateCommandLine in class PythonExecutorBase
      Returns:
      A string representing the command line used for this executor.
    • initStreamWriter

      public void initStreamWriter(Function<T,ByteArrayOutputStream> itemSerializer)
      Obtain a stream writer that serializes and writes batches of items of type T on a background thread.
      Parameters:
      itemSerializer - Function that accepts items of type T and converts them to a ByteArrayOutputStream that is subsequently written to the stream
    • startBatchWrite

      public void startBatchWrite(String pythonCommand, List<T> batchList)
      Request that a batch of items be written to the stream on a background thread. Any previously requested batch must have already been completed and retrieved via waitForPreviousBatchCompletion().
      Parameters:
      pythonCommand - command that will be executed asynchronously to cconsume the data written to the stream
      batchList - a list of items to be written
    • waitForPreviousBatchCompletion

      public Future<Integer> waitForPreviousBatchCompletion()
      Waits for a batch that was previously initiated via startBatchWrite(String, List)} to complete, flushes the target stream and returns the corresponding completed Future. The Future representing a given batch can only be obtained via this method once. If no work is outstanding, and/or the previous batch has already been retrieved, null is returned.
      Returns:
      returns null if no previous work to complete, otherwise a completed Future
    • getProcess

      protected Process getProcess()
      Get the Process object associated with this executor. For testing only.
      Returns:
    • terminate

      public void terminate()
      Terminate the remote process, closing the fifo if any.
    • getAccumulatedOutput

      public ProcessOutput getAccumulatedOutput()
      Return all data accumulated since the last call to getAccumulatedOutput() (either directly, or indirectly through sendSynchronousCommand(java.lang.String). Note that the output returned is somewhat non-deterministic, in that there is no guaranty that all of the output from the previous command has been flushed at the time this call is made.
      Returns:
      ProcessOutput containing all accumulated output from stdout/stderr
      Throws:
      UserException - if a timeout occurs waiting for output
      PythonScriptExecutorException - if a traceback is detected in the output