Class BucketUtils

java.lang.Object
org.broadinstitute.hellbender.utils.gcs.BucketUtils

public final class BucketUtils extends Object
Utilities for dealing with google buckets.
  • Field Details

  • Method Details

    • isGcsUrl

      public static boolean isGcsUrl(String path)
      Parameters:
      path - path to inspect
      Returns:
      true if this path represents a gcs location
    • isGcsUrl

      public static boolean isGcsUrl(GATKPath pathSpec)
      Return true if this GATKPath represents a gcs URI.
      Parameters:
      pathSpec - specifier to inspect
      Returns:
      true if this GATKPath represents a gcs URI.
    • isEligibleForPrefetching

      public static boolean isEligibleForPrefetching(GATKPath pathSpec)
      Parameters:
      pathSpec - specifier to inspect
      Returns:
      true if this GATKPath represents a remote storage system which may benefit from prefetching (gcs or http(s))
    • isEligibleForPrefetching

      public static boolean isEligibleForPrefetching(Path path)
      Parameters:
      path - path to inspect
      Returns:
      true if this Path represents a remote storage system which may benefit from prefetching (gcs or http(s))
    • isHttpUrl

      public static boolean isHttpUrl(String path)
      Returns:
      true if the given path is an http or https Url.
    • isHadoopUrl

      public static boolean isHadoopUrl(String path)
      Returns true if the given path is a HDFS (Hadoop filesystem) URL.
    • isRemoteStorageUrl

      public static boolean isRemoteStorageUrl(String path)
      Returns true if the given path is a GCS, HDFS (Hadoop filesystem), or Http(s) URL.
    • makeFilePathAbsolute

      public static String makeFilePathAbsolute(String path)
      Changes relative local file paths to be absolute file paths. Paths with a scheme are left unchanged.
      Parameters:
      path - the path
      Returns:
      an absolute file path if the original path was a relative file path, otherwise the original path
    • openFile

      public static InputStream openFile(String path)
      Open a file for reading regardless of whether it's on GCS, HDFS or local disk. If the file ends with .gz will attempt to wrap it in an appropriate unzipping stream
      Parameters:
      path - the GCS, HDFS or local path to read from. If GCS, it must start with "gs://", or "hdfs://" for HDFS.
      Returns:
      an InputStream that reads from the specified file.
    • createFile

      public static OutputStream createFile(String path)
      Open a binary file for writing regardless of whether it's on GCS, HDFS or local disk. For writing to GCS it'll use the application/octet-stream MIME type.
      Parameters:
      path - the GCS or local path to write to. If GCS, it must start with "gs://", or "hdfs://" for HDFS.
      Returns:
      an OutputStream that writes to the specified file.
    • copyFile

      public static void copyFile(String sourcePath, String destPath) throws IOException
      Copies a file. Can be used to copy e.g. from GCS to local.
      Parameters:
      sourcePath - the path to read from. If GCS, it must start with "gs://", or "hdfs://" for HDFS.
      destPath - the path to copy to. If GCS, it must start with "gs://", or "hdfs://" for HDFS.
      Throws:
      IOException
    • deleteFile

      public static void deleteFile(String pathToDelete) throws IOException
      Deletes a file: local, GCS or HDFS.
      Parameters:
      pathToDelete - the path to delete. If GCS, it must start with "gs://", or "hdfs://" for HDFS.
      Throws:
      IOException
    • getTempFilePath

      public static String getTempFilePath(String prefix, String extension)
      Get a temporary file path based on the prefix and extension provided. This file (and possible indexes associated with it) will be scheduled for deletion on shutdown
      Parameters:
      prefix - a prefix for the file name for remote paths this should be a valid URI to root the temporary file in (ie. gs://hellbender/staging/) there is no guarantee that this will be used as the root of the tmp file name, a local prefix may be placed in the tmp folder for example
      extension - and extension for the temporary file path, the resulting path will end in this
      Returns:
      a path to use as a temporary file, on remote file systems which don't support an atomic tmp file reservation a path is chosen with a long randomized name
    • randomRemotePath

      public static String randomRemotePath(String stagingLocation, String prefix, String suffix)
      Picks a random name, by putting some random letters between "prefix" and "suffix".
      Parameters:
      stagingLocation - The folder where you want the file to be. Must start with "gs://" or "hdfs://"
      prefix - The beginning of the file name
      suffix - The end of the file name, e.g. ".tmp"
    • fileExists

      public static boolean fileExists(String path)
      Returns true if we can read the first byte of the file.
      Parameters:
      path - The folder where you want the file to be (local, GCS or HDFS).
    • fileSize

      public static long fileSize(String path) throws IOException
      Returns the file size of a file pointed to by a GCS/HDFS/local path
      Parameters:
      path - The URL to the file whose size to return
      Returns:
      the file size in bytes
      Throws:
      IOException
    • dirSize

      public static long dirSize(GATKPath pathSpecifier)
      Returns the total file size of all files in a directory, or the file size if the path specifies a file. Note that sub-directories are ignored - they are not recursed into. Only supports HDFS and local paths.
      Parameters:
      pathSpecifier - The URL to the file or directory whose size to return
      Returns:
      the total size of all files in bytes
    • isFileUrl

      public static boolean isFileUrl(String path)
    • getBucket

      public static String getBucket(String path)
      Given a path of the form "gs://bucket/folder/folder/file", returns "bucket".
    • getPathWithoutBucket

      public static String getPathWithoutBucket(String path)
      Given a path of the form "gs://bucket/folder/folder/file", returns "folder/folder/file".
    • setGlobalNIODefaultOptions

      public static void setGlobalNIODefaultOptions(int maxReopens, String requesterProject)
      Sets max_reopens, requester_pays, and generous timeouts as the global default. These will apply even to library code that creates its own paths to access with NIO.
      Parameters:
      maxReopens - If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection.
      requesterProject - Project to bill when accessing "requester pays" buckets. If unset, these buckets cannot be accessed.
    • getPathOnGcs

      public static Path getPathOnGcs(String gcsUrl)
      String -> Path. This *should* not be necessary (use Paths.get(URI.create(...)) instead) , but it currently is on Spark because using the fat, shaded jar breaks the registration of the GCS FilesystemProvider. To transform other types of string URLs into Paths, use IOUtils.getPath instead.
    • getCloudStorageConfiguration

      public static com.google.cloud.storage.contrib.nio.CloudStorageConfiguration getCloudStorageConfiguration(int maxReopens, String requesterProject)
      The config we want to use.
      Parameters:
      maxReopens - If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection.
      requesterProject - Project to bill when accessing "requester pays" buckets. If unset, these buckets cannot be accessed.
    • getAuthenticatedGcs

      public static FileSystem getAuthenticatedGcs(String projectId, String bucket, byte[] credentials) throws IOException
      Get an authenticated GCS-backed NIO FileSystem object representing the selected projected and bucket. Credentials are found automatically when running on Compute/App engine, logged into gcloud, or if the GOOGLE_APPLICATION_CREDENTIALS env. variable is set. In that case leave credentials null. Otherwise, you must pass the contents of the service account credentials file. See https://github.com/GoogleCloudPlatform/gcloud-java#authentication Note that most of the time it's enough to just open a file via Files.newInputStream(Paths.get(URI.create( path ))).
      Throws:
      IOException
    • addPrefetcher

      public static SeekableByteChannel addPrefetcher(int bufferSizeMB, SeekableByteChannel channel)
      Wrap a SeekableByteChannel with a prefetcher.
      Parameters:
      bufferSizeMB - buffer size in mb which the prefetcher should fetch ahead.
      channel - a channel that needs prefetching
    • getPrefetchingWrapper

      public static Function<SeekableByteChannel,SeekableByteChannel> getPrefetchingWrapper(int cloudPrefetchBuffer)
      Creates a wrapping function which adds a prefetcher if the buffer size is > 0 if it's <= 0 then this wrapper returns the original channel.
      Parameters:
      cloudPrefetchBuffer - the prefetcher buffer size in MB
    • createSignedUrlToGcsObject

      public static String createSignedUrlToGcsObject(String path, long hoursToLive)
      Take a GCS path and return a signed url to the same resource which allows unauthenticated users to access the file.
      Parameters:
      path - String representing a GCS path
      hoursToLive - how long in hours the url will remain valid
      Returns:
      A signed url which provides access to the bucket location over http allowing unauthenticated users to access it
    • bucketPathToPublicHttpUrl

      public static String bucketPathToPublicHttpUrl(String path)
      Convert a GCS bucket location into the equivalent public http url. This doesn't do any validation checking to be sure that the location actually exists or is accessible. It's just a string -> string conversion
      Parameters:
      path - String representing the gs:// path to an object in a public bucket
      Returns:
      String representing the https:// path to the same object