Class BucketUtils
java.lang.Object
org.broadinstitute.hellbender.utils.gcs.BucketUtils
Utilities for dealing with google buckets.
-
Field Summary
Fields -
Method Summary
Modifier and TypeMethodDescriptionstatic SeekableByteChannel
addPrefetcher
(int bufferSizeMB, SeekableByteChannel channel) Wrap a SeekableByteChannel with a prefetcher.static String
Convert a GCS bucket location into the equivalent public http url.static void
Copies a file.static OutputStream
createFile
(String path) Open a binary file for writing regardless of whether it's on GCS, HDFS or local disk.static String
createSignedUrlToGcsObject
(String path, long hoursToLive) Take a GCS path and return a signed url to the same resource which allows unauthenticated users to access the file.static void
deleteFile
(String pathToDelete) Deletes a file: local, GCS or HDFS.static long
Returns the total file size of all files in a directory, or the file size if the path specifies a file.static boolean
fileExists
(String path) Returns true if we can read the first byte of the file.static long
Returns the file size of a file pointed to by a GCS/HDFS/local pathstatic FileSystem
getAuthenticatedGcs
(String projectId, String bucket, byte[] credentials) Get an authenticated GCS-backed NIO FileSystem object representing the selected projected and bucket.static String
Given a path of the form "gs://bucket/folder/folder/file", returns "bucket".static com.google.cloud.storage.contrib.nio.CloudStorageConfiguration
getCloudStorageConfiguration
(int maxReopens, String requesterProject) The config we want to use.static Path
getPathOnGcs
(String gcsUrl) String -> Path.static String
getPathWithoutBucket
(String path) Given a path of the form "gs://bucket/folder/folder/file", returns "folder/folder/file".getPrefetchingWrapper
(int cloudPrefetchBuffer) Creates a wrapping function which adds a prefetcher if the buffer size is > 0 if it's <= 0 then this wrapper returns the original channel.static String
getTempFilePath
(String prefix, String extension) Get a temporary file path based on the prefix and extension provided.static boolean
isEligibleForPrefetching
(Path path) static boolean
isEligibleForPrefetching
(GATKPath pathSpec) static boolean
static boolean
static boolean
Return true if thisGATKPath
represents a gcs URI.static boolean
isHadoopUrl
(String path) Returns true if the given path is a HDFS (Hadoop filesystem) URL.static boolean
static boolean
isRemoteStorageUrl
(String path) Returns true if the given path is a GCS, HDFS (Hadoop filesystem), or Http(s) URL.static String
makeFilePathAbsolute
(String path) Changes relative local file paths to be absolute file paths.static InputStream
Open a file for reading regardless of whether it's on GCS, HDFS or local disk.static String
randomRemotePath
(String stagingLocation, String prefix, String suffix) Picks a random name, by putting some random letters between "prefix" and "suffix".static void
setGlobalNIODefaultOptions
(int maxReopens, String requesterProject) Sets max_reopens, requester_pays, and generous timeouts as the global default.
-
Field Details
-
GCS_PREFIX
- See Also:
-
HTTP_PREFIX
- See Also:
-
HTTPS_PREFIX
- See Also:
-
HDFS_SCHEME
- See Also:
-
HDFS_PREFIX
- See Also:
-
FILE_PREFIX
- See Also:
-
-
Method Details
-
isGcsUrl
- Parameters:
path
- path to inspect- Returns:
- true if this path represents a gcs location
-
isGcsUrl
Return true if thisGATKPath
represents a gcs URI.- Parameters:
pathSpec
- specifier to inspect- Returns:
- true if this
GATKPath
represents a gcs URI.
-
isEligibleForPrefetching
- Parameters:
pathSpec
- specifier to inspect- Returns:
- true if this
GATKPath
represents a remote storage system which may benefit from prefetching (gcs or http(s))
-
isEligibleForPrefetching
- Parameters:
path
- path to inspect- Returns:
- true if this
Path
represents a remote storage system which may benefit from prefetching (gcs or http(s))
-
isHttpUrl
- Returns:
- true if the given path is an http or https Url.
-
isHadoopUrl
Returns true if the given path is a HDFS (Hadoop filesystem) URL. -
isRemoteStorageUrl
Returns true if the given path is a GCS, HDFS (Hadoop filesystem), or Http(s) URL. -
makeFilePathAbsolute
Changes relative local file paths to be absolute file paths. Paths with a scheme are left unchanged.- Parameters:
path
- the path- Returns:
- an absolute file path if the original path was a relative file path, otherwise the original path
-
openFile
Open a file for reading regardless of whether it's on GCS, HDFS or local disk. If the file ends with .gz will attempt to wrap it in an appropriate unzipping stream- Parameters:
path
- the GCS, HDFS or local path to read from. If GCS, it must start with "gs://", or "hdfs://" for HDFS.- Returns:
- an InputStream that reads from the specified file.
-
createFile
Open a binary file for writing regardless of whether it's on GCS, HDFS or local disk. For writing to GCS it'll use the application/octet-stream MIME type.- Parameters:
path
- the GCS or local path to write to. If GCS, it must start with "gs://", or "hdfs://" for HDFS.- Returns:
- an OutputStream that writes to the specified file.
-
copyFile
Copies a file. Can be used to copy e.g. from GCS to local.- Parameters:
sourcePath
- the path to read from. If GCS, it must start with "gs://", or "hdfs://" for HDFS.destPath
- the path to copy to. If GCS, it must start with "gs://", or "hdfs://" for HDFS.- Throws:
IOException
-
deleteFile
Deletes a file: local, GCS or HDFS.- Parameters:
pathToDelete
- the path to delete. If GCS, it must start with "gs://", or "hdfs://" for HDFS.- Throws:
IOException
-
getTempFilePath
Get a temporary file path based on the prefix and extension provided. This file (and possible indexes associated with it) will be scheduled for deletion on shutdown- Parameters:
prefix
- a prefix for the file name for remote paths this should be a valid URI to root the temporary file in (ie. gs://hellbender/staging/) there is no guarantee that this will be used as the root of the tmp file name, a local prefix may be placed in the tmp folder for exampleextension
- and extension for the temporary file path, the resulting path will end in this- Returns:
- a path to use as a temporary file, on remote file systems which don't support an atomic tmp file reservation a path is chosen with a long randomized name
-
randomRemotePath
Picks a random name, by putting some random letters between "prefix" and "suffix".- Parameters:
stagingLocation
- The folder where you want the file to be. Must start with "gs://" or "hdfs://"prefix
- The beginning of the file namesuffix
- The end of the file name, e.g. ".tmp"
-
fileExists
Returns true if we can read the first byte of the file.- Parameters:
path
- The folder where you want the file to be (local, GCS or HDFS).
-
fileSize
Returns the file size of a file pointed to by a GCS/HDFS/local path- Parameters:
path
- The URL to the file whose size to return- Returns:
- the file size in bytes
- Throws:
IOException
-
dirSize
Returns the total file size of all files in a directory, or the file size if the path specifies a file. Note that sub-directories are ignored - they are not recursed into. Only supports HDFS and local paths.- Parameters:
pathSpecifier
- The URL to the file or directory whose size to return- Returns:
- the total size of all files in bytes
-
isFileUrl
-
getBucket
Given a path of the form "gs://bucket/folder/folder/file", returns "bucket". -
getPathWithoutBucket
Given a path of the form "gs://bucket/folder/folder/file", returns "folder/folder/file". -
setGlobalNIODefaultOptions
Sets max_reopens, requester_pays, and generous timeouts as the global default. These will apply even to library code that creates its own paths to access with NIO.- Parameters:
maxReopens
- If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection.requesterProject
- Project to bill when accessing "requester pays" buckets. If unset, these buckets cannot be accessed.
-
getPathOnGcs
String -> Path. This *should* not be necessary (use Paths.get(URI.create(...)) instead) , but it currently is on Spark because using the fat, shaded jar breaks the registration of the GCS FilesystemProvider. To transform other types of string URLs into Paths, use IOUtils.getPath instead. -
getCloudStorageConfiguration
public static com.google.cloud.storage.contrib.nio.CloudStorageConfiguration getCloudStorageConfiguration(int maxReopens, String requesterProject) The config we want to use.- Parameters:
maxReopens
- If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection.requesterProject
- Project to bill when accessing "requester pays" buckets. If unset, these buckets cannot be accessed.
-
getAuthenticatedGcs
public static FileSystem getAuthenticatedGcs(String projectId, String bucket, byte[] credentials) throws IOException Get an authenticated GCS-backed NIO FileSystem object representing the selected projected and bucket. Credentials are found automatically when running on Compute/App engine, logged into gcloud, or if the GOOGLE_APPLICATION_CREDENTIALS env. variable is set. In that case leave credentials null. Otherwise, you must pass the contents of the service account credentials file. See https://github.com/GoogleCloudPlatform/gcloud-java#authentication Note that most of the time it's enough to just open a file via Files.newInputStream(Paths.get(URI.create( path ))).- Throws:
IOException
-
addPrefetcher
Wrap a SeekableByteChannel with a prefetcher.- Parameters:
bufferSizeMB
- buffer size in mb which the prefetcher should fetch ahead.channel
- a channel that needs prefetching
-
getPrefetchingWrapper
public static Function<SeekableByteChannel,SeekableByteChannel> getPrefetchingWrapper(int cloudPrefetchBuffer) Creates a wrapping function which adds a prefetcher if the buffer size is > 0 if it's <= 0 then this wrapper returns the original channel.- Parameters:
cloudPrefetchBuffer
- the prefetcher buffer size in MB
-
createSignedUrlToGcsObject
Take a GCS path and return a signed url to the same resource which allows unauthenticated users to access the file.- Parameters:
path
- String representing a GCS pathhoursToLive
- how long in hours the url will remain valid- Returns:
- A signed url which provides access to the bucket location over http allowing unauthenticated users to access it
-
bucketPathToPublicHttpUrl
Convert a GCS bucket location into the equivalent public http url. This doesn't do any validation checking to be sure that the location actually exists or is accessible. It's just a string -> string conversion- Parameters:
path
- String representing the gs:// path to an object in a public bucket- Returns:
- String representing the https:// path to the same object
-