java.lang.Object
- com.yahoo.text.Utf8

```
public final class Utf8
extends java.lang.Object
```
utility class with functions for handling UTF-8

Author:

arnej27959, Steinar Knutsen, baldersheim

Constructor Summary

Constructors
Constructor Description

Utf8()

Method Summary

All Methods Static Methods Concrete Methods
Modifier and Type	Method	Description
`static int`	`byteCount(java.lang.CharSequence str)`	Count the number of bytes needed to represent a given sequence of 16-bit char values as a UTF-8 encoded array.
`static int`	`byteCount(java.lang.CharSequence str, int offset, int length)`	Count the number of bytes needed to represent a given sequence of 16-bit char values as a UTF-8 encoded array.
`static int[]`	`calculateBytePositions(java.lang.CharSequence value)`	Returns an integer array the length as the input string plus one.
`static int[]`	`calculateStringPositions(byte[] utf8)`	Returns an array of the same length as the input array plus one.
`static int`	`codePointAsUtf8Length(int codepoint)`	Return the number of octets needed to encode a valid Unicode codepoint as UTF-8.
`static byte[]`	`encode(int codepoint)`	Encode a valid Unicode codepoint as a sequence of UTF-8 bytes into a new allocated array.
`static int`	`encode(int codepoint, byte[] destination, int offset)`	Encode a valid Unicode codepoint as a sequence of UTF-8 bytes into an array.
`static int`	`encode(int codepoint, java.io.OutputStream destination)`	Encode a valid Unicode codepoint as a sequence of UTF-8 bytes into an OutputStream.
`static void`	`encode(int codepoint, java.nio.ByteBuffer destination)`	Encode a valid Unicode codepoint as a sequence of UTF-8 bytes into a ByteBuffer.
`static java.nio.charset.Charset`	`getCharset()`	Returns the Charset instance for UTF-8
`static java.nio.charset.CharsetEncoder`	`getNewEncoder()`	Create a new UTF-8 encoder.
`static byte[]`	`toAsciiBytes(boolean v)`
`static byte[]`	`toAsciiBytes(long l)`	Encode a long as its decimal representation, i.e.
`static byte[]`	`toBytes(java.lang.String string)`	Will try an optimistic approach to utf8 encoding.
`static byte[]`	`toBytes(java.lang.String str, int offset, int length)`	Utility method as toBytes(String).
`static int`	`toBytes(java.lang.String str, int srcOffset, int srcLen, byte[] dst, int dstOffset)`	Direct encoding of a String into an array.
`static void`	`toBytes(java.lang.String src, int srcOffset, int srcLen, java.nio.ByteBuffer dst, java.nio.charset.CharsetEncoder encoder)`	Encode a string directly into a ByteBuffer instance.
`static byte[]`	`toBytesStd(java.lang.String str)`	Uses String.getBytes directly.
`static java.lang.String`	`toString(byte[] utf8)`	Will try an optimistic approach to utf8 decoding.
`static java.lang.String`	`toString(byte[] data, int offset, int length)`	Utility method as toString(byte[]).
`static java.lang.String`	`toString(java.nio.ByteBuffer data)`	Fetch a string from a ByteBuffer instance.
`static java.lang.String`	`toStringStd(byte[] data)`	To be used instead of String.String(byte[] bytes)
`static int`	`totalBytes(byte firstByte)`	Inspects a byte assumed to be the first byte in a UTF8 to check how many bytes in total the sequence of bytes will use.
`static int`	`unitCount(byte firstByte)`	Calculate the number of Unicode code units ("UTF-16 characters") needed to represent a given UTF-8 encoded code point.
`static int`	`unitCount(byte[] utf8)`	Count the number of Unicode code units ("UTF-16 characters") needed to represent a given array of UTF-8 characters.
`static int`	`unitCount(byte[] utf8, int offset, int length)`	Count the number of Unicode code units ("UTF-16 characters") needed to represent a given array of UTF-8 characters.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - Utf8
```
public Utf8()
```
- Method Detail
  - getCharset
```
public static java.nio.charset.Charset getCharset()
```
    Returns the Charset instance for UTF-8
  - toStringStd
```
public static java.lang.String toStringStd(byte[] data)
```
    To be used instead of String.String(byte[] bytes)
  - toString
```
public static java.lang.String toString(byte[] data,
                                        int offset,
                                        int length)
```
    Utility method as toString(byte[]).
    
    Parameters:
    
    data - bytes to decode
    
    offset - index of first byte to decode
    
    length - number of bytes to decode
    
    Returns:
    
    String decoded from UTF-8
  - toString
```
public static java.lang.String toString(java.nio.ByteBuffer data)
```
    Fetch a string from a ByteBuffer instance. ByteBuffer instances are stateful, so it is assumed to caller manipulates the instance's limit if the entire buffer is not a string.
    
    Parameters:
    
    data - The UTF-8 data source
    
    Returns:
    
    a decoded String
  - toBytesStd
```
public static byte[] toBytesStd(java.lang.String str)
```
    Uses String.getBytes directly.
  - toAsciiBytes
```
public static byte[] toAsciiBytes(long l)
```
    Encode a long as its decimal representation, i.e. toAsciiBytes(15L) will return "15" encoded as UTF-8. In other words it is an optimized version of String.valueOf() followed by UTF-8 encoding. Avoid going through string in order to get a simple UTF-8 sequence.
    
    Parameters:
    
    l - value to represent as a decimal number encded as utf8
    
    Returns:
    
    byte array
  - toAsciiBytes
```
public static byte[] toAsciiBytes(boolean v)
```
  - toBytes
```
public static byte[] toBytes(java.lang.String string)
```
    Will try an optimistic approach to utf8 encoding. That is 4.6x faster that the brute encode for ascii, not accounting for reduced memory footprint and GC.
    
    Parameters:
    
    string - The string to encode.
    
    Returns:
    
    Utf8 encoded array
  - toString
```
public static java.lang.String toString(byte[] utf8)
```
    Will try an optimistic approach to utf8 decoding.
    
    Parameters:
    
    utf8 - The string to encode.
    
    Returns:
    
    Utf8 encoded array
  - toBytes
```
public static byte[] toBytes(java.lang.String str,
                             int offset,
                             int length)
```
    Utility method as toBytes(String).
    
    Parameters:
    
    str - String to encode
    
    offset - index of first character to encode
    
    length - number of characters to encode
    
    Returns:
    
    substring encoded as UTF-8
  - toBytes
```
public static int toBytes(java.lang.String str,
                          int srcOffset,
                          int srcLen,
                          byte[] dst,
                          int dstOffset)
```
    Direct encoding of a String into an array.
    
    Parameters:
    
    str - string to encode
    
    srcOffset - index of first character in string to encode
    
    srcLen - number of characters in string to encode
    
    dst - destination for encoded data
    
    dstOffset - index of first position to write data
    
    Returns:
    
    the number of bytes written to the array.
  - toBytes
```
public static void toBytes(java.lang.String src,
                           int srcOffset,
                           int srcLen,
                           java.nio.ByteBuffer dst,
                           java.nio.charset.CharsetEncoder encoder)
```
    Encode a string directly into a ByteBuffer instance.
    This method is somewhat more cumbersome than the rest of the helper methods in this library, as it is intended for use cases in the following style, if extraneous copying is highly undesirable:
    String[] a = {"abc", "def", "ghiè"}; int[] aLens = {3, 3, 5}; CharsetEncoder ce = Utf8.getNewEncoder(); ByteBuffer forWire = ByteBuffer.allocate(someNumber); for (int i = 0; i < a.length; i++) { forWire.putInt(aLens[i]); Utf8.toBytes(a[i], 0, a[i].length(), forWire, ce); }
    Parameters:
    
    src - the string to encode
    
    srcOffset - index of first character to encode
    
    srcLen - number of characters to encode
    
    dst - the destination ByteBuffer
    
    encoder - the character encoder to use
    
    See Also:
    
    getNewEncoder()
  - getNewEncoder
```
public static java.nio.charset.CharsetEncoder getNewEncoder()
```
    Create a new UTF-8 encoder.
    
    See Also:
    
    toBytes(String, int, int, ByteBuffer, CharsetEncoder)
  - byteCount
```
public static int byteCount(java.lang.CharSequence str)
```
    Count the number of bytes needed to represent a given sequence of 16-bit char values as a UTF-8 encoded array. This method is written to be cheap to invoke. Note: It is strongly assumed to character sequence is valid.
  - byteCount
```
public static int byteCount(java.lang.CharSequence str,
                            int offset,
                            int length)
```
    Count the number of bytes needed to represent a given sequence of 16-bit char values as a UTF-8 encoded array. This method is written to be cheap to invoke. Note: It is strongly assumed to character sequence is valid.
  - unitCount
```
public static int unitCount(byte[] utf8)
```
    Count the number of Unicode code units ("UTF-16 characters") needed to represent a given array of UTF-8 characters. This method is written to be cheap to invoke. Note: It is strongly assumed the sequence is valid.
  - unitCount
```
public static int unitCount(byte[] utf8,
                            int offset,
                            int length)
```
    Count the number of Unicode code units ("UTF-16 characters") needed to represent a given array of UTF-8 characters. This method is written to be cheap to invoke. Note: It is strongly assumed the sequence is valid.
    
    Parameters:
    
    utf8 - raw data
    
    offset - index of first byte of UTF-8 sequence to check
    
    length - number of bytes in the UTF-8 sequence to check
  - unitCount
```
public static int unitCount(byte firstByte)
```
    Calculate the number of Unicode code units ("UTF-16 characters") needed to represent a given UTF-8 encoded code point.
    
    Parameters:
    
    firstByte - the first byte of a character encoded as UTF-8
    
    Returns:
    
    the number of UTF-16 code units needed to represent the given code point
  - totalBytes
```
public static int totalBytes(byte firstByte)
```
    Inspects a byte assumed to be the first byte in a UTF8 to check how many bytes in total the sequence of bytes will use.
    
    Parameters:
    
    firstByte - the first byte of a UTF8 encoded character
    
    Returns:
    
    the number of bytes used to encode the character
  - calculateBytePositions
```
public static int[] calculateBytePositions(java.lang.CharSequence value)
```
    Returns an integer array the length as the input string plus one. For every index in the array, the corresponding value gives the index into the UTF-8 byte sequence that can be created from the input.
    
    Parameters:
    
    value - a String to generate UTF-8 byte indexes from
    
    Returns:
    
    an array containing corresponding UTF-8 byte indexes
  - calculateStringPositions
```
public static int[] calculateStringPositions(byte[] utf8)
```
    Returns an array of the same length as the input array plus one. For every index in the array, the corresponding value gives the index into the Java string (UTF-16 sequence) that can be created from the input.
    
    Parameters:
    
    utf8 - a byte array containing a string encoded as UTF-8. Note: It is strongly assumed that this sequence is correct.
    
    Returns:
    
    an array containing corresponding UTF-16 character indexes. If input array is empty, returns an array containg a single zero.
  - encode
```
public static byte[] encode(int codepoint)
```
    Encode a valid Unicode codepoint as a sequence of UTF-8 bytes into a new allocated array.
    
    Parameters:
    
    codepoint - Unicode codepoint to encode
    
    Returns:
    
    number of bytes written
    
    Throws:
    
    java.lang.IndexOutOfBoundsException - if there is insufficient room for the encoded data in the given array
  - encode
```
public static int encode(int codepoint,
                         byte[] destination,
                         int offset)
```
    Encode a valid Unicode codepoint as a sequence of UTF-8 bytes into an array.
    
    Parameters:
    
    codepoint - Unicode codepoint to encode
    
    destination - array to write into
    
    offset - index of first byte written
    
    Returns:
    
    index of the first byte after the last byte written (i.e. offset plus number of bytes written)
    
    Throws:
    
    java.lang.IndexOutOfBoundsException - if there is insufficient room for the encoded data in the given array
  - encode
```
public static void encode(int codepoint,
                          java.nio.ByteBuffer destination)
```
    Encode a valid Unicode codepoint as a sequence of UTF-8 bytes into a ByteBuffer.
    
    Parameters:
    
    codepoint - Unicode codepoint to encode
    
    destination - buffer to write into
    
    Throws:
    
    java.nio.BufferOverflowException - if the buffer's limit is met while writing (propagated from the ByteBuffer)
    
    java.nio.ReadOnlyBufferException - if the buffer is read only (propagated from the ByteBuffer)
  - encode
```
public static int encode(int codepoint,
                         java.io.OutputStream destination)
                  throws java.io.IOException
```
    Encode a valid Unicode codepoint as a sequence of UTF-8 bytes into an OutputStream.
    
    Parameters:
    
    codepoint - Unicode codepoint to encode
    
    destination - buffer to write into
    
    Returns:
    
    number of bytes written
    
    Throws:
    
    java.io.IOException - propagated from stream
  - codePointAsUtf8Length
```
public static int codePointAsUtf8Length(int codepoint)
```
    Return the number of octets needed to encode a valid Unicode codepoint as UTF-8.
    
    Parameters:
    
    codepoint - the Unicode codepoint to inspect
    
    Returns:
    
    the number of bytes needed for UTF-8 representation

Class Utf8

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Detail

Utf8

Method Detail

getCharset

toStringStd

toString

toString

toBytesStd

toAsciiBytes

toAsciiBytes

toBytes

toString

toBytes

toBytes

toBytes

getNewEncoder

byteCount

byteCount

unitCount

unitCount

unitCount

totalBytes

calculateBytePositions

calculateStringPositions

encode

encode

encode

encode

codePointAsUtf8Length