Class DataUtilities
- java.lang.Object
-
- com.upokecenter.util.DataUtilities
-
public final class DataUtilities extends java.lang.Object
Contains methods useful for reading and writing text strings. It is designed to have no dependencies other than the basic runtime class library.Many of these methods work with text encoded in UTF-8, an encoding form of the Unicode Standard which uses one byte to encode the most basic characters and two to four bytes to encode other characters. For example, the
GetUtf8
method converts a text string to an array of bytes in UTF-8.In C# and Java, text strings are represented as sequences of 16-bit values called
char
s. These sequences are well-formed under UTF-16, a 16-bit encoding form of Unicode, except if they contain unpaired surrogate code points. (A surrogate code point is used to encode supplementary characters, those with code points U+10000 or higher, in UTF-16. A surrogate pair is a high surrogate, U+D800 to U+DBFF, followed by a low surrogate, U+DC00 to U+DFFF. An unpaired surrogate code point is a surrogate not appearing in a surrogate pair.) Many of the methods in this class allow setting the behavior to follow when unpaired surrogate code points are found in text strings, such as throwing an error or treating the unpaired surrogate as a replacement character (U+FFFD).
-
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static int
CodePointAt(java.lang.String str, int index)
Gets the Unicode code point at the given index of the string.static int
CodePointAt(java.lang.String str, int index, int surrogateBehavior)
Gets the Unicode code point at the given index of the string.static int
CodePointBefore(java.lang.String str, int index)
Gets the Unicode code point just before the given index of the string.static int
CodePointBefore(java.lang.String str, int index, int surrogateBehavior)
Gets the Unicode code point just before the given index of the string.static int
CodePointCompare(java.lang.String strA, java.lang.String strB)
Compares two strings in Unicode code point order.static int
CodePointLength(java.lang.String str)
Finds the number of Unicode code points in the given text string.static byte[]
GetUtf8Bytes(java.lang.String str, boolean replace)
Encodes a string in UTF-8 as a byte array.static byte[]
GetUtf8Bytes(java.lang.String str, boolean replace, boolean lenientLineBreaks)
Encodes a string in UTF-8 as a byte array.static long
GetUtf8Length(java.lang.String str, boolean replace)
Calculates the number of bytes needed to encode a string in UTF-8.static java.lang.String
GetUtf8String(byte[] bytes, boolean replace)
Generates a text string from a UTF-8 byte array.static java.lang.String
GetUtf8String(byte[] bytes, int offset, int bytesCount, boolean replace)
Generates a text string from a portion of a UTF-8 byte array.static int
ReadUtf8(java.io.InputStream stream, int bytesCount, java.lang.StringBuilder builder, boolean replace)
Reads a string in UTF-8 encoding from a data stream.static int
ReadUtf8FromBytes(byte[] data, int offset, int bytesCount, java.lang.StringBuilder builder, boolean replace)
Reads a string in UTF-8 encoding from a byte array.static java.lang.String
ReadUtf8ToString(java.io.InputStream stream)
Reads a string in UTF-8 encoding from a data stream in full and returns that string.static java.lang.String
ReadUtf8ToString(java.io.InputStream stream, int bytesCount, boolean replace)
Reads a string in UTF-8 encoding from a data stream and returns that string.static java.lang.String
ToLowerCaseAscii(java.lang.String str)
Returns a string with the basic upper-case letters A to Z (U+0041 to U+005A) converted to lower-case.static java.lang.String
ToUpperCaseAscii(java.lang.String str)
Returns a string with the basic lower-case letters A to Z (U+0061 to U+007A) converted to upper-case.static int
WriteUtf8(java.lang.String str, int offset, int length, java.io.OutputStream stream, boolean replace)
Writes a portion of a string in UTF-8 encoding to a data stream.static int
WriteUtf8(java.lang.String str, int offset, int length, java.io.OutputStream stream, boolean replace, boolean lenientLineBreaks)
Writes a portion of a string in UTF-8 encoding to a data stream.static int
WriteUtf8(java.lang.String str, java.io.OutputStream stream, boolean replace)
Writes a string in UTF-8 encoding to a data stream.
-
-
-
Method Detail
-
GetUtf8String
public static java.lang.String GetUtf8String(byte[] bytes, boolean replace)
Generates a text string from a UTF-8 byte array.- Parameters:
bytes
- A byte array containing text encoded in UTF-8.replace
- If true, replaces invalid encoding with the replacement character (U+FFFD). If false, stops processing when invalid UTF-8 is seen.- Returns:
- A string represented by the UTF-8 byte array.
- Throws:
java.lang.NullPointerException
- The parameterbytes
is null.java.lang.IllegalArgumentException
- The string is not valid UTF-8 andreplace
is false.
-
CodePointLength
public static int CodePointLength(java.lang.String str)
Finds the number of Unicode code points in the given text string. Unpaired surrogate code points increase this number by 1. This is not necessarily the length of the string in "char" s.- Parameters:
str
- The parameterstr
is a text string.- Returns:
- The number of Unicode code points in the given string.
- Throws:
java.lang.NullPointerException
- The parameterstr
is null.
-
GetUtf8String
public static java.lang.String GetUtf8String(byte[] bytes, int offset, int bytesCount, boolean replace)
Generates a text string from a portion of a UTF-8 byte array.- Parameters:
bytes
- A byte array containing text encoded in UTF-8.offset
- Offset into the byte array to start reading.bytesCount
- Length, in bytes, of the UTF-8 text string.replace
- If true, replaces invalid encoding with the replacement character (U+FFFD). If false, stops processing when invalid UTF-8 is seen.- Returns:
- A string represented by the UTF-8 byte array.
- Throws:
java.lang.NullPointerException
- The parameterbytes
is null.java.lang.IllegalArgumentException
- The portion of the byte array is not valid UTF-8 andreplace
is false.java.lang.IllegalArgumentException
- The parameteroffset
is less than 0,bytesCount
is less than 0, or offset plus bytesCount is greater than the length of "data" .
-
GetUtf8Bytes
public static byte[] GetUtf8Bytes(java.lang.String str, boolean replace)
Encodes a string in UTF-8 as a byte array. This method does not insert a byte-order mark (U+FEFF) at the beginning of the encoded byte array.
REMARK: It is not recommended to use
Encoding.UTF8.GetBytes
in.getNET(), or thegetBytes()
method in Java to do this. For instance,getBytes()
encodes text strings in a default (so not fixed) character encoding, which can be undesirable.- Parameters:
str
- The parameterstr
is a text string.replace
- If true, replaces unpaired surrogate code points with the replacement character (U+FFFD). If false, stops processing when an unpaired surrogate code point is seen.- Returns:
- The string encoded in UTF-8.
- Throws:
java.lang.NullPointerException
- The parameterstr
is null.java.lang.IllegalArgumentException
- The string contains an unpaired surrogate code point andreplace
is false, or an internal error occurred.
-
GetUtf8Bytes
public static byte[] GetUtf8Bytes(java.lang.String str, boolean replace, boolean lenientLineBreaks)
Encodes a string in UTF-8 as a byte array. This method does not insert a byte-order mark (U+FEFF) at the beginning of the encoded byte array.
REMARK: It is not recommended to use
Encoding.UTF8.GetBytes
in.getNET(), or thegetBytes()
method in Java to do this. For instance,getBytes()
encodes text strings in a default (so not fixed) character encoding, which can be undesirable.- Parameters:
str
- The parameterstr
is a text string.replace
- If true, replaces unpaired surrogate code points with the replacement character (U+FFFD). If false, stops processing when an unpaired surrogate code point is seen.lenientLineBreaks
- If true, replaces carriage return (CR) not followed by line feed (LF) and LF not preceded by CR with CR-LF pairs.- Returns:
- The string encoded in UTF-8.
- Throws:
java.lang.NullPointerException
- The parameterstr
is null.java.lang.IllegalArgumentException
- The string contains an unpaired surrogate code point andreplace
is false, or an internal error occurred.
-
GetUtf8Length
public static long GetUtf8Length(java.lang.String str, boolean replace)
Calculates the number of bytes needed to encode a string in UTF-8.- Parameters:
str
- The parameterstr
is a text string.replace
- If true, treats unpaired surrogate code points as having 3 UTF-8 bytes (the UTF-8 length of the replacement character U+FFFD).- Returns:
- The number of bytes needed to encode the given string in UTF-8, or
-1 if the string contains an unpaired surrogate code point and
replace
is false. - Throws:
java.lang.NullPointerException
- The parameterstr
is null.
-
CodePointBefore
public static int CodePointBefore(java.lang.String str, int index)
Gets the Unicode code point just before the given index of the string.- Parameters:
str
- The parameterstr
is a text string.index
- Index of the current position into the string.- Returns:
- The Unicode code point at the previous position. Returns -1 if
index
is 0 or less, or is greater than the string's length. Returns the replacement character (U+FFFD) if the code point at the previous position is an unpaired surrogate code point. If the return value is 65536 (0x10000) or greater, the code point takes up two UTF-16 code units. - Throws:
java.lang.NullPointerException
- The parameterstr
is null.
-
CodePointBefore
public static int CodePointBefore(java.lang.String str, int index, int surrogateBehavior)
Gets the Unicode code point just before the given index of the string.- Parameters:
str
- The parameterstr
is a text string.index
- Index of the current position into the string.surrogateBehavior
- Specifies what kind of value to return if the previous code point is an unpaired surrogate code point: if 0, return the replacement character (U+FFFD); if 1, return the value of the surrogate code point; if neither 0 nor 1, return -1.- Returns:
- The Unicode code point at the previous position. Returns -1 if
index
is 0 or less, or is greater than the string's length. Returns a value as specified undersurrogateBehavior
if the code point at the previous position is an unpaired surrogate code point. If the return value is 65536 (0x10000) or greater, the code point takes up two UTF-16 code units. - Throws:
java.lang.NullPointerException
- The parameterstr
is null.
-
CodePointAt
public static int CodePointAt(java.lang.String str, int index)
Gets the Unicode code point at the given index of the string.- Parameters:
str
- The parameterstr
is a text string.index
- Index of the current position into the string.- Returns:
- The Unicode code point at the given position. Returns -1 if
index
is 0 or less, or is greater than the string's length. Returns the replacement character (U+FFFD) if the code point at that position is an unpaired surrogate code point. If the return value is 65536 (0x10000) or greater, the code point takes up two UTF-16 code units. - Throws:
java.lang.NullPointerException
- The parameterstr
is null.
-
CodePointAt
public static int CodePointAt(java.lang.String str, int index, int surrogateBehavior)
Gets the Unicode code point at the given index of the string.The following example shows how to iterate a text string code point by code point, terminating the loop when an unpaired surrogate is found.
for (int i = 0;i<str.length(); ++i) { int codePoint = DataUtilities.CodePointAt(str, i, 2); if (codePoint < 0) { break; /* Unpaired surrogate */ } System.out.println("codePoint:"+codePoint); if (codePoint >= 0x10000) { i++; /* Supplementary code point */ } }
.- Parameters:
str
- The parameterstr
is a text string.index
- Index of the current position into the string.surrogateBehavior
- Specifies what kind of value to return if the code point at the given index is an unpaired surrogate code point: if 0, return the replacement character (U+FFFD); if 1, return the value of the surrogate code point; if neither 0 nor 1, return -1.- Returns:
- The Unicode code point at the given position. Returns -1 if
index
is 0 or less, or is greater than the string's length. Returns a value as specified undersurrogateBehavior
if the code point at that position is an unpaired surrogate code point. If the return value is 65536 (0x10000) or greater, the code point takes up two UTF-16 code units. - Throws:
java.lang.NullPointerException
- The parameterstr
is null.
-
ToLowerCaseAscii
public static java.lang.String ToLowerCaseAscii(java.lang.String str)
Returns a string with the basic upper-case letters A to Z (U+0041 to U+005A) converted to lower-case. Other characters remain unchanged.- Parameters:
str
- The parameterstr
is a text string.- Returns:
- The converted string, or null if
str
is null.
-
ToUpperCaseAscii
public static java.lang.String ToUpperCaseAscii(java.lang.String str)
Returns a string with the basic lower-case letters A to Z (U+0061 to U+007A) converted to upper-case. Other characters remain unchanged.- Parameters:
str
- The parameterstr
is a text string.- Returns:
- The converted string, or null if
str
is null.
-
CodePointCompare
public static int CodePointCompare(java.lang.String strA, java.lang.String strB)
Compares two strings in Unicode code point order. Unpaired surrogate code points are treated as individual code points.- Parameters:
strA
- The first string. Can be null.strB
- The second string. Can be null.- Returns:
- A value indicating which string is " less" or " greater" . 0: Both strings are equal or null. Less than 0: a is null and b isn't; or the first code point that's different is less in A than in B; or b starts with a and is longer than a. Greater than 0: b is null and a isn't; or the first code point that's different is greater in A than in B; or a starts with b and is longer than b.
-
WriteUtf8
public static int WriteUtf8(java.lang.String str, int offset, int length, java.io.OutputStream stream, boolean replace) throws java.io.IOException
Writes a portion of a string in UTF-8 encoding to a data stream.- Parameters:
str
- A string to write.offset
- The Index starting at 0 where the string portion to write begins.length
- The length of the string portion to write.stream
- A writable data stream.replace
- If true, replaces unpaired surrogate code points with the replacement character (U+FFFD). If false, stops processing when an unpaired surrogate code point is seen.- Returns:
- 0 if the entire string portion was written; or -1 if the string
portion contains an unpaired surrogate code point and
replace
is false. - Throws:
java.lang.NullPointerException
- The parameterstr
is null orstream
is null.java.io.IOException
- An I/O error occurred.java.lang.IllegalArgumentException
- Eitheroffset
orlength
is less than 0 or greater thanstr
's length, orstr
's length minusoffset
is less thanlength
.
-
WriteUtf8
public static int WriteUtf8(java.lang.String str, int offset, int length, java.io.OutputStream stream, boolean replace, boolean lenientLineBreaks) throws java.io.IOException
Writes a portion of a string in UTF-8 encoding to a data stream.- Parameters:
str
- A string to write.offset
- The Index starting at 0 where the string portion to write begins.length
- The length of the string portion to write.stream
- A writable data stream.replace
- If true, replaces unpaired surrogate code points with the replacement character (U+FFFD). If false, stops processing when an unpaired surrogate code point is seen.lenientLineBreaks
- If true, replaces carriage return (CR) not followed by line feed (LF) and LF not preceded by CR with CR-LF pairs.- Returns:
- 0 if the entire string portion was written; or -1 if the string
portion contains an unpaired surrogate code point and
replace
is false. - Throws:
java.lang.NullPointerException
- The parameterstr
is null orstream
is null.java.lang.IllegalArgumentException
- The parameteroffset
is less than 0,length
is less than 0, oroffset
pluslength
is greater than the string's length.java.io.IOException
- An I/O error occurred.
-
WriteUtf8
public static int WriteUtf8(java.lang.String str, java.io.OutputStream stream, boolean replace) throws java.io.IOException
Writes a string in UTF-8 encoding to a data stream.- Parameters:
str
- A string to write.stream
- A writable data stream.replace
- If true, replaces unpaired surrogate code points with the replacement character (U+FFFD). If false, stops processing when an unpaired surrogate code point is seen.- Returns:
- 0 if the entire string was written; or -1 if the string contains an
unpaired surrogate code point and
replace
is false. - Throws:
java.lang.NullPointerException
- The parameterstr
is null orstream
is null.java.io.IOException
- An I/O error occurred.
-
ReadUtf8FromBytes
public static int ReadUtf8FromBytes(byte[] data, int offset, int bytesCount, java.lang.StringBuilder builder, boolean replace)
Reads a string in UTF-8 encoding from a byte array.- Parameters:
data
- A byte array containing a UTF-8 text string.offset
- Offset into the byte array to start reading.bytesCount
- Length, in bytes, of the UTF-8 text string.builder
- A string builder object where the resulting string will be stored.replace
- If true, replaces invalid encoding with the replacement character (U+FFFD). If false, stops processing when invalid UTF-8 is seen.- Returns:
- 0 if the entire string was read without errors, or -1 if the string
is not valid UTF-8 and
replace
is false. - Throws:
java.lang.NullPointerException
- The parameterdata
is null orbuilder
is null.java.lang.IllegalArgumentException
- The parameteroffset
is less than 0,bytesCount
is less than 0, or offset plus bytesCount is greater than the length ofdata
.
-
ReadUtf8ToString
public static java.lang.String ReadUtf8ToString(java.io.InputStream stream) throws java.io.IOException
Reads a string in UTF-8 encoding from a data stream in full and returns that string. Replaces invalid encoding with the replacement character (U+FFFD).- Parameters:
stream
- A readable data stream.- Returns:
- The string read.
- Throws:
java.io.IOException
- An I/O error occurred.java.lang.NullPointerException
- The parameterstream
is null.
-
ReadUtf8ToString
public static java.lang.String ReadUtf8ToString(java.io.InputStream stream, int bytesCount, boolean replace) throws java.io.IOException
Reads a string in UTF-8 encoding from a data stream and returns that string.- Parameters:
stream
- A readable data stream.bytesCount
- The length, in bytes, of the string. If this is less than 0, this function will read until the end of the stream.replace
- If true, replaces invalid encoding with the replacement character (U+FFFD). If false, throws an error if an unpaired surrogate code point is seen.- Returns:
- The string read.
- Throws:
java.io.IOException
- An I/O error occurred; or, the string is not valid UTF-8 andreplace
is false.java.lang.NullPointerException
- The parameterstream
is null.
-
ReadUtf8
public static int ReadUtf8(java.io.InputStream stream, int bytesCount, java.lang.StringBuilder builder, boolean replace) throws java.io.IOException
Reads a string in UTF-8 encoding from a data stream.- Parameters:
stream
- A readable data stream.bytesCount
- The length, in bytes, of the string. If this is less than 0, this function will read until the end of the stream.builder
- A string builder object where the resulting string will be stored.replace
- If true, replaces invalid encoding with the replacement character (U+FFFD). If false, stops processing when an unpaired surrogate code point is seen.- Returns:
- 0 if the entire string was read without errors, -1 if the string is
not valid UTF-8 and
replace
is false, or -2 if the end of the stream was reached before the last character was read completely (which is only the case ifbytesCount
is 0 or greater). - Throws:
java.io.IOException
- An I/O error occurred.java.lang.NullPointerException
- The parameterstream
is null orbuilder
is null.
-
-