|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectnet.sf.mmm.util.component.base.AbstractComponent
net.sf.mmm.util.component.base.AbstractLoggableComponent
net.sf.mmm.util.io.base.EncodingUtilImpl
@Singleton @Named public class EncodingUtilImpl
This is the implementation of the EncodingUtil
interface.
getInstance()
Nested Class Summary | |
---|---|
protected static class |
EncodingUtilImpl.AsciiProcessor
This inner class is used to process the byes from the underlying InputStream in ASCII mode. |
protected static class |
EncodingUtilImpl.Surrogate
This enum contains represents the type of a EncodingUtilImpl.Surrogate from an
UTF-16 sequence. |
protected static class |
EncodingUtilImpl.UtfDetectionProcessor
This inner class is used to perform the actual UTF detection. |
protected class |
EncodingUtilImpl.UtfDetectionReader
|
Field Summary | |
---|---|
private static EncodingUtil |
instance
|
private static int |
RANK_BOM
The rank gain if a proper ByteOrderMark was detected. |
private static int |
RANK_UTF16_SURROGATE
The rank gain if an UTF-16 surrogate pair was detected. |
private static int |
RANK_UTF8_SEQUNCE
The rank gain if a proper UTF-8 multi-byte sequence was detected. |
static byte |
UTF_16_FIRST_SURROGATE_MAX
An UTF-16 four-byte-sequence consists of 2 two-byte-sequences called surrogate. |
static byte |
UTF_16_FIRST_SURROGATE_MIN
An UTF-16 four-byte-sequence consists of 2 two-byte-sequences called surrogate. |
static byte |
UTF_16_SECOND_SURROGATE_MAX
An UTF-16 four-byte-sequence consists of 2 two-byte-sequences called surrogate. |
static byte |
UTF_16_SECOND_SURROGATE_MIN
An UTF-16 four-byte-sequence consists of 2 two-byte-sequences called surrogate. |
static byte |
UTF_8_CONTINUATION_BYTE_MAX
In an UTF-8 multi-byte-sequence all bytes except the first one have the from 10xxxxxx . |
static byte |
UTF_8_CONTINUATION_BYTE_MIN
In an UTF-8 multi-byte-sequence all bytes except the first one have the from 10xxxxxx . |
static byte |
UTF_8_FOUR_BYTE_MAX
An UTF-8 four-byte-sequence has the form 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx . |
static byte |
UTF_8_FOUR_BYTE_MIN
An UTF-8 four-byte-sequence has the form 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx . |
static byte |
UTF_8_THREE_BYTE_MAX
An UTF-8 thee-byte-sequence has the form 1110xxxx 10xxxxxx 10xxxxxx . |
static byte |
UTF_8_THREE_BYTE_MIN
An UTF-8 thee-byte-sequence has the form 1110xxxx 10xxxxxx 10xxxxxx . |
static byte |
UTF_8_TWO_BYTE_MAX
An UTF-8 two-byte-sequence has the form 110xxxxx 10xxxxxx . |
static byte |
UTF_8_TWO_BYTE_MIN
An UTF-8 two-byte-sequence has the form 110xxxxx 10xxxxxx . |
Constructor Summary | |
---|---|
EncodingUtilImpl()
The constructor. |
Method Summary | |
---|---|
EncodingDetectionReader |
createUtfDetectionReader(InputStream inputStream,
String nonUtfEncoding)
This method creates a new Reader for the given
inputStream . |
static EncodingUtil |
getInstance()
This method gets the singleton instance of this EncodingUtilImpl . |
Methods inherited from class net.sf.mmm.util.component.base.AbstractLoggableComponent |
---|
doInitialize, getLogger, setLogger |
Methods inherited from class net.sf.mmm.util.component.base.AbstractComponent |
---|
doInitialized, getInitializationState, initialize |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final byte UTF_8_CONTINUATION_BYTE_MIN
10xxxxxx
. This is the lower bound to detect such char.
public static final byte UTF_8_CONTINUATION_BYTE_MAX
10xxxxxx
. This is the upper bound to detect such char.
public static final byte UTF_8_TWO_BYTE_MIN
110xxxxx 10xxxxxx
.
This is the lower bound to detect the first char of such sequence.0xC0
or 0xC1
would indicate a
two-byte-sequence with code-point <= 127 what makes no sense.
public static final byte UTF_8_TWO_BYTE_MAX
110xxxxx 10xxxxxx
.
This is the upper bound to detect the first char of such sequence.
public static final byte UTF_8_THREE_BYTE_MIN
1110xxxx 10xxxxxx 10xxxxxx
. This is the lower bound to detect
the first char of such sequence.
public static final byte UTF_8_THREE_BYTE_MAX
1110xxxx 10xxxxxx 10xxxxxx
. This is the upper bound to detect
the first char of such sequence.
public static final byte UTF_8_FOUR_BYTE_MIN
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
. This is the lower bound
to detect the first char of such sequence.
public static final byte UTF_8_FOUR_BYTE_MAX
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
. This is the upper bound
to detect the first char of such sequence.0xF5
, 0xF6
, or 0xF7
would
lead to a four-byte-sequence with code-point greater than
10FFFF
which is restricted by rfc3629.
public static final byte UTF_16_FIRST_SURROGATE_MIN
110110xx xxxxxxxx
.
This is the lower bound to detect the first char of such sequence.
public static final byte UTF_16_FIRST_SURROGATE_MAX
110110xx xxxxxxxx
.
This is the upper bound to detect the first char of such sequence.
public static final byte UTF_16_SECOND_SURROGATE_MIN
110111xx xxxxxxxx
.
This is the lower bound to detect the first char of such sequence.
public static final byte UTF_16_SECOND_SURROGATE_MAX
110111xx xxxxxxxx
.
This is the upper bound to detect the first char of such sequence.
private static final int RANK_BOM
ByteOrderMark
was detected.
private static final int RANK_UTF8_SEQUNCE
private static final int RANK_UTF16_SURROGATE
private static EncodingUtil instance
getInstance()
Constructor Detail |
---|
public EncodingUtilImpl()
Method Detail |
---|
public static EncodingUtil getInstance()
EncodingUtilImpl
.getInstance()
methods and
construct new instances via the container-framework of your choice (like
plexus, pico, springframework, etc.). To wire up the dependent components
everything is properly annotated using common-annotations (JSR-250). If
your container does NOT support this, you should consider using a better
one.
public EncodingDetectionReader createUtfDetectionReader(InputStream inputStream, String nonUtfEncoding)
Reader
for the given
inputStream
. The EncodingDetectionReader
automatically
detects UTF (Unicode Transformation Format) encodings. If the data provided
by inputStream
is NOT in such encoding, it will use the given
nonUtfEncoding
as fallback.EncodingDetectionReader
will behave like
InputStreamReader
but with an encoding that is
automatically detected whilst reading. It will use a lookahead buffer to
detect the encoding. As long as no UTF characteristic was detected and only
ASCII-characters (<128
) are hit, the encoding remains
EncodingUtil.ENCODING_US_ASCII
. As soon as an UTF sequence was detected (e.g.
EncodingUtil.ENCODING_UTF_8
or EncodingUtil.ENCODING_UTF_16_BE
), the encoding
switches to that encoding. If a non-ASCII character is hit and no UTF
encoding is detected, the EncodingDetectionReader
switches to the
given nonUtfEncoding
.
createUtfDetectionReader
in interface EncodingUtil
inputStream
- is the InputStream
to decode and read.nonUtfEncoding
- is the encoding to use in case the data is NOT
encoded in UTF (e.g. EncodingUtil.ENCODING_ISO_8859_15
). It is pointless
to use an UTF-based encoding or EncodingUtil.ENCODING_US_ASCII
here.
EncodingDetectionReader
that can be used to read the
inputStream
.
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |