EncodingUtilImpl (util-core 2.0.1 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

net.sf.mmm.util.io.base
Class EncodingUtilImpl

java.lang.Object
  net.sf.mmm.util.component.base.AbstractComponent
      net.sf.mmm.util.component.base.AbstractLoggableComponent
          net.sf.mmm.util.io.base.EncodingUtilImpl

All Implemented Interfaces:: EncodingUtil

@Singleton @Named public class EncodingUtilImpl
extends AbstractLoggableComponent
implements EncodingUtil
extends AbstractLoggableComponent
implements EncodingUtil

This is the implementation of the EncodingUtil interface.

Since:: 1.0.1
Author:: Joerg Hohwiller (hohwille at users.sourceforge.net)
See Also:: getInstance()

Nested Class Summary
`protected static class`	`EncodingUtilImpl.AsciiProcessor` This inner class is used to process the byes from the underlying `InputStream` in ASCII mode.
`protected static class`	`EncodingUtilImpl.Surrogate` This enum contains represents the type of a `EncodingUtilImpl.Surrogate` from an UTF-16 sequence.
`protected static class`	`EncodingUtilImpl.UtfDetectionProcessor` This inner class is used to perform the actual UTF detection.
`protected class`	`EncodingUtilImpl.UtfDetectionReader`

Field Summary
`private static EncodingUtil`	`instance`
`private static int`	`RANK_BOM` The rank gain if a proper `ByteOrderMark` was detected.
`private static int`	`RANK_UTF16_SURROGATE` The rank gain if an UTF-16 surrogate pair was detected.
`private static int`	`RANK_UTF8_SEQUNCE` The rank gain if a proper UTF-8 multi-byte sequence was detected.
`static byte`	`UTF_16_FIRST_SURROGATE_MAX` An UTF-16 four-byte-sequence consists of 2 two-byte-sequences called surrogate.
`static byte`	`UTF_16_FIRST_SURROGATE_MIN` An UTF-16 four-byte-sequence consists of 2 two-byte-sequences called surrogate.
`static byte`	`UTF_16_SECOND_SURROGATE_MAX` An UTF-16 four-byte-sequence consists of 2 two-byte-sequences called surrogate.
`static byte`	`UTF_16_SECOND_SURROGATE_MIN` An UTF-16 four-byte-sequence consists of 2 two-byte-sequences called surrogate.
`static byte`	`UTF_8_CONTINUATION_BYTE_MAX` In an UTF-8 multi-byte-sequence all bytes except the first one have the from `10xxxxxx`.
`static byte`	`UTF_8_CONTINUATION_BYTE_MIN` In an UTF-8 multi-byte-sequence all bytes except the first one have the from `10xxxxxx`.
`static byte`	`UTF_8_FOUR_BYTE_MAX` An UTF-8 four-byte-sequence has the form `11110xxx 10xxxxxx 10xxxxxx 10xxxxxx`.
`static byte`	`UTF_8_FOUR_BYTE_MIN` An UTF-8 four-byte-sequence has the form `11110xxx 10xxxxxx 10xxxxxx 10xxxxxx`.
`static byte`	`UTF_8_THREE_BYTE_MAX` An UTF-8 thee-byte-sequence has the form `1110xxxx 10xxxxxx 10xxxxxx`.
`static byte`	`UTF_8_THREE_BYTE_MIN` An UTF-8 thee-byte-sequence has the form `1110xxxx 10xxxxxx 10xxxxxx`.
`static byte`	`UTF_8_TWO_BYTE_MAX` An UTF-8 two-byte-sequence has the form `110xxxxx 10xxxxxx`.
`static byte`	`UTF_8_TWO_BYTE_MIN` An UTF-8 two-byte-sequence has the form `110xxxxx 10xxxxxx`.

Fields inherited from interface net.sf.mmm.util.io.api.EncodingUtil
ENCODING_CP_437, ENCODING_CP_737, ENCODING_CP_850, ENCODING_CP_852, ENCODING_CP_855, ENCODING_CP_857, ENCODING_CP_858, ENCODING_CP_860, ENCODING_CP_861, ENCODING_CP_863, ENCODING_CP_865, ENCODING_CP_866, ENCODING_CP_869, ENCODING_ISO_8859_1, ENCODING_ISO_8859_10, ENCODING_ISO_8859_11, ENCODING_ISO_8859_12, ENCODING_ISO_8859_13, ENCODING_ISO_8859_14, ENCODING_ISO_8859_15, ENCODING_ISO_8859_16, ENCODING_ISO_8859_2, ENCODING_ISO_8859_3, ENCODING_ISO_8859_4, ENCODING_ISO_8859_5, ENCODING_ISO_8859_6, ENCODING_ISO_8859_7, ENCODING_ISO_8859_8, ENCODING_ISO_8859_9, ENCODING_KOI8_R, ENCODING_KOI8_U, ENCODING_US_ASCII, ENCODING_UTF_16, ENCODING_UTF_16_BE, ENCODING_UTF_16_LE, ENCODING_UTF_32, ENCODING_UTF_32_BE, ENCODING_UTF_32_LE, ENCODING_UTF_8, ENCODING_WINDOWS_1250, ENCODING_WINDOWS_1251, ENCODING_WINDOWS_1252, ENCODING_WINDOWS_1253, ENCODING_WINDOWS_1254, ENCODING_WINDOWS_1255, ENCODING_WINDOWS_1256, ENCODING_WINDOWS_1257, ENCODING_WINDOWS_1258, SYSTEM_DEFAULT_ENCODING

Fields inherited from interface net.sf.mmm.util.io.api.EncodingUtil

ENCODING_CP_437, ENCODING_CP_737, ENCODING_CP_850, ENCODING_CP_852, ENCODING_CP_855, ENCODING_CP_857, ENCODING_CP_858, ENCODING_CP_860, ENCODING_CP_861, ENCODING_CP_863, ENCODING_CP_865, ENCODING_CP_866, ENCODING_CP_869, ENCODING_ISO_8859_1, ENCODING_ISO_8859_10, ENCODING_ISO_8859_11, ENCODING_ISO_8859_12, ENCODING_ISO_8859_13, ENCODING_ISO_8859_14, ENCODING_ISO_8859_15, ENCODING_ISO_8859_16, ENCODING_ISO_8859_2, ENCODING_ISO_8859_3, ENCODING_ISO_8859_4, ENCODING_ISO_8859_5, ENCODING_ISO_8859_6, ENCODING_ISO_8859_7, ENCODING_ISO_8859_8, ENCODING_ISO_8859_9, ENCODING_KOI8_R, ENCODING_KOI8_U, ENCODING_US_ASCII, ENCODING_UTF_16, ENCODING_UTF_16_BE, ENCODING_UTF_16_LE, ENCODING_UTF_32, ENCODING_UTF_32_BE, ENCODING_UTF_32_LE, ENCODING_UTF_8, ENCODING_WINDOWS_1250, ENCODING_WINDOWS_1251, ENCODING_WINDOWS_1252, ENCODING_WINDOWS_1253, ENCODING_WINDOWS_1254, ENCODING_WINDOWS_1255, ENCODING_WINDOWS_1256, ENCODING_WINDOWS_1257, ENCODING_WINDOWS_1258, SYSTEM_DEFAULT_ENCODING

Constructor Summary
`EncodingUtilImpl()` The constructor.

Method Summary
`EncodingDetectionReader`	`createUtfDetectionReader(InputStream inputStream, String nonUtfEncoding)` This method creates a new `Reader` for the given `inputStream`.
`static EncodingUtil`	`getInstance()` This method gets the singleton instance of this `EncodingUtilImpl`.

Methods inherited from class net.sf.mmm.util.component.base.AbstractLoggableComponent
`doInitialize, getLogger, setLogger`

Methods inherited from class net.sf.mmm.util.component.base.AbstractComponent
`doInitialized, getInitializationState, initialize`

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

UTF_8_CONTINUATION_BYTE_MIN

public static final byte UTF_8_CONTINUATION_BYTE_MIN

In an UTF-8 multi-byte-sequence all bytes except the first one have the from 10xxxxxx. This is the lower bound to detect such char.

See Also:: Constant Field Values

UTF_8_CONTINUATION_BYTE_MAX

public static final byte UTF_8_CONTINUATION_BYTE_MAX

In an UTF-8 multi-byte-sequence all bytes except the first one have the from 10xxxxxx. This is the upper bound to detect such char.

See Also:: Constant Field Values

UTF_8_TWO_BYTE_MIN

public static final byte UTF_8_TWO_BYTE_MIN

An UTF-8 two-byte-sequence has the form 110xxxxx 10xxxxxx. This is the lower bound to detect the first char of such sequence.
ATTENTION:
The bytes 0xC0 or 0xC1 would indicate a two-byte-sequence with code-point <= 127 what makes no sense.

See Also:: Constant Field Values

UTF_8_TWO_BYTE_MAX

public static final byte UTF_8_TWO_BYTE_MAX

An UTF-8 two-byte-sequence has the form 110xxxxx 10xxxxxx. This is the upper bound to detect the first char of such sequence.

See Also:: Constant Field Values

UTF_8_THREE_BYTE_MIN

public static final byte UTF_8_THREE_BYTE_MIN

An UTF-8 thee-byte-sequence has the form 1110xxxx 10xxxxxx 10xxxxxx. This is the lower bound to detect the first char of such sequence.

See Also:: Constant Field Values

UTF_8_THREE_BYTE_MAX

public static final byte UTF_8_THREE_BYTE_MAX

An UTF-8 thee-byte-sequence has the form 1110xxxx 10xxxxxx 10xxxxxx. This is the upper bound to detect the first char of such sequence.

See Also:: Constant Field Values

UTF_8_FOUR_BYTE_MIN

public static final byte UTF_8_FOUR_BYTE_MIN

An UTF-8 four-byte-sequence has the form 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx. This is the lower bound to detect the first char of such sequence.

See Also:: Constant Field Values

UTF_8_FOUR_BYTE_MAX

public static final byte UTF_8_FOUR_BYTE_MAX

An UTF-8 four-byte-sequence has the form 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx. This is the upper bound to detect the first char of such sequence.
ATTENTION:
The bytes 0xF5, 0xF6, or 0xF7 would lead to a four-byte-sequence with code-point greater than 10FFFF which is restricted by rfc3629.

See Also:: Constant Field Values

UTF_16_FIRST_SURROGATE_MIN

public static final byte UTF_16_FIRST_SURROGATE_MIN

An UTF-16 four-byte-sequence consists of 2 two-byte-sequences called surrogate. The first has the form 110110xx xxxxxxxx. This is the lower bound to detect the first char of such sequence.

See Also:: Constant Field Values

UTF_16_FIRST_SURROGATE_MAX

public static final byte UTF_16_FIRST_SURROGATE_MAX

An UTF-16 four-byte-sequence consists of 2 two-byte-sequences called surrogate. The first has the form 110110xx xxxxxxxx. This is the upper bound to detect the first char of such sequence.

See Also:: Constant Field Values

UTF_16_SECOND_SURROGATE_MIN

public static final byte UTF_16_SECOND_SURROGATE_MIN

An UTF-16 four-byte-sequence consists of 2 two-byte-sequences called surrogate. The second has the form 110111xx xxxxxxxx. This is the lower bound to detect the first char of such sequence.

See Also:: Constant Field Values

UTF_16_SECOND_SURROGATE_MAX

public static final byte UTF_16_SECOND_SURROGATE_MAX

An UTF-16 four-byte-sequence consists of 2 two-byte-sequences called surrogate. The second has the form 110111xx xxxxxxxx. This is the upper bound to detect the first char of such sequence.

See Also:: Constant Field Values

RANK_BOM

private static final int RANK_BOM

The rank gain if a proper ByteOrderMark was detected.

See Also:: Constant Field Values

RANK_UTF8_SEQUNCE

private static final int RANK_UTF8_SEQUNCE

The rank gain if a proper UTF-8 multi-byte sequence was detected.

See Also:: Constant Field Values

RANK_UTF16_SURROGATE

private static final int RANK_UTF16_SURROGATE

The rank gain if an UTF-16 surrogate pair was detected.

See Also:: Constant Field Values

instance

private static EncodingUtil instance

See Also:: getInstance()

Constructor Detail

EncodingUtilImpl

public EncodingUtilImpl()

The constructor.

Method Detail

getInstance

public static EncodingUtil getInstance()

This method gets the singleton instance of this EncodingUtilImpl.
This design is the best compromise between easy access (via this indirection you have direct, static access to all offered functionality) and IoC-style design which allows extension and customization.
For IoC usage, simply ignore all static getInstance() methods and construct new instances via the container-framework of your choice (like plexus, pico, springframework, etc.). To wire up the dependent components everything is properly annotated using common-annotations (JSR-250). If your container does NOT support this, you should consider using a better one.

Returns:: the singleton instance.

createUtfDetectionReader

public EncodingDetectionReader createUtfDetectionReader(InputStream inputStream,
                                                        String nonUtfEncoding)

This method creates a new Reader for the given inputStream. The EncodingDetectionReader automatically detects UTF (Unicode Transformation Format) encodings. If the data provided by inputStream is NOT in such encoding, it will use the given nonUtfEncoding as fallback.
The EncodingDetectionReader will behave like InputStreamReader but with an encoding that is automatically detected whilst reading. It will use a lookahead buffer to detect the encoding. As long as no UTF characteristic was detected and only ASCII-characters (<128) are hit, the encoding remains EncodingUtil.ENCODING_US_ASCII. As soon as an UTF sequence was detected (e.g. EncodingUtil.ENCODING_UTF_8 or EncodingUtil.ENCODING_UTF_16_BE), the encoding switches to that encoding. If a non-ASCII character is hit and no UTF encoding is detected, the EncodingDetectionReader switches to the given nonUtfEncoding.

Specified by:: createUtfDetectionReader in interface EncodingUtil

Parameters:: inputStream - is the InputStream to decode and read.; nonUtfEncoding - is the encoding to use in case the data is NOT encoded in UTF (e.g. EncodingUtil.ENCODING_ISO_8859_15). It is pointless to use an UTF-based encoding or EncodingUtil.ENCODING_US_ASCII here.
Returns:: a new EncodingDetectionReader that can be used to read the inputStream.