net.sf.mmm.util.io.base
Class EncodingUtilImpl

java.lang.Object
  extended by net.sf.mmm.util.component.base.AbstractComponent
      extended by net.sf.mmm.util.component.base.AbstractLoggableComponent
          extended by net.sf.mmm.util.io.base.EncodingUtilImpl
All Implemented Interfaces:
EncodingUtil

@Singleton
@Named
public class EncodingUtilImpl
extends AbstractLoggableComponent
implements EncodingUtil

This is the implementation of the EncodingUtil interface.

Since:
1.0.1
Author:
Joerg Hohwiller (hohwille at users.sourceforge.net)
See Also:
getInstance()

Nested Class Summary
protected static class EncodingUtilImpl.AsciiProcessor
          This inner class is used to process the byes from the underlying InputStream in ASCII mode.
protected static class EncodingUtilImpl.Surrogate
          This enum contains represents the type of a EncodingUtilImpl.Surrogate from an UTF-16 sequence.
protected static class EncodingUtilImpl.UtfDetectionProcessor
          This inner class is used to perform the actual UTF detection.
protected  class EncodingUtilImpl.UtfDetectionReader
           
 
Field Summary
private static EncodingUtil instance
           
private static int RANK_BOM
          The rank gain if a proper ByteOrderMark was detected.
private static int RANK_UTF16_SURROGATE
          The rank gain if an UTF-16 surrogate pair was detected.
private static int RANK_UTF8_SEQUNCE
          The rank gain if a proper UTF-8 multi-byte sequence was detected.
static byte UTF_16_FIRST_SURROGATE_MAX
          An UTF-16 four-byte-sequence consists of 2 two-byte-sequences called surrogate.
static byte UTF_16_FIRST_SURROGATE_MIN
          An UTF-16 four-byte-sequence consists of 2 two-byte-sequences called surrogate.
static byte UTF_16_SECOND_SURROGATE_MAX
          An UTF-16 four-byte-sequence consists of 2 two-byte-sequences called surrogate.
static byte UTF_16_SECOND_SURROGATE_MIN
          An UTF-16 four-byte-sequence consists of 2 two-byte-sequences called surrogate.
static byte UTF_8_CONTINUATION_BYTE_MAX
          In an UTF-8 multi-byte-sequence all bytes except the first one have the from 10xxxxxx.
static byte UTF_8_CONTINUATION_BYTE_MIN
          In an UTF-8 multi-byte-sequence all bytes except the first one have the from 10xxxxxx.
static byte UTF_8_FOUR_BYTE_MAX
          An UTF-8 four-byte-sequence has the form 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx.
static byte UTF_8_FOUR_BYTE_MIN
          An UTF-8 four-byte-sequence has the form 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx.
static byte UTF_8_THREE_BYTE_MAX
          An UTF-8 thee-byte-sequence has the form 1110xxxx 10xxxxxx 10xxxxxx.
static byte UTF_8_THREE_BYTE_MIN
          An UTF-8 thee-byte-sequence has the form 1110xxxx 10xxxxxx 10xxxxxx.
static byte UTF_8_TWO_BYTE_MAX
          An UTF-8 two-byte-sequence has the form 110xxxxx 10xxxxxx.
static byte UTF_8_TWO_BYTE_MIN
          An UTF-8 two-byte-sequence has the form 110xxxxx 10xxxxxx.
 
Fields inherited from interface net.sf.mmm.util.io.api.EncodingUtil
ENCODING_CP_437, ENCODING_CP_737, ENCODING_CP_850, ENCODING_CP_852, ENCODING_CP_855, ENCODING_CP_857, ENCODING_CP_858, ENCODING_CP_860, ENCODING_CP_861, ENCODING_CP_863, ENCODING_CP_865, ENCODING_CP_866, ENCODING_CP_869, ENCODING_ISO_8859_1, ENCODING_ISO_8859_10, ENCODING_ISO_8859_11, ENCODING_ISO_8859_12, ENCODING_ISO_8859_13, ENCODING_ISO_8859_14, ENCODING_ISO_8859_15, ENCODING_ISO_8859_16, ENCODING_ISO_8859_2, ENCODING_ISO_8859_3, ENCODING_ISO_8859_4, ENCODING_ISO_8859_5, ENCODING_ISO_8859_6, ENCODING_ISO_8859_7, ENCODING_ISO_8859_8, ENCODING_ISO_8859_9, ENCODING_KOI8_R, ENCODING_KOI8_U, ENCODING_US_ASCII, ENCODING_UTF_16, ENCODING_UTF_16_BE, ENCODING_UTF_16_LE, ENCODING_UTF_32, ENCODING_UTF_32_BE, ENCODING_UTF_32_LE, ENCODING_UTF_8, ENCODING_WINDOWS_1250, ENCODING_WINDOWS_1251, ENCODING_WINDOWS_1252, ENCODING_WINDOWS_1253, ENCODING_WINDOWS_1254, ENCODING_WINDOWS_1255, ENCODING_WINDOWS_1256, ENCODING_WINDOWS_1257, ENCODING_WINDOWS_1258, SYSTEM_DEFAULT_ENCODING
 
Constructor Summary
EncodingUtilImpl()
          The constructor.
 
Method Summary
 EncodingDetectionReader createUtfDetectionReader(InputStream inputStream, String nonUtfEncoding)
          This method creates a new Reader for the given inputStream.
static EncodingUtil getInstance()
          This method gets the singleton instance of this EncodingUtilImpl.
 
Methods inherited from class net.sf.mmm.util.component.base.AbstractLoggableComponent
doInitialize, getLogger, setLogger
 
Methods inherited from class net.sf.mmm.util.component.base.AbstractComponent
doInitialized, getInitializationState, initialize
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

UTF_8_CONTINUATION_BYTE_MIN

public static final byte UTF_8_CONTINUATION_BYTE_MIN
In an UTF-8 multi-byte-sequence all bytes except the first one have the from 10xxxxxx. This is the lower bound to detect such char.

See Also:
Constant Field Values

UTF_8_CONTINUATION_BYTE_MAX

public static final byte UTF_8_CONTINUATION_BYTE_MAX
In an UTF-8 multi-byte-sequence all bytes except the first one have the from 10xxxxxx. This is the upper bound to detect such char.

See Also:
Constant Field Values

UTF_8_TWO_BYTE_MIN

public static final byte UTF_8_TWO_BYTE_MIN
An UTF-8 two-byte-sequence has the form 110xxxxx 10xxxxxx. This is the lower bound to detect the first char of such sequence.
ATTENTION:
The bytes 0xC0 or 0xC1 would indicate a two-byte-sequence with code-point <= 127 what makes no sense.

See Also:
Constant Field Values

UTF_8_TWO_BYTE_MAX

public static final byte UTF_8_TWO_BYTE_MAX
An UTF-8 two-byte-sequence has the form 110xxxxx 10xxxxxx. This is the upper bound to detect the first char of such sequence.

See Also:
Constant Field Values

UTF_8_THREE_BYTE_MIN

public static final byte UTF_8_THREE_BYTE_MIN
An UTF-8 thee-byte-sequence has the form 1110xxxx 10xxxxxx 10xxxxxx. This is the lower bound to detect the first char of such sequence.

See Also:
Constant Field Values

UTF_8_THREE_BYTE_MAX

public static final byte UTF_8_THREE_BYTE_MAX
An UTF-8 thee-byte-sequence has the form 1110xxxx 10xxxxxx 10xxxxxx. This is the upper bound to detect the first char of such sequence.

See Also:
Constant Field Values

UTF_8_FOUR_BYTE_MIN

public static final byte UTF_8_FOUR_BYTE_MIN
An UTF-8 four-byte-sequence has the form 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx. This is the lower bound to detect the first char of such sequence.

See Also:
Constant Field Values

UTF_8_FOUR_BYTE_MAX

public static final byte UTF_8_FOUR_BYTE_MAX
An UTF-8 four-byte-sequence has the form 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx. This is the upper bound to detect the first char of such sequence.
ATTENTION:
The bytes 0xF5, 0xF6, or 0xF7 would lead to a four-byte-sequence with code-point greater than 10FFFF which is restricted by rfc3629.

See Also:
Constant Field Values

UTF_16_FIRST_SURROGATE_MIN

public static final byte UTF_16_FIRST_SURROGATE_MIN
An UTF-16 four-byte-sequence consists of 2 two-byte-sequences called surrogate. The first has the form 110110xx xxxxxxxx. This is the lower bound to detect the first char of such sequence.

See Also:
Constant Field Values

UTF_16_FIRST_SURROGATE_MAX

public static final byte UTF_16_FIRST_SURROGATE_MAX
An UTF-16 four-byte-sequence consists of 2 two-byte-sequences called surrogate. The first has the form 110110xx xxxxxxxx. This is the upper bound to detect the first char of such sequence.

See Also:
Constant Field Values

UTF_16_SECOND_SURROGATE_MIN

public static final byte UTF_16_SECOND_SURROGATE_MIN
An UTF-16 four-byte-sequence consists of 2 two-byte-sequences called surrogate. The second has the form 110111xx xxxxxxxx. This is the lower bound to detect the first char of such sequence.

See Also:
Constant Field Values

UTF_16_SECOND_SURROGATE_MAX

public static final byte UTF_16_SECOND_SURROGATE_MAX
An UTF-16 four-byte-sequence consists of 2 two-byte-sequences called surrogate. The second has the form 110111xx xxxxxxxx. This is the upper bound to detect the first char of such sequence.

See Also:
Constant Field Values

RANK_BOM

private static final int RANK_BOM
The rank gain if a proper ByteOrderMark was detected.

See Also:
Constant Field Values

RANK_UTF8_SEQUNCE

private static final int RANK_UTF8_SEQUNCE
The rank gain if a proper UTF-8 multi-byte sequence was detected.

See Also:
Constant Field Values

RANK_UTF16_SURROGATE

private static final int RANK_UTF16_SURROGATE
The rank gain if an UTF-16 surrogate pair was detected.

See Also:
Constant Field Values

instance

private static EncodingUtil instance
See Also:
getInstance()
Constructor Detail

EncodingUtilImpl

public EncodingUtilImpl()
The constructor.

Method Detail

getInstance

public static EncodingUtil getInstance()
This method gets the singleton instance of this EncodingUtilImpl.
This design is the best compromise between easy access (via this indirection you have direct, static access to all offered functionality) and IoC-style design which allows extension and customization.
For IoC usage, simply ignore all static getInstance() methods and construct new instances via the container-framework of your choice (like plexus, pico, springframework, etc.). To wire up the dependent components everything is properly annotated using common-annotations (JSR-250). If your container does NOT support this, you should consider using a better one.

Returns:
the singleton instance.

createUtfDetectionReader

public EncodingDetectionReader createUtfDetectionReader(InputStream inputStream,
                                                        String nonUtfEncoding)
This method creates a new Reader for the given inputStream. The EncodingDetectionReader automatically detects UTF (Unicode Transformation Format) encodings. If the data provided by inputStream is NOT in such encoding, it will use the given nonUtfEncoding as fallback.
The EncodingDetectionReader will behave like InputStreamReader but with an encoding that is automatically detected whilst reading. It will use a lookahead buffer to detect the encoding. As long as no UTF characteristic was detected and only ASCII-characters (<128) are hit, the encoding remains EncodingUtil.ENCODING_US_ASCII. As soon as an UTF sequence was detected (e.g. EncodingUtil.ENCODING_UTF_8 or EncodingUtil.ENCODING_UTF_16_BE), the encoding switches to that encoding. If a non-ASCII character is hit and no UTF encoding is detected, the EncodingDetectionReader switches to the given nonUtfEncoding.

Specified by:
createUtfDetectionReader in interface EncodingUtil
Parameters:
inputStream - is the InputStream to decode and read.
nonUtfEncoding - is the encoding to use in case the data is NOT encoded in UTF (e.g. EncodingUtil.ENCODING_ISO_8859_15). It is pointless to use an UTF-based encoding or EncodingUtil.ENCODING_US_ASCII here.
Returns:
a new EncodingDetectionReader that can be used to read the inputStream.


Copyright © 2001-2010 mmm-Team. All Rights Reserved.