net.sf.mmm.util.io.api
Interface EncodingUtil

All Known Implementing Classes:
EncodingUtilImpl

@ComponentSpecification
public interface EncodingUtil

This is the interface for a collection of utility functions to that help deal with encodings. An encoding defines a mapping of Characters of a Charset to Bytes and vice versa.

Since:
1.0.1
Author:
Joerg Hohwiller (hohwille at users.sourceforge.net)
See Also:
EncodingUtilImpl

Field Summary
static String ENCODING_CP_437
          The encoding CP437 also called DOS-US.
static String ENCODING_CP_737
          The encoding CP737.
static String ENCODING_CP_850
          The encoding CP850.
static String ENCODING_CP_852
          The encoding CP852.
static String ENCODING_CP_855
          The encoding CP855.
static String ENCODING_CP_857
          The encoding CP857.
static String ENCODING_CP_858
          The encoding CP857.
static String ENCODING_CP_860
          The encoding CP860.
static String ENCODING_CP_861
          The encoding CP861.
static String ENCODING_CP_863
          The encoding CP863.
static String ENCODING_CP_865
          The encoding CP865.
static String ENCODING_CP_866
          The encoding CP866.
static String ENCODING_CP_869
          The encoding CP869.
static String ENCODING_ISO_8859_1
          The encoding ISO-8859-1 also called Latin-1.
static String ENCODING_ISO_8859_10
          The encoding ISO-8859-10 also called Latin-6.
static String ENCODING_ISO_8859_11
          The encoding ISO-8859-11.
static String ENCODING_ISO_8859_12
          Deprecated. 
static String ENCODING_ISO_8859_13
          The encoding ISO-8859-13 also called Latin-7.
static String ENCODING_ISO_8859_14
          The encoding ISO-8859-14 also called Latin-8.
static String ENCODING_ISO_8859_15
          The encoding ISO-8859-15 also called Latin-9.
static String ENCODING_ISO_8859_16
          The encoding ISO-8859-16 also called Latin-10.
static String ENCODING_ISO_8859_2
          The encoding ISO-8859-2 also called Latin-2.
static String ENCODING_ISO_8859_3
          The encoding ISO-8859-3 also called Latin-3.
static String ENCODING_ISO_8859_4
          The encoding ISO-8859-4 also called Latin-4.
static String ENCODING_ISO_8859_5
          The encoding ISO-8859-5.
static String ENCODING_ISO_8859_6
          The encoding ISO-8859-6.
static String ENCODING_ISO_8859_7
          The encoding ISO-8859-7.
static String ENCODING_ISO_8859_8
          The encoding ISO-8859-8.
static String ENCODING_ISO_8859_9
          The encoding ISO-8859-9 also called Latin-5.
static String ENCODING_KOI8_R
          The encoding KOI8-R.
static String ENCODING_KOI8_U
          The encoding KOI8-U.
static String ENCODING_US_ASCII
          The encoding US-ASCII (American Standard Code for Information Interchange) also just called ASCII.
static String ENCODING_UTF_16
          The encoding UTF-16.
static String ENCODING_UTF_16_BE
          The encoding UTF-16, big-endian.
static String ENCODING_UTF_16_LE
          The encoding UTF-16, little-endian.
static String ENCODING_UTF_32
          The encoding UTF-32.
static String ENCODING_UTF_32_BE
          The encoding UTF-32, big-endian.
static String ENCODING_UTF_32_LE
          The encoding UTF-32, little-endian.
static String ENCODING_UTF_8
          The encoding UTF-8.
static String ENCODING_WINDOWS_1250
          The encoding CP1250 also called Windows-1250.
static String ENCODING_WINDOWS_1251
          The encoding CP1251 also called Windows-1251.
static String ENCODING_WINDOWS_1252
          The encoding CP1252 also called Windows-1252.
static String ENCODING_WINDOWS_1253
          The encoding CP1253 also called Windows-1253.
static String ENCODING_WINDOWS_1254
          The encoding CP1254 also called Windows-1254.
static String ENCODING_WINDOWS_1255
          The encoding CP1255 also called Windows-1255.
static String ENCODING_WINDOWS_1256
          The encoding CP1256 also called Windows-1256.
static String ENCODING_WINDOWS_1257
          The encoding CP1257 also called Windows-1257.
static String ENCODING_WINDOWS_1258
          The encoding CP1258 also called Windows-1258.
static String SYSTEM_DEFAULT_ENCODING
          The default encoding used by this JVM as fallback if no explicit encoding is specified.
 
Method Summary
 EncodingDetectionReader createUtfDetectionReader(InputStream inputStream, String nonUtfEncoding)
          This method creates a new Reader for the given inputStream.
 

Field Detail

SYSTEM_DEFAULT_ENCODING

static final String SYSTEM_DEFAULT_ENCODING
The default encoding used by this JVM as fallback if no explicit encoding is specified.


ENCODING_US_ASCII

static final String ENCODING_US_ASCII
The encoding US-ASCII (American Standard Code for Information Interchange) also just called ASCII.
This is a basic encoding for Java contained in lib/rt.jar.

See Also:
Constant Field Values

ENCODING_UTF_8

static final String ENCODING_UTF_8
The encoding UTF-8. It is an 8-bit Unicode Transformation Format.
This is a basic encoding for Java contained in lib/rt.jar.

See Also:
Constant Field Values

ENCODING_UTF_16

static final String ENCODING_UTF_16
The encoding UTF-16. It is an 16-bit Unicode Transformation Format. The byte-order is determined by an optional ByteOrderMark.
This is a basic encoding for Java contained in lib/rt.jar.

See Also:
Constant Field Values

ENCODING_UTF_16_LE

static final String ENCODING_UTF_16_LE
The encoding UTF-16, little-endian. It is an 16-bit Unicode Transformation Format.
This is a basic encoding for Java contained in lib/rt.jar.

See Also:
Constant Field Values

ENCODING_UTF_16_BE

static final String ENCODING_UTF_16_BE
The encoding UTF-16, big-endian. It is an 16-bit Unicode Transformation Format.
This is a basic encoding for Java contained in lib/rt.jar.

See Also:
Constant Field Values

ENCODING_UTF_32

static final String ENCODING_UTF_32
The encoding UTF-32. It is an 32-bit Unicode Transformation Format. The byte-order is determined by an optional ByteOrderMark.
ATTENTION:
UTF-32 is NOT yet supported by Java.

See Also:
Constant Field Values

ENCODING_UTF_32_LE

static final String ENCODING_UTF_32_LE
The encoding UTF-32, little-endian. It is an 32-bit Unicode Transformation Format.
ATTENTION:
UTF-32 is NOT yet supported by Java.

See Also:
Constant Field Values

ENCODING_UTF_32_BE

static final String ENCODING_UTF_32_BE
The encoding UTF-32, big-endian. It is an 32-bit Unicode Transformation Format.
ATTENTION:
UTF-32 is NOT yet supported by Java.

See Also:
Constant Field Values

ENCODING_ISO_8859_1

static final String ENCODING_ISO_8859_1
The encoding ISO-8859-1 also called Latin-1. It is covering most Western European languages.
This is a basic encoding for Java contained in lib/rt.jar.

See Also:
Constant Field Values

ENCODING_ISO_8859_2

static final String ENCODING_ISO_8859_2
The encoding ISO-8859-2 also called Latin-2. It is covering the Central and Eastern European languages that use the Latin alphabet.
This is a basic encoding for Java contained in lib/rt.jar.

See Also:
Constant Field Values

ENCODING_ISO_8859_3

static final String ENCODING_ISO_8859_3
The encoding ISO-8859-3 also called Latin-3. It is covering the South European languages.
This is an extended encoding for Java contained in lib/charsets.jar.

See Also:
Constant Field Values

ENCODING_ISO_8859_4

static final String ENCODING_ISO_8859_4
The encoding ISO-8859-4 also called Latin-4. It is covering the North European languages.
This is a basic encoding for Java contained in lib/rt.jar.

See Also:
Constant Field Values

ENCODING_ISO_8859_5

static final String ENCODING_ISO_8859_5
The encoding ISO-8859-5. It is covering mostly Slavic languages that use a Cyrillic alphabet.
This is a basic encoding for Java contained in lib/rt.jar.

See Also:
Constant Field Values

ENCODING_ISO_8859_6

static final String ENCODING_ISO_8859_6
The encoding ISO-8859-6. It is covering common Arabic language characters.
This is an extended encoding for Java contained in lib/charsets.jar.

See Also:
Constant Field Values

ENCODING_ISO_8859_7

static final String ENCODING_ISO_8859_7
The encoding ISO-8859-7. It is covering modern Greek.
This is a basic encoding for Java contained in lib/rt.jar.

See Also:
Constant Field Values

ENCODING_ISO_8859_8

static final String ENCODING_ISO_8859_8
The encoding ISO-8859-8. It is covering modern Hebrew (used in Israel).
This is an extended encoding for Java contained in lib/charsets.jar.

See Also:
Constant Field Values

ENCODING_ISO_8859_9

static final String ENCODING_ISO_8859_9
The encoding ISO-8859-9 also called Latin-5. It is covering Turkish and Kurdish.
This is a basic encoding for Java contained in lib/rt.jar.

See Also:
Constant Field Values

ENCODING_ISO_8859_10

static final String ENCODING_ISO_8859_10
The encoding ISO-8859-10 also called Latin-6. It is used for Nordic languages.
ATTENTION:
This encoding is NOT supported by Java.

See Also:
Constant Field Values

ENCODING_ISO_8859_11

static final String ENCODING_ISO_8859_11
The encoding ISO-8859-11. The canonical name however is x-iso-8859-11. It is covering common Thai language characters.

See Also:
Constant Field Values

ENCODING_ISO_8859_12

@Deprecated
static final String ENCODING_ISO_8859_12
Deprecated. 
The encoding ISO-8859-12. The work on this encoding for Devanagari was stopped so it does NOT exist at all.

See Also:
Constant Field Values

ENCODING_ISO_8859_13

static final String ENCODING_ISO_8859_13
The encoding ISO-8859-13 also called Latin-7. It is covering Baltic languages.
This is a basic encoding for Java contained in lib/rt.jar.

See Also:
Constant Field Values

ENCODING_ISO_8859_14

static final String ENCODING_ISO_8859_14
The encoding ISO-8859-14 also called Latin-8. It is covering Celtic languages.
This encoding is NOT supported by Java.

See Also:
Constant Field Values

ENCODING_ISO_8859_15

static final String ENCODING_ISO_8859_15
The encoding ISO-8859-15 also called Latin-9. It is very similar to Latin-1 but adds the euro-sign and 7 other characters by replacing rarely used ones.
This is a basic encoding for Java contained in lib/rt.jar.

See Also:
Constant Field Values

ENCODING_ISO_8859_16

static final String ENCODING_ISO_8859_16
The encoding ISO-8859-16 also called Latin-10. It is covering South-Eastern European languages and includes the euro-sign.
This encoding is NOT supported by Java.

See Also:
Constant Field Values

ENCODING_KOI8_R

static final String ENCODING_KOI8_R
The encoding KOI8-R. It is covering Russian and Bulgarian. It is therefore related to ENCODING_ISO_8859_5 and ENCODING_WINDOWS_1251.
This is a basic encoding for Java contained in lib/rt.jar.

See Also:
Constant Field Values

ENCODING_KOI8_U

static final String ENCODING_KOI8_U
The encoding KOI8-U. It is covering Ukrainian. It is related to ENCODING_KOI8_R, ENCODING_ISO_8859_5 and ENCODING_WINDOWS_1251.
ATTENTION:
This encoding is NOT supported by Java.

See Also:
Constant Field Values

ENCODING_CP_437

static final String ENCODING_CP_437
The encoding CP437 also called DOS-US. It is used by MS-DOS and is based on ENCODING_US_ASCII but NOT completely compatible.

See Also:
Constant Field Values

ENCODING_CP_737

static final String ENCODING_CP_737
The encoding CP737. It is used by MS-DOS for Greek and is therefore related to ENCODING_CP_869 and ENCODING_ISO_8859_7.

See Also:
Constant Field Values

ENCODING_CP_850

static final String ENCODING_CP_850
The encoding CP850. It is used by MS-DOS for Western European languages and is therefore related to ENCODING_ISO_8859_1.

See Also:
Constant Field Values

ENCODING_CP_852

static final String ENCODING_CP_852
The encoding CP852. It is used by MS-DOS for Central European languages and is therefore related to ENCODING_ISO_8859_2.

See Also:
Constant Field Values

ENCODING_CP_855

static final String ENCODING_CP_855
The encoding CP855. It is used by MS-DOS for Cyrillic letters and is therefore related to ENCODING_ISO_8859_5.

See Also:
Constant Field Values

ENCODING_CP_857

static final String ENCODING_CP_857
The encoding CP857. It is used by MS-DOS for Turkish and is therefore related to ENCODING_ISO_8859_9.

See Also:
Constant Field Values

ENCODING_CP_858

static final String ENCODING_CP_858
The encoding CP857. It is used by MS-DOS for Western European languages and is like ENCODING_CP_850 but replaces one character with the euro-sign. It is therefore related to ENCODING_ISO_8859_15.

See Also:
Constant Field Values

ENCODING_CP_860

static final String ENCODING_CP_860
The encoding CP860. It is used by MS-DOS for Portuguese and is therefore related to ENCODING_ISO_8859_1.

See Also:
Constant Field Values

ENCODING_CP_861

static final String ENCODING_CP_861
The encoding CP861. It is used by MS-DOS for Nordic languages especially for Icelandic and is therefore related to ENCODING_ISO_8859_10.

See Also:
Constant Field Values

ENCODING_CP_863

static final String ENCODING_CP_863
The encoding CP863. It is used by MS-DOS for French and is therefore related to ENCODING_ISO_8859_15.

See Also:
Constant Field Values

ENCODING_CP_865

static final String ENCODING_CP_865
The encoding CP865. It is used by MS-DOS for Nordic languages except Icelandic for which ENCODING_CP_861 is used. It is therefore related to ENCODING_ISO_8859_10.

See Also:
Constant Field Values

ENCODING_CP_866

static final String ENCODING_CP_866
The encoding CP866. It is used by MS-DOS for Cyrillic letters and is therefore related to ENCODING_CP_855 and ENCODING_ISO_8859_5.

See Also:
Constant Field Values

ENCODING_CP_869

static final String ENCODING_CP_869
The encoding CP869. It is used by MS-DOS for Greek and is therefore related to ENCODING_CP_737 and ENCODING_ISO_8859_7.

See Also:
Constant Field Values

ENCODING_WINDOWS_1250

static final String ENCODING_WINDOWS_1250
The encoding CP1250 also called Windows-1250. It is used by Microsoft Windows for Central European languages and is similar to ENCODING_ISO_8859_2.
This is a basic encoding for Java contained in lib/rt.jar.

See Also:
Constant Field Values

ENCODING_WINDOWS_1251

static final String ENCODING_WINDOWS_1251
The encoding CP1251 also called Windows-1251. It is used by Microsoft Windows for Cyrillic letters and is similar to ENCODING_ISO_8859_5.
This is a basic encoding for Java contained in lib/rt.jar.

See Also:
Constant Field Values

ENCODING_WINDOWS_1252

static final String ENCODING_WINDOWS_1252
The encoding CP1252 also called Windows-1252. It is used by Microsoft Windows for Western European languages and is similar to ENCODING_ISO_8859_1.
This is a basic encoding for Java contained in lib/rt.jar.

See Also:
Constant Field Values

ENCODING_WINDOWS_1253

static final String ENCODING_WINDOWS_1253
The encoding CP1253 also called Windows-1253. It is used by Microsoft Windows for Greek and is similar to ENCODING_ISO_8859_7.
This is a basic encoding for Java contained in lib/rt.jar.

See Also:
Constant Field Values

ENCODING_WINDOWS_1254

static final String ENCODING_WINDOWS_1254
The encoding CP1254 also called Windows-1254. It is used by Microsoft Windows for Turkish and is similar to ENCODING_ISO_8859_9.
This is a basic encoding for Java contained in lib/rt.jar.

See Also:
Constant Field Values

ENCODING_WINDOWS_1255

static final String ENCODING_WINDOWS_1255
The encoding CP1255 also called Windows-1255. It is used by Microsoft Windows for Hebrew and is similar to ENCODING_ISO_8859_8.

See Also:
Constant Field Values

ENCODING_WINDOWS_1256

static final String ENCODING_WINDOWS_1256
The encoding CP1256 also called Windows-1256. It is used by Microsoft Windows for Arabic and is similar to ENCODING_ISO_8859_6.

See Also:
Constant Field Values

ENCODING_WINDOWS_1257

static final String ENCODING_WINDOWS_1257
The encoding CP1257 also called Windows-1257. It is used by Microsoft Windows for Baltic languages and is similar to ENCODING_ISO_8859_13.
This is a basic encoding for Java contained in lib/rt.jar.

See Also:
Constant Field Values

ENCODING_WINDOWS_1258

static final String ENCODING_WINDOWS_1258
The encoding CP1258 also called Windows-1258. It is used by Microsoft Windows for Vietnamese and is similar to ENCODING_WINDOWS_1252.

See Also:
Constant Field Values
Method Detail

createUtfDetectionReader

EncodingDetectionReader createUtfDetectionReader(InputStream inputStream,
                                                 String nonUtfEncoding)
This method creates a new Reader for the given inputStream. The EncodingDetectionReader automatically detects UTF (Unicode Transformation Format) encodings. If the data provided by inputStream is NOT in such encoding, it will use the given nonUtfEncoding as fallback.
The EncodingDetectionReader will behave like InputStreamReader but with an encoding that is automatically detected whilst reading. It will use a lookahead buffer to detect the encoding. As long as no UTF characteristic was detected and only ASCII-characters (<128) are hit, the encoding remains ENCODING_US_ASCII. As soon as an UTF sequence was detected (e.g. ENCODING_UTF_8 or ENCODING_UTF_16_BE), the encoding switches to that encoding. If a non-ASCII character is hit and no UTF encoding is detected, the EncodingDetectionReader switches to the given nonUtfEncoding.

Parameters:
inputStream - is the InputStream to decode and read.
nonUtfEncoding - is the encoding to use in case the data is NOT encoded in UTF (e.g. ENCODING_ISO_8859_15). It is pointless to use an UTF-based encoding or ENCODING_US_ASCII here.
Returns:
a new EncodingDetectionReader that can be used to read the inputStream.


Copyright © 2001-2010 mmm-Team. All Rights Reserved.