A utility class to parse, clean up, and extract email addresses from messages per RFC2822 syntax. Designed to integrate with Javamail (this class will
require that you have a javamail mail.jar in your classpath), but you could easily change the existing methods around to not use Javamail at all. For
example, if you're changing the code, see the difference between getInternetAddress and getDomain: the latter doesn't depend on any javamail code. This is
all a by-product of what this class was written for, so feel free to modify it to suit your needs.
For real-world addresses, this class is roughly 3-4 times slower than parsing with InternetAddress (although recent versions of this class might be faster),
but it can handle a whole lot more. Because of sensible design tradeoffs made in javamail, if InternetAddress has trouble parsing, it might throw an
exception, but often it will silently leave the entire original string in the result of ia.getAddress(). This class can be trusted to only provide
authenticated results.
This class has been successfully used on many billion real-world addresses, live in production environments, but it's not perfect yet.
Comments/Questions/Corrections welcome: https://github.com/bbottema/email-rfc2822-validator/issues
Historie:
Started with code by Les Hazlewood:
leshazlewood.com.
Modified/added (Casey Connor): removed some functions, added support for CFWS token, corrected FWSP token, added some boolean flags, added getInternetAddress
and extractHeaderAddresses and other methods, some optimization.
Modified/added (Benny Bottema): modularized the code and separated configuration, validation and extraction functions.
Where Mr. Hazlewood's version was more for ensuring certain forms that were passed in during registrations, etc, this handles more types of verifying as well
a few forms of extracting the data in predictable, cleaned-up chunks.
Note: CFWS means the "comment folded whitespace" token from 2822, in other words, whitespace and comment text that is enclosed in ()'s.
Limitations: doesn't support nested CFWS (comments within (other) comments), doesn't support mailbox groups except when flat-extracting addresses from
headers or when doing verification, doesn't support any of the obs-* tokens. Also: the getInternetAddress and extractHeaderAddresses methods return
InternetAddress objects; if the personal name has any quotes or \'s in it at all, the InternetAddress object will always escape the name entirely and put it
in quotes, so multiple-token personal names with those characters somewhere in them will always be munged into one big escaped string. This is not really a
big deal at all, but I mention it anyway. (And you could get around it by a simple modification to those methods to not use InternetAddress objects.) See the
docs of those methods for more info.
Note: Unlike InternetAddress, this class will preserve any RFC-2047-encoding of international characters. Thus doing my_internetaddress.getPersonal() will
return the 2047-encoded string, ready for use in an RFC-822-compliant message, whereas the common InternetAddress constructor (when used outside the context
of EmailAddressValidator) would return the decoded version of the text, if any was needed. If you need the decoded form, you can do something like this
(where ia is the InternetAddress object returned from an EmailAddressValidator method):
ia.setPersonal(javax.mail.internet.MimeUtility.decodeText(ia.getPersonal()));
...subsequent calls to ia.getPersonal() will then return the decoded text.
Note: This class does not do any header-length-checking. There are no such limitations on the email address grammar in 2822, though email headers in general
do have length restrictions. So if the return path is 40000 unfolded characters long, but otherwise valid under 2822, this class will pass it.
Examples of passing (2822-valid) addresses, believe it or not:
bob @example.com "bob" @ example.com bob (comment) (other comment) @example.com (personal name)
"<bob \" (here) " < (hi there) "bob(the man)smith" (hi) @ (there) example.com (hello) > (again)
(none of which are permitted by javamail's InternetAddress parsing, incidentally)
By using getInternetAddress(), you can retrieve an InternetAddress object that, when toString()'ed, would reveal that the parser had converted the above
into:
<[email protected]> <[email protected]> "personal name" <[email protected]> "<bob
\" (here)" <"bob(the man)smith"@example.com> (respectively)
If parsing headers, however, you'll probably be calling
extractHeaderAddresses().
A future improvement may be to use this class to extract info from corrupted addresses, but for now, it does not permit them.
Some of the configuration booleans allow a bit of tweaking already. The source code can be compiled with these booleans in various states. They are
configured to what is probably the most commonly-useful state.