java - Library for canonicalizing (normalizing but NOT just cleansing) email addresses -
there multiple ways produce email address strings differ straight string comparison (see below), logically equivalent (i.e. mail sent both goes same mail box). allows users give seemingly unique email addresses, if strict equality disallowed.
i hoping find library try normalization, allow finding of duplicates large sets of email addresses. goal here find many duplicates possible. given how useful multiple purposes (in case simple abuse detection, abuse accounts tend (try to) reuse accounts), thinking there might existing solutions.
so kind of things can vary? know of @ least things like:
- domain name part case-insensitive (as per dns); local part may or may not be, depends on mail provider (for example, gmail considers case-insensitive)
- many domains have aliases (googlemail.com equivalent gmail.com)
- some email providers allow other variations ignore (gmail, example, ignores dots in email address!)
ideally in java, although scripting languages work (command-line tool)
i find few bits off code on google searching "normalize email address", nothing thorough enough. i'm afraid have write own tool. if write such tool, here few rules think apply:
first tool lower case of domain name (after @). shouldn't hard, unless want handle emails international domain names. example, joe@cafÉ.fr (note accent on e) should first go through nameprep algorithm. leads joe@xn--caf-dma.fr. have never seen such international email address, suspect might find in china or japan, example.
rfc 5322 states local-part of email (before @) case sensitive, de facto standard virtually providers ignore case (i have never seen case-sensitive email address used human being, suppose there still sysadmins out there use un*x email accounts, case matter). think tool should have option ignore case list of domain names (or on contrary, case sensitive list of domain names). @ point, email address joe@cafÉ.fr normalized joe@xn--caf-dma.fr.
once again, question of international (aka. non ascii) email addresses pops up. what if local-part non-ascii ? example 甲斐@黒川.日本 (disclaimer: don't speak japanese). rfc 5322 forbids this, more recent rfcs support (see this wikipedia article). lot of languages have no notion of lower or uppercase. when do, if want change lower-case form, make sure use appropriate unicode lower-case algorithms, it's not trivial. example, in german, lower case of word "großes" may either "grosses" or "großes" (disclaimer: don't speak german either). @ point, email address "großes@cafÉ.fr" should have been normalized "grosses@xn--caf-dma.fr".
i haven't read rfc 5322 in detail think there's possibility have comments in email address, either @ beginning or @ end of local part, such (sir)john.lennon@beatles.com or john.lennon(ono)@beatles.com. these comments should stripped (this lead john.lennon@beatles.com. stripping comments not entirely trivial because don't know nested comments, , comments enclosed in double-quotes should not stripped, according rfc (unless mistaken). example, comment in following email address should not stripped, according rfc: "john.(ono).lennon"@beatles.com.
once email normalized, apply "provider-specific" rules suggest. example stripping dots in gmail addresses , mixing equivalent domain names (googlemail.com == gmail.com example). think keep separate previous normalization steps.
note gmail ignores plus sign (+) , after it, example s.m.i.t.h+hello_world@gmail.com equivalent smith@gmail.com.
i'm not aware of other provider rules. thing is, these rules may change @ time, have keep track of them all.
i think that's it. if come working code, interested see it.
cheers!
Comments
Post a Comment