Non-English Domain Names Likely Delayed

Forwarded Message from Neil Harris <neil@tonal.clara.co.uk> ---

Fergie (Paul Ferguson) wrote:

...sez Vint...due to the prevalence of phishing:

http://www.msnbc.msn.com/id/8586332/

- ferg

Paul,

I'm not registered as a poster on the Nanog list, so I thought I'd let
you know that this problem is already well under control.

After extensive analysis and discussion, the Mozilla community and Opera
have already produced a fix for this, based on only displaying Unicode
IDN labels where the registry publishes and enforces well-defined
anti-homograph policies, and displaying the Punycode equivalent
otherwise. All that is needed is a couple of lines of code in the
Punycode -> Unicode translation code in the application, and a whitelist
of TLDs. See
IDN Display Algorithm - MozillaWiki for
more details. This delegates the responsibility of catching homographs
to the registries, rather than trying to catch them using ad-hoc
heuristics at the browser end.

In many cases, this can be as simple as restricting labels within a TLD
to use a small set of non-confusable characters. In others, with wider
character sets, techniques such as bundling and blocking sets of
confusable labels using homograph tables can be used. RFC 3743 is a case
in point. For an excellent summary of the technical details, which is
intended to help anyone attempting to eliminate homographs from a naming
system, see the latest, much-expanded, version of Unicode TR #36, which
also links to machine-readable confusables tables.
http://www.unicode.org/reports/tr36/

Already, some 21 TLDs are whitelisted, including .cn, .tw, a number of
European ccTLDs, .museum, and .info. Any other registrars who want to be
supported can simply E-mail Gerv at the Mozilla Foundation, or his Opera
counterpart, and give them a pointer to their anti-spoofing rules.

You might want to summarize to the list.

-- Neil

After extensive analysis and discussion, the Mozilla community and Opera have already produced a fix for this, based on only displaying Unicode

> IDN labels where the registry publishes and enforces well-defined
> anti-homograph policies, and displaying the Punycode equivalent

1. It's strange that so many months of discussion and debate about this elsewhere missed such an obvious and complete solution.

2. Who is the authority that decides whether a TLD uses an acceptable policy?

3. How does this apply to subordinate domains that might or might not enforce "acceptable" policies, given that no all policy-making is at the TLD level?

    d/

  Dave Crocker
  Brandenburg InternetWorking
  +1.408.246.8253
  dcrocker a t ...
  WE'VE MOVED to: www.bbiw.net

a message of 49 lines which said:

Forwarded Message from Neil Harris <neil@tonal.clara.co.uk> ---

...

After extensive analysis and discussion, the Mozilla community and Opera
have already produced a fix for this,

Which is highly questionable and that is rejected by most european
ccTLDs.

Already, some 21 TLDs are whitelisted, including .cn, .tw, a number
of European ccTLDs, .museum, and .info. Any other registrars who
want to be supported can simply E-mail Gerv at the Mozilla
Foundation, or his Opera counterpart, and give them a pointer to
their anti-spoofing rules.

The Polish registry already refused to comply, saying that the Mozilla
foundation has no legitimacy deciding the registration rules in ".pl".

a message of 25 lines which said:

2. Who is the authority that decides whether a TLD uses an
acceptable policy?

That's the big problem with this so-called "solution".

Stephane Bortzmeyer <bortzmeyer@nic.fr> writes:

Already, some 21 TLDs are whitelisted, including .cn, .tw, a number
of European ccTLDs, .museum, and .info. Any other registrars who
want to be supported can simply E-mail Gerv at the Mozilla
Foundation, or his Opera counterpart, and give them a pointer to
their anti-spoofing rules.

The Polish registry already refused to comply, saying that the Mozilla
foundation has no legitimacy deciding the registration rules in ".pl".

And it's completely their right to do this, however, if they are at
all subject to pressure from their constituency this policy will
probably change over time if this scheme becomes a de-facto standard
(say, for instance, M$ and Apple decide to run the same whitelist, the
discussion is effectively over).

What's the drawback again to letting commercial forces help shape the
discussion here? I forget...

                                        ---Rob

Stephane Bortzmeyer wrote:

Forwarded Message from Neil Harris <neil@tonal.clara.co.uk> ---
   

...

After extensive analysis and discussion, the Mozilla community and Opera have already produced a fix for this,
   
Which is highly questionable and that is rejected by most european
ccTLDs.

Already, some 21 TLDs are whitelisted, including .cn, .tw, a number
of European ccTLDs, .museum, and .info. Any other registrars who
want to be supported can simply E-mail Gerv at the Mozilla
Foundation, or his Opera counterpart, and give them a pointer to
their anti-spoofing rules.
   
The Polish registry already refused to comply, saying that the Mozilla
foundation has no legitimacy deciding the registration rules in ".pl".

Stephane, can I ask you what your detailed objections are to the Moz/Opera mechanism, and could you let me know your proposal for an alternative mechanism for preventing IDN spoofing?

I completely understand the need for registries to define and control their own rules, since every registry has different needs. Thus, I agree with you that the Mozilla foundation does not have, and should not have, any right whatsoever to decide registries' registration rules.

However, by the same principle, Mozilla, Opera and other software vendors also have the right to choose their policy for how they display domain names in their products' GUI. Ultimately, the decision of what policy is used devolves to the user, who decides what software they want to install on their machine.

The Moz/Opera anti-spoofing mechanism is the result of widespread public analysis and discussion, and has the following advantages:
* it deals with the actual problem: the visual representation of characters to the user -- the problem is, quite literally, in the eye of the beholder
* it is simple to code and deploy: about ten lines of code for the Mozilla implementation.
* it is based on simple and non-political principles
* it requires only a minimal amount of data to be distributed with the software
* it is the sole survivor of a large number of alternative proposals that were considered and rejected. Unlike most of the other rejected proposals, it does not need any modifications to the DNS protocol, or distribution of "language" codes for labels, nor does it require multiple DNS lookups, large character tables in the browser, or real-time access to WHOIS information. (I can tell you in great detail about some of the flawed alternative proposals, if you like).
* it is based on a much more thorough analysis of the problem than the earlier ICANN proposals, and builds on the experience of the Unicode community, and the earlier analysis of the spoofing problem for the CJK languages performed for RFC 3743. For example, simple script restrictıons alone, as per ICANN, do not solve the problem -- there are plenty of subtle homographs in the Latin alphabet, such as the one embedded in this sentence.
* it does not treat IDNs as second-class citizens
* it is language- and script-agnostic
* it is scalable on a per-registry basis, so there's no need for a "flag day", and requires no action on behalf of the registry beyond that which might be expected as a service to their customers, who have a reasonable expectation that their domains not be easily spoofed.
* and, most of all, it uses human, and not technical, means to provide a chain of trust from the registry to the application to the user

I must say that, from a user's perspective, I find it hard to understand why any registry would not want to put their anti-spoofing policy -- assuming they have one -- on public display, thus encouraging software vendors to regard their IDN labels as safe to display within their software.

In the long run, of course, it makes sense for best common registry anti-spoofing practices to be codified, probably in an RFC, or through the Unicode consortium. However, until then, the maintenance of an ad-hoc list by software vendors seems to be a powerful incentive in the short term for registries to implement and publish anti-spoofing policies which encourage trust.

There are a vast number of possible policies which registries could introduce, any of which might serve this purpose.

For example, for .fr, it could be as simple as saying something like "labels in .fr must consist only of characters from the set -, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z, à, â, æ, ç, è, é, ê, ë, î, ï, ô, ù, û, ü, ÿ, œ", putting that statement on their website, and letting the software makers know about it.

For .pl, which appears to want to support multiple character sets including the Cyrillic alphabet, it could be to say "we implement the character set restrictions of draft-bartosiewicz-idn-pltld-06.txt, together with blocking bundling using the confusables.txt table as per UTR #36-3".

In my opinion, either of these statements would persuade me that the registry was applying due diligence in avoiding homograph spoofs, and I would imagine that browser vendors would take the same view.

Again, if this is unworkable, please let me know a better alternative.

-- Neil

Dave Crocker wrote:

After extensive analysis and discussion, the Mozilla community and Opera have already produced a fix for this, based on only displaying Unicode

> IDN labels where the registry publishes and enforces well-defined
> anti-homograph policies, and displaying the Punycode equivalent

...snip...

3. How does this apply to subordinate domains that might or might not enforce "acceptable" policies, given that no all policy-making is at the TLD level?

It assumes that organization-level delegation of names is enforced by the TLD registry for all domains that it issues domains in.

The assumption is made that operators and users of websites and other services have to place their trust in the chain of organizations delegating the DNS for their domain, and in particular, the one that registered the domain with the TLD registry. This reflects common practice, in which most services involving any significant value or risk are generally operated from their own domains in order to reduce the number of third parties to be trusted as far as possible.

-- Neil

Stephane, can I ask you what your detailed objections are to the
Moz/Opera mechanism, and could you let me know your proposal for an
alternative mechanism for preventing IDN spoofing?

I would suggest that an alternative mechanism should include
a set of code points to be used for the on-the-wire DNS
protocol and the registry databases. This set of codepoints
will greatly restrict the possibility of ambiguity. Right
now it is utterly impossible to represent the ambiguity
of IBM, ibm, IBM or IbM in the DNS because the set of
codepoints only allows for one code to be shared by I and i.
This principle could be extended to other scripts so that,
for instance, codes for the 2nd and 4th letters of the
Cyrillic alphabet could be added while not adding codes
for the 1st and 3rd letters because A and B are already there.

Two additional items needed are translation tables. One
translation table would be the PREFERRED mapping from the
DNS codepoints to Unicode. I say "preferred" because while
some people will be happy to see the "b" as in "ibm", others
may prefer to see it as "B" especially Cyrillic users who
use "B" for a completely different letter most of the time.
Also, Arabs may prefer to map first and last letters of a
domain to the initial and final forms of the letter and
use medials for the rest because it looks better most of
the time. This does not create exploitable ambiguity.

The second item is a comprehensive mapping for all of
UNICODE that maps each code point into one of the DNS
code points. This should be defined as an algorithm because
that allows for a combination of mapping tables and more
efficient ways of defining and executing the mapping.

It may be painful to upgrade the DNS, but if we are going
to do so, we need to try to make it a solution that will
work for a long time, not just quick fix patches.

I have nothing against the Mozilla solution as a quick
fix but I hope that it is used to demonstrate the need
for upgrading DNS and fixing the problem at its root.

For example, simple script
restrict�ons alone, as per ICANN, do not solve the problem -- there are
plenty of subtle homographs in the Latin alphabet, such as the one
embedded in this sentence.

Personally, I consider that to be the Turkish alphabet, not the
Latin one. Turkic speakers who use Cyrillic also have a habit
of adopting munged up characters in their alphabets. I think this
is solved by defining the PREFERRED mapping as described above.
Turkey would implement it keeping the distinction between the
i with and without the dot. Many other countries would opt for
sticking in some code like "?" to indicate that there is a wierd
character there. If I localize my computer to allow Turkish text
entry and Turkish fonts, no doubt I would also get the Turkish
domain name mapping preferences. And no doubt, central asian countries
speaking Turkic languages but using the Cyrillic alphabet would map
all the codes into their familiar Cyrillic forms.

This is possible because the reverse mapping allows one to type
in many different possible UNICODE character forms of a domain name
in order to get the same single unambiguous registered domain name.

* it is scalable on a per-registry basis, so there's no need for a "flag

day", and requires no action on behalf of the registry beyond that which

might be expected as a service to their customers, who have a reasonable

expectation that their domains not be easily spoofed.

I think if we are going to upgrade the DNS, then registries will have
to adapt in the same way as everybody else. And if that includes a
flag day, then so be it. I suspect, however, that we will find some
less disruptive way to transition, perhaps with two flag days to
indicate the beginning and the end of a transition period.

For example, for .fr, it could be as simple as saying something like
"labels in .fr must consist only of characters from the set -, 0, 1, 2,
3, 4, 5, 6, 7, 8, 9, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q,
r, s, t, u, v, w, x, y, z, �, �, �, �, �, �, �, �, �, �, �, �, �, �, �,
�", putting that statement on their website, and letting the software
makers know about it.

And if a Turkish cultural centre in Paris wants to register a domain
name with the undotted i, then what? National boundaries have no
relationship
to cultural boundaries. Admittedly, in my solution suggested above, if
such
a turkish domain name did exist, anyone who did not have a localized
system
supporting entry of the undotted i would not be able to enter the name of
the domain. They could still access the website by leveraging a website
that
allowed them to access it by clicking a link, in the same way that
http://www.translit.ru provides a Cyrillic keyboard for computers without
Cyrillic localization installed.

--Michael Dillon

Michael, your idea of mapping confusable characters to a single "master" character was one of the options which was considered, but rejected.

To see why, consider the Turkish dotless-i in your second example. Now, to most non-Turkish readers, dotless-i is a homograph of the more common dotted-i character. If we map both to ASCII code 105, we've eliminated the homograph for non-Turkish users, but we then deny Turkish users the useful distinction between the two letters. Adding epicycles to this scheme with character-set tags, or filter rules based on locale setting on the client unfortunately make things worse not better.

This example actually illustrates rather nicely why it is so important that different TLDs, particularly ccTLDs, should be able to have different rules. For example, it's possible (I don't know Turkish) that there may be some pair of names in Turkish for which may be distinguished entirely by the difference between dotted and dotless-i.

Any procedure for preventing spoofing must bear in mind the fact that registries process vast numbers of registrations daily, and human oversight is not generally possible in the general case.

Bundling using confusables-tables, with appropriate considerations for cultural variations in what is confusable, is a much more effective approach, and allows subtle distinctions to be retained for those labels for which they are useful.

For example, the example of registering a dotless-i in a name registered in .fr could be easily dealt with by bundling, even if for French purposes dotted and dotless-i were normalized to the same equivalence set of confusable characters, provided that no potentially confusable French name had been registered first.

-- Neil