RE: Mozilla Implements TLD Whitelist for Firefox in Response to IDN Homogr aphs Spoofing

Phil said:

Does anyone else think that it's not the job of a web browser to do

this?

Yes, it's recognized by Mozilla and others as the job of the Internet
Architecture Board (in particular, the IAB-IDN group) to make a final
decision on how to deal with homographs. However, for early adopters
like Mozilla with a released software package that supports IDNs, they
are taking an intermediate action until the committee comes up with a
guideline.

I think you're both right -- most applications can't feasibly manage
their own Unicode Philosophy, but Mozilla needs to do something for the
short term.

-Jason

* Jason Sloderbeck:

Yes, it's recognized by Mozilla and others as the job of the Internet
Architecture Board (in particular, the IAB-IDN group) to make a final
decision on how to deal with homographs.

Homographs are a classical example of a PR attack. It's a complete
non-issue. In practice, people don't use domain names to assess the
credibility of web sites. 1/l/I and 0/O are homographs as well, and
the Internet hasn't collapsed as a result.

The really stunning thing about the whole mess is that nobody seems to
grasp that technically, TLDs are not in a position to restrict name
server operators to any character sets in the domain names they use.
After all, I can add any domain name I want to my zone files.

Florian Weimer wrote:

* Jason Sloderbeck:

Yes, it's recognized by Mozilla and others as the job of the Internet
Architecture Board (in particular, the IAB-IDN group) to make a final
decision on how to deal with homographs.
   
Homographs are a classical example of a PR attack. It's a complete
non-issue. In practice, people don't use domain names to assess the
credibility of web sites. 1/l/I and 0/O are homographs as well, and
the Internet hasn't collapsed as a result.

The really stunning thing about the whole mess is that nobody seems to
grasp that technically, TLDs are not in a position to restrict name
server operators to any character sets in the domain names they use.
After all, I can add any domain name I want to my zone files.

Indeed you can.

But since the TLD registry operators can, and do, control the delegation of their TLDs, they have de-facto control over the sets of labels that can be used for second-level domain labels that are publically visible within their TLD domains, unless you can persuade people to point at your nameserver other than through the normal delegation from the root. This means that they can, if they so wish, apply character set restrictions to those labels. Your TLD registry, for example, can and does enforce such a policy. (http://www.denic.de/en/richtlinien.html)

On the other hand, there's nothing anyone can do to stop you resolving whatever labels you like on your own public nameservers, within your third-level, fourth-level and so on domains. However, this is unlikely to cause security problems for anyone apart from yourself and/or your customers.

-- Neil

-- Neil

* Neil Harris:

But since the TLD registry operators can, and do, control the delegation
of their TLDs, they have de-facto control over the sets of labels that
can be used for second-level domain labels that are publically visible
within their TLD domains,

I just don't see why this label is particularly important. If the
domain name is sufficiently long, it's not even displayed by current
browsers.

Even if this is fixed, how many users are aware that you have to read
domain names from right to left?

Homographs are a classical example of a PR attack. It's a complete
non-issue.

I am inclined to agree.

But since the TLD registry operators can, and do, control the delegation
of their TLDs, they have de-facto control over the sets of labels that
can be used for second-level domain labels that are publically visible
within their TLD domains

Indeed. The actual problem is that ICANN has been captured by the
trademark community (WIPO, basically) and has internalized two bad
ideas, that domains are like trademarks, and it is ICANN's job to
protect them. Once the registrars and registries realized that this
meant a thousand first-day registrations in a new domain (you may be
sure that disney.xxx has been presold), there hasn't been any serious
opposition so there are continuing inane arguments about how to
prevent 2LD homographs, even as everyone agrees that it's impossible.

Mozilla's approach strikes me as the least bad way to appease the
trademark crazies without interfering too badly with useful work. I
will be interested to see what they do when a cctld declares that
their policy is that they permit any name.

R's,
John

English-speaking folks actually do often notice the difference between 1/l/I
and 0/O, partly because they're usually (in browsers) lower case -- hence
1/l/i and 0/o (while 1/l is still close, the users are trained by years to
know the difference). It's an implicit Turing-test factor based on
linguistic experience.

Homographs where the glyphs are almost or completely identical, but
completely different code points, is where this *really* breaks down. There
are several sets of glyphs that can mimic nearly all of the Latin alphabet
-- and in most fonts, looks *identical* to the Latin glyphs (some fonts
simply remap to use the Latin glyph's data).

Unfortunately, Pine isn't really a UTF-8 mailer, or I'd demonstrate on list
for you. However, if you have a UTF-capable browser (chances are, you do),
the following should demonstrate identical-glyph homographs nicely.

    http://www.duh.org/homographs.cgi

(Hint: In each group of three lines, the strings of characters are NOT
identical, regardless of what your eyes may tell you.)

* Todd Vierling:

Homographs are a classical example of a PR attack. It's a complete
non-issue. In practice, people don't use domain names to assess the
credibility of web sites. 1/l/I and 0/O are homographs as well, and
the Internet hasn't collapsed as a result.

English-speaking folks actually do often notice the difference between 1/l/I
and 0/O, partly because they're usually (in browsers) lower case -- hence
1/l/i and 0/o (while 1/l is still close, the users are trained by years to
know the difference). It's an implicit Turing-test factor based on
linguistic experience.

But case is controlled by the attacker. Maybe users would be alerted
if they saw a capitalized domain name, which rules out the O/0
replacement. But the l/1/I issue still remains.

Homographs where the glyphs are almost or completely identical, but
completely different code points, is where this *really* breaks down. There
are several sets of glyphs that can mimic nearly all of the Latin alphabet
-- and in most fonts, looks *identical* to the Latin glyphs (some fonts
simply remap to use the Latin glyph's data).

So what? For most .DE domain, I still can get the corresponding
.DE.VU domain. Apart from the trailing .VU, the strings are even
bitwise identical.

Let me repeat my other argument: Users don't use domain names in trust
assessments. The smarter ones seem to recall how they got to a
particular page. This is quite consistent with real-world behavior.
Most people tend not to forget that they are in some questionable part
of the city just because they meet an attractive member of the
appropriate sex (or something like that, you get the idea).

(Hint: In each group of three lines, the strings of characters are NOT
identical, regardless of what your eyes may tell you.)

They appear differently because even though they are from a single
font, the characters have slightly different widths. This wouldn't
matter in the location field, of course.

Let me repeat my other argument: Users don't use domain names in trust
assessments. The smarter ones seem to recall how they got to a
particular page. This is quite consistent with real-world behavior.

Uh, I beg to differ -- most of my family would see

    h t t p : / / w w w . y a h <omicron> <omicron> . g r /

and think "the Yahoo site in Greece". After all, it renders as precisely

    http://www.yahoo.gr/

on-screen, same character glyph, width, and all. This isn't a PR attack;
it's a real inverse-Turing-test type of attack. People do look at URLs
visually, and many can recognize the difference with simple homographs, but
most, I assure you, cannot.

> (Hint: In each group of three lines, the strings of characters are NOT
> identical, regardless of what your eyes may tell you.)

They appear differently because even though they are from a single
font, the characters have slightly different widths.

Actually, out of all the fonts and OSs I tried, including one I prefer not
to use or name but which many people do use, only the Cyrillic lowercase on
one font on one OS had different widths, for exactly one character -- all
others had identical widths.

So you probably have a lucky font -- and you're fortunately already
technically knowledgeable to know what a Unicode character is and how it's
different from plain ASCII. Most users are *NOT* so lucky, as much as you'd
hope for that.

This wouldn't matter in the location field, of course.

How so? The movement is in the direction of rendering IDNs natively as
Unicode in the Location field, so this is exactly the same problem.

(Hm. I'm beginning to smell the T-word, but I'll wait and see how thick the
skull material is first.)

John Levine wrote:

Homographs are a classical example of a PR attack. It's a complete
non-issue.
     
I am inclined to agree.

But since the TLD registry operators can, and do, control the delegation of their TLDs, they have de-facto control over the sets of labels that can be used for second-level domain labels that are publically visible within their TLD domains
   
Indeed. The actual problem is that ICANN has been captured by the
trademark community (WIPO, basically) and has internalized two bad
ideas, that domains are like trademarks, and it is ICANN's job to
protect them. Once the registrars and registries realized that this
meant a thousand first-day registrations in a new domain (you may be
sure that disney.xxx has been presold), there hasn't been any serious
opposition so there are continuing inane arguments about how to
prevent 2LD homographs, even as everyone agrees that it's impossible.

Mozilla's approach strikes me as the least bad way to appease the
trademark crazies without interfering too badly with useful work. I
will be interested to see what they do when a cctld declares that
their policy is that they permit any name.

R's,
John

On the first point, yes, I agree, it's probably the least-worst solution.

On the second point: Mozilla, I imagine, would do nothing at all.

-- Neil