|
May 4, 2004
Internationalized Domain Names
The Internet is becoming increasingly international, accessed by people
who speak a wide variety of different languages. However, the character
set used by DNS and other core protocols hasn't kept up. This must change
if IP technology is to reach broader acceptance among non-English-speaking
audiences, and breaking the Internet's dependency on seven-bit ASCII
is a good place to start. One critical advance toward this objective
was made last year when the IETF published "Internationalizing Domain
Names in Applications (IDNA)" (RFC 3490). IDNA specifies the use of Internationalized
Domain Names (IDNs) to display characters from foreign languages and
alphabets.
For people in predominately English-speaking countries, international
characters may seem irrelevant, or at best a distraction from more pressing
needs. However, large-scale changes to the global infrastructure will
affect every network whose users communicate internationally. Sending
e-mail to users in another country may eventually require an upgrade
to IDNs. Companies selling products or services worldwide may want to
register an IDN that accurately represents their wares, and anyone with
international clientele will need to be prepared for support issues.
The IDNA Transformation Model
IDNA describes algorithms for the presentation and encapsulation of
IDNs. This means that an IDN can exist either in a "rich" form that uses
characters from international languages, or in an encoded form that's
compatible with legacy ASCII. Applications can then use whichever form
is most appropriate. For example, the rich form is likely to be presented
to users for display purposes, while the encoded ASCII form can be passed
through to underlying applications and protocols. Eventually, underlying
applications may also be able to use the rich form internally.
Domain names are encoded in ASCII form by default. This way, they're
compatible with legacy host name rules and can thus be supported by every
Internet application. These legacy rules only allow English alphanumeric
characters and the hyphen, and also impose certain ordering and length
restrictions on individual labels and the overall domain name.
The IDNA encoding mechanism produces ASCII syntax that's compatible
with legacy DNS rules. This ensures that all legacy protocols and services
using domain names are able to interoperate, even if they can't display
the full rich form to the end user. For example, SMTP uses e-mail addresses
as identifiers, while HTTP uses URLs and host names for various operations.
These protocols must continue to use ASCII until they're extended to
use the full IDN.
For the moment, the internationalized form is used mainly for input
and output operations that interface with a human user. For example,
a user may be able to type an IDN into a Web browser's URL bar, but the
browser must convert that domain name into the ASCII-encoded form before
performing a DNS lookup. Similarly, HTTP will issue a GET message using
ASCII.
Full IDNA support requires that this conversion also be performed in
reverse, presenting encoded domain names in their rich form. If a user
clicks on a link leading to a Web page in an internationalized domain,
the browser should render that domain name using non-ASCII characters
for display, printing, and bookmarking. Similarly, an e-mail client should
allow internationalized addresses to be stored in the client's address
book.
Unfortunately, not all applications support this seamless process.
For instance, e-mail and Usenet messages all contain a globally unique
Message-ID header field that includes a domain name. If these are converted
on the fly during search and fetch operations, the resulting values won't
match the originals, causing problems for indexing software and mail
databases.
The Internationalization Wave
To date, few applications have implemented IDNA's transformation service.
Among those that have, the implementations have been of variable quality
and aren't always complete.
For example, the Web browser component of Mozilla 1.6 offers support
for IDNA-to-ASCII conversion, but not the reverse: It will accept and
process an IDN entered into the URL bar, but won't display the rich form
of one reached through a hyperlink. Mozilla's e-mail client offers no
support, so users can't enter an e-mail address containing an IDN. Recent
versions of Opera and Konqueror offer pretty good support, although both
still include some minor bugs.
Microsoft doesn't currently offer IDNA support for either Internet
Explorer or Outlook, but has announced plans to implement it in future
releases. Users who don't want to wait can add IDNA to current versions
of Explorer and Outlook using third-party plug-ins.
Most instances of Internet software don't perform any kind of IDNA
transformation yet. Everyday applications such as Traceroute will have
to be extended to perform input and output conversions before the Internet
can appear to be anything other than an ASCII-centric network. Similarly,
basic services such as DHCP and SNMP will need to be upgraded before
they can be used to reach domains containing non-ASCII characters. A
100 percent international experience requires a 100 percent replacement
of every user-facing piece of code on the planet, from ping to printer
drivers.
This means that the entire network needs to undergo a forklift upgrade
before IDNA can truly internationalize the Internet. On top of that,
several core technologies require enhancements to support IDNs internally.
For example, even though the domain element within an e-mail address
can be internationalized through IDNA transformation technology, the
local element that defines a username or mailbox is still limited to
ASCII.
The IDNA-to-ASCII Conversion Process
IDNs require two different types of conversion: one in which the rich
domain name is encoded into ASCII, and the other in which ASCII is encoded
into a rich domain name. These conversion operations are described in
RFC 3490 as the ToASCII and ToUnicode functions, respectively.
The ToASCII function requires the following steps to be performed:
- Separate the domain name into its component labels-the fields separated
by dots-and check each one for international characters. If a label
only contains ordinary ASCII characters, it doesn't require conversion.
- Convert any extended characters into Unicode, the international standard
for non-ASCII characters. Many OSs and applications use other character
sets, but the encoding routines require Unicode.
- Normalize the characters to a particular form as specified in RFC
3491, "Nameprep: A Stringprep Profile for Internationalized Domain Names." This
step is required because different Unicode strings can represent the
same domain name. Uppercase characters must be converted to lowercase,
and accents must be added to characters. For example, if a user enters "www.Ex¨Ample.com," the
sequence will need to be normalized to "www.exämple.com". Some domain
name registries apply additional restrictions. For instance, the Polish
top-level domain (.pl) only allows characters from European and Middle
Eastern languages. However, these restrictions only need to be considered
if the domain name is actually registered; DNS will simply return a "Name
not found" error if an illegal domain name is requested.
- Apply the conversion algorithm specified in RFC 3492, "Punycode:
A Bootstring Encoding of Unicode for Internationalized Domain Names
in Applications." At this point, each Unicode character is replaced
with a sequence of ASCII characters. For instance, the "exämple" label
would be converted into "example-cua," where the "-cua" sequence indicates
the location and characters that have been encoded.
- Prepend a special tagging sequence, "xn--," to the beginning of the
resulting label. This allows systems to recognize that the label contains
an encoded domain name. A system that doesn't support rich characters
will thus display "www.example.com" as "www.xn--example-cua.com".
When the "www.xn--example-cua.com" domain name is received by an IDNA-aware
application, it will recognize the "xn--" tag in the middle lable as
indicating an IDNA domain name, and can convert the domain name back
to the "www.exämple.com" original. Note that since IDNA does not preserve
non-normalized information during conversion and encoding, it is not
possible for the original name with mixed case and separated accents
to be preserved.
Written by Eric
A. Hall.
Copyright © 2004 CMP Media, Inc. Used with permission. |