Punycode
Punycode is an encoding representing Unicode
characters using the limited character set of the
Domain Name System (DNS). Internationalized domain
names (IDNs) containing non-Latin characters are
converted to ASCII-Compatible Encoding (ACE) labels
prefixed with xn--, allowing DNS infrastructure to
handle them without modification.
Usage
The DNS was designed for ASCII hostnames. Domain names containing characters from scripts like Arabic, Chinese, Cyrillic, Devanagari, or accented Latin letters cannot be processed directly by DNS resolvers. Punycode bridges this gap by encoding Unicode labels into a restricted ASCII character set supported natively by DNS.
The encoding applies exclusively to individual labels within a hostname. Characters in the path, query, or fragment of a URL use Percent-Encoding instead.
Hostname labels only
Punycode encodes individual hostname labels (the parts separated by dots). Each label is encoded independently. The path and query components of a URL use Percent-Encoding for non-ASCII characters.
ACE labels
An ASCII-Compatible Encoding (ACE) label is the
Punycode-encoded form of a Unicode label, prefixed
with xn--. The prefix signals to DNS software
the label contains encoded Unicode content.
The xn-- prefix applies to each label
independently. A domain like münchen.example.re
encodes only the first label:
xn--mnchen-3ya.example.re
ACE labels follow the same constraints as standard DNS labels: a maximum of 63 characters per label and 253 characters for the full domain name.
Character set
Punycode uses a base-36 character set:
- Lowercase letters: a-z (values 0-25)
- Digits: 0-9 (values 26-35)
- Hyphen: - (delimiter only)
Letters are case-insensitive in Punycode. The hyphen serves as a separator between the literal ASCII portion of a label and the encoded non-ASCII portion.
How encoding works
Punycode is an implementation of Bootstring, a general algorithm for representing a sequence of Unicode code points using a smaller set of basic code points. Bootstring works by:
- Copying all ASCII characters from the input to the output as literal characters
- Inserting a hyphen as a delimiter after the literal characters (when ASCII characters are present)
- Encoding the positions and values of non-ASCII characters as a series of variable-length integers using base-36
- Applying a bias adaptation algorithm to adjust the base dynamically, producing shorter output for commonly occurring patterns
The algorithm is deterministic. The same input always produces the same ACE label, and decoding always recovers the exact original Unicode string.
IDNA protocol
Internationalized Domain Names in Applications (IDNA) is the protocol framework governing how applications process internationalized domain names. IDNA defines the rules for which Unicode characters are allowed in domain names and how to convert between Unicode labels (U-labels) and ACE labels (A-labels).
IDNA2003 and IDNA2008
Two versions of the IDNA protocol exist, and they differ in how certain characters are handled.
IDNA2003 applies Unicode
normalization and case folding before encoding.
The German Eszett (ß) is mapped to ss, and the
Greek final sigma (ς) is mapped to standard sigma
(σ). These mappings are irreversible: a domain
registered with ß resolves the same as one with
ss.
IDNA2008 treats characters
like ß and ς as distinct. A domain containing
ß encodes differently from one containing ss,
and the two resolve to different DNS records.
IDNA2008 also removes the dependency on a specific
Unicode version by basing character validity on
Unicode properties rather than static tables.
The characters affected by this difference are
called deviation characters. The domain
faß.example.re resolves differently depending on
which IDNA version the application uses:
| Version | Lookup label |
|---|---|
| IDNA2003 | fass.example.re |
| IDNA2008 | xn--fa-hia.example.re |
UTS #46 (Unicode IDNA Compatibility Processing) is a compatibility layer maintained by the Unicode Consortium. Browsers and applications use UTS #46 to bridge IDNA2003 and IDNA2008 behavior. UTS #46 has deprecated transitional processing, and modern implementations follow IDNA2008 behavior for deviation characters.
Example
The Unicode domain münchen.example.re contains
the character ü (U+00FC). Punycode encodes only
the label containing the non-ASCII character.
Unicode (U-label)
münchen.example.re
Punycode (A-label)
xn--mnchen-3ya.example.re
A domain using Japanese characters:
Unicode
例え.example.re
Punycode
xn--r8jz45g.example.re
The ACE prefix xn-- signals the label
contains Punycode-encoded content. Everything after
the prefix is the encoded representation.
Homograph attacks
Homograph attacks exploit the visual similarity between characters from different scripts. An attacker registers a domain looking identical to a legitimate domain but uses different Unicode code points.
Consider a legitimate domain www.example.re. An
attacker registers a visually similar domain using
the Polish ł (U+0142) in place of l (U+006C):
Legitimate domain
www.example.re
Homograph domain (with ł instead of l)
www.exampłe.re
The Punycode form reveals the difference:
www.xn--exampe-7db.re
At a glance, the two domains are nearly indistinguishable in many fonts. A visitor clicking a link to the homograph domain reaches the attacker's server instead.
Browser protections
Modern browsers apply IDN display policies to counter homograph attacks. When a domain name triggers certain heuristic rules, the browser shows the Punycode form in the address bar instead of the Unicode rendering.
Common protection rules:
- Mixed-script detection: labels mixing Latin and Cyrillic characters, or Latin and Greek characters, display as Punycode
- Whole-script confusables: labels composed entirely of characters from a script containing many Latin look-alikes (like Cyrillic) display as Punycode
- TLD allowlists: some browsers maintain lists of TLDs with their own IDN policies and allow Unicode display for labels registered under those TLDs
Homograph detection limits
Browser heuristics catch many homograph attacks but not all. Intra-script homographs (characters within the same script looking alike) are harder to detect automatically. Domain registrars also implement confusable-character policies at registration time to reduce this risk.
Takeaway
Punycode encodes Unicode domain name labels into
ASCII-compatible strings prefixed with xn--,
enabling DNS to handle internationalized domain
names. The IDNA protocol framework governs which
Unicode characters are valid in domain names, with
IDNA2008 treating characters like the German Eszett
as distinct rather than mapping them. Browsers apply
display policies to show Punycode in the address bar
when a domain name triggers homograph attack
heuristics.
See also
- RFC 3492: Punycode
- RFC 5891: IDNA Protocol
- RFC 5892: The Unicode Code Points and IDNA
- UTS #46: Unicode IDNA Compatibility Processing
- Percent-Encoding
- URL
- URI
- HTTP headers