Punycode

Punycode is an encoding representing Unicode characters using the limited character set of the Domain Name System (DNS). Internationalized domain names (IDNs) containing non-Latin characters are converted to ASCII-Compatible Encoding (ACE) labels prefixed with xn--, allowing DNS infrastructure to handle them without modification.

Usage

The DNS was designed for ASCII hostnames. Domain names containing characters from scripts like Arabic, Chinese, Cyrillic, Devanagari, or accented Latin letters cannot be processed directly by DNS resolvers. Punycode bridges this gap by encoding Unicode labels into a restricted ASCII character set supported natively by DNS.

The encoding applies exclusively to individual labels within a hostname. Characters in the path, query, or fragment of a URL use Percent-Encoding instead.

Hostname labels only

Punycode encodes individual hostname labels (the parts separated by dots). Each label is encoded independently. The path and query components of a URL use Percent-Encoding for non-ASCII characters.

ACE labels

An ASCII-Compatible Encoding (ACE) label is the Punycode-encoded form of a Unicode label, prefixed with xn--. The prefix signals to DNS software the label contains encoded Unicode content.

The xn-- prefix applies to each label independently. A domain like münchen.example.re encodes only the first label:

xn--mnchen-3ya.example.re

ACE labels follow the same constraints as standard DNS labels: a maximum of 63 characters per label and 253 characters for the full domain name.

Character set

Punycode uses a base-36 character set:

  • Lowercase letters: a-z (values 0-25)
  • Digits: 0-9 (values 26-35)
  • Hyphen: - (delimiter only)

Letters are case-insensitive in Punycode. The hyphen serves as a separator between the literal ASCII portion of a label and the encoded non-ASCII portion.

How encoding works

Punycode is an implementation of Bootstring, a general algorithm for representing a sequence of Unicode code points using a smaller set of basic code points. Bootstring works by:

  1. Copying all ASCII characters from the input to the output as literal characters
  2. Inserting a hyphen as a delimiter after the literal characters (when ASCII characters are present)
  3. Encoding the positions and values of non-ASCII characters as a series of variable-length integers using base-36
  4. Applying a bias adaptation algorithm to adjust the base dynamically, producing shorter output for commonly occurring patterns

The algorithm is deterministic. The same input always produces the same ACE label, and decoding always recovers the exact original Unicode string.

IDNA protocol

Internationalized Domain Names in Applications (IDNA) is the protocol framework governing how applications process internationalized domain names. IDNA defines the rules for which Unicode characters are allowed in domain names and how to convert between Unicode labels (U-labels) and ACE labels (A-labels).

IDNA2003 and IDNA2008

Two versions of the IDNA protocol exist, and they differ in how certain characters are handled.

IDNA2003 applies Unicode normalization and case folding before encoding. The German Eszett (ß) is mapped to ss, and the Greek final sigma (ς) is mapped to standard sigma (σ). These mappings are irreversible: a domain registered with ß resolves the same as one with ss.

IDNA2008 treats characters like ß and ς as distinct. A domain containing ß encodes differently from one containing ss, and the two resolve to different DNS records. IDNA2008 also removes the dependency on a specific Unicode version by basing character validity on Unicode properties rather than static tables.

The characters affected by this difference are called deviation characters. The domain faß.example.re resolves differently depending on which IDNA version the application uses:

Version Lookup label
IDNA2003 fass.example.re
IDNA2008 xn--fa-hia.example.re

UTS #46 (Unicode IDNA Compatibility Processing) is a compatibility layer maintained by the Unicode Consortium. Browsers and applications use UTS #46 to bridge IDNA2003 and IDNA2008 behavior. UTS #46 has deprecated transitional processing, and modern implementations follow IDNA2008 behavior for deviation characters.

Example

The Unicode domain münchen.example.re contains the character ü (U+00FC). Punycode encodes only the label containing the non-ASCII character.

Unicode (U-label)

münchen.example.re

Punycode (A-label)

xn--mnchen-3ya.example.re

A domain using Japanese characters:

Unicode

例え.example.re

Punycode

xn--r8jz45g.example.re

The ACE prefix xn-- signals the label contains Punycode-encoded content. Everything after the prefix is the encoded representation.

Homograph attacks

Homograph attacks exploit the visual similarity between characters from different scripts. An attacker registers a domain looking identical to a legitimate domain but uses different Unicode code points.

Consider a legitimate domain www.example.re. An attacker registers a visually similar domain using the Polish ł (U+0142) in place of l (U+006C):

Legitimate domain

www.example.re

Homograph domain (with ł instead of l)

www.exampłe.re

The Punycode form reveals the difference:

www.xn--exampe-7db.re

At a glance, the two domains are nearly indistinguishable in many fonts. A visitor clicking a link to the homograph domain reaches the attacker's server instead.

Browser protections

Modern browsers apply IDN display policies to counter homograph attacks. When a domain name triggers certain heuristic rules, the browser shows the Punycode form in the address bar instead of the Unicode rendering.

Common protection rules:

  • Mixed-script detection: labels mixing Latin and Cyrillic characters, or Latin and Greek characters, display as Punycode
  • Whole-script confusables: labels composed entirely of characters from a script containing many Latin look-alikes (like Cyrillic) display as Punycode
  • TLD allowlists: some browsers maintain lists of TLDs with their own IDN policies and allow Unicode display for labels registered under those TLDs

Homograph detection limits

Browser heuristics catch many homograph attacks but not all. Intra-script homographs (characters within the same script looking alike) are harder to detect automatically. Domain registrars also implement confusable-character policies at registration time to reduce this risk.

Takeaway

Punycode encodes Unicode domain name labels into ASCII-compatible strings prefixed with xn--, enabling DNS to handle internationalized domain names. The IDNA protocol framework governs which Unicode characters are valid in domain names, with IDNA2008 treating characters like the German Eszett as distinct rather than mapping them. Browsers apply display policies to show Punycode in the address bar when a domain name triggers homograph attack heuristics.

See also

Last updated: March 6, 2026