Punycode

Punycode is an encoding representing Unicode characters using the limited character set of the Domain Name System (DNS). Internationalized domain names (IDNs) containing non-Latin characters are converted to ASCII-Compatible Encoding (ACE) labels prefixed with xn--, allowing DNS infrastructure to handle them without modification.

Usage

The DNS was designed for ASCII hostnames. Domain names containing characters from scripts like Arabic, Chinese, Cyrillic, Devanagari, or accented Latin letters cannot be processed directly by DNS resolvers. Punycode bridges this gap by encoding Unicode labels into a restricted ASCII character set supported natively by DNS.

The encoding applies exclusively to individual labels within a hostname. Characters in the path, query, or fragment of a URL use Percent-Encoding instead.

Hostname labels only

Punycode encodes individual hostname labels (the parts separated by dots). Each label is encoded independently. The path and query components of a URL use Percent-Encoding for non-ASCII characters.

ACE labels

An ASCII-Compatible Encoding (ACE) label is the Punycode-encoded form of a Unicode label, prefixed with xn--. The prefix signals to DNS software the label contains encoded Unicode content.

The xn-- prefix applies to each label independently. A domain like münchen.example.re encodes only the first label:

xn--mnchen-3ya.example.re

ACE labels follow the same constraints as standard DNS labels: a maximum of 63 characters per label and 253 characters for the full domain name.

Character set

Punycode uses a base-36 character set:

  • Lowercase letters: a-z (values 0-25)
  • Digits: 0-9 (values 26-35)
  • Hyphen: - (delimiter only)

Letters are case-insensitive in Punycode. The hyphen serves as a separator between the literal ASCII portion of a label and the encoded non-ASCII portion.

How encoding works

Punycode is an implementation of Bootstring, a general algorithm for representing a sequence of Unicode code points using a smaller set of basic code points. Bootstring works by:

  1. Copying all ASCII characters from the input to the output as literal characters
  2. Inserting a hyphen as a delimiter after the literal characters (when ASCII characters are present)
  3. Encoding the positions and values of non-ASCII characters as a series of variable-length integers using base-36
  4. Applying a bias adaptation algorithm to adjust the base dynamically, producing shorter output for commonly occurring patterns

The algorithm is deterministic. The same input always produces the same ACE label, and decoding always recovers the exact original Unicode string.

IDNA protocol

Internationalized Domain Names in Applications (IDNA) is the protocol framework governing how applications process internationalized domain names. IDNA defines the rules for which Unicode characters are allowed in domain names and how to convert between Unicode labels (U-labels) and ACE labels (A-labels).

IDNA2003 and IDNA2008

Two versions of the IDNA protocol exist, and they differ in how certain characters are handled.

IDNA2003 applies Unicode normalization and case folding before encoding. The German Eszett (ß) is mapped to ss, and the Greek final sigma (ς) is mapped to standard sigma (σ). These mappings are irreversible: a domain registered with ß resolves the same as one with ss.

IDNA2008 treats characters like ß and ς as distinct. A domain containing ß encodes differently from one containing ss, and the two resolve to different DNS records. IDNA2008 also removes the dependency on a specific Unicode version by basing character validity on Unicode properties rather than static tables.

The characters affected by this difference are called deviation characters. The domain faß.example.re resolves differently depending on which IDNA version the application uses:

Version Lookup label
IDNA2003 fass.example.re
IDNA2008 xn--fa-hia.example.re

UTS #46 (Unicode IDNA Compatibility Processing) is a compatibility layer maintained by the Unicode Consortium. Browsers and applications use UTS #46 to bridge IDNA2003 and IDNA2008 behavior. UTS #46 has deprecated transitional processing, and modern implementations follow IDNA2008 behavior for deviation characters.

Example

The Unicode domain münchen.example.re contains the character ü (U+00FC). Punycode encodes only the label containing the non-ASCII character.

Unicode (U-label)

münchen.example.re

Punycode (A-label)

xn--mnchen-3ya.example.re

A domain using Japanese characters:

Unicode

例え.example.re

Punycode

xn--r8jz45g.example.re

The ACE prefix xn-- signals the label contains Punycode-encoded content. Everything after the prefix is the encoded representation.

Homograph attacks

Homograph attacks exploit the visual similarity between characters from different scripts. An attacker registers a domain looking identical to a legitimate domain but uses different Unicode code points.

Consider a legitimate domain www.example.re. An attacker registers a visually similar domain using the Polish ł (U+0142) in place of l (U+006C):

Legitimate domain

www.example.re

Homograph domain (with ł instead of l)

www.exampłe.re

The Punycode form reveals the difference:

www.xn--exampe-7db.re

At a glance, the two domains are nearly indistinguishable in many fonts. A visitor clicking a link to the homograph domain reaches the attacker's server instead.

Browser protections

Modern browsers apply IDN display policies to counter homograph attacks. When a domain name triggers certain heuristic rules, the browser shows the Punycode form in the address bar instead of the Unicode rendering.

Common protection rules:

  • Mixed-script detection: labels mixing Latin and Cyrillic characters, or Latin and Greek characters, display as Punycode
  • Whole-script confusables: labels composed entirely of characters from a script containing many Latin look-alikes (like Cyrillic) display as Punycode
  • TLD allowlists: some browsers maintain lists of TLDs with their own IDN policies and allow Unicode display for labels registered under those TLDs

Homograph detection limits

Browser heuristics catch many homograph attacks but not all. Intra-script homographs (characters within the same script looking alike) are harder to detect automatically. Domain registrars also implement confusable-character policies at registration time to reduce this risk.

Security risks

Punycode itself is a deterministic encoding without inherent vulnerabilities, but the ability to register visually deceptive domain names creates attack surface.

Homograph phishing is the primary risk. Attackers register domains using characters from non-Latin scripts visually similar to Latin letters. A domain using Cyrillic а (U+0430) in place of Latin a (U+0061) looks identical in most fonts but resolves to a different server. Browser protections catch many cases through mixed-script detection, but intra-script homographs remain difficult to detect automatically.

Domain spoofing at scale is possible because IDNA allows thousands of Unicode characters in domain labels. Registrars implement confusable- character policies to block obvious look-alikes, but coverage varies across TLDs and registries.

Certificate issuance for homograph domains adds to the risk. An attacker obtaining a valid TLS certificate for a Punycode domain gains a padlock icon in the browser, increasing the appearance of legitimacy.

Alternatives to Punycode

No direct replacement for Punycode exists in DNS. The DNS protocol requires ASCII labels, making some encoding necessary for internationalized names.

Approaches that avoid Punycode altogether include using ASCII-only domain names and placing internationalized text in the URL path or query components, where Percent-Encoding handles non-ASCII characters directly. Some services use QR codes or app-based links to bypass domain display entirely.

Within the IDNA framework, UTS #46 (Unicode IDNA Compatibility Processing) provides a compatibility layer bridging IDNA2003 and IDNA2008 behavior, but still relies on Punycode as the underlying encoding for DNS transport.

See also

Last updated: April 4, 2026