Punycode

Punycode is a subset of the ASCII character set made up only of lower case letters, digits, and hyphens. It is used to encode hostnames of URIs.

Usage

Punycode is used to convert internationalized hostnames that contain Unicode characters to ASCII and vice-versa. ASCII is based on the Latin (English) alphabet and does not support special characters not present in the Latin alphabet, whereas Unicode characters includes alphabet characters from all other languages.

Note

Punycode only applies to the hostname in a URL. Beyond the hostname, Percent-Encoding is used to display characters beyond ASCII characters.

Definition

An International Domain Name (IDN) contains special characters or letters that are not in the Latin alphabet. These characters are not supported by common internet protocols such as the Domain Name System (DNS) and as such, need to be encoded such that they can be universally processed.

Bootstring

Punycode is an implementation or encoding of a Bootstring, which is an arbitrary sequence of code points represented as a sequence of basic code points.

Character set

Punycode operates on what is referred to as the base character set, which is:

  • Lowercase letters: a-z
  • Digits: 0-9
  • A single special character, the hyphen: -

Note

Punycode strings in hostnames starts with xn--, which can apply to each element of a hostname, for example the subdomain, domain label as well as the top level domain.

Example

In the following example, notice that the Unicode version of the domain name appears to be example.re but, the first letter, e, has an accent, making it é.

Unicode

éxample.ai

Punycode

xn--xample-9ua.ai

Security concerns

Punycode is an effective way to re-encode Unicode characters such as the IDNs can be more easily typed or searched. However, it is sometimes used to employ homograph cyber-attacks. A homograph attack is one where the user clicks on a link that appears legitimate, yet leads to a malicious site.

Consider that a user wants to click on a link for www.example.re

The link that they click on looks similar www.exampłe.ai

It can be difficult at a glance to see that the l in “example” is actually ł, which is Unicode character 0142. When translated using Punycode, it is actually www.xn--exampe-7db.ai instead of www.example.re.

When the user clicks on the link, they are sent to the malicious site and the attack continues.

Takeaway

Punycode is a method of encoding non-Latin alphabet characters in ASCII, making them compatible with internet services and protocols such as DNS. They are useful for processing internationalized domain names but can be a point of concern because they facilitate homograph cyber-attacks in some circumstances.

See also

Last updated: August 2, 2023