Punycode is a subset of the ASCII character set made up only of lower case letters, digits, and hyphens. It is used to encode hostnames of URIs.
Punycode is used to convert internationalized hostnames that contain Unicode characters to ASCII and vice-versa. ASCII is based on the Latin (English) alphabet and does not support special characters not present in the Latin alphabet, whereas Unicode characters includes alphabet characters from all other languages.
An International Domain Name (IDN) contains special characters or letters that are not in the Latin alphabet. These characters are not supported by common internet protocols such as the Domain Name System (DNS) and as such, need to be encoded such that they can be universally processed.
Punycode is an implementation or encoding of a Bootstring, which is an arbitrary sequence of code points represented as a sequence of basic code points.
Punycode operates on what is referred to as the base character set, which is:
- Lowercase letters: a-z
- Digits: 0-9
- A single special character, the hyphen: -
Punycode strings in hostnames starts with
xn--, which can apply to each element of a hostname, for example the subdomain, domain label as well as the top level domain.
In the following example, notice that the Unicode version of the domain name appears to be
example.ai but, the first letter,
e, has an accent, making it
Punycode is an effective way to re-encode Unicode characters such as the IDNs can be more easily typed or searched. However, it is sometimes used to employ homograph cyber-attacks. A homograph attack is one where the user clicks on a link that appears legitimate, yet leads to a malicious site.
Consider that a user wants to click on a link for
The link that they click on looks similar
It can be difficult at a glance to see that the
l in “example” is actually
ł, which is Unicode character 0142. When translated using Punycode, it is actually
www.xn--exampe-7db.ai instead of
When the user clicks on the link, they are sent to the malicious site and the attack continues.
Punycode is a method of encoding non-Latin alphabet characters in ASCII, making them compatible with internet services and protocols such as DNS. They are useful for processing internationalized domain names but can be a point of concern because they facilitate homograph cyber-attacks in some circumstances.