Punycode
Punycode is a subset of the ASCII character set made up only of lower case letters, digits, and hyphens. It is used to encode hostnames of URIs.
Usage
Punycode is used to convert internationalized hostnames that contain Unicode characters to ASCII and vice-versa. ASCII is based on the Latin (English) alphabet and does not support special characters not present in the Latin alphabet, whereas Unicode characters includes alphabet characters from all other languages.
Note
Punycode only applies to the hostname in a URL. Beyond the hostname, Percent-Encoding is used to display characters beyond ASCII characters.
Definition
An International Domain Name (IDN) contains special characters or letters that are not in the Latin alphabet. These characters are not supported by common internet protocols such as the Domain Name System (DNS) and as such, need to be encoded such that they can be universally processed.
Bootstring
Punycode is an implementation or encoding of a Bootstring, which is an arbitrary sequence of code points represented as a sequence of basic code points.
Character set
Punycode operates on what is referred to as the base character set, which is:
- Lowercase letters: a-z
- Digits: 0-9
- A single special character, the hyphen: -
Note
Punycode strings in hostnames starts with xn--
, which can apply to each element of a hostname, for example the subdomain, domain label as well as the top level domain.
Example
In the following example, notice that the Unicode version of the domain name appears to be example.re
but, the first letter, e
, has an accent, making it é
.
Unicode
éxample.ai
Punycode
xn--xample-9ua.ai
Security concerns
Punycode is an effective way to re-encode Unicode characters such as the IDNs can be more easily typed or searched. However, it is sometimes used to employ homograph cyber-attacks. A homograph attack is one where the user clicks on a link that appears legitimate, yet leads to a malicious site.
Consider that a user wants to click on a link for www.example.re
The link that they click on looks similar www.exampłe.ai
It can be difficult at a glance to see that the l
in “example” is actually ł
, which is Unicode character 0142. When translated using Punycode, it is actually www.xn--exampe-7db.ai
instead of www.example.re
.
When the user clicks on the link, they are sent to the malicious site and the attack continues.
Takeaway
Punycode is a method of encoding non-Latin alphabet characters in ASCII, making them compatible with internet services and protocols such as DNS. They are useful for processing internationalized domain names but can be a point of concern because they facilitate homograph cyber-attacks in some circumstances.