The url-generic-parse-url
parser does not obey RFC 3986 in
one respect: it allows non-ASCII characters in URI strings.
Strictly speaking, RFC 3986 compatible URIs may only consist of ASCII characters; non-ASCII characters are represented by converting them to UTF-8 byte sequences, and performing percent encoding on the bytes. For example, the o-umlaut character is converted to the UTF-8 byte sequence ‘\xD3\xA7’, then percent encoded to ‘%D3%A7’. (Certain “reserved” ASCII characters must also be percent encoded when they appear in URI components.)
The function url-encode-url
can be used to convert a URI
string containing arbitrary characters to one that is properly
percent-encoded in accordance with RFC 3986.
This function return a properly URI-encoded version of url-string. It also performs URI normalization, e.g., converting the scheme component to lowercase if it was previously uppercase.
To convert between a string containing arbitrary characters and a
percent-encoded all-ASCII string, use the functions
url-hexify-string
and url-unhex-string
:
This function performs percent-encoding on string, and returns the result.
If string is multibyte, it is first converted to a UTF-8 byte string. Each byte corresponding to an allowed character is left as-is, while all other bytes are converted to a three-character sequence: ‘%’ followed by two upper-case hex digits.
The allowed characters are specified by allowed-chars. If this
argument is nil
, the allowed characters are those specified as
unreserved characters by RFC 3986 (see the variable
url-unreserved-chars
). Otherwise, allowed-chars should
be either a list of allowed chars, or a vector whose Nth element is
non-nil
if character N is allowed.
This function replaces percent-encoding sequences in string with their character equivalents, and returns the resulting string.
If allow-newlines is non-nil
, it allows the decoding of
carriage returns and line feeds, which are normally forbidden in URIs.