I’m supposed to be doing language revision but instead have found a far more interesting distraction in the form of international domain names, and as this is NetScaler related I thought I’d share the results…
NetScaler can be configured to support and content switch International Domain Names, such as those containing Cyrillic characters.
Domain names can contain non-Latin characters by using an encoding scheme called Punycode (as defined within a framework called Internationalised Domain Names in Applications). This scheme results in a prefix of “xn--“, suffix of “-“, and the encoded non-latin characters follow. This of course could allow for lots of phishing related fun.
As an example of both the possible phishing fun and the encoding scheme, here is citriх.com with a Cyrillic “х” (which is pronounced Ha for anyone who’s interested) which decodes to xn--citri-uze.com — you’ll note this goes to my website and not the real citrix.com site. The difference between “х” and “x” is more difficult to spot if I change the font.
While the example above is interesting, in my case I’m interested in “райт.com” as this is
a) how you would write “Wright” in Russian (“Wright” is my last name, and I’m studying Russian)
b) a source of some childish amusement as “pants” are a chaps underwear in the UK 🙂
Matching a path written in Cyrillic
Having configured DNS for www.xn--80astk.com and pointed it towards my NetScaler appliance, which is running 10.5 firmware, I wanted to configure content switching rules; I’d like to switch based on a request for either www.райт.com/стивен (/steven) or www.райт.com/Даниил (/daniel) — this was far more difficult than it would first appear.
• Discovery number one was content switching policy names are restricted to Latin characters only. A policy name of “райт_Политика1” returns an error.
• Discovery number two was HTTP.REQ.URL.PATH.CONTAINS(“стивен”) produces the error “Advanced expression function does not accept non-ASCII arguments [.URL.PATH.CONTAINS(“,18]”
What I hadn’t done was to change the character set, the required expression was HTTP.REQ.URL.SET_CHAR_SET(UTF_8).PATH.CONTAINS(“стивен”)
While this expression is correct it would still only work via the GUI. Trying the expression on the CLI produced the following:
> add cs policy test5_policy -rule “HTTP.REQ.URL.SET_CHAR_SET(UTF_8).PATH.CONTAINS(“стивен”)” -action deliver_test_site_action
ERROR: No such argument [стивен”)”]
Not very friendly at all. Examining running config after adding this rule via the GUI showed the expression had been added with the hex version of each character.
add cs policy test6_policy -rule “HTTP.REQ.URL.SET_CHAR_SET(UTF_8).PATH.CONTAINS(\”\xd1\x81\xd1\x82\xd0\xb8\xd0\xb2\xd0\xb5\xd0\xbd\”)” -action deliver_test_site_action
That solves matching the path – now I need to match the domain.
Matching a Cyrillic domain
Initially I tried a long shot with this expression HTTP.REQ.HEADER(“Host”).EQ(“райт.com”) and unsurprisingly this also gives an error.
Next I tried HTTP.REQ.HEADER(“Host”).SET_CHAR_SET(UTF_8).EQ(“райт.com”) which doesn’t give an error, but also doesn’t work.
It appears that for the host header there is no automatic encoding with punycode, so I did this manually and created the following (working – GUI only) expression HTTP.REQ.HEADER(“Host”).EQ(“www.xn--80astk.com”) && HTTP.REQ.URL.SET_CHAR_SET(UTF_8).PATH.CONTAINS(“стивен”)
I can now match both the domain and a path of Daniel (Даниил), or the domain and a path of Steven (стивен). What I wanted was a single rule to do both (/Даниил || /стивен). I then needed a regular expression.
A combined policy
As starting point I began with a Latin character policy of HTTP.REQ.URL.PATH.REGEX_MATCH(re#\bblue\b|\bred\b#) which matches a word boundary, red or blue, then another word boundary.
Setting the match to UTF_8 gives an expression of HTTP.REQ.URL.SET_CHAR_SET(UTF_8).PATH.REGEX_MATCH(re#\bстивен\b|\bДаниил\b#) but to my surprise this didn’t match anything.
Removing the word boundaries solved the problem, which, when I think about it, makes sense if a word is defined as [A-Za-z0-9_] — my cyrillic word doesn’t match the definition of a regular expression word.
For testing I replaced the word boundary match with one for “/“ HTTP.REQ.URL.SET_CHAR_SET(UTF_8).PATH.REGEX_MATCH(re#/стивен/|/Даниил/#) which gives success but of course now requires “/name/“ and doesn’t match “/name” without that final “/”
A little more messing with the expression and the problem is solved HTTP.REQ.URL.SET_CHAR_SET(UTF_8).PATH.REGEX_MATCH(re#/стивен(/|$)|/Даниил(/|$)#)
Finally – I have an expression of HTTP.REQ.HEADER(“Host”).EQ(“www.xn--80astk.com”) && HTTP.REQ.URL.SET_CHAR_SET(UTF_8).PATH.REGEX_MATCH(re#/стивен(/|$)|/Даниил(/|$)#)
Now I need get an SSL certificate!
For anyone still interested (just me maybe) 😉 it appears the common name in an x509 cert can be UTF-8 and for this reason the actual unencoded name can be used. Examining domains on the .рф ccTLD I see this is quite common and the punycode version is set as a subject alternative name.
This is something I need to investigate further as the StartCom domain validation system (which I normally get certificates from) refuses to accept UTF-8 domains..