PHP, System Locale and I18n
A large number of PHP functions base their behavior on the system locale: number_format, money_format …
If we take a closer look to the
documentation, here is what is says about its first parameter
`category` is a named constant specifying the category of the functions affected by the locale setting:
LC_ALLfor all of the below
LC_COLLATEfor string comparison, see strcoll()
LC_CTYPEfor character classification and conversion, for example strtoupper()
LC_NUMERICfor decimal separator (See also localeconv())
LC_TIMEfor date and time formatting with strftime()
LC_MESSAGESfor system responses (available if PHP was compiled with libintl)
And just below this explanation, the only given examples in the doc make
But let’s take a concrete example:
<?php $table = 'Administrator_Roles'; setlocale(LC_ALL, 'tr_TR.UTF-8', 'tr'); $class = preg_replace('/[^A-Z0-9]/i','_',$table); print $class.PHP_EOL; ?>
What is the output going to be ?
Find the culprit: PCRE!
…or maybe developers like me, who did not spend enough time hunting down the information in the PHP manual ;)
Rather than copying the whole contents referenced by the two above
links, let us summarize how
setlocale could end up producing such
PHP builds in PCRE module (Perl-Compatible Regular Expressions), whose standard escape sequences might be locale-specific.
Unfortunately, the official PHP documentation is not so helpful (see http://us2.php.net/manual/en/regexp.reference.escape.php):
A "word" character is any letter or digit or the underscore character, that is, any character which can be part of a Perl "word". The definition of letters and digits is controlled by PCRE’s character tables, and may vary if locale-specific matching is taking place. For example, in the "fr" (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w.
LC_ALL whenever possible. Select the most restrictive
"category" you need.
LC_NUMERIC seem to be good candidates for
e-commerce websites for instance. By the way, you can apply several
categories by calling
setlocale multiple times in a row (bitwise OR
should NOT work).
If you really cannot change
LC_ALL to a more specific category, be
extremely cautious with regular expressions, string comparisons etc.
Your application could grow enough to have to support new languages, and this could break as in the example of this post.
Note that if your application handles Unicode representation (crazy idea in 2011, I know), then you can use Unicode in your expressions: http://php.net/manual/en/regexp.reference.unicode.php. However, the documentation says:
Matching characters by Unicode property is not fast, because PCRE has to search a structure that contains data for over fifteen thousand characters. That is why the traditional escape sequences such as \d and \w do not use Unicode properties in PCRE.
It might be worth ignoring this piece of advice until it is proven to be a problem for your application. As they say, premature optimization is the root of all evil.
And that’s all, folks!