Showing posts with label utf-8. Show all posts
Showing posts with label utf-8. Show all posts

Wednesday, October 17, 2012

Installing Multi-Byte PHP Functions

I received this error today:

Fatal error: Call to undefined function mb_strlen()

And it would look the same, no matter what multi-byte PHP function I might be using: mb_strpos(), mb_substr(), etc. And it is beginning to be important to have this extensions installed if you are working with international, multi-byte character sets like UTF-8.

The fix is super easy. All you need to do is install the PHP multi-byte extension with yum (if you're on CentOS or RHEL). Do the following from the command line:

yum install php-mbstring
service httpd restart

That installs the extension and restarts Apache. Then you're good to go!

Monday, August 1, 2011

PHP UTF-8 Script Input Cleaner

When you switch to UTF-8 on your website, there is a few things that everyone recommends (for good reason) like using multi-byte functions (e.g., mb_strlen()) and adding the HTTP-EQUIV attribute header. One of these commonly recommended things is that you should clean up all user-submitted input.

With UTF-8 comes the ability to submit a lot of crazy characters to the script, either by POST or GET. These crazy characters might be control characters, invalid UTF-8 characters or some other charset that was mixed in for good measure. So, I created the following function to help clean my inputs:

<?PHP

function cleanUTF8(&$input, $stripSlashes = true) {
 if ($stripSlashes) $stripSlashes = get_magic_quotes_gpc();
 if (is_array($input)) 
  foreach ($input as $k => $v) cleanUTF8($input[$k], $stripSlashes);
 else {
  if ($stripSlashes) $input = stripslashes($input);
  $input = mb_convert_encoding($input, "UTF-8", "UTF-8");
  $input = preg_replace('!\p{C}!u', '', $input);
 }
}

?>

It is a recursive function in that it will iterate into a variable if arrays exist, as they sometimes do. It will also strip slashes for you if your version of PHP still has magic quotes on.

It removes invalid characters through the mb_convert_encoding() function. Anything that is not UTF-8 is dropped. Lastly, the fancy preg_replace() function removes all control characters. (The \p{C} means "all control characters", the !! are delimiters the same as // or ##, and the last "u" modifier means "treat this as UTF-8.")

At the top of your script add this to iterate over the input array and clean up the data:

<?PHP

cleanUTF8($_POST);
cleanUTF8($_GET);

?>