Monday, August 1, 2011

PHP UTF-8 Script Input Cleaner

When you switch to UTF-8 on your website, there is a few things that everyone recommends (for good reason) like using multi-byte functions (e.g., mb_strlen()) and adding the HTTP-EQUIV attribute header. One of these commonly recommended things is that you should clean up all user-submitted input.

With UTF-8 comes the ability to submit a lot of crazy characters to the script, either by POST or GET. These crazy characters might be control characters, invalid UTF-8 characters or some other charset that was mixed in for good measure. So, I created the following function to help clean my inputs:


function cleanUTF8(&$input, $stripSlashes = true) {
 if ($stripSlashes) $stripSlashes = get_magic_quotes_gpc();
 if (is_array($input)) 
  foreach ($input as $k => $v) cleanUTF8($input[$k], $stripSlashes);
 else {
  if ($stripSlashes) $input = stripslashes($input);
  $input = mb_convert_encoding($input, "UTF-8", "UTF-8");
  $input = preg_replace('!\p{C}!u', '', $input);


It is a recursive function in that it will iterate into a variable if arrays exist, as they sometimes do. It will also strip slashes for you if your version of PHP still has magic quotes on.

It removes invalid characters through the mb_convert_encoding() function. Anything that is not UTF-8 is dropped. Lastly, the fancy preg_replace() function removes all control characters. (The \p{C} means "all control characters", the !! are delimiters the same as // or ##, and the last "u" modifier means "treat this as UTF-8.")

At the top of your script add this to iterate over the input array and clean up the data:





Andy Richter said...

This was very helpful. Thank You!!

Eric said...

No problem. Just a note, if you use textarea's in your forms, this will remove line breaks as well (they are considered control characters and would be something you'd want to remove from most inputs).