With UTF-8 comes the ability to submit a lot of crazy characters to the script, either by POST or GET. These crazy characters might be control characters, invalid UTF-8 characters or some other charset that was mixed in for good measure. So, I created the following function to help clean my inputs:
<?PHP function cleanUTF8(&$input, $stripSlashes = true) { if ($stripSlashes) $stripSlashes = get_magic_quotes_gpc(); if (is_array($input)) foreach ($input as $k => $v) cleanUTF8($input[$k], $stripSlashes); else { if ($stripSlashes) $input = stripslashes($input); $input = mb_convert_encoding($input, "UTF-8", "UTF-8"); $input = preg_replace('!\p{C}!u', '', $input); } } ?>
It is a recursive function in that it will iterate into a variable if arrays exist, as they sometimes do. It will also strip slashes for you if your version of PHP still has magic quotes on.
It removes invalid characters through the mb_convert_encoding() function. Anything that is not UTF-8 is dropped. Lastly, the fancy preg_replace() function removes all control characters. (The \p{C} means "all control characters", the !! are delimiters the same as // or ##, and the last "u" modifier means "treat this as UTF-8.")
At the top of your script add this to iterate over the input array and clean up the data:
<?PHP cleanUTF8($_POST); cleanUTF8($_GET); ?>
2 comments:
This was very helpful. Thank You!!
No problem. Just a note, if you use textarea's in your forms, this will remove line breaks as well (they are considered control characters and would be something you'd want to remove from most inputs).
Post a Comment