Tips for Supporting UTF-8 in Your PHP5 Applications

Here are some great tips that will help any PHP5 developer support UTF-8 in their applications.  This list isn't comprehensive, but it will allow you to do most basic operations while avoiding troublesome character encoding issues.

Most articles on this subject tend to be very confusing, as they try to explain in detail how encoding works.  This article, however, was written for the average PHP developer who simply wants to support UTF-8 characters in their applications.  If you have additional tips, please feel free to share them in the comments below.

Note that, for the purpose of this article, the term extended characters refers to any UTF-8 character that falls outside of the ISO-8859-1 spectrum.  (I.E. Chinese characters, Russian characters, certain Polish characters, etc.)

Webpages

  • First and foremost, make sure your files are UTF-8 encoded.
    • Most text editors have an option for this somewhere in the File menu or in their preferences.  Look for something called File Encoding.
    • It's important that the files themselves be UTF-8 encoded, especially if they contain extended characters.  Otherwise, you will almost certainly run into display issues.
    • If possible, select UTF-8 without BOM.
  • Make sure every page has the UTF-8 meta tag:
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

    • This works on most servers, but not all.  If the meta tag option doesn't work, you can try adding this line to your .htaccess file:
      AddDefaultCharset UTF-8

Working with Strings

  • Include this at the beginning of every page:
    mb_internal_encoding("UTF-8");
  • STOP USING THE NORMAL PHP STRING FUNCTIONS ON UTF-8 INPUT!
    • Until PHP6 comes out, it's important to use the Multibyte String Functions instead of the standard string functions.
    • Many of the MB String Functions mimic standard string functions:
    • Using standard string functions on strings containing extended characters will result in erroneous output!
  • Stick with PCRE functions when working with regular expressions.
    • preg_match, preg_replace, etc.
    • If you absolutely need to use ereg_ functions, use their mb_ equivalents (above).
  • Use htmlspecialchars instead of htmlentities.
    • Using htmlentities will cause problems with extended characters, usually leading to erroneous output!

Database Settings (MySQL)

  • Set your database collation to UTF-8 (utf8_general_ci)
    • This can be done in phpMyAdmin using the Operations tab. (Make sure you have the database itself selected, not one of the tables.)
  • Make sure all your database tables are UTF-8 (utf8_general_ci)
    • Again, this can be done in phpMyAdmin by selecting a table first, then selecting the Operations tab.
  • Make sure all your fields are all UTF-8 (utf8_general_ci), where applicable.
    • This includes all fields of type text, varchar, enum, char, etc.  (Anything that stores textual data.)
  • Always use the following immediately after opening your database connection:
    mysql_set_charset('utf8',$link);

Working with Files

In PHP5, I have yet to this day found a way to work with files that have extended characters in their filename.  I am led to believe that this is a problem within PHP itself, which is why version 6 is taking so long to develop.  (PHP6 is scheduled to have full Unicode support.)

There is absolutely no problems with reading and writing to files that contain extended characters in their content, but the filenames themselves cannot contain them.  This is currently the only limitation that I personally haven't found a workaround for.  Fortunately, it's usually not an issue since web-safe filenames don't current include extended characters which cause problems.

Author avatar

About the author

New Hampshirite building web apps in Florida. Creator of Surreal CMS, Postleaf, and DirtyMarkup.

Need to get in touch? Catch me on Twitter.