How to unify text file encoding to utf-8 in PHP


I’m Nishimura, the creator of QuizGenerator, and I’ve been working on it since the release of version 2.0. In fact, I didn’t participate in the early development of learningBOX, but I started to participate in the development around the release of version 2.0. I received a pull request regarding character code handling, and I thought I should talk a bit more about character codes.
In this article, I will introduce “How to unify text file character codes to utf-8 in PHP”. I hope you enjoy this article. Let’s begin!

Table of contents


  • 1. Shift-JIS is inevitable
  • 2. How to avoid garbled text
  • 3. How do we determine the character code?
  • 4. Summary


Shift-JIS is inevitable

Systems such as learningBOX and QuizGenerator may receive text files such as CSVs. However, as a practical matter, Shift_JIS files are rarely uploaded and often resulted in reported as a bug.
So, when QuizGenerator receives a Shift_JIS file, it will be converted to utf-8 and the process will continue.

What is Shift-JISThis is one of the character codes for various characters including Japanese that have been standardized as JIS standards. It is an improved version of the JIS code, and while the JIS code uses 7 bits to represent characters, the Shift JIS code uses 2 bytes (16 bits) to represent all characters.


mb_convert_encoding is unreliable

PHP has a function called mb_convert_encoding, which can convert the character code. At first glance, this function seems to be able to determine the character code and convert it to utf-8 just by using this function, but actually, this function is not trustworthy.

If mb_convert_encoding(“ah”, “utf-8”, “utf-8”, “utf-8, sjis-win”), then if “ah” is utf-8, it will be converted to utf-8 as it is, or if it is Shift_JIS, it will be converted to utf-8, and then the character code will be It should be able to do this (at least as far as the official documentation is concerned), but it actually does a terrible job of it.
It takes a string passed in utf-8 and breaks it by forcing it to be interpreted as Shift_JIS, and then converts it to utf-8, which gives us a nonsensical value.


How to avoid garbled characters

▼ “mb_convert_encoding will work if you specify the source character code”
mb_convert_encoding works correctly as long as you specify the source character code. In other words, convert from Shift_JIS to utf-8 only in the case of Shift_JIS and do nothing in the case of utf-8, which basically works.

mb_convert_encoding(“oh no”, “utf-8”, “sjis-win”)
The above code will work fine if “Ahhh” is Shift_JIS. If you use utf-8, you can use it as it is.


How do we determine the character code?

There is a function called mb_detect_encoding, which, if it works correctly in the first place, can be solved simply by using mb_convert_encoding.

If the standard function doesn’t work, you’ll have to do it manually

It’s not so much difficult to determine if it meets the utf-8 specification to solve the problem.


What if it’s not utf-8?

If it’s not utf-8…let’s just treat it as Shift_JIS…I can’t support it until someone brings up an euc-jp or utf-16 file.
At least, people who do that should know about the encoding, so please do self-service. It’s specified to use utf-8 in the first place.


Another trap

I have used the term Shift_JIS many times in this article, but in many cases, what is currently called Shift_JIS is Windows-31J (MS932), which is an extension of Shift_JIS.
However, if you specify Shift_JIS in PHP, all characters other than those specified in the original Shift_JIS specification will be garbled. Unless you have a special reason, please use Windows-31J or sjis-win instead of Shift_JIS.
The official document says to use Windows-31J, but only sjis-win is listed, which is a strange situation, but at least with PHP 7.3.13, both of them worked fine.


Will character code be unified in the future?

When I first started web programming about 20 years ago, UTF-8 was non-standard and garbled characters were an everyday occurrence. I think it seems the unification of utf-8 is occurring in the field of web development.
As for smartphones, they were born after the spread of utf-8, so they are made on the premise of utf-8 (so they tend to be garbled in other encodings, such as Shift_JIS). (Therefore, Shift_JIS and other encoding files tend to be garbled). On the other hand, the files exchanged on Windows are often Shift_JIS.



In this article, I have introduced “How to unify the encoding of text files into utf-8 in PHP”. As we are a company from Japan, we want to make our products easy for Japanese people to use, so we will keep Shift_JIS in mind for a while longer.



  • Comment ( 0 )

  • Trackbacks are closed.

  1. No comments yet.

Related posts