How to unify text file encoding to utf-8 in PHP
I am Nishimura, the creator of QuizGenerator, and the creator of QuizGenerator. Actually, I did not participate in the initial development of learningBOX, but started full-fledged participation around the release of version 2.0. I have just received a pull request regarding the handling of character encoding, and I thought I should talk a little more about character encoding, so I put it together in an article.
In this article, I will introduce "How to unify the character encoding of text files with utf-8 in PHP". We hope you enjoy this article as well.
Table of Contents
1. shift-JIS is unavoidable
2. how to avoid garbled characters
3. how to determine character encoding
4. summary
Shift-JIS is unavoidable.
Systems like learningBOX and QuizGenerator sometimes receive text files such as CSV. In modern Web systems, the character encoding of text files should be utf-8, and we do not want to accept any other character encoding, but in reality, Shift_JIS files are often uploaded on rare occasions and reported as a problem.
Therefore, when QuizGenerator receives a Shift_JIS file, it converts it to utf-8 and continues processing.
What is Shift-JIS?
Shift-JIS is one of the character codes that contain various characters including Japanese, standardized as JIS standard. It is used as the standard character code for Japanese in many personal computers. it is an improvement of the JIS code defined in the JIS standard. while the JIS code uses 7 bits to represent a character, the Shift-JIS code uses 2 bytes (16 bits) to represent all characters.
mb_convert_encoding is not reliable.
PHP has a function called mb_convert_encoding to perform character encoding conversion. At first glance, this function seems to be able to determine the character encoding and convert to utf-8, but in fact, this function is not reliable.
mb_convert_encoding("aaa", "utf-8", "utf-8, sjis-win") will convert "aaa" to utf-8 if it is utf-8 and to Shift_JIS If it is Shift_JIS, it should convert it to utf-8 (at least that's what I've read in the official documentation), but in fact, it does something ridiculous.
It forcefully interprets a string passed in as utf-8 as Shift_JIS, breaks it, and converts it to utf-8, returning a value that is incomprehensible.
How to avoid character corruption
▼ "mb_convert_encoding works properly if you specify the character encoding of the conversion source.
mb_convert_encoding works properly as long as you specify the character encoding of the conversion source. In other words, the strategy of converting from Shift_JIS to utf-8 only if it is Shift_JIS, and doing nothing if it is utf-8, basically works.
mb_convert_encoding("aaaa", "utf-8", "sjis-win")
The above code works fine if "aaaa" is Shift_JIS. If it is originally utf-8, it is OK as is.
How do we determine the character encoding?
There is a function called mb_detect_encoding, but to begin with, if this works correctly, it can be solved simply by using mb_convert_encoding.
If the standard functions don't work, you're on your own.
It is not that difficult to determine whether or not the utf-8 specification is met, so just do it.
What if it is not utf-8?
If it's not utf-8, we'll treat it as Shift_JIS at ・・・・. We can't support people who submit euc-jp or utf-16 files, at least not if they know about encoding. At least, people who do that should know about encoding, so please self-serve. In the first place, they are specified to use utf-8.
Another trap
I have used the term Shift_JIS many times in this article, but what is currently referred to as Shift_JIS is often Windows-31J (MS932), which is an extension of Shift_JIS.
However, if you specify Shift_JIS in PHP, characters other than those specified in the original Shift_JIS specification will be corrupted. Unless you have a special reason, use Windows-31J or sjis-win instead of Shift_JIS. The official documentation says to use Windows-31J, but only sjis-win is listed, which is a strange state of affairs, but at least with PHP 7.3.13, either specification worked fine.
Will a unified future ever come?
About 20 years ago. When I first started web programming, UTF-8 was not the standard and garbled characters were a common occurrence. In the Web world, it seems safe to say that the unification to utf-8 has been completed.
Smartphones were created based on the assumption of utf-8, since they were born in a world after utf-8 became widely used. (On the other hand, files exchanged on Windows are often in Shift_JIS.
Conclusion
In this article, we introduced "How to unify the character encoding of text files with utf-8 in PHP. As a company from Japan, we would like to keep the usability for Japanese people in mind, so we will continue to develop without forgetting about Shift_JIS for a while longer. (We really want to forget about IE11...)