How to unify text file encoding to utf-8 in PHP

 learningBOX, Admin Blog

Handling of character codes

I'm Nishimura, the creator of QuizGenerator, and I've been working on it since the release of version 2.0. In fact, I didn't participate in the early development of learningBOX, but I started to participate in it from the release of version 2.0. This time, I'm going to take part in the development ofHandling of character codesI thought I should talk a bit more about character codes, so I've put together an article about it.
In this article.How to unify text file encoding to utf-8 in PHPWe are pleased to introduce you to the following. We hope you enjoy this paper as much as we do.

Content

  • 1. Shift-JIS is unavoidable.
  • 2. how to avoid garbled text
  • 3. how do we determine the character code?
  • 4. Summary

 

Shift-JIS is unavoidable.

Systems such as learningBOX and QuizGenerator may receive text files such as CSVs. In a modern web system, you should use utf-8 as the text file's character code, and you don't want to accept any other character code, but as a practical matterShift_JISfiles are rarelyoftenIt will be uploaded and reported as a defect.
So, in QuizGeneratorShift_JISfiles are converted to utf-8 and then the process continues.

What is Shift-JIS?

Shift JIS code is one of the character codes for various characters including Japanese that have been standardized as JIS standard. It is an improved version of the JIS code, and while the JIS code uses 7 bits to represent characters, the Shift JIS code uses 2 bytes (16 bits) to represent all characters.

 

I don't trust mb_convert_encoding.

PHP
PHP has a function called mb_convert_encoding, which can convert the character code. At first glance, this function seems to be able to determine the character code and convert it to utf-8 just by using this function, but actually, this function is not trustworthy.

mb_convert_encoding("ah", "utf-8", "utf-8, sjis-win" ), then if "ah" is utf-8, it should be converted to utf-8 as is, and if it's Shift_JIS, it should be converted to utf-8 (at least as far as the official documentation is concerned). In fact, they do a hell of a lot of things.
If you take a string passed in utf-8 and force it to be interpreted as Shift_JIS, break it, and convert it to utf-8, you will get an incomprehensible value.

 

How to avoid garbled text

PHP - Garbled characters
If you specify the source character code, mb_convert_encoding will work properly.
mb_convert_encoding works correctly as long as you specify the source character code. In other words, convert from Shift_JIS to utf-8 only in the case of Shift_JIS and do nothing in the case of utf-8, which basically works.

mb_convert_encoding("ah", "utf-8", "sjis-win")
The above code will work fine if "Ahhh" is Shift_JIS. In the case of utf-8, you can use it as it is.

 

How do we determine the character code?

There is a function called mb_detect_encoding, which, if it works correctly in the first place, can be solved simply by using mb_convert_encoding.

If the standard function doesn't work, you'll have to do it on your own.

Just do it because it's not that hard to determine if you meet the utf-8 specification.

 

What if it's not utf-8?

If it's not utf-8, ・・・・ treat it as Shift_JIS. I can't support it until someone brings up euc-jp or utf-16 files. At least, people who do that should know about the encoding, so please do self-service. I can't support people who want to use euc-jp or utf-16 files.

 

Another trap.

I've used the term Shift_JIS many times in this article, but what is now called Shift_JIS is an extension of Shift_JISWindows-31J (MS932)It is often the case that the
However, if you specify Shift_JIS in PHP, all characters other than those specified in the original Shift_JIS specification will be garbled. Unless you have a special reason, please use Windows-31J or sjis-win instead of Shift_JIS. The official document says to use Windows-31J, but only sjis-win is listed, which is a strange situation, but at least with PHP 7.3.13, both of these options worked fine.

 

Will a unified future ever come?

About 20 years ago. When I first started web programming, UTF-8 was non-standard and garbled characters were an everyday occurrence. It seems safe to say that the unification of
Smartphones were born after the spread of utf-8, so they are built on the premise of utf-8 (so they tend to be garbled by files encoded in other formats, such as Shift_JIS). (Therefore, Shift_JIS and other encoding files tend to be garbled) On the other hand, files exchanged on Windows are often Shift_JIS.

 

Summary

In this article, I have introduced "How to unify the encoding of text files into utf-8 in PHP". As we are a company from Japan, we would like to keep Shift_JIS in mind for a while longer as we try to make our products easy to use for the Japanese.(I really wish I could forget about IE11...)

Get started with free compliance training!
banner

  • Comment ( 0 )

  • Trackbacks are closed.

  1. No comments yet.

Related posts