{"id":31943,"date":"2020-02-17T14:54:53","date_gmt":"2020-02-17T05:54:53","guid":{"rendered":"https:\/\/learningbox.online\/?p=31943"},"modified":"2020-09-11T17:27:11","modified_gmt":"2020-09-11T08:27:11","slug":"php-mb-convert-encoding-utf8-shift-jis","status":"publish","type":"post","link":"https:\/\/learningbox.online\/en\/column\/php-mb-convert-encoding-utf8-shift-jis\/","title":{"rendered":"How to unify text file encoding to utf-8 in PHP"},"content":{"rendered":"<p class=\"well\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/learningbox.online\/wp-content\/uploads\/2020\/02\/f2a31f2538a98c122fe3a9d2b7f04733.png\" alt=\"Handling of character codes\" width=\"889\" height=\"304\" class=\"aligncenter size-full wp-image-32082\" srcset=\"https:\/\/learningbox.online\/wp-content\/uploads\/2020\/02\/f2a31f2538a98c122fe3a9d2b7f04733.png 889w, https:\/\/learningbox.online\/wp-content\/uploads\/2020\/02\/f2a31f2538a98c122fe3a9d2b7f04733-300x103.png 300w, https:\/\/learningbox.online\/wp-content\/uploads\/2020\/02\/f2a31f2538a98c122fe3a9d2b7f04733-768x263.png 768w\" sizes=\"auto, (max-width: 889px) 100vw, 889px\" \/><\/p>\n<p>I'm Nishimura, the creator of QuizGenerator, and I've been working on it since the release of version 2.0. In fact, I didn't participate in the early development of learningBOX, but I started to participate in it from the release of version 2.0. This time, I'm going to take part in the development of<span class=\"yellowline\"><strong>Handling of character codes<\/strong><\/span>I thought I should talk a bit more about character codes, so I've put together an article about it.<br \/>\nIn this article.<span style=\"border-bottom:solid 2px red;\"><strong>How to unify text file encoding to utf-8 in PHP<\/strong><\/span>We are pleased to introduce you to the following. We hope you enjoy this paper as much as we do.<\/p>\n<p class=\"mokujimidasi\">Content<\/p>\n<ul class=\"mokuji\">\n<li>1. Shift-JIS is unavoidable.<\/li>\n<li>2. how to avoid garbled text<\/li>\n<li>3. how do we determine the character code?<\/li>\n<li>4. Summary<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<h2><b>Shift-JIS is unavoidable.<\/b><\/h2>\n<p>Systems such as learningBOX and QuizGenerator may receive text files such as CSVs. In a modern web system, you should use utf-8 as the text file's character code, and you don't want to accept any other character code, but as a practical matter<strong>Shift_JIS<\/strong>files are rarely<del datetime=\"2020-02-14T07:37:00+00:00\">often<\/del>It will be uploaded and reported as a defect.<br \/>\nSo, in QuizGenerator<strong>Shift_JIS<\/strong>files are converted to utf-8 and then the process continues.<\/p>\n<div class=\"box27\">\n    <span class=\"box-title\"><b>What is Shift-JIS?<\/b><\/span><\/p>\n<p>Shift JIS code is one of the character codes for various characters including Japanese that have been standardized as JIS standard. It is an improved version of the JIS code, and while the JIS code uses 7 bits to represent characters, the Shift JIS code uses 2 bytes (16 bits) to represent all characters.<\/p>\n<\/div>\n<p>&nbsp;<\/p>\n<h3><b>I don't trust mb_convert_encoding.<\/b><\/h3>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/learningbox.online\/wp-content\/uploads\/2020\/02\/PHP-1024x229.png\" alt=\"PHP\" width=\"1024\" height=\"229\" class=\"aligncenter size-large wp-image-32076\" srcset=\"https:\/\/learningbox.online\/wp-content\/uploads\/2020\/02\/PHP-1024x229.png 1024w, https:\/\/learningbox.online\/wp-content\/uploads\/2020\/02\/PHP-300x67.png 300w, https:\/\/learningbox.online\/wp-content\/uploads\/2020\/02\/PHP-768x171.png 768w, https:\/\/learningbox.online\/wp-content\/uploads\/2020\/02\/PHP.png 1133w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><br \/>\nPHP has a function called mb_convert_encoding, which can convert the character code. At first glance, this function seems to be able to determine the character code and convert it to utf-8 just by using this function, but actually, this function is not trustworthy.<\/p>\n<p>mb_convert_encoding(\"ah\", \"utf-8\", \"utf-8, sjis-win\" ), then if \"ah\" is utf-8, it should be converted to utf-8 as is, and if it's Shift_JIS, it should be converted to utf-8 (at least as far as the official documentation is concerned). In fact, they do a hell of a lot of things.<br \/>\nIf you take a string passed in utf-8 and force it to be interpreted as Shift_JIS, break it, and convert it to utf-8, you will get an incomprehensible value.<\/p>\n<p>&nbsp;<\/p>\n<h2><b>How to avoid garbled text<\/b><\/h2>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/learningbox.online\/wp-content\/uploads\/2020\/02\/550a5b916cb51308120dc04f2362c166.png\" alt=\"PHP - Garbled characters\" width=\"784\" height=\"521\" class=\"aligncenter size-full wp-image-32078\" srcset=\"https:\/\/learningbox.online\/wp-content\/uploads\/2020\/02\/550a5b916cb51308120dc04f2362c166.png 784w, https:\/\/learningbox.online\/wp-content\/uploads\/2020\/02\/550a5b916cb51308120dc04f2362c166-300x199.png 300w, https:\/\/learningbox.online\/wp-content\/uploads\/2020\/02\/550a5b916cb51308120dc04f2362c166-768x510.png 768w\" sizes=\"auto, (max-width: 784px) 100vw, 784px\" \/><br \/>\n<strong>If you specify the source character code, mb_convert_encoding will work properly.<\/strong><br \/>\nmb_convert_encoding works correctly as long as you specify the source character code. In other words, convert from Shift_JIS to utf-8 only in the case of Shift_JIS and do nothing in the case of utf-8, which basically works.<\/p>\n<p><span style=\"border-bottom:solid 2px red;\"><strong>mb_convert_encoding(\"ah\", \"utf-8\", \"sjis-win\")<\/strong><\/span><br \/>\nThe above code will work fine if \"Ahhh\" is Shift_JIS. In the case of utf-8, you can use it as it is.<\/p>\n<p>&nbsp;<\/p>\n<h2><b>How do we determine the character code?<\/b><\/h2>\n<p>There is a function called mb_detect_encoding, which, if it works correctly in the first place, can be solved simply by using mb_convert_encoding.<\/p>\n<h3><b>If the standard function doesn't work, you'll have to do it on your own.<\/b><\/h3>\n<p>Just do it because it's not that hard to determine if you meet the utf-8 specification.<br \/>\n<script src=\"https:\/\/gist.github.com\/ynishi2014\/5a1809d126273898e2a1e6e9afc0f077.js\"><\/script><\/p>\n<p>&nbsp;<\/p>\n<h3><b>What if it's not utf-8?<\/b><\/h3>\n<p>If it's not utf-8, \u30fb\u30fb\u30fb\u30fb treat it as Shift_JIS. I can't support it until someone brings up euc-jp or utf-16 files. At least, people who do that should know about the encoding, so please do self-service. I can't support people who want to use euc-jp or utf-16 files.<\/p>\n<p>&nbsp;<\/p>\n<h3><b>Another trap.<\/b><\/h3>\n<p>I've used the term Shift_JIS many times in this article, but what is now called Shift_JIS is an extension of Shift_JIS<strong>Windows-31J (MS932)<\/strong>It is often the case that the<br \/>\nHowever, if you specify Shift_JIS in PHP, all characters other than those specified in the original Shift_JIS specification will be garbled. Unless you have a special reason, please use Windows-31J or sjis-win instead of Shift_JIS. The official document says to use Windows-31J, but only sjis-win is listed, which is a strange situation, but at least with PHP 7.3.13, both of these options worked fine.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/learningbox.online\/wp-content\/uploads\/2020\/02\/2020-02-14_1708.png\" alt=\"\" width=\"951\" height=\"422\" class=\"alignnone size-full wp-image-32022\" srcset=\"https:\/\/learningbox.online\/wp-content\/uploads\/2020\/02\/2020-02-14_1708.png 951w, https:\/\/learningbox.online\/wp-content\/uploads\/2020\/02\/2020-02-14_1708-300x133.png 300w, https:\/\/learningbox.online\/wp-content\/uploads\/2020\/02\/2020-02-14_1708-768x341.png 768w\" sizes=\"auto, (max-width: 951px) 100vw, 951px\" \/><\/p>\n<p>&nbsp;<\/p>\n<h3><b>Will a unified future ever come?<\/b><\/h3>\n<p>About 20 years ago. When I first started web programming, UTF-8 was non-standard and garbled characters were an everyday occurrence. It seems safe to say that the unification of<br \/>\nSmartphones were born after the spread of utf-8, so they are built on the premise of utf-8 (so they tend to be garbled by files encoded in other formats, such as Shift_JIS). (Therefore, Shift_JIS and other encoding files tend to be garbled) On the other hand, files exchanged on Windows are often Shift_JIS.<\/p>\n<p>&nbsp;<\/p>\n<h2><b>Summary<\/b><\/h2>\n<p>In this article, I have introduced \"How to unify the encoding of text files into utf-8 in PHP\". As we are a company from Japan, we would like to keep Shift_JIS in mind for a while longer as we try to make our products easy to use for the Japanese.<del datetime=\"2020-02-14T07:37:00+00:00\">(I really wish I could forget about IE11...)<\/del><\/p>","protected":false},"excerpt":{"rendered":"I am Nishimura, the creator of QuizGenerator, and the creator of QuizGenerator. Actually, I did not participate in the initial development of learningBOX, and around the time of the release of version 2.0...","protected":false},"author":5,"featured_media":32064,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_crdt_document":"","content-type":"","_lmt_disableupdate":"yes","_lmt_disable":"","advanced_seo_description":"","jetpack_seo_html_title":"","jetpack_seo_noindex":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[41,40],"tags":[177,176,178],"class_list":["post-31943","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-learningbox","category-blog","tag-utf-8","tag-176","tag-178"],"acf":[],"modified_by":null,"jetpack_featured_media_url":"https:\/\/learningbox.online\/wp-content\/uploads\/2020\/02\/1696211.jpg","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/pgWaOl-8jd","_links":{"self":[{"href":"https:\/\/learningbox.online\/en\/wp-json\/wp\/v2\/posts\/31943","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/learningbox.online\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/learningbox.online\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/learningbox.online\/en\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/learningbox.online\/en\/wp-json\/wp\/v2\/comments?post=31943"}],"version-history":[{"count":153,"href":"https:\/\/learningbox.online\/en\/wp-json\/wp\/v2\/posts\/31943\/revisions"}],"predecessor-version":[{"id":32105,"href":"https:\/\/learningbox.online\/en\/wp-json\/wp\/v2\/posts\/31943\/revisions\/32105"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/learningbox.online\/en\/wp-json\/wp\/v2\/media\/32064"}],"wp:attachment":[{"href":"https:\/\/learningbox.online\/en\/wp-json\/wp\/v2\/media?parent=31943"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/learningbox.online\/en\/wp-json\/wp\/v2\/categories?post=31943"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/learningbox.online\/en\/wp-json\/wp\/v2\/tags?post=31943"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}