phpで日本語文字列を文字単一で抜き出す最速の方法
UTF-8だったら、多分この方法が最速になると思う。
<?php $str = "科学の力ではどうしようもできない、魑魅魍魎などの奇怪な輩に立ち向かう胡散臭い男"; $chars = preg_split("//u", $str, -1, PREG_SPLIT_NO_EMPTY); print_r($chars); ?>
参考:http://we-b.anchortag.jp/434.html
結果
Array ( [0] => 科 [1] => 学 [2] => の [3] => 力 [4] => で [5] => は [6] => ど [7] => う [8] => し [9] => よ [10] => う [11] => も [12] => で [13] => き [14] => な [15] => い [16] => 、 [17] => 魑 [18] => 魅 [19] => 魍 [20] => 魎 [21] => な [22] => ど [23] => の [24] => 奇 [25] => 怪 [26] => な [27] => 輩 [28] => に [29] => 立 [30] => ち [31] => 向 [32] => か [33] => う [34] => 胡 [35] => 散 [36] => 臭 [37] => い [38] => 男 )
mb_substr で切り出す場合と比較してみる。
ソースコードは、tiny_segmenter.phps(http://www.programming-magic.com/20080816010106/)より。
<?php $str = "科学の力ではどうしようもできない、魑魅魍魎などの奇怪な輩に立ち向かう胡散臭い男"; $result = array(); $length = mb_strlen($str, 'UTF-8'); for($i=0; $i<$length; ++$i){ $result[] = mb_substr($str, $i, 1, 'UTF-8'); } print_r($result); ?>
ab -n 10000 でのベンチ結果
preg_split
Requests per second: 1242.96 [#/sec] (mean)
mb_strlen
Requests per second: 620.58 [#/sec] (mean)
preg_splitの方が2倍早い
ベンチの詳細
環境 debian lenny stable. (vmware)
preg_splitの方
debian:~# ab -n 10000 "http://192.168.195.129/~rti/test/preg_split.php" This is ApacheBench, Version 2.3 <$Revision: 655654 $> Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Licensed to The Apache Software Foundation, http://www.apache.org/ Benchmarking 192.168.195.129 (be patient) Completed 1000 requests Completed 2000 requests Completed 3000 requests Completed 4000 requests Completed 5000 requests Completed 6000 requests Completed 7000 requests Completed 8000 requests Completed 9000 requests Completed 10000 requests Finished 10000 requests Server Software: Apache/2.2.9 Server Hostname: 192.168.195.129 Server Port: 80 Document Path: /~rti/test/preg_split.php Document Length: 624 bytes Concurrency Level: 1 Time taken for tests: 8.045 seconds Complete requests: 10000 Failed requests: 0 Write errors: 0 Total transferred: 9770000 bytes HTML transferred: 6240000 bytes Requests per second: 1242.96 [#/sec] (mean) Time per request: 0.805 [ms] (mean) Time per request: 0.805 [ms] (mean, across all concurrent requests) Transfer rate: 1185.91 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 0 0.0 0 0 Processing: 0 1 0.3 1 17 Waiting: 0 0 0.1 0 5 Total: 0 1 0.3 1 17 Percentage of the requests served within a certain time (ms) 50% 1 66% 1 75% 1 80% 1 90% 1 95% 1 98% 1 99% 1 100% 17 (longest request)
mb_splitの方
debian:~# ab -n 10000 "http://192.168.195.129/~rti/test/mb_split.php" This is ApacheBench, Version 2.3 <$Revision: 655654 $> Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Licensed to The Apache Software Foundation, http://www.apache.org/ Benchmarking 192.168.195.129 (be patient) Completed 1000 requests Completed 2000 requests Completed 3000 requests Completed 4000 requests Completed 5000 requests Completed 6000 requests Completed 7000 requests Completed 8000 requests Completed 9000 requests Completed 10000 requests Finished 10000 requests Server Software: Apache/2.2.9 Server Hostname: 192.168.195.129 Server Port: 80 Document Path: /~rti/test/mb_split.php Document Length: 624 bytes Concurrency Level: 1 Time taken for tests: 16.114 seconds Complete requests: 10000 Failed requests: 0 Write errors: 0 Total transferred: 9770000 bytes HTML transferred: 6240000 bytes Requests per second: 620.58 [#/sec] (mean) Time per request: 1.611 [ms] (mean) Time per request: 1.611 [ms] (mean, across all concurrent requests) Transfer rate: 592.10 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 0 0.0 0 2 Processing: 1 2 0.3 2 15 Waiting: 0 1 0.3 1 15 Total: 1 2 0.3 2 15 Percentage of the requests served within a certain time (ms) 50% 2 66% 2 75% 2 80% 2 90% 2 95% 2 98% 2 99% 2 100% 15 (longest request)