phpで日本語文字列を文字単一で抜き出す最速の方法
UTF-8だったら、多分この方法が最速になると思う。
<?php $str = "科学の力ではどうしようもできない、魑魅魍魎などの奇怪な輩に立ち向かう胡散臭い男"; $chars = preg_split("//u", $str, -1, PREG_SPLIT_NO_EMPTY); print_r($chars); ?>
参考:http://we-b.anchortag.jp/434.html
結果
Array ( [0] => 科 [1] => 学 [2] => の [3] => 力 [4] => で [5] => は [6] => ど [7] => う [8] => し [9] => よ [10] => う [11] => も [12] => で [13] => き [14] => な [15] => い [16] => 、 [17] => 魑 [18] => 魅 [19] => 魍 [20] => 魎 [21] => な [22] => ど [23] => の [24] => 奇 [25] => 怪 [26] => な [27] => 輩 [28] => に [29] => 立 [30] => ち [31] => 向 [32] => か [33] => う [34] => 胡 [35] => 散 [36] => 臭 [37] => い [38] => 男 )
mb_substr で切り出す場合と比較してみる。
ソースコードは、tiny_segmenter.phps(http://www.programming-magic.com/20080816010106/)より。
<?php $str = "科学の力ではどうしようもできない、魑魅魍魎などの奇怪な輩に立ち向かう胡散臭い男"; $result = array(); $length = mb_strlen($str, 'UTF-8'); for($i=0; $i<$length; ++$i){ $result[] = mb_substr($str, $i, 1, 'UTF-8'); } print_r($result); ?>
ab -n 10000 でのベンチ結果
preg_split
Requests per second: 1242.96 [#/sec] (mean)
mb_strlen
Requests per second: 620.58 [#/sec] (mean)
preg_splitの方が2倍早い
ベンチの詳細
環境 debian lenny stable. (vmware)
preg_splitの方
debian:~# ab -n 10000 "http://192.168.195.129/~rti/test/preg_split.php"
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 192.168.195.129 (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Completed 10000 requests
Finished 10000 requests
Server Software: Apache/2.2.9
Server Hostname: 192.168.195.129
Server Port: 80
Document Path: /~rti/test/preg_split.php
Document Length: 624 bytes
Concurrency Level: 1
Time taken for tests: 8.045 seconds
Complete requests: 10000
Failed requests: 0
Write errors: 0
Total transferred: 9770000 bytes
HTML transferred: 6240000 bytes
Requests per second: 1242.96 [#/sec] (mean)
Time per request: 0.805 [ms] (mean)
Time per request: 0.805 [ms] (mean, across all concurrent requests)
Transfer rate: 1185.91 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.0 0 0
Processing: 0 1 0.3 1 17
Waiting: 0 0 0.1 0 5
Total: 0 1 0.3 1 17
Percentage of the requests served within a certain time (ms)
50% 1
66% 1
75% 1
80% 1
90% 1
95% 1
98% 1
99% 1
100% 17 (longest request)mb_splitの方
debian:~# ab -n 10000 "http://192.168.195.129/~rti/test/mb_split.php"
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 192.168.195.129 (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Completed 10000 requests
Finished 10000 requests
Server Software: Apache/2.2.9
Server Hostname: 192.168.195.129
Server Port: 80
Document Path: /~rti/test/mb_split.php
Document Length: 624 bytes
Concurrency Level: 1
Time taken for tests: 16.114 seconds
Complete requests: 10000
Failed requests: 0
Write errors: 0
Total transferred: 9770000 bytes
HTML transferred: 6240000 bytes
Requests per second: 620.58 [#/sec] (mean)
Time per request: 1.611 [ms] (mean)
Time per request: 1.611 [ms] (mean, across all concurrent requests)
Transfer rate: 592.10 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.0 0 2
Processing: 1 2 0.3 2 15
Waiting: 0 1 0.3 1 15
Total: 1 2 0.3 2 15
Percentage of the requests served within a certain time (ms)
50% 2
66% 2
75% 2
80% 2
90% 2
95% 2
98% 2
99% 2
100% 15 (longest request)