phpで日本語文字列を文字単一で抜き出す最速の方法

UTF-8だったら、多分この方法が最速になると思う。

<?php
$str = "科学の力ではどうしようもできない、魑魅魍魎などの奇怪な輩に立ち向かう胡散臭い男";
$chars = preg_split("//u", $str, -1, PREG_SPLIT_NO_EMPTY);
print_r($chars);
?>

参考:http://we-b.anchortag.jp/434.html


結果

Array ( [0] => 科 [1] => 学 [2] => の [3] => 力 [4] => で [5] => は [6] => ど [7] => う [8] => し [9] => よ [10] => う [11] => も [12] => で [13] => き [14] => な [15] => い [16] => 、 [17] => 魑 [18] => 魅 [19] => 魍 [20] => 魎 [21] => な [22] => ど [23] => の [24] => 奇 [25] => 怪 [26] => な [27] => 輩 [28] => に [29] => 立 [30] => ち [31] => 向 [32] => か [33] => う [34] => 胡 [35] => 散 [36] => 臭 [37] => い [38] => 男 ) 

mb_substr で切り出す場合と比較してみる。
ソースコードは、tiny_segmenter.phps(http://www.programming-magic.com/20080816010106/)より。

<?php
$str = "科学の力ではどうしようもできない、魑魅魍魎などの奇怪な輩に立ち向かう胡散臭い男";

$result = array();
$length = mb_strlen($str, 'UTF-8');
for($i=0; $i<$length; ++$i){
	$result[] = mb_substr($str, $i, 1, 'UTF-8');
}
print_r($result);
?>

ab -n 10000 でのベンチ結果

preg_split
Requests per second: 1242.96 [#/sec] (mean)
mb_strlen
Requests per second: 620.58 [#/sec] (mean)

preg_splitの方が2倍早い

ベンチの詳細

環境 debian lenny stable. (vmware)

preg_splitの方

debian:~# ab -n 10000 "http://192.168.195.129/~rti/test/preg_split.php"
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 192.168.195.129 (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Completed 10000 requests
Finished 10000 requests


Server Software:        Apache/2.2.9
Server Hostname:        192.168.195.129
Server Port:            80

Document Path:          /~rti/test/preg_split.php
Document Length:        624 bytes

Concurrency Level:      1
Time taken for tests:   8.045 seconds
Complete requests:      10000
Failed requests:        0
Write errors:           0
Total transferred:      9770000 bytes
HTML transferred:       6240000 bytes
Requests per second:    1242.96 [#/sec] (mean)
Time per request:       0.805 [ms] (mean)
Time per request:       0.805 [ms] (mean, across all concurrent requests)
Transfer rate:          1185.91 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:     0    1   0.3      1      17
Waiting:        0    0   0.1      0       5
Total:          0    1   0.3      1      17

Percentage of the requests served within a certain time (ms)
  50%      1
  66%      1
  75%      1
  80%      1
  90%      1
  95%      1
  98%      1
  99%      1
 100%     17 (longest request)

mb_splitの方

debian:~# ab -n 10000 "http://192.168.195.129/~rti/test/mb_split.php"
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 192.168.195.129 (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Completed 10000 requests
Finished 10000 requests


Server Software:        Apache/2.2.9
Server Hostname:        192.168.195.129
Server Port:            80

Document Path:          /~rti/test/mb_split.php
Document Length:        624 bytes

Concurrency Level:      1
Time taken for tests:   16.114 seconds
Complete requests:      10000
Failed requests:        0
Write errors:           0
Total transferred:      9770000 bytes
HTML transferred:       6240000 bytes
Requests per second:    620.58 [#/sec] (mean)
Time per request:       1.611 [ms] (mean)
Time per request:       1.611 [ms] (mean, across all concurrent requests)
Transfer rate:          592.10 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       2
Processing:     1    2   0.3      2      15
Waiting:        0    1   0.3      1      15
Total:          1    2   0.3      2      15

Percentage of the requests served within a certain time (ms)
  50%      2
  66%      2
  75%      2
  80%      2
  90%      2
  95%      2
  98%      2
  99%      2
 100%     15 (longest request)