Fast line iteration in PHP
Published
It's been a long time since I wrote anything here. During the last 2 years of study I haven't had much time to finish anything. Well, now I have the time. One of the first things on the list is an optimized mail parser completely written in PHP (so you don't need to install any extensions). Since mails are a line based format the parser will have to iterate over lines quite often. So it's worth to do some microbenchmarks on that.
I'll focus on two use cases:
- Iterate over lines from a PHP stream. For example an opened file (MBox, mail dir, …), a network connection (HTTP, IMAP, NNTP, …) or any other PHP stream.
- Iterate over lines of a memory buffer. The data is already loaded into one large string. This can be useful to reduce method call overhead. The parser can be called only once for the entire buffer instead of once for each line.
Please be aware that this is a microbenchmark! Results can very quite a bit depending on the rest of your code. The values here are a starting point for your own function or library microbenchmarks not a definitive answer.
Test setup
All tests were performed with PHP 5.4.6 on Linux. The test data was a 508 MiByte dump of the digitalmars.D newsgroup with CRLF ("\r\n"
) line breaks and 12 930 435 lines (on average 41.2 Bytes per line). The NNTP, IMAP and HTTP protocols also use these line breaks.
I measured the throughput and memory usage of different line iteration approaches. Most of them were already listed on Stack Overflow. Throughput is the time it took to crunch through the 508 MiByte dump divided by the dump size. The time was measured with the microtime() function. On the memory side I measured the peak memory usage with memory_get_peak_usage() after the iteration was done. Not perfect but I also tested the memory usage during iteration: practically the same values but the measurement degraded the throughput significantly. So we stick to the peak memory usage after iteration.
All tests were performed 11 times and the result is the average of test runs 2 to 11. The first run only served to load the entire file into the kernel read cache. Therefore the result of the first run is discarded. I tried to eliminate IO factors here since they can vary extremely if your data comes from a normal hard disk, an SSD or directly from a network connection.
All tests are online at GitHub. Take a look for any details not mentioned here.
Facts first
The results are grouped by the use case. First the results of file iteration then the results from memory buffer iteration.
Details on the different approaches:
I subtracted the buffer size from the peak memory usage in the memory buffer tests. This leaves only the memory usage of the iteration (that's what's interesting, right?).
Details on the different approaches:
Results
- If you don't care if the line break is still at the end of the line
fgets()
is the clear winner. No matter if you read directly from a file or a memory buffer. - If you can't tolerate the line break
stream_get_line()
seems to be quite fast but has some quirks. - Other approaches excel in some special cases:
fscanf()
is fast for simple line based data. Configuration files, simple YAML, headers, that kind of stuff.strtok()
is fast when you don't need blank likes (it skips them).strpos()
consumes almost no memory. Useful when working with very large buffers.preg_split()
excels at consuming memory. :)
Source and details of different approaches
fgets() from file
$fd = fopen(…, 'rb');
while( ($line = fgets($fd)) !== false ) {
// do something with $line
}
fclose($fd);
- Works with
"\n"
or"\r\n"
line breaks - Line breaks still appended to
$line
- Using
rtrim()
to get rid of line breaks reduces throughput to about 47% (see "fgets with rtrim" in chart above). In that case it's faster to usestream_get_line()
.
stream_get_line() from file
$fd = fopen(…, 'rb');
while( ($line = stream_get_line($fd, 0, "\r\n")) !== false ) {
// do something with $line
}
fclose($fd);
- Works only with the specified kind of line break (above
"\r\n"
). Most protocols define exactly one line break so not a problem in many situations. When using local files it might be a good idea to detect what line break is used on the first line. - Reads only lines up to a maximal line length (second argument).
0
represents the default value of 8192 Bytes. Many line based protocols also define a maximal line length. For mails RFC 5322 defines it to be 1000 Bytes. - You can search for more complex "line endings" with this function. For example when parsing a mail you can search for the next MIME boundary instead of iterating all the lines until you found it.
fgets()
can't do that. But the maximal line length might spoil the fun here.
fscanf() from file
$fd = fopen(..., 'rb');
$size = fstat($fd)['size'];
while(true){
$matches = fscanf($fd, "%[^\r\n]");
if ($matches === false)
break;
$line = $matches[0];
// do something with $line
}
fclose($fd);
$line
will benull
for empty lines.
Not really useful for raw line iteration but fscanf() can be useful for simple line based text data like configuration files. For example take a simple config file:
key: "value"
key: "value" # comment
# comment
Kind of simplified YAML without nesting. fscanf()
parsed it with 77 MiByte/s (see file_conf_fscanf.php). fgets()
followed by a simple preg_match()
got about 42 MiByte/s (see file_conf_regexpr.php). Both with a peak memory usage of 0.25 MiB.
For simple data fscanf()
can be quite fast. Note that you'll not find useful documentation in PHPs manual. The only useful documentation of fscanf()
I know of is the Linux man page or the C language specification.
fgets() from memory buffer
$fd = fopen('php://memory', 'r+');
fwrite($fd, $data);
rewind($fd);
while( ($line = fgets($fd)) !== false ) {
// do something with $line
}
fclose($fd);
- See fgets() above.
strtok() from memory buffer
$data = file_get_contents('../mails.txt');
$line = strtok($data, "\r\n");
while ($line !== false) {
$line = strtok("\r\n");
// do something with $line
}
- Works with
"\n"
and"\r\n"
line breaks. - Empty lines are ignored (
strtok()
skipps empty tokens). That pretty much eliminates this option for protocols where empty lines are important (e.g. HTTP, mail).
stream_get_line
$fd = fopen('php://memory', 'r+');
fwrite($fd, $data);
rewind($fd);
while( ($line = stream_get_line($fd, 0, "\r\n")) !== false ) {
// do something with $line
}
fclose($fd);
- See stream_get_line() above
strpos() from memory buffer
$data = file_get_contents(...);
$pos = 0;
while ( ($line_end_pos = strpos($data, "\r\n", $pos)) !== false ) {
$line = substr($data, $pos, $line_end_pos - $pos);
// do something with $line
$pos = $line_end_pos + 2;
}
- Works only with one kind of delimiter or line break. But can be easily extended to any kind of delimiter.
- Almost no memory overhead. If you can't afford to duplicate the memory footprint of the buffer this is the way to go.
preg_split with foreach from memory buffer
$data = file_get_contents(...);
foreach( preg_split('/\r?\n/', $data) as $line ) {
// do something with $line
}
- Easy for small data sets but the worst you can do for large ones.
Closing remarks
Again this is just a microbenchmark. Results can vary greatly depending on what you're actually doing. Most problems aren't as simple as pure line iteration… especially since the benchmark doesn't do anything with the lines. A very clever compiler would optimize the entire thing away.
Remember that the stream based approaches (e.g. fgets() or stream_get_line()) can use the full power of PHPs streams. They can read data from pretty much everything and you can combine them with filters to shovel e.g. base64 decoding or decompression into internal PHP routines. Haven't tested that though.
At the end just a small thing to put the throughput of PHP into perspective. The fastest value I measured was 261 MiByte/s. The line count command pv test-data.txt | wc -l
got a throughput of 1.6 to 1.8 GiByte/s (pv
is like cat
but displays a progress bar and some statistics). So if you really need performance don't bother with PHP. Grab C, C++ or D and get going. :)
6 comments for this post
leave a new one
Thanks for these benchmarks & remarks! You seem to have fun in your "Winter of code".
What Hardware are you on? On my AMD Athlon X2 64bit PC with internal 1 TB HDD I got only 82 MiB/s with a 8.4GiB text file
$ pv /media/bigone/headers-dupl2.txt | wc -l 8,35GB 0:01:50 [77,6MB/s] [==================================>] 100%
Average over 10s in the middle of operation $ sudo iotop -d 10 -a -b -o Total DISK READ: 82.80 M/s | Total DISK WRITE: 31.27 K/s
A bit faster when throwing away the output $ pv headers-dupl2.txt > /dev/null 8,35GB 0:01:45 [81,2MB/s] [==================================>] 100%
Input file: I concatenated the Linux Kernel header n times, the file looks like this $ ll headers-dupl2.txt -h -rwxrwxrwx 1 root root 8,4G Sep 27 11:32 headers-dupl2.txt
$ head -5 headers-dupl.txt /usr/src/linux-headers-3.2.0-41/ /usr/src/linux-headers-3.2.0-41/block /usr/src/linux-headers-3.2.0-41/block/Kconfig /usr/src/linux-headers-3.2.0-41/block/Kconfig.iosched /usr/src/linux-headers-3.2.0-41/block/Makefile
OK it turns out it is very hardware dependent, on a Athlon Opteron 2427 Server with 12 cores there is more throughput. The line counting has a lot of effect on the server
# pv headers-dupl2.txt | wc -l 8.35GB 0:00:12 [ 677MB/s]
# pv headers-dupl2.txt > /dev/null 8.35GB 0:00:03 [2.28GB/s]
Thanks. :)
The tests were run on a Core i5-3570K with 16 GiByte of RAM and an Western Digital WD10EADS-00L. In my case the kernel cached the entire 508 MiByte file after the first test run. This was just an easy way to eliminate IO interference from the benchmark. Otherwise I would have tested the read throughput of my disk instead of the PHP code. The effect is best seen when using the dump multiple times:
The first run is limited by the read throughput of the disk. The following runs get the data straight from the kernels read cache.
In your case the dump is probably to big for the kernel to cache it. To cache 8 GiByte of data you'll probably need 16 or 32 GiByte of RAM on the machine. Therefore your results pretty much mirror the throughput of your disk. Nevertheless if the server is the server I think it is it's SAN has a nice throughput (no idea how fast SAN should be actually). :)
To eliminate the IO component you can try using smaller text files (few hundred MiByte) and run the test multiple times.
How exactly did you code "fgets with rtrim"? Thanks!
Just like the fgets() test case but with an additional rtrim() call on each line. The code is on GitHub: https://github.com/arkanis/nntp-forum/blob/master/experiments/line-iteration-benchmark/file_fgets_trimmed.php
You load the whole content with "file_get_contents" and then use strpos. It consumes a lot of memory because the whole file will be in memory. fgets will load the file chunk by chunk that causes no memory exhausting.
When the data is stored in a file fgets() is usually the winner (see Results in the article).
But when you already have the data in a memory buffer things look different. For example when you're reading a mail directly from a network connection. When you can't take the memory hit of duplicating your data strpos() is the way to go. Just as you implied. But when you have the memory and want more performance you can open the data as a stream and use fgets(), too. In my measurements this was more than twice as fast.
But be aware, the measurements are quite old (2013). The situation might have changed by now.