Measurements of system call performance and overhead - Arkanis DevelopmentStuff about programming, technology, life… err… how about the universe?http://arkanis.de/weblog.xml2023-06-14T09:23:44+02:00Comment by OptionalOptionalhttp://arkanis.de/weblog/2017-01-05-measurements-of-system-call-performance-and-overhead#comment-2023-06-14-09-23-44-optional2023-06-14T09:23:44+02:00
<p>The read/write results say that the vdso doesn't really help anything (except for one case, which is worth an investigation) for those cases. So it's reallly only useful as a unified break-out for the few syscalls that can do without the context switch part*.This stands to reason because on x86_64 vdso will call syscall anyway.</p>
<p>Not so on 32 bit x86: There the interface is to use int 0x80, except where you can break out, or where you can use sysenter or syscall, depending on manufacturer. Which then ought to speed up all syscalls. Except that you can't really because the interfaces are muddled and yucky, after figuring out whether you're on intel or amd. That's not a job for a program, so linux does it through a vdso. Though that too is rather complicated and poorly documented if libc doesn't do it for you. And that means a non-static binary and linking your assembly with libc again. *Sigh.*</p>
<p>So it would be more interesting to benchmark on 32bit linux: Does it pay to use vdso from assembly rather than just use int 0x80?</p>
<p>• Which is one or two, a small handful at most. The one use case that comes to mind is from a solaris(!) report where financial software was found to call time() a few millions of times per second, so speeding that up sped up the software considerably. I consider it a defect of design: Should've just taken one time() value, run a batch of transactions, and take another time() value to confirm the batch was completed within one second, then record that value for all transactions. Rather than several time() values for each transaction, which being single-second granular, will be the same value for the bulk of the transactions anyway.</p>
Comment by Alex PetrovAlex Petrovhttp://arkanis.de/weblog/2017-01-05-measurements-of-system-call-performance-and-overhead#comment-2023-05-26-10-26-11-alex-petrov2023-05-26T10:26:11+02:00
<p>Интересно было бы глянуть те же тесты на современных amd/intel/arm64.</p>
Comment by StephanStephanhttp://arkanis.de/weblog/2017-01-05-measurements-of-system-call-performance-and-overhead#comment-2021-08-30-01-24-55-stephan2021-08-30T01:24:55+02:00
<p>That's a nice way to look at it. Thanks for the compact description Paul. <span class="smiley smile">:)</span></p>
Comment by Paul FalkePaul Falkehttp://arkanis.de/weblog/2017-01-05-measurements-of-system-call-performance-and-overhead#comment-2021-08-23-17-16-21-paul-falke2021-08-23T17:16:21+02:00
<p>Thanks for your work! To the old write() versus fwrite() discussion. You tell that fwrite() uses a 4kByte buffer. This increases throughput, but increases latency, too. Therefore write() is the low latency solution and fwrite() is the high throughput solution. On the next level of detail you can use fflush() together with fwrite() to have the best of both worlds. On the last level of detail you have to remember that next to the userland buffers of glibc there are kernel buffers. To flush these buffers you use fsync() if it is disk io or ioctl() for network connections. Still, your work is a nice introduction in the topic.</p>
Comment by Michael WittenMichael Wittenhttp://arkanis.de/weblog/2017-01-05-measurements-of-system-call-performance-and-overhead#comment-2021-07-22-18-47-27-michael-witten2021-07-22T18:47:27+02:00
<p>Thanks for sharing your knowledge.</p>
Comment by StephanStephanhttp://arkanis.de/weblog/2017-01-05-measurements-of-system-call-performance-and-overhead#comment-2019-06-01-18-49-22-stephan2019-06-01T18:49:22+02:00
<p>Hi Basile,</p>
<p>Please note that the current benchmark numbers are not what they appear to be (see comment #6 by Rob Bradford). Unfortunately I haven't had the time since then to write up a new article.</p>
<p>Regarding mmap() or malloc(): Quite difficult. What would a meaningful benchmark of those system calls look like? The performance of malloc() depends heavily on the implementation (does it use linked lists, memory pools, etc.). If you can you should do a micro-benchmark for your target system and situation. With mmap() I guess it depends on how many memory is available right now, how many memory maps there are, and so on. I don't know the implementation though.</p>
<p>I've read the refpersys project page. From that I guess you're primarily interested in garbage collection. In that case you might be interested in the madvise() system call. Specifically the MADV_DONTNEED flag. With that you can allocate one large memory area and invalidate parts of it after garbage collection. MADV_SEQUENTIAL and MADV_NORMAL might also be of interest. Before scanning a GC space you can switch to MADV_SEQUENTIAL (in case you do sequential scanning) and when done you can switch back to MADV_NORMAL. But I haven't uses those two flags myself yet. I've only ever written a Baker GC for a small Lisp interpreter so I don't know how well this applies to other GCs.</p>
<p>pthread_mutex_lock() is another beast entirely. I suppose the performance will depend heavily on the contention of the mutex. As far as I know it's implemented with futexes so it should be fast if only one thread uses it. But it will get slow if many threads try to lock it. There should be some good futex benchmarks out there (I hope).</p>
<p>I've also got some questions regarding refpersys: If every mutable object has its own lock how do you avoid deadlocks? Primarily with more complex objects that consists of many atomic objects. From the "persistent values" section I also got the impression that the language operates on a global memory model. Meaning every variable can be accessed from every tasklet. This will be quite the garbage collection challenge. Not the mention the chaotic memory access patterns and fragmentation (entailing false sharing of cache lines, etc.). Have you considered keeping most variables local to tasklets and only explicitly exchange variables with other tasklets (by transfering ownership, via message passing, etc.)? That way each agenda could run its own thread-local garbage collector and that's way simpler than a multi-threaded one. Also you could do pretty cool stuff like running a tasklet, collect its survivor variables when it's done and store those in the tasklets own private memory area. This would compact the memory for each tasklet and lower the overhread of not running tasklets. But well, I'm dreaming right now…</p>
<p>This got a bit longer than intended. Sorry about that. It's not often that I can talk about those things. <span class="smiley grin">:D</span></p>
<p>Happy programming
Stephan</p>
Comment by Basile StarynkevitchBasile Starynkevitchhttp://arkanis.de/weblog/2017-01-05-measurements-of-system-call-performance-and-overhead#comment-2019-05-30-15-22-12-basile-starynkevitch2019-05-30T15:22:12+02:00
<p>Do you have any `mmap` related benchmark? Or `malloc` related ones? Or `pthread_mutex_lock` related ones?</p>
<p>A big thanks for your benchmark. I am coding [RefPerSys](
<a href="https://gitlab.com/bstarynk/refpersys">https://gitlab.com/bstarynk/refpersys</a>) and that is why I needed them.</p>
<p>Basile Starynkevitch <a href="http://starynkevitch.net/Basile/">http://starynkevitch.net/Basile/</a>
Bourg La Reine, France
<basile@starynkevitch.net></p>
Comment by AnonymusAnonymushttp://arkanis.de/weblog/2017-01-05-measurements-of-system-call-performance-and-overhead#comment-2018-12-30-22-16-36-anonymus2018-12-30T22:16:36+01:00
<p>Good experiment! I'm trying to learn about operating systems, and my most recent topic was system call latency. It was great to see some real numbers about it!</p>
Comment by oussamaoussamahttp://arkanis.de/weblog/2017-01-05-measurements-of-system-call-performance-and-overhead#comment-2018-04-04-13-32-54-oussama2018-04-04T13:32:54+02:00
<p>thank you for those information. i was really in need to read this and get some idea about what happen with hardware test</p>
Comment by StephanStephanhttp://arkanis.de/weblog/2017-01-05-measurements-of-system-call-performance-and-overhead#comment-2018-01-29-15-12-50-stephan2018-01-29T15:12:50+01:00
<p>Thanks for the heads up and the link Rob! I liberated the idea to use getpid() as a test from the Exokernel paper and blindly assumed Linux would do the same. Kind of stupid from me. I should have read the getpid() man page… they even state it there.</p>
<p>Well, that calls for some revising and a new round of benchmarking. <span class="smiley smile">:)</span> I think I'll use time() or gettimeofday() next as they're listed in the vDSO man page for x86-64 systems. I don't expect it to change much but that remains to be seen. With a bit of luck I can do a new benchmark in a few months time. Maybe even with some new CPUs…</p>