Welcome to Arkanis Development


Unconventional Java programming


TLDR: Java is still a pain for me, but it can be a lot less than I used to belief.

Old prejudices

I usually try to avoid Java projects. In normal C code it's pretty easy to see what the machine really does and well written code (in most languages) can also communicate what the programmer meant to do. Both are important aspects of a program, especially when you have to debug it. But when reading Java code I mostly just get impressions of the kinds of diagrams people drew on whiteboards (or in their heads) or how their teams were structured… or of confusion.

There seems to be a lot of object orientation religion involved and no, you can't properly solve a problem by blindly throwing "design patterns" at it until the problem rage-quits. Understanding the problem itself is important. But this might just be because I mostly see Java code in academia or university. But the few commercial projects I've seen don't exactly inspire confidence in other areas. Java projects also seem to be rather fond of a lot of tooling but fortunately I wasn't to involved with that. It doesn't sound motivating to spend more time debugging skyscraper sized tooling stacks than to work on the actual problem.

I also have mixed feelings about the language design and API design of the standard library. I mostly get the point of what they were trying to achieve (making it simple enough so you can easily swap programmers, static exception handling, etc.) but I can't say I've ever seen most of that work out in real life. But then it hardly does for any language. And like any other language Java has some pretty nice parts as well. The way you can package your entire application into one JAR is nice and can make deployment very simple if you bother to care about it. So maybe my problem is mostly not with the language but maybe more with how the community uses it, I guess.

A more recent experience

Sorry for the short rant. Anyway, a while ago I stumbled into a Java codebase again. A friend of mine has to work with RDF and tripple stores and they're using Fuseki as a database server. You basically send a query via HTTP and get back the results. Neat and simple. Especially if you fetch the results as JSON-LD which also does some post-processing to make the data almost human readable (the so called compacted form). But in our case we needed to do some further processing of the data and that post-processing made it a lot more complicated. So I wanted to disable it (get the so called "expanded" form).

Jena (the library behind Fuseki) supports that exact output format. At least the functionality is in their API docs. How about a quick patch for Fuseki to add that output format? How hard can it be, right? Even with content negotiation, which isn't all that complicated when you get down to it. Just add another mapping of the expanded JSON-LD content type to the corresponding Jena output format. Or another if statement or a branch in a switch case statement. Something like that. Probably.

I gave up after a couple of hours. To me it looked like a horrible mess of interconnected classes for even the simplest of tasks. Abstractions that don't make stuff simpler but mostly create confusion and limit what you can do. But the whole ordeal got me thinking: Do you have to program in Java that way?

Java changed over the years

A few weeks later and pretty much by accident I ended up skimming through the list of JDK Enhancement Proposals (by JDK version, full list). Most languages have stuff like that and for me it was a convenient way to see what changed over the years. Without all that marketing bullshit or oversimplification that most normal online articles do.

And there is a lot of interesting stuff there:

Of course there are a lot of other features like Switch expressions, Text Blocks and Records but the stuff above was the most relevant to my style of doing a lot of small iterative experiments.

Playing around with Jena

The final stretch of my Java just-for-fun weekend was working with Jena. Basically I wanted to write a small HTTP server that executes SPARQL construct queries and returns the result as expanded JSON-LD.

Turns out it's just about 90 lines of code. That includes a small HTML page to enter the SPARQL query, a bit of JavaScript to send the query as JSON and a small example construct query. All within a text block. I didn't put it on GitHub because, well, it's just one file. The JSON stuff is mostly there because I wanted to test out JSON parsing in Java. Unfortunately I didn't find a way to do it with the standard library but Jena provides an API for that anyway.

Given, I had to manually download Jena and used their JavaDoc quite a bit. But all of that without any big IDE, Maven and all that. Just with some text editor (Sublime Text in this case) and browsing the API docs. Which are funnily enough easier to work with than the endless prose of the Python documentation.

For seasoned Java developers this might seem a bit weird but I work on a lot of different projects in a lot of different languages and contexts. From Assembly to PHP, Ruby, JavaScript, C, GLSL and what not. So I usually stick to the basics and don't spend much time with specialized tooling that needs regular tending (like IDEs or complex tool stacks). Pros and cons, as always.

Well, that was a fun little exploration weekend. The Java eco-system is still not something I enjoy working with but the language itself isn't too bad. It came a long way in the last 10 years or so since I actively started avoiding it. Who knows, maybe it will be a good fit for some future project. :)

Visualizing process address space and thread activity


I've spend the last couple of days tinkering around at a process visualization tool. It shows the virtual memory of a process on a 2D map and threads as small dots jumping around (at the instructions they're executing). The map also shows the public symbols within a given shared library (e.g. poll() in libc). Just to give you an idea where threads are hanging out.

Visualizing a media player process. Looks like threads are sticking around libavcodec, poll(), pthread_cond_wait() and a few other places.

If you want to try for yourself you can get the code at github. But it's a linux-only tool and you'll have to compile it yourself.

The main purpose of the tools is to give students an easier and intuitive idea of what a process is. It's not a serious debugging tool something like that. I've had the idea while I was still giving the operating system lecture at university. Back then I saw the xkcd Map of the Internet, and poof, stuff happens in your head and you can't unthink the idea.

That was quite some time ago but it was a nice pretext to fiddle around with raylib and ptrace. Thanks to Chris from bedroomcoders.co.uk for giving me the final push to try out raylib. It really is a nice library to work with.

Aside from raylib the project had other interesting aspects:

Hilbert curve: This is a nice way of mapping the 1D address space of a process onto a 2D map. Quite a neat idea but for now the mathematical details are beyond me. Maybe in the future I'll find the time to derive the largest completely enclosed 2D rectangle for a given 1D range on a Hilbert curve. That would be rather neat to properly position the labels of address ranges or the linker symbols in them. But who knows when that fancy will strike me.

ptrace was another Linux API I wanted to try for a while now. It's basically the Linux API for debuggers like gdb and to be honest that kind of shines through. I probably abused the API for what I did here. :D

ELF dynamic linker symbols and DWARF debug symbols: Here we are again in fantasy land. Where I write this on something that was formerly called the GNOME 2 desktop (now MATE). Some fantasy authors would have been proud. Anyway, I wanted to put at least some functions on the map. Primarily to see whats inside address ranges that contain executable code. While not simple ELF is straight forward enough to read the dynamic linking symbols from its .dynsym section. And those have to be there for every shared library.

Also showing DWARF debug symbols would have been nice. But unfortunately this got a bit to complex for a just-for-fun operation. And getting debug symbols for e.g. Debian packages is kinda complicated (how to load them is nicely documented though). Well, maybe some other time. :)

I hope someone else finds this tool illuminating or at least helpful. :)

If anyone is interested I can explain stuff in more detail. For example how one texture is used to draw the map regions and borders without polygons (plus one texture for each address region with symbols in it). I'm just to lazy to go into details if nobody is interested. :D

Performance measurements of brute-force search for a substring in lines


A friend of mine recently worked on a search for a system that used a triplestore. The problem was a simple one: Find all the stuff with a given string in it, even if the string isn't a whole word. They ended up with about 5 million (id, text) pairs, used Elasticsearch and the queries are usually faster than 1 second.

I don't know the details of what else is going on (ranking, sorting, etc.) so please don't take this a representative performance of Elasticsearch. ;) But I asked myself:

How fast would that be if we brute-force it with grep?

What if we write all the (id, text) pairs out to a file, one pair per line. Then use the grep command line program to get all lines that contain a specific substring. And from those matching lines we can extract the IDs. A few lines should do the trick, right?

For how much data would this be a good sweet spot of complexity vs. performance (read: The right tool for the job)? So I wrote a microbenchmark. Because what else do you do over Christmas? :)

In case you couldn't tell already: This is going to be something only programmers can obsess over. You have been warned. ;)

The overview

Just in case you only want the gist of it and don't care about the details:

Now on to the data itself:

Searching for a string in a line with different tools and languages

I also wrote a C programs that do the same as above. And since multi-threadding in C is easier than in many interpreted languages I threw some threads at the problem. Note the different scale.

C programs to search for a string in a line, including multi-threadding. Added grep for comparison.

So much for the "short" version. If you're still interested in the details feel free to read on. :)

The test setup

All the programs, scripts and commands used are available in a GitHub repository. I also put the raw benchmark results in there (this includes the versions of everything used. The benchmarks were run with the command ./bench.sh > debug.log 2> results.txt.

To get a reproducible data set I wrote a small ruby script to generate id: text lines. The script uses a fixed seed so it always generates the same lines. A small sample:

http://example.com/3073701/Ybyld6r9vNai95TYXiQH4FuQV+l/CHMwP4i3jQ==   qO76BFQLlj2g
http://example.com/3874850/zjQdPR3iPXe/f28d0x6D2fQONt8x   j5gSmB0dQhRpZ5yjnMBU9la1Dl00wzcD2C7J
http://example.com/4659995/zLtSsvjmP6o6vYwQQm0VRTtgxM0xUbY=   QvzHvFPq7/VrKDJbjoQ9xnzgUUEHP2F/RW2WbIOfnq0wgS5xNCCwF4w=
http://example.com/9674460/OGuKOpNt22+aPUfoW3XTBDOfh/q/lc8ysJpil+M=   WURbI/R069glGkYLjfOizaobX5gJRUjF71pe

Pretty much base64 encoded random garbage. Each line is on average about 150 bytes long, some a lot longer, some a lot shorter.

The 150 byte length doesn't have any deeper meaning. I just picked something. You might have way fewer documents with a lot more text in them or way more things with just short text. That's why I give the performance in MiByte/s. It doesn't matter that way.

lines-100k.txt 14.4 MiByte
lines-1m.txt 144.9 MiByte
lines-10m.txt 1.4 GiByte
lines-50m.txt 7.2 GiByte
lines-100m.txt 14.3 GiByte

I generated a few of those files with 100 thousand, 1, 10, 50 and 100 million lines. Just to see if the programs would behave differently.

The last step was to pick some substring in those lines (e.g. "VrKDJbjoQ9xnzg") and searched for it with different tools. Each program was executed multiple times and the results were averaged. From 10 times for the small files like lines-1m.txt down to 2 times for lines-100m.txt. You can look at bench.sh for the details.

Just like grep each tool is supposed to search a text file for lines that contain the given substring. No matter where it is or if it's an entire word or just part of a word. I just tested the tools I had already installed and what struck my fancy. So don't be surprised if it doesn't include your favorite tool.

I also wanted to know how fast naive implementations of such a search would be in different languages. That's how the PHP, Ruby, Python and C programs found their way into the benchmark. Again, not to compare performance but rather to know what to expect in a given environment.

Measuring execution time

/usr/bin/time was used to measure how long it took for a program to go through an entire file (logged the wall time and some other stuff). This isn't the time builtin of shells like bash or zsh but an extra program.

A small caveat here: The time program only reports times with 10ms precision (e.g. 0.21s). Therefore I didn't do runs with 100k lines. Some programs were simply to fast and I got division by zero errors. And I was to lazy to look for a tool to properly read the timing information form the Linux kernel or write a skript for it. All that stuff is available under /proc/[pid] in stat, io, smaps and more, see the proc man page.

The test systems

Two systems had the honor of not running fast enough when I had the idea for the benchmark. So they had to search text files for substrings for the next two days. :)

One PC (can't blame it, it can't run):

Linux 4.15.0-91-generic #92-Ubuntu SMP x86_64
CPU: Intel(R) Core(TM) i5-3570K CPU @ 3.40GHz
RAM: 15.6 GiByte
Memory bandwith: 15.96 GiByte/s (reported by `mbw 2048`)
Disk: 4TB WD Red "WDC WD40EFRX-68WT0N0 (82.00A82)"
Read speed: 103 MiByte/s (measured with `wc -l`)

And one notebook (no excuses here):

Linux 5.3.0-46-generic #38~18.04.1-Ubuntu SMP x86_64
CPU: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
RAM: 31.1 GiByte
Memory bandwith: 13.8 GiByte/s (reported by `mbw 2048`)
SSD: 1 TB Samsung "MZVLB1T0HALR-000L7"
Read speed: 1.64 GiByte/s (measured with `wc -l`)

The rest of the article only shows the results of the lines-10m.txt run (10 million lines, 1.4 GiByte) on the notebook. It's pretty representative of the other runs. No need to show more data that doesn't say anything interesting. Funnily enough both CPUs performed pretty much the same. Well, the task at hand isn't exactly black magic, even for older CPUs. Anyway, the full data of all runs is in the repository.

Assumption: All the data fits into memory

Or rather in the page cache. In case this is new for you: When you read data from a file the operating system does something nifty. If it has unused memory left it keeps the data in there. The next time a program wants to read the same chunk ("page") of the same file the operating system takes it right out of the page cache. That way it doesn't have to wait for the disk again to get the data. Well, free memory doesn't help anyone, so might as well use it to cache some data that might help someone in the future. :)

In the benchmarks below I read the file once with wc -l. From then on it's in the page cache and all consecutive programs got their data straigt from there. Thanks to that the benchmarks are not bound by the read performance of the disk.

As long as the file fits into the page cache that is. If the file is for example larger than the memory your program is always limited by how fast your disk can provide data (e.g. 1.64 GiByte/s on the notebooks SSD).

Results and findings

Ok, that's it for the "preliminaries". Now on to the real meat: The results and what suprised me.

Funny thing: grep is faster with longer search strings

While testing out grep I noticed that longer search strings seem to perform better:

Searching for substrings of a different length with grep.

Here are the individual commans:

short, 2310 MiByte/s:
$ grep VrKDJbjoQ9xnzg lines-10m.txt
medium, 4177 MiByte/s:
$ grep V6TGqGjjnmhlHRcYEq1IJCgzUNSx09bCkwJnEK lines-10m.txt
long, 5207 MiByte/s:
$ grep FMzaoLmVeEoJ3PAPERDFSO2RFEo5/mO17YTQrXz4jr0Ud9w0854q6/rcRu11AocX3vzl4q7O0f6c lines-10m.txt

At first I was a bit startled by that. Especially given that grep doesn't only search for simple substrings but can also search for regular expressions.

But then I realized that the longer the match the smaller the characters in a line where the match could start. In other words: If we search for a string that is 20 chars long we can skip the last 19 chars in each line. They can't possibly contain a match. At least that's what I think is happening here. This might also be a general optimization for regular expressions in grep. I haven't looked into it.

Update 2020-12-30: It's probably the Boyer–Moore string-search algorithm. Forgot about that one. :D

In the rest of the article I quoted the performance for the smallest substring. In my experience long substring searches are somewhat rare. But in the end that depends on your usecase.

The awk side of things

I also wanted to test out awk a bit. If you've never heard about it: awk is a command line tool for line based processing. Perfect for processing logs, CSVs, statistics and stuff like that. It basically executes scripted actions before processing a file, then on each line that matches a pattern and after processing a file. It can do much more than just search each line for a substring.

And what's really cool: awk reads the input line by line. It doesn't have to load the entire file into memory. So you can happily crunch gigabytes or terabytes of data all day long without worying about running out of memory. Well, not that this is reflected in the benchmark since the entire file was in memory when those benchmarks were run. :P

I threw sed into the chart as well because it had no where else to go and felt lonely.

Searching for substrings with awk and sed.

GNU awk seems to be the default version, at least on Ubunt based systems. The performance is pretty decent and it could handle about 350 MiByte/s on the test system.

mawk is another implementation and uses a bytecode interpreter. I had it flagged down for a while because it seemed to be quite a bit faster than GNU awk. Looks like it actually is. :)

Also kind of funny: When I used the index() function of mawk to look for a substring instead of a regular expression it got quite a bit slower (the "mawk index" bar).

Naive line-by-line search in different languages

As mentioned before: I wanted to know what performance I can expect from a naive line-by-line search in different languages. This might be useful to decide if I go with a simple low-effort brute-force solution. Or if I have to spend a bit more time on evaluating alternatives (a database, library or whatever). It all depends on the requirements and environment of a project. The performance numbers given here are just datapoints that help to figure out how long a simple brute-force line-by-line search is fast enough.

Again: Please don't take this as a performance comparison. Because the world isn't that simple. I hope you don't choose the programming language you use for a project based on it's naive find-substring-in-line performance. ;)

With all that out of the way, on to the programs. The code is simple enough, here for example in Ruby:

File.open(filename) do |f|
    f.each_line do |line|
        puts line if line.include? search_term
Searching for substrings line-by-line in PHP, Python and Ruby.

I wrote the same in PHP, Python and C as well. But we'll look at the C programs later.

The nice thing of this approach is that we only need the memory for one line at a time. No matter how large the entire file is we won't run out of memory. As long as the entire file isn't just one big line that is. :D

Pretty decent performance for just a few lines of code. Probably fast enough for a few hundred MiBytes of data. But for larger data sets I would probably execute a grep command if the code only needs to run on Linux.

Search the whole file instead of each line

While playing around with the strstr() function in C I noticed that the function by itself pretty darn fast. So wouldn't it be great if we could eliminate the line-iteration overhead and our code could spend all its time in strstr()?

What if we had the entire file in a single string. Then we could search this one large string with consecutive strstr() calls until we've found all the matches and reached the end of the string. This is more like searching in a file with a text editor than searching each line individually.

Each time we find a match we could search for the start and end of the current line and print the entire line (just like grep does). Then our code wouldn't have to loop over each line, wasting instructions and CPU time doing so.

Might be faster but it would no longer be easy to search for multiple substrings in the same line. Also the code is more complicated than before, e.g. see ruby2-all-index.rb.

Another slight problem is that we would need the entire contents of the text file in one huge string variable to do that. But reading several hundred MiBytes kind of sucks: We would go over the entire file just to load it into memory and then go over that memory again to search it.

Well, somewhat obscure (depending on your background) operating system knowledge to the rescue! :)

Operating systems can "map" files into memory. Basically the operating system reserves a part of your memory (address space to be more exact) that is as large as the target file. When you access that memory at a given place the operating system automatically reads the corresponding chunk of data directly from the file. Meaning you can read data from a file just by looking at the right memory address.

And big surprise: A string is just another (usually small) part of your memory. So with memory mapping we can use the entire file as one large string without having to read it first. The operating system reads the parts that are used as soon as we access them.

There's more to memory maps than that. For example the operating system can easily evict seldomly used data chunks and use the memory for more important purposes. If need be it can always read the data chunks from the disk again. This can be useful when some other programs quickly need more memory (e.g. when you open a particularly "modern" website in your browser).

Anyway, I gave it a spin and lo and behold now we're talking. :)

Searching for substrings in one big mmaped string in PHP, Python and Ruby.

PHP and Ruby really shift up a gear here. Easily achieving several times their previous performance. Depending on the usecase this might be complexity well invested.

Python is pretty interesting here: It has an explicit mmap() function that maps a file into memory. The 2nd Python program uses that but otherwise does a line-by-line search just as before. Just with that it's a bit faster, but this is probably not because of the memory mapping itself.

Usually reading lines from a file in Python returns an str for each line. Those are encoding-aware and might do some extra stuff for that. Reading lines from a memory mapped file tough returns bytes. Those are just raw data without any encoding. When adding mmap() I also changed the search from str to bytes and I suspect that this causes the difference in processing speed.

Another note about PHP and Ruby: Both languages don't expose an explicit way to memory map a file. Of course there are extensions for that but I wanted to keep it simple. Instead they both have functions to read the entire file into a string (file_get_contents() and File.read()).

I hoped those use the C function fopen() internally because that in turn uses mmap() and maps the file into memory instead of reading it chunk by chunk. This worked out in my micro-benchmark here but I wouldn't count on that behaviour for a real project without further research. Especially if the file is larger than the systems memory. ;)

Going faster - C and multithreading

It might suprise some people but I still write quite a bit of code in C. Some tasks are simply easier done there than in other languages. Naturally I also wanted to know how fast the above programs would be when written in C.

Oddly enough threadding is something I find way easier in C than e.g. in Python or Ruby. No need to worry about the global interpreter lock (aka GIL), garbage collector and so on. We just let each thread search a slice of the data, e.g. 4 slices for 4 threads.

All previous benchmark results were still quite a bit away from the systems memory bandwith of 13.8 GiByte/s (as reported by mbw 2048). That is a very rough estimate of how fast the CPU can read from and write to RAM. Maybe multithreadding will help us getting closer to that limit. Sorry, unfortunately the sky isn't the limit for a CPU. :P

Searching for substrings with C: Line-by-line, in one big mmaped string and multithreaded. Added grep for scale (sorry, no bananas here).

Well, it turns out that i7-8550U can read more than 13.8 GiByte/s. :D Incidently the PC system peaked at 4 threads with 16.8 GiByte/s.

Turns out that if you really need performany C is the way to go. By a wide margin. With 4 threads the test system could search through 1 GiByte of data in just 65ms. I've seen a lot of SQL queries that operated on a lot less data and took a lot longer. But if that makes sense depends on your project and environment (as usual).

The downside is the complexity of the task in C. You have to be careful to properly slice the data and how to put a zero-terminator at the end of each slice (strstr() needs one). Not as easy as writing a hand full of lines in Ruby. But not particularly difficult if you're fit in C. You can look at c5-threaded-strstr.c for more details or feel free to just ask.

Putting the pedal to the metal

Ok, I didn't. To utilize the maximum memory bandwith of modern CPUs you have to use vector instructions to fetch data from memory into vector registers. Not to mention that we can use vector instructions to scan a lot of characters in parallel. More details about that (and much more) are in Agner's optimization manuals. But a fair bit of warning, this is pretty advanced and detailed stuff over there. ;)

As much fun as it would be to write an AVX2 assembler version my motivation didn't cary me that far… yet. Honestly I can envision myself using and maintaining the multithreaded C version in some circumstances. But an AVX2 version… not very much. But well, who knows what the future will bring. :D

When the data is a tight fit

In all previous runs the entire data easily fit into memory. I wanted to see how far I could take this until we get limited by the read bandwith of the drive on which the file is stored. My PC has 15.6 GiByte RAM (as reported by the free program), so I did a run on it with 100 million lines (14.3 GiByte). I didn't use the notebook because I was to lazy to generate a 31 GiByte test file. :P

This run only contains some of the programs that can handle files that are larger than the memory. Either by iterating line-by-line or by explicitly memory mapping the file. While all the data might still fit into memory I didn't want to risk thrashing.

Searching a 14.3 GiByte file on a PC with 15.6 GiByte RAM and 104 MiByte/s disk read speed.

grep and the search-by-line C program are straight out limited by the read speed of the disk (104 MiByte/s). Where we previously had hundreds of MiByte/s or even several GiByte/s we're now down to whatever the disk can provide. Remember, this one chart shows benchmark results from the PC. There the file was stored a HDD (104 MiByte/s), not an SSD like on the notebook (1.64 GiByte/s).

It starts to get interesting when the first program runs that explicitly uses memory mapping. The first time it still reads the entire data from disk. But after that it and all the following programs (also using memory mapping) get the data from the page cache.

I'm not sure why that is. But given that this is a corner case I didn't dig any deeper. I can't really see myself wanting to use pretty much all of my memory as a cache for a substring search. ;)

A bit of perspective

As much fun as it is to optimize the brute-force approach… at some point it will no longer be the right tool to solve the problem.

If your data is just a few MiBytes in size, or even a few hundred MiBytes, the brute-force appraoch is quite attractive. You only need to write the file with all the data and execute a grep command when someone does a seach. All in all just a few lines of code. Pretty nice complexity to performance ratio I think :)

Personally pretty much all the projects I did for customers were in that category (e.g. entire database SQL dump of a few hundred MiBytes). So for me those performance numbers are actually kind of relevant. :o

But sometimes you have a lot more data, and at some point brute-force just isn't viable anymore. Either because the text file gets to large or you don't want to waste the memory and I/O bandwidth. For those situations other tools are better suited to get the job done. Databases have indices for a reason (but I don't know a good index structure for substring search).

In that situation my first approach would be: Maintain a list of unique words in the searchable data. Use that for an autocomplete on the client side so the server only ever gets full words (not substrings). Then use a word based search on the server (a simple Python dict lookup or a hash table, a text index in SQL, Elasticsearch, whatever).

But again: Only if the naive approach would be to slow or large. After all a dozzen lines of code are quickly thrown away. And sometimes the naive approach might carray you further than you think.

In the end it isn't about which solution is the "best", but instead about which solution is a better fit for a given problem in a specific environment. And I hope this blog post at least somewhat hinted at the boundries of the brute-force search method. That it showed the complexity-to-performance sweet spot for this approach: When it can be used and when it stops being useful.

If anyone is reading this far: Wow, I'm amazed at your endurence. Take a cookie and enjoy the holidays. :)

sdt_dead_reckoning.h - Single header file library to create signed distance fields


Another small C function that came out of the project I'm currently working on. This one is to create a signed distance field out of a black and white image (see pictures below). Signed distance fields are quite useful for rendering 2D or 3D shapes on the GPU. I came across them while investigating text rendering but they're very versatile constructs.

While searching for a function to create one I stumbled across the paper "The dead reckoning signed distance transform" by George J. Grevera. It contained a quite nice pseudo-code implementation of the algorithm so I ended up implementing it in C. This is how code using it looks like:

// Load the mask image with stb_image.h
int width = 0, height = 0;
uint8_t* mask = stbi_load("mask.png", &width, &height, NULL, 1);

// Allocate and create the distance field for the mask. Pixels > 16 in the
// mask are considered inside. Negative distances in the distance field are
// inside, positive ones outside.
float* distance_field = malloc(width * height * sizeof(float));
sdt_dead_reckoning(width, height, 16, mask, distance_field);

// Create an 8 bit version of the distance field by mapping the distance
// -128..127 to the brightness 0..255
uint8_t* distance_field_8bit = malloc(width * height * sizeof(uint8_t));
for(int n = 0; n < width * height; n++) {
    float mapped_distance = distance_field[n] + 128;
    float clamped_distance = fmaxf(0, fminf(255, mapped_distance))
    distance_field_8bit[n] = clamped_distance;

The distance field itself is written into an float buffer. I've added the "mapping" part above because in most use cases you want to store the distance field not as floats but as something else (e.g. as 8 bit image). The function leaves that to you because it would make the API quite a lot more bloated.

Below is an example of an 500x500 pixel black and white mask image and the distance field computed for it. On my old Intel Core i5-3570K @3.40GHz it took about 5ms to compute it (single threaded, compiled with -O2 -ffast-math). Not useful for fancy realtime stuff but fast enough for my needs. And above all: Simple to use. :)

A mask image for a 2D shape. White pixels are considered inside the shape, black pixels outside (pretty intuitive for humans).
The resulting distance field. Each pixel contains the distance (in pixels) to the outline. Gray (128) is a distance of 0, white (255) is 127 pixels outside of the outline and black (0) is 128 pixels inside the outline.

While it doesn't look like much a distance field is really easy to render with a GLSL shader. And you get high-quality anti-aliasing almost for free. Another very handy thing is that you can shrink or grow the shape by simply adding or subtracting a value from the distance in the shader. And a lot of other cool things.

Without going into to much detail… if you have to render 2D shapes look into signed distance fields. :)

iir_gauss_blur.h - A Gaussian blur single header file library


A few days ago I needed a C function to blur an image. I had an old implementation of a Gaussian blur filter lying around so I reused that. It worked rather well and it's just one C function so I decided to clean it up and package it into a single header file library: iir_gauss_blur.h. This is what it looks like:

int width = 0, height = 0, components = 1;
uint8_t* image = stbi_load("foo.png", &width, &height, &components, 0);

float sigma = 10;
iir_gauss_blur(width, height, components, image, sigma);

Here stb_image.h is used to load the image before it is blurred. iir_gauss_blur.h compiles as C and C++ so you can just plug it in and use it (tested it on GCC only though).

The function is an implementation of the paper "Recursive implementation of the Gaussian filter" by Ian T. Young and Lucas J. van Vliet. It has nothing to do with recursive function calls, instead it's a special way to construct a filter. Other (convolution based) gauss filters apply a kernel for each pixel and the kernel grows as the blur strength (sigma) gets larger. Meaning their performance degrades the more blurry you want your image to be.

Instead the algorithm in the paper sweeps across the image four times: From left to right, right to left, top to bottom and bottom to top. This is enough to propagate the blur across the entire image, no matter how large or small the blur strength (sigma) is. Thanks to that the performance always stays the same. You can have some ridiculously large sigmas without any performance hit.

The meaning of life… eh, sigma (σ)

Sigma (σ) is a number (at least 0.5) that defines the blur strength:

The same image blurred with different sigmas. The original (top left), sigma 1.0 (top right), sigma 5.0 (bottom left) and sigma 10.0 (bottom right).

It seems the usual approach is to start out with some sigma value and then adjust it until you get the blurriness you want.

There's also the concept of a "blur radius" but I have no idea what exactly it's supposed to mean. I suspect it's the kernel size for convolution based algorithms. There seem to be several rules of thumb out there, e.g. radius = 2 * sigma. So if you want to have a blur radius of 10px you can use sigma = (1.0 / 2.0) * radius to get the sigma for it (5.0). I also found other interesting equations here and here (again for kernel sizes I guess).

All this got me thinking about what a useful "blur radius" would actually mean for me. Mentally I think that the filter blurs each pixel and I want to configure the radius by which a pixel is blurred. In my current project I have two difference scenarios where that's important for me:

I tried to solve the equation of the normal distribution to calculate the proper sigma for a given radius and threshold (e.g. 16/256 or 1/256 as above). But after a bit of tinkering (and a lot of math confusion) I gave up and went the empirical way: I made a test image and blurred it with different sigmas. Then used GIMP to measured how far the white seeped into the black region (again for the thresholds of 16 and 1). This is what I got for the threshold of 16:

A white bar (top and bottom) blurred with different sigmas. The red bars show the distance from the original white bar until the brightness drops below 16.

I did the measurements one image (and sigma) at a time. Combining many into the image above makes the lower gray values a bit more visible than they're otherwise. Anyway, I did the same for the threshold of 1 and plugged it into Wolfram Alpha to approximate the values with a linear equation. Again, after some fiddling around this is what I got:

sigma = (1.0 / 1.42) * radius16
sigma = (1.0 / 3.66) * radius1

It seems to work more or less well for a sigma up to 100. At least that's good enough for my use cases.

But given how crude these approximations are take them with a grain of salt. Or even better a whole tee spoon of salt. They work for me but might not work for anything you have in mind. That's why I decided to leave these calculations out of the iir_gauss_blur() function.

Just in case they work for you and you also want to use them via the command line I created a corresponding CLI tool.

By the way: The algorithm looks like it could be implemented quite nicely on a GPU with OpenCL or compute shaders. I assume for small sigmas a convolution based GLSL implementation is probably faster. It can use the bilinear interpolation of the hardware and is probably more local and texture cache friendly. But for larger sigmas the "recursive" algorithm will outperform it, even if it needs 4 passes over the data. Also its performance doesn't degrade with the blur strength so that might be the killer argument in some cases. Anyway, I'm very happy with the (single-threadded) CPU performance so I won't do that any time soon.

I hope this stuff is useful to someone or was at least interesting. :)

js2plot - plot 2D functions written in JavaScript


A while ago I wanted to plot a simple mathematical function for a project of mine. No problem, there are a lot of good function plotters out there. Then I stumbled over the polynomial smooth min function by Inigo Quilez. It was written for GLSL shaders so it uses functions like clamp() and mix():

float smin( float a, float b, float k )
    float h = clamp( 0.5+0.5*(b-a)/k, 0.0, 1.0 );
    return mix( b, a, h ) - k*h*(1.0-h);

Ok, getting that into one of those plotters seemed like a hassle. Some of those tools can do amazing things, but I’m a programmer so I wanted to write the functions in some programming language. I found code snippets and small libraries to draw function plots into a canvas but nothing simple with panning and zooming. I know there’s insert your favorite tool here but my Google bubble wasn’t trained for that and I didn’t find it. Of course there are the heavy-weights like Octave or CindyJS but those didn’t appeal to me at all.

I think you can guess the rest. :D How hard can it be to write a simple plotter with panning and zooming? So I wrote a little tool for my own personal needs: js2plot (GitHub project).

It executes a piece of JavaScript and in there you can call plot() to show the graph of the function. You can then explore the graph with panning and zooming. Or add more functions, or use variables, JavaScripts Math functions, and so on. All in all nothing special but for me it’s more convenient to write mathematical code that way (see smooth min example).

In case that you have the same problem I hope the tool is useful to you. The plotting part is written as a small library without dependencies. With it you can turn a canvas element into a graph with panning and zooming. But you have to provide a way for users to write the plotting code.

ps.: I only use the tool on desktops and notebooks so I didn’t implement touch panning and zooming. Panning isn’t a problem but handling to a pinch-zoom still seems like a pain. Reimplementing that gesture isn’t my idea of fun so I skipped it for now.

Chromium and WebKit bug when switching stylesheets


A few days ago I implemented a style switcher for my latest project. Unfortunately I ran into a rather obscure Chromium (and WebKit) bug. This post should give a small heads up to anyone running into the same weird effects. I opened an issue for the bug, so please post your comments there.

To switch between styles you just have to disable the current stylesheet and enable another one. I had no trouble with that in Firefox and Internet Explorer. But in Chromium and WebKit enabling the other stylesheet doesn't work the first time. It just stays disabled. This leaves the page unstyled since the first stylesheet gets disabled (disabling works). The bug only happens the first time. When you disable it and enable it again it works.

And that’s also the workaround: Instead of just enabling the other stylesheet you can enable it, disable it and then enable it again. All in one go.

Time to illustrate this with some code. This is the code of a small page usually styled by blue.css. As soon as the page is loaded the style is switched to green.css:

<!DOCTYPE html>
<title>Switch when loaded</title>
<link href="blue.css" rel="stylesheet" title="blue">
<link href="green.css" rel="alternate stylesheet" title="green">
    window.addEventListener("load", function(){
        var blue_style = document.querySelector("link[title=blue]");
        var green_style = document.querySelector("link[title=green]");

        blue_style.disabled = true;
        green_style.disabled = false;

Note that green.css is an alternate stylesheet, meaning it’s not used by default (the browser ignores it). When the page is loaded the styles’ disabled attributes are set accordingly. That attribute is part of the HTMLLinkElement interface. Or you can use the newer StyleSheet interface in which case it would be green_style.sheet.disabled = false; (same bug applies).

The workaround looks rather silly but is effective never the less:

<!DOCTYPE html>
<title>Switch when loaded</title>
<link href="blue.css" rel="stylesheet" title="blue">
<link href="green.css" rel="alternate stylesheet" title="green">
    window.addEventListener("load", function(){
        var blue_style = document.querySelector("link[title=blue]");
        var green_style = document.querySelector("link[title=green]");

        blue_style.disabled = true;
        green_style.disabled = false;
        // Workaround for Chromium and WebKit bug (Chromium issue 843887)
        green_style.disabled = true;
        green_style.disabled = false;

You can try it with this minimalistic standalone style swticher. The styles just change the color of the first paragraph. When it’s black the page is unstyled. This only happens when you load the page and click on "green". Everything else after that or every other combination of clicks works.

Finding the cause of the bug was quite maddening… and I hope this post spares you the same experience.

smtp_send - A simple PHP function to send mails


A few years ago I wrote a small PHP function to send mails via SMTP. Not so long ago a friend asked for it. So I cleaned it up and wrote a bit of documentation: smtp_send. It won’t work for everyone but it’s only about 90 lines of code and you can easily change it to suite your needs.

If you don’t care about dependencies or code footprint you can use one of the many libraries out there. But if you’re tired of PHPs mail() function, overly complicated APIs or huge amounts of dependencies just to send mails this might be worth a look. The function is meant for people who want to construct the mails themselves. Either for simplicity or because of more complex cases. Then you can use the function to simply send the result.

Usage example:

$message = <<<EOD
From: "Mr. Sender" <sender@dep1.example.com>
To: "Mr. Receiver" <receiver@dep2.example.com>
Subject: SMTP Test
Date: Thu, 21 Dec 2017 16:01:07 +0200
Content-Type: text/plain; charset=utf-8

Hello there. Just a small test. 😊
End of message.
smtp_send('sender@dep1.example.com', 'receiver@dep2.example.com', $message,
    'mail.dep1.example.com', 587,
    array('user' => 'sender', 'pass' => 'secret')

I put it online because my friend searched for a simple PHP mailer but couldn’t find one matching the projects requirements. I had the same experience a few years ago. In the end decided to look at the SMTP RFC and write my own little function. With a bit of luck this makes your life easier if you’re caught in same corner cases.

The documentation also contains examples of mails with attachments, alternative content (HTML mail with plain text fallback) or both (alternative content nested in mixed content). For some reason this information seems to be surprisingly hard to find.

Measurements of system call performance and overhead


I was kind of bored for once and figured I could clean up an old microbenchmark and kick the results out of the door. This is gonna be another very technical post for programmers. Be prepared. ;)

A while ago I came across a paper about "exokernels", very light-weight operating system kernels that leave most of the work up to libraries (Exokernel: An Operating System Architecture for Application-Level Resource Management). The paper is from 1995 and among other things they implemented high performance system calls.

I always heard system calls are expensive but I never saw any real numbers or measurements. Only the stuff the C standard library was doing with fread() and fwrite(): Batching I/O to minimize system calls. But system calls evolved a lot since the x86 32 bit days and today we have a dedicated syscall instruction for them.

This got me wondering: How fast are system calls on Linux today compared to normal function calls? In the paper they compared a function call with no arguments to the getpid() system call. That system call just does the minimum of work: Switch into kernel mode, read one variable and return its value. Sounds simple, so I wrote a set of small microbenchmarks and did the same on my Linux notebook.

When I asked my brother for some old CPUs he gave me this…
…an entire stack of CPU chips.

Back when reading the paper I just ran the microbenchmarks on my own notebook (an old Intel Core 2 Duo T6600 from 2009). But right now I'm with my brother and he owns quite a few computers and I asked him to run the microbenchmarks on some of them. We also run the benchmarks on some older CPUs (see the photo :)). Anyway, it gave me a nice perspective of how the performance of system calls evolved over time.

Mind you, I just took whatever PC was close by (no idea how fast or slow) as well as a few older CPUs. I was curious how system call performance changed over the years on x86_64 and don't really care about what CPU is faster or slower. Don't take these numbers and say that CPU X is faster than CPU Y. It just doesn't make sense. System calls are one of many factors that make up the performance of a CPU and software but by no means a defining one. So don't take the numbers to serious. This is just for fun after all. ;)

Before we get to the numbers I have to explain a bit about how Linux optimizes some system calls: On x86 64 bit system calls are usually made through the syscall instruction. You put the arguments into the proper registers and the syscall instruction then transitions into kernel mode and calls the proper kernel function. This incurs some overhead since the CPU has to switch into a different address space and this might need updates to the TLB, etc.

For some system calls the Linux kernel provides optimized versions via the vDSO. In some cases these optimized versions don't need to transition into kernel space and thus avoid the overhead. By default the C runtime on Linux (glibc) uses these optimized functions automatically whenever there is one available. I'm interested in how both of these mechanisms perform.

On to the numbers. Just for a bit of context: Reading a value from the main memory takes about 100ns, see Latency Numbers Every Programmer Should Know.

Syscalls incur quite a heavy overhead compared to normal function calls. That got better with time. The vDSO implementation of the getpid() system call is pretty good at mitigating the system call overhead and is almost as fast as a normal function call. On the Intel Celeron D 341 from 2004 the a system call via the syscall instruction was about 25 times slower than a system call via the vDSO. On the Intel Core i7-4790K from 2014 it's only about 12 times slower. For me I'll use 10 times slower as rule of thumb for modern CPUs and 25 times for older CPUs.

In case you're wondering about the details:

Performance of fread() and fwrite() I/O batching

I was also wondering about the I/O batching the standard C library does with fread() and fwrite(). Is it really worth the effort? Or are that many function calls wasted since system calls can be pretty fast, too?

A few microbenchmarks answered that question, too. This time I measured 1 million 4 byte reads and writes. Again directly via the syscall instruction and via the vDSO. But this time also via fread() and fwrite() instead of using system calls. Here's what I got:

So, yeah, in this scenario the I/O batching definitely is worth it. It's about 10 to 5 times faster.

But keep in mind, these were 4 byte reads and writes. I took a look with strace and they got batched into 4 KiByte reads and writes. So when you're reading or writing a few KiBytes the speedup is likely not that large.

It's also interesting that the vDSO doesn't help much for the read() and write() system calls. But usually it doesn't hurt either. So maybe these system calls can't be optimized in that way.

If you're still with me: Thanks for reading. I hope my strange little experiment was interesting. :)

Math 3D - A simple vector and matrix library for C


If you've done a bit of graphics programming with OpenGL you're probably familiar with vector and matrix math. Every point or direction in 3D space can be represented as a vector and most manipulations of those can be represented as matrices (moving them around, rotating them, projecting them on the screen, etc.).

I wasn't happy with the C math libraries I found so I wrote a new one (surprise!). Read on if you want to know why. But here is what using it looks like:

mat4_t projection = m4_perspective(60, 800.0 / 600.0, 1, 10);
vec3_t from = vec3(0, 0.5, 2), to = vec3(0, 0, 0), up = vec3(0, 1, 0);
mat4_t transform = m4_look_at(from, to, up);

vec3_t world_space = vec3(1, 1, -1);
mat4_t world_to_screen_space = m4_mul(projection, transform);
vec3_t screen_space = m4_mul_pos(world_to_screen_space, world_space);

OpenGL provides a lot of vector and matrix math for shaders that run on the GPU but on the CPU you have to do it yourself. I like it this way because it allows you to choose what kind of 3D math library you want: One that tries to optimize every last bit of performance but is complicated to use or one that is easy to use but lax on performance. Depending on the project you can pick what you need.

I do pretty much all of my low level graphics programming in C (I know, I'm weird, but OpenGL is a C API after all). And I haven't found a 3D math library I'm happy with for C. I don't know all of them but the ones I've looked at were complicated to use and vague on their semantics. Where they written with the OpenGL or Direct3D conventions in mind? How are matrices laid out in memory? Can I pass them into OpenGL directly or do I have to transpose them before hand?

There are good C++ math libraries that perfectly fit my purpose but I'm an awful C++ programmer. I spend way more time fiddling around with the language than I spend on solving the problem. So I stayed with C. Maybe someone else will find non-templated math code similarly appealing.

All this drove me to write my own small 3D math library for C. Just the basics, nothing fancy. A friend of mine joined in and together we started out from scratch. It was quite a nice learning experience. We spend a lot of time on the whiteboard, calculated a lot of the math by hand and wrote a lot of tests until we understood what the math should actually mean. It was just there that we realized that the perspective divide step in OpenGL is just an fixed function hack to fit a perspective projection into a 4x4 matrix. Kind of pointless with shaders...

Anyway, if you ever do OpenGL stuff in C and need a simple to use math library you can pick it up here: math_3d.h. It's a single header-file library in the style of stb_image.h so it's easy to integrate into projects.

It covers basic 3D vector math, transformation and camera matrices (translation, rotation, scaling and look at), projection matrices (orthographic and perspective) as well as basic matrix math (matrix-matrix, matrix-point and matrix-direction multiplication, inversion of affine transformations). The documentation is in the file itself so take a look if you're interested.

Happy programming. :)

Thanks for scrolling down all the way, it can get quite lonely here…
Anyway, looking for older entries? Want to know more? Take a look at the archive.