In concurrent applications, we need a way to allow different tasks to make progress at the “same” time, which is known as concurrency. The most widely-used technology for concurrency is threads, an execution unit provided by the operating system. However, there is always a significant overhead associated with multi-threading, especially when a thread is only performing some small, lightweight tasks.

To understand the overhead introduced by multi-threading, we need to look into how the operating system implements threads. In Linux, a thread is just a process that shares resources like virtual memory with its parent process. Therefore, when a thread yields control and begins to wait on an event to wake up, the operating system will take over the control, find the next suitable process to run, and switch to that process, just like what it would do with normal processes. This introduces a significant overhead for almost all kinds of multithread applications, especially those requiring frequent message passing between threads. Additionally, when a thread is created or destroyed, the operating system has to initialize or destroy a big chunk of process information (task_struct in Linux) along with the related resources (user stack, etc.), which takes a lot of memory and CPU time, and especially affects those applications that need to run millions of tasks concurrently or frequently create/destroy tasks.

Coroutine is the technology used for reducing multithreading overheads. The idea behind coroutines is that, instead of letting the OS manage our tasks, we manage it ourselves in userspace.

Stackless Coroutines

There are two classes of coroutine systems, stackless and stackful. A stackless coroutine system is usually implemented with some language-level transformations, like CPS transformation and a switch-based method similar to Duff’s Device. A program in CPS (Continuation Passing Style) form, as its name suggests, instead of passing the results of computations by return values, passes them by “continuations”, which looks like:

// Original form
let a = f1()
let b = f2(a)
print(b)

// CPS form
f1(
    a => f2(
        a,
        b => print(b)
    )
)

If you are familiar with JavaScript, the code above should remind you of JavaScript callbacks. JavaScript is by design single threaded and asynchronous, and blocking operations have to be implemented with callbacks. In fact, such kind of callbacks is an explicit form of CPS, regardless of whether Promise is used to avoid problems like callback hell. To simplify programming, async/await syntax is introduced later, which enables users to write code that looks synchronous but is in fact asynchronous, and is implemented with automatic CPS transformation.

Another approach to stackless coroutine, which looks like Duff’s Device, makes use of a big switch loop:

// Original form
int f1(int a) {
    int i, v = 0;
    for(i = 0; i < a; i++) v += i;
    return v;
}

// Stackless coroutine
struct Local {
    int i;
    int v;
};
int f1(int a, int state, struct Local *local, int *out) {
    switch(state) {
        case 0:
            local->i = 0;
            local->v = 0;
            return 1;
        case 1:
            if(local->i < a) {
                local->v += local->i;
                local->i++;
                return 1;
            } else {
                *out = local->v;
                return -1;
            }
    }
}

In the code above, we maintain the execution state of f1 outside of itself, therefore allowing recovery of its execution at any point of time as long as we preserve state and local.

Stackful Coroutines

Stackless coroutines are “stackless” because they do not explicitly preserve the call stack, while stackful coroutine systems, like native threads, allocate a stack for each coroutine. Compared to stackless coroutines, stackful coroutines usually do not need language-level transformations, making it easier to be integrated into different programming languages.

When the OS switches tasks, it needs to save the machine state of the old task and load the new task’s one. A machine state contains all used registers, along with in-memory data that should be preserved, e.g. the call stack. As we are going to do coroutine switching in userspace, we have to manually implement the logic for switching machine states.

For convenience, in this section, we assume that our platform is Linux on a x86_64 machine and we are implementing cooperative scheduling. Therefore, we need to follow the x86_64 System V ABI when dealing with calling conventions. The ABI specifies that the registers rbx, rbp, r12, r13, r14 and r15 are callee saved, so they must be saved before switching to another coroutine.

Let’s start from the coroutine switching function:

asm(
    "_co_switch:\n"
    "push %rbx\n"
    "push %rbp\n"
    "push %r12\n"
    "push %r13\n"
    "push %r14\n"
    "push %r15\n"
    "mov (%rdi), %rax\n"
    "mov %rsp, (%rdi)\n"
    "mov %rax, %rsp\n"
    "pop %r15\n"
    "pop %r14\n"
    "pop %r13\n"
    "pop %r12\n"
    "pop %rbp\n"
    "pop %rbx\n"
    "ret\n"
);

The co_switch function pushes all current callee-saved registers onto the current stack, swaps (%rdi) (the memory pointed to by the first argument, target stack address here) with the stack pointer %rsp, loads the callee-saved registers of the target coroutine, and then executes ret to return in the target coroutine.

The code above might be a bit hard to understand without experience in coroutine systems, but I will try to explain it here.

Suppose we have a coroutine A that has just executed the instruction mov %rax, %rsp. At this point, we are sure that the old stack, which is swapped out into memory location (%rdi), has the following layout (from lower address to higher address):

[r15] [r14] [r13] [r12] [rbp] [rbx] [return_address]

A coroutine can only be switched out with co_switch, so its stack must have the same layout. Therefore, since we are already on the target stack after executing mov %rax, %rsp, we can pop the saved registers in the reverse order, from r15 to rbx. Then, the ret instruction is executed, which pops the return address and jumps to it, and the caller code on the target coroutine observes a normal return from co_switch. Okay, we are on the target coroutine!

However, we just assumed that we already have a coroutine switched out previously. How should we start a coroutine then?

We have to construct the stack layout:

[r15] [r14] [r13] [r12] [rbp] [rbx] [return_address]

The values of the saved registers can be arbitrary, since there’s no old registers to save. To construct a valid “return” address, a trampoline is needed:

asm(
    "_pre_call_entry:\n"
    "pop %rax\n" // entry
    "pop %rdi\n" // userdata
    "call *%rax\n"
    "ud2\n"
);

pre_call_entry pops a function pointer entry and the associated userdata from the stack, calls entry with userdata as its only argument, then executes ud2 to indicate entry should never return. Therefore, we need to include entry and userdata in our stack construction:

[r15] [r14] [r13] [r12] [rbp] [rbx] [return_address] [entry] [userdata]

With the two assembly functions and the expected stack layout, now we can write the code to start a coroutine:

void co_switch(void **stack);
void pre_call_entry();

struct Coroutine;

typedef void (*CoEntry)(struct Coroutine *co);

struct Coroutine {
    void *stack;
    CoEntry entry;
    int terminated;
};

void call_entry(struct Coroutine *co) {
    co_switch(&co->stack);
    co->entry(co);
    co->terminated = 1;
    co_switch(&co->stack);
}

void start_coroutine(struct Coroutine *co) {
    void **stack = (void **) co->stack;

    *(--stack) = co;
    *(--stack) = call_entry; // 16-byte aligned

    *(--stack) = pre_call_entry;

    *(--stack) = 0;
    *(--stack) = 0;
    *(--stack) = 0;
    *(--stack) = 0;
    *(--stack) = 0;
    *(--stack) = 0;

    co->stack = (void *) stack;

    co_switch(&co->stack);
}

The C code above should be trivial if you’ve understood how co_switch and pre_call_entry work.

The full example code is available on my GitHub Gist.

Optimizations and Tricks

  • ret in co_switch can be replaced by an explicit pop and jmp to avoid return address prediction failures.
  • If your coroutines are extremely lightweight and usually use a very small space on the stack, we can use a fixed memory region as the execution stack (“shared stack”) and copy in/out the used range (stack_end - sp) of the coroutine stack when switching in/out of a coroutine. Although memory copying is needed, this approach can save a lot of memory while still allowing to use guard pages to ensure safety.
  • To implement cross-platform stackful coroutines without writing assembly, the pthread_* functions can be used along with setjmp and longjmp to initialize the machine state on a new stack and switch between stacks, without actually using OS threads.

Scheduling, I/O, and More

For both stackless and stackful coroutines, we take over the job of task scheduling from the OS. Therefore, a scheduler in userspace is needed to allow multiple coroutines running at the “same” time. Meanwhile, high-performance I/O is one of, if not the most, important uses of coroutine-based systems. The next article will cover scheduling, I/O and more techniques in coroutine-based systems.