K8s-mtp: a multi-tenant Kubernetes platform pt. 2

In part 1 we could see the foundation of this project taking shape - we had a zero-framework HTTP server, with structured logging. Solid foundations are very important, but now it’s time to have a go (pun intended) at the fundamental part of this project: we want to automatically provision secure, isolated namespaces with proper RBAC. This is the “paved road” I wanted to build, and these are the first steps to getting there!

The Bad Stuff (and why it’s good!)#

When I started this project, one of the main objectives I had for it was to reduce the amount of external dependencies to a bare minimum. But I also know that this can often lead to reinventing the wheel, and when we’re dealing with certain complex subjects, this is huge pitfall: a quick and dirty implementation is prone to be extremely faulty, and a correct one would take forever. So, there must be some situations where we must fetch help from the outside.

While I was building this part of the project, two domains jumped right out where I’d need to compromise: on one hand, the Kubernetes API; and on the other, the usage of controller-runtime. And while this frustrates me a little bit, I do realize that this is the only pragmatic solution. As I said, a proper implementation of any of these components would be bound to take me forever to build, and I very much that I wouldn’t be introducing a whole host of bugs. So, while it’s (for me) bad that I’m using these external dependencies, the truth of the matter is that this is bound to improve the quality of this project, as these are very mature projects used in thousands and thousands of production environments. So, good news!

Guardrails are here#

I had promised some guardrails, and I’m proud to say that they are here. Not very comprehensively yet, that’s true, but we can certainly see them in place. One example of how these guardrails work is the differentiation between an admin and an operator. This is, in effect, the translation of business requirements into Kubernetes primitives via the API we imported earlier. We can see this in practice by having a look at the code that instantiates the respective roles (some parts omitted for brevity):

func buildAdminRole(t *v1.Tenant) *rbacv1.Role {
    return &rbacv1.Role{
        Rules: []rbacv1.PolicyRule{
		{
			APIGroups: []string{""},
			Resources: []string{
				"pods", "pods/logs", "pods/status", "pods/exec",
				"services", "services/proxy", "endpoints",
				"configmaps", "secrets", "serviceaccounts",
				"events", "limitranges", "podtemplates",
			},
			Verbs: []string{
				"get", "list", "watch", "create", "update",
				"patch", "delete", "deletecollection",
			},
		},
    }
}

func buildOperatorRole(t *v1.Tenant) *rbacv1.Role {
	return &rbacv1.Role{
		Rules: []rbacv1.PolicyRule{
			{
				APIGroups: []string{""},
				Resources: []string{
					"pods", "pods/log", "pods/status", "pods/exec",
					"services", "endpoints",
					"configmaps", "secrets",
					"events",
				},
				Verbs: []string{
					"get", "list", "watch",
					"create", "update", "patch",
					"delete",
				},
			},
        },
    }
}

For now, all these rules are hard-coded into the binary that this project produces, but it wouldn’t be difficult to make these configurable. The question is: do you trust yourself with this responsibility? (more on this later)

Another instance where the guardrails are being erected is the resource consumption from each namespace. I implemented a builder for resource quotas, that guarantees, on one hand, that you won’t have runaway costs, as well as allowing for much more predictability in resource allocation. This is how each quota is being built:

func (r *TenantReconciler) buildResourceQuota(t *v1.Tenant) *corev1.ResourceQuota {
	return &corev1.ResourceQuota{
		ObjectMeta: metav1.ObjectMeta{
			Name:      "tenant-quota",
			Namespace: fmt.Sprintf("tenant-%s-%s", t.Name, t.Spec.Name),
			OwnerReferences: []metav1.OwnerReference{
				{
					APIVersion: v1.SchemeGroupVersion.String(),
					Kind:       "Tenant",
					Name:       t.Name,
					UID:        t.UID,
				},
			},
		},
		Spec: corev1.ResourceQuotaSpec{
			Hard: corev1.ResourceList{
				corev1.ResourceCPU:    resource.MustParse("500m"),
				corev1.ResourceMemory: resource.MustParse("1Gi"),
				corev1.ResourcePods:   resource.MustParse("20"),
			},
		},
	}
}

This is just an example of how each quota can be built, the logic is always the same. For now, there’s only a single quota in place, but by default I expect to have at least 3 quotas in place, with the possibility of creating configurable quotas via a configuration file.

The Reconciler Pattern#

Perhaps the most central piece of code of this whole project is the reconciler. This is what allows us to translate the desired state, expressed in a configuration, to the actual state in a cluster. And this process is bound to have two proprieties which are fundamental for a system like this: it’s an ongoing convergence, that continuously reconciles the two states; and it’s idempotent, which is a must-have, as it makes no sense to have the same configuration generate two different states.

This pattern is thus expressed in 4 different phases: Fetch, Build, Apply, and Update/Report. In terms of code, we can see it expressed in this following snippet:

func (r *TenantReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    // 1. Fetch the Tenant CR
    tenant := &v1.Tenant{}
    if err := r.Get(ctx, req.NamespacedName, tenant); err != nil {
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }

    // 2. Build desired state
    desiredNamespace := r.buildNamespace(tenant)
    desiredQuota := r.buildResourceQuota(tenant)
    // ... more builders

    // 3. Apply/create each resource
    if err := r.createOrUpdate(ctx, desiredNamespace); err != nil {
        return ctrl.Result{}, err
    }
    // ... etc

    // 4. Update status
    tenant.Status.Phase = v1.TenantPhaseActive
    // ...
}

The Fetch part is simple to grasp: making use of the controller-runtime package we imported before, we try and get the information for a given Tenant. After we get that information, we build the desired state, and we already saw how that works in the precious section. So, we’re missing the last two. Update/Report doesn’t really have much to say, other than being important from a systems audit perspective. So, we’re left with the Apply part, which is where most of the magic happens.

The createOrUpdate function is indeed kind of magic. Let’s have a look and break it apart:

func (r *TenantReconciler) createOrUpdate(ctx context.Context, obj client.Object) error {
	if err := r.Create(ctx, obj); err != nil {
		if !errors.IsAlreadyExists(err) {
			return fmt.Errorf("failed to create: %w", err)
		}

		key := client.ObjectKeyFromObject(obj)
		current := obj.DeepCopyObject().(client.Object)

		if err := r.Get(ctx, key, current); err != nil {
			return fmt.Errorf("failed to get existing: %w", err)
		}

		if !needsUpdate(current, obj) {
			return nil
		}

		obj.SetResourceVersion(current.GetResourceVersion())

		if err := r.Update(ctx, obj); err != nil {
			return fmt.Errorf("failed to update: %w", err)
		}
	}
	return nil
}

While this seems a function that’s simple, in some sense that’s a deception, for we make use of a trick. We start by optimistically trying to create every tenant, and while this might seem odd at first, this pattern is enabled by the fact that the Kubernetes API has an errors package that reports a IsAlreadyExists error. So, in less than 5 lines, we already took of two thirds of the possible states: we try to create the tenant, and check for errors and if there were no errors, it means the tenant was created successfully; if there were errors, we return said error except when the error is that the tenant already exists. If and only if that is the situation is that we need to worry about more the complicated logic of having to handle an update to the tenant.

The Good Stuff (and why it’s really good!)#

As it is, this project already displays some quite nice basic features that are bound to be extremely useful for production environments.

One such case, is the self-healing that’s built-in into the system. If, by mistake, an admin deletes (e.g.) a Role, our reconciler will make sure that it will be recreated quite fast; or if permissions are changed manually, this will be detected and reverted. It’s hard to overstate how important this is in a production environment.

Another good feature is that since we have a state management configuration that’s declarative, we can build a GitOps system to operate everything. This opens up a whole avenue for integration of already existing systems, as well as allowing for better auditing of the system at a different layer. And while we’re talking about auditing, the fact that every change has to go through the controller means that it’s remarkably easy to audit changes to it.

What’s Next#

For the next phase of this project, I’ll be concentrating on both extending the existing features, as well as improving the overall security of the system, by making network isolation mandatory and enforcing pod privileges. Stay tuned!