K8s-mtp: a multi-tenant Kubernetes platform pt. 3

tl;dr:

1557 lines of code added
17 fix commits vs 8 feature commits
Almost 60% of the commits were related with debugging the platform
Complete rewrite of the webhook TLS logic

It’s been a while since I last posted about this project. The reason was not that it was abandoned, on the contrary: in the meantime, I’ve been able to add ~1500 lines of code to this project. I know this might not sound a lot in this day and age, but trust me: the problem is never about how many lines of code you write, but rather whether the code is good quality. And at this point, it was time to start testing what we had built so far; so much so, that in this time, ~60% of my commits have been related with fixing existing issues in the actual implementation. But I’m getting ahead of myself.

Let’s just say: if part 2 was about building the foundations of this platform, this post will be about discovering that foundations crack when subject to real-world pressure!

What’s new#

Admission Webhooks, our gatekeeper#

If you recall, my promise at the end of part 2 was that I’d start improving the security of the platform. The first step for this, and where most of the new code was added, was in the admission webhook. This allows the platform to intercept the K8s API requests and validate them against the security standards we want to implement.

Here the choice was to deploy the webhook separately from the controller and the API calls receiver. Still, it needs to be integrated with them, as they all need to work in tandem. That’s the easy part to implement:

webhookMgr := &webhook.WebhookManager{
	Client:    directClient,
	Logger:    logger,
	Namespace: "k8s-mtp",
	Image:     cfg.WebhookImage,
}

ctx := ctrl.SetupSignalHandler()

if err = webhookMgr.EnsureAll(ctx); err != nil {
	logger.Error("unable to create webhook", "error", err)
	os.Exit(1)
}

The webhook manager then works to guarantee that all the guardrails are in place:

func (w *WebhookManager) EnsureAll(ctx context.Context) error {
	if err := w.ensureTLSSecret(ctx); err != nil {
		return fmt.Errorf("failed to ensure TLS secret: %w", err)
	}

	if err := w.ensureDeployment(ctx); err != nil {
		return fmt.Errorf("failed to ensure deployment: %w", err)
	}

	if err := w.ensureService(ctx); err != nil {
		return fmt.Errorf("failed to ensure service: %w", err)
	}

	if err := w.ensureValidatingWebhookConfig(ctx); err != nil {
		return fmt.Errorf("failed to ensure webhook config: %w", err)
	}

	if err := w.ensureMutatingWebhookConfiguration(ctx); err != nil {
		return fmt.Errorf("failed to ensure mutating webhook config: %w", err)
	}

	return nil
}

Of note in the current state of the webhook design is that not only do we validate not only pods and containers, but also ephemeral containers, as they’re useful for debugging but must still obey the constraints that are set by the policies that are in place. This is the first guardrail is set:

func validateContainer(container corev1.Container) (bool, metav1.Status) {
	if container.Resources.Limits == nil {
		return false, metav1.Status{
			Status:  "Failure",
			Message: fmt.Sprintf("Container %s must have resource limits", container.Name),
			Reason:  "MissingResourceLimits",
			Code:    400,
		}
	}

	if container.Resources.Limits.Cpu().IsZero() {
		return false, metav1.Status{
			Status:  "Failure",
			Message: fmt.Sprintf("Container %s must have CPU limits", container.Name),
			Reason:  "MissingCPULimit",
			Code:    400,
		}
	}

	if container.Resources.Limits.Memory().IsZero() {
		return false, metav1.Status{
			Status:  "Failure",
			Message: fmt.Sprintf("Container %s must have Memory limits", container.Name),
			Reason:  "MissingMemoryLimit",
			Code:    400,
		}
	}

	if container.SecurityContext == nil {
		return false, metav1.Status{
			Status:  "Failure",
			Message: fmt.Sprintf("Container %s must have SecurityContext defined", container.Name),
			Reason:  "MissingSecurityContext",
			Code:    400,
		}
	}

	if (container.SecurityContext.Privileged != nil && *container.SecurityContext.Privileged) ||
		(container.SecurityContext.RunAsNonRoot != nil && !*container.SecurityContext.RunAsNonRoot) ||
		(container.SecurityContext.RunAsUser != nil && *container.SecurityContext.RunAsUser == 0) {
		return false, metav1.Status{
			Status:  "Failure",
			Message: fmt.Sprintf("Container %s failed privilege check", container.Name),
			Reason:  "Conflict",
			Code:    400,
		}
	}

	if (container.SecurityContext.ReadOnlyRootFilesystem != nil &&
		!*container.SecurityContext.ReadOnlyRootFilesystem) ||
		container.SecurityContext.ReadOnlyRootFilesystem == nil {
		return false, metav1.Status{
			Status:  "Failure",
			Message: fmt.Sprintf("Container %s must have read only root filesystem", container.Name),
			Reason:  "Conflict",
			Code:    400,
		}
	}

	if (container.SecurityContext.AllowPrivilegeEscalation != nil &&
		*container.SecurityContext.AllowPrivilegeEscalation) ||
		container.SecurityContext.AllowPrivilegeEscalation == nil {
		return false, metav1.Status{
			Status:  "Failure",
			Message: fmt.Sprintf("Container %s must not be allowed to escalete privileges", container.Name),
			Reason:  "Conflict",
			Code:    400,
		}
	}
	return true, metav1.Status{}
}

Through this we achieve a guarantee that 1) we have limits set, and that said limits are actually set (and not merely zero); 2) that a security context is set; and 3) that our security context guarantees non-privileged containers, with read-only file system, and explicitly disallows privilege escalation. All of these conditions must be met simultaneously for the webhook to accept the request, apply labels automatically (which allows us to track resources), and pass the request to the reconciler. Finally, we also implement TLS challenges, with self-signed certificates, from a set of DNS names for the webhook service, as well as a certificate management lifecycle.

Network Isolation to keep tenants apart#

A good rule for any security-aware context, such as the one we seek to implement, is to have network isolation in place. This is achieved by building a policy at the reconciler level (just like we did before with the e.g. roles), that has guarantees of a properly secure ingress and egress policy, i.e., we have a default deny-all ingress/egress with specific allowances:

func (r *TenantReconciler) buildNetworkPolicy(t *v1.Tenant) *networkingv1.NetworkPolicy {
	ns := fmt.Sprintf("tenant-%s-%s", t.Name, t.Spec.Name)

	ingressRules := []networkingv1.NetworkPolicyIngressRule{
		{
			From: []networkingv1.NetworkPolicyPeer{
				{
					PodSelector: &metav1.LabelSelector{},
				},
			},
		},
		{
			From: []networkingv1.NetworkPolicyPeer{
				{
					NamespaceSelector: &metav1.LabelSelector{
						MatchLabels: map[string]string{
							"name": "k8s-mtp",
						},
					},
					PodSelector: &metav1.LabelSelector{
						MatchLabels: map[string]string{
							r.Config.PlatformAccessLabel: "true",
						},
					},
				},
			},
		},
	}

	egressRules := []networkingv1.NetworkPolicyEgressRule{
		{
			To: []networkingv1.NetworkPolicyPeer{
				{
					PodSelector: &metav1.LabelSelector{},
				},
			},
		},
		{
			To: []networkingv1.NetworkPolicyPeer{
				{
					NamespaceSelector: &metav1.LabelSelector{
						MatchLabels: map[string]string{
							"name": "kube-system",
						},
					},
				},
			},
			Ports: []networkingv1.NetworkPolicyPort{
				{
					Protocol: ptr.To(corev1.ProtocolUDP),
					Port:     ptr.To(intstr.FromInt(53)),
				},
				{
					Protocol: ptr.To(corev1.ProtocolTCP),
					Port:     ptr.To(intstr.FromInt(53)),
				},
			},
		},
		{
			To: []networkingv1.NetworkPolicyPeer{
				{
					NamespaceSelector: &metav1.LabelSelector{
						MatchLabels: map[string]string{
							"name": "k8s-mtp",
						},
					},
				},
			},
		},
	}

	if r.Config.TenantEgressPolicy == "internet" {
		egressRules = append(egressRules, networkingv1.NetworkPolicyEgressRule{
			To: []networkingv1.NetworkPolicyPeer{
				{
					IPBlock: &networkingv1.IPBlock{
						CIDR: "0.0.0.0/0",
					},
				},
			},
		})
	}

	return &networkingv1.NetworkPolicy{
		ObjectMeta: metav1.ObjectMeta{
			Name:      "tenant-isolation",
			Namespace: ns,
			OwnerReferences: []metav1.OwnerReference{{
				APIVersion: v1.SchemeGroupVersion.String(),
				Kind:       "Tenant",
				Name:       t.Name,
				UID:        t.UID,
			}},
		},
		Spec: networkingv1.NetworkPolicySpec{
			PodSelector: metav1.LabelSelector{},
			PolicyTypes: []networkingv1.PolicyType{
				networkingv1.PolicyTypeIngress,
				networkingv1.PolicyTypeEgress,
			},
			Ingress: ingressRules,
			Egress:  egressRules,
		},
	}
}

In effect, what we achieve here is the following: 1) we only allow traffic on the same namespace; 2) DNS queries that can go to the kube-system namespace (k3s provides us with the coredns service to enable DNS resolution); 3) platform services in the k8s-mtp namespace; 4) and the possibility of internet egress if and only if the “TenantEgressPolicy” configuration option is enable. Of note here is the fact that we disallow traffic between tenant namespaces, which is a critical security boundary!

LimitRanges for resource guardrails#

The LimitRange spec allows us to make sure that, if the admin doesn’t specify limits in their policies, we’re still implementing sane default for the different tiers. This prevents resource starvation from a misconfigured pod. Thankfully, this is a relatively policy to build:

func (r *TenantReconciler) buildLimitRange(t *v1.Tenant) *corev1.LimitRange {
	ns := fmt.Sprintf("tenant-%s-%s", t.Name, t.Spec.Name)

	var cpu, memory string
	switch t.Spec.Tier {
	case v1.TierFree:
		cpu = r.Config.LimitRangeDefaults.Free.DefaultCPU
		memory = r.Config.LimitRangeDefaults.Free.DefaultMemory
	case v1.TierPro:
		cpu = r.Config.LimitRangeDefaults.Pro.DefaultCPU
		memory = r.Config.LimitRangeDefaults.Pro.DefaultMemory
	case v1.TierEnterprise:
		cpu = r.Config.LimitRangeDefaults.Enterprise.DefaultCPU
		memory = r.Config.LimitRangeDefaults.Enterprise.DefaultMemory
	default:
		cpu = "100m"
		memory = "128Mi"
	}

	return &corev1.LimitRange{
		ObjectMeta: metav1.ObjectMeta{
			Name:      "tenant-defaults",
			Namespace: ns,
			OwnerReferences: []metav1.OwnerReference{{
				APIVersion: v1.SchemeGroupVersion.String(),
				Kind:       "Tenant",
				Name:       t.Name,
				UID:        t.UID,
			}},
		},
		Spec: corev1.LimitRangeSpec{
			Limits: []corev1.LimitRangeItem{
				{
					Type: corev1.LimitTypeContainer,
					Default: corev1.ResourceList{
						corev1.ResourceCPU:    resource.MustParse(cpu),
						corev1.ResourceMemory: resource.MustParse(memory),
					},
					DefaultRequest: corev1.ResourceList{
						corev1.ResourceCPU:    resource.MustParse(cpu),
						corev1.ResourceMemory: resource.MustParse(memory),
					},
				},
			},
		},
	}
}

Testing our platform#

As we’ve been building a lot of code, I felt like this was the time and place to finally start testing that what was built actually works. As I said in the beginning, this is where the bulk of the commits to this phase of the project actually went to. While these were mostly small fixes, of at most a dozen lines or so, they were fundamental to guarantee that things were working as expected.

Right out of the bat, I have to say: holy moly, the K8s spec is a monster! There are so many different pieces to it, and while there is some regularity to how it is implemented, there’s still a bunch of footguns that I ended up facing only when I started actually trying to deploy the app. But that’s part of the learning experience.

An ownership trap#

Here’s something that you don’t see often when people talk about Kubernetes: did you know that an OwnerReference can only reference resources in the same namespace? Of course, once this is spelled out, it becomes sort of obvious. But when I designed the platform, I had: a tenant CR in the k8s-mtp namespace; and all the owned resources in tenant-* namespaces.

Turns out that the error message for this is extremely cryptic: OwnerRefInvalidNamespace: ownerRef ... does not exist in namespace "tenant-test-acme-acme" ??? After a lot of back and forth, I finally understood what was happening, and the fix was as “simple” as changing the CRD scope from Namespaced to Cluster. Once that fix was in, everything fell into place.

Deployment of a proper tenant#

First things first, I tested whether the platform is actually able to deploy an actual tenant. So, we started relatively simple:

apiVersion: multitenant.k8s-mtp.io/v1
kind: Tenant
metadata:
  name: test-acme
  namespace: k8s-mtp
spec:
  name: acme 
  tier: free
  ownerEmail: admin@acme.org
  admins:
    - alice@acme.com
  operators: 
    - bob@acme.com

This allowed us to discover quite some bugs in the implementation: that the webhook works with a FQDN and not only with simple names; that there are caching and timing issues when the setup of the reconciler occurs; that the self-signed certs need to be in the correct format and decoded properly; that an empty slice and an empty string inside a slice are two different things; and a few more. But after fixing all of these, everything was working as expected! So, now it was time to start testing whether our security guardrails actually worked.

The deployment of a bad tenant#

Here is where I was expecting the most trouble, as the policies in K8s can be quite tricky to get right. And indeed, we did face some troubles with the proper RBAC for the ClusterRoles. But with that out the way, we could test a few scenarios:

apiVersion: v1
kind: Pod
metadata:
  name: privileged-pod
  namespace: tenant-test-acme-acme
spec:
  containers:
  - name: nginx
    image: nginx
    resources:
      limits:
        cpu: "100m"
        memory: "128Mi"
    securityContext:
      privileged: true  # Webhook rejects: privileged containers not allowed
      runAsNonRoot: true
      readOnlyRootFilesystem: true
      allowPrivilegeEscalation: false
---
apiVersion: v1
kind: Pod
metadata:
  name: root-pod
  namespace: tenant-test-acme-acme
spec:
  containers:
  - name: nginx
    image: nginx
    resources:
      limits:
        cpu: "100m"
        memory: "128Mi"
    securityContext:
      runAsNonRoot: false  # Webhook rejects: must run as non-root
      readOnlyRootFilesystem: true
      allowPrivilegeEscalation: false
---
apiVersion: v1
kind: Pod
metadata:
  name: no-secctx-pod # Webhook rejects: no security context
  namespace: tenant-test-acme-acme
spec:
  containers:
  - name: nginx
    image: nginx
    resources:
      limits:
        cpu: "100m"
        memory: "128Mi"

These very simple pods should all fail deployment in the context. And it turns out that they all do!

[lv@omega test]$ kubectl apply -f test-tenant-reject.yaml
Error from server: error when creating "test-tenant-reject.yaml": admission webhook "k8s-mtp-webhook.k8s-mtp.io" denied the request: Container nginx failed privilege check
Error from server: error when creating "test-tenant-reject.yaml": admission webhook "k8s-mtp-webhook.k8s-mtp.io" denied the request: Container nginx failed privilege check
Error from server: error when creating "test-tenant-reject.yaml": admission webhook "k8s-mtp-webhook.k8s-mtp.io" denied the request: Container nginx must have SecurityContext defined

Seeing this was a great feeling! Not only are all the pieces working for the deployment, but they correctly implement the guardrails we’ve developed previously.

Closing thoughts, and what’s next#

I can’t deny that I feel a bit proud of what what was accomplished so far. With the current state of the platform, we show that it’s possible to implement something that, from the ground-up, has security at its core; and with some clever design, and with a multiple layered implementation, the security guarantees become much stronger.

But, of course, there’s also the cost of complexity. The objective with this platform is to abstract away a lot of the complexity of properly setting up a K8s cluster, without compromises in security or functionality. This means that we have to build extremely precise and concrete policies and specs, and that does comes with having to write a lot of verbose code that includes tons of boilerplate. But such is the price we pay for building on top of K8s: we can build something very exact, but it takes a while to achieve that.

What’s next? Glad you asked! Let’s keep it short and sweet: authentication, an API gateway with rate limiting built-in, and if I’m feeling generous both a CLI tool and a Web UI! Stay tuned for more.