Why are kubernetes operators hard to install and manage?
Installing software dependencies is one of those tasks that seems simple but is actually quite complex under the hood. One key piece of software meant to address the problem is the package manager, a tool like apt
or pip
that provides a simple UX to install packages in your own software project. Simply run pip install <my-package>
and the package is available to use in your project.
This can be pretty simple in the base-case scenario. However, as projects grow their dependencies increase in size and complexity. What if <my-package>
itself has a dependency that is shared with your project, but <my-package>
requires a later version and updating it across the project would break other dependencies? The formal problem introduced here is called the boolean satisfiability problem. In a nutshell, solving this problem requires transforming a list of dependencies into a set of boolean statements and then attempting to find a solution (a set of statements that all evaluate to TRUE) that fulfills all the requirements. This problem is NP-complete, meaning there is no generalized solution and a solution is not available for every possible statement. However, through the use of heuristics and the advance of computing resources most SAT problems can be solved by the use of a SAT solver, a tool that evaluates the satisfiability problem provided some set of inputs. The Operator Lifecycle Manager uses the open-source gini SAT solver developed by IRI France to resolve dependencies between operator packages.
What makes dependency management harder in kubernetes?
Installing packages inside a running kubernetes cluster is much different than installing them locally on a hard drive when installing a program or developing software. Consider the most popular ways to install resources inside a cluster: GitOps and Helm. These solutions enable the user to state “here’s what I want” via a YAML specification, and after some templating these resources are delivered on-cluster. However, there is no easy way to know ahead of time whether these resources will work correctly on-cluster, or whether they will cause issues with existing cluster resources. For example, when installing an operator there may be a dependency on a CRD that is also used by existing operators on cluster (for example, a Prometheus ServiceMonitor resource) – if the versions required conflict, installing the operator will fail.
There’s no simple way to back out of an installation either – once the state of the cluster has been modified, it’s impossible to go back to the exact previous state. Basically, since kubernetes is constantly reconciling the state of the overall system it’s hard to be 100% certain about the effect of adding one new resource to the cluster, particularly if that resource is sufficiently complex like an operator.
What are some solutions to make things easier?
A robust package manager for kubernetes should make it easy to not only install new packages but also see the changes and manually approve them, provide guarantees around upgrades (both package upgrades and cluster upgrades) and ideally provide a clean uninstall mechanism. Let’s look at the Operator Lifecycle Manager(OLM) and some of the design decisions it makes to provide a good package manager experience for operator users.
- Every resource installed by OLM is referenced in an
InstallPlan
resource in the namespace that the package is installed in. ThisInstallPlan
is an auditable reference to all the resources installed, and can be approved automatically (resulting in a seamless install experience) or manually (in which case the cluster admin must approve the installplan before the resources are actually applied to the cluster). This is similar in spirit to adry-run
installation in Helm. - Certain resources, like CRDs, require great care when installing and OLM provides additional safety checks to ensure new CRDs don’t break existing CRs on the cluster. For example, the schema validation for the new CRD must conform to those of existing CRs on the cluster, and the storage version of the CRD also needs to conform otherwise there is a risk of data loss. Upgrading to a new storage version can result in data loss if not done carefully.
- Packages can be installed with a custom
ServiceAccount
that limits the permissions the resource has on-cluster. It’s important to reason about the RBAC permissions a resource has in the cluster, especially one developed by third parties and installed over the internet, and OLM can ensure the package runs with the user-specified SA. - Supporting webhooks. Webhooks are key resources in limiting the types of resources allowed to be installed, and users should be able to install webhooks onto their cluster. An admission webhook can deny certain resources from being created on cluster, whereas a mutating webhook can enforce certain default values. There is some additional complexity in installing a webhook successfully, particuarly around certificate management (there is mTLS bewteen the webhook and the api-server) but a good package manager should help solve this, either directly or via some third-party solution like cert-manager.
- Uninstalling packages. As stated previously, it can be tricky to remove packages after installation if something goes wrong. Simply deleting everything based on a label selector works in principle but can have some unintended consequences. For example, consider the example of an operator that creates resources outside the cluster (for example an S3 bucket) as part of its workload on the cluster. The operator can be deleted, but what about the external cloud resources? To avoid this issue, it’s recommended to use finalizers for custom resources, and OLM respects these finalizers when attempting to cleanup operators during an uninstall.
Conclusion
Package management is actually quite complex, and layered on-top of an already complex system like kubernetes it can prove extremely challenging. The success of package managers like Helm show that there is high demand for an intuitive solution to enable users to install packages onto their cluster. However, it’s one thing to simply install resources, but another to ensure that they are working as intended and not breaking other workflows on the cluster. It’s also important to provide uninstallation capabilities so users can feel more comfortable installing packages. Neither Helm, nor OLM or GitOps are perfect in addressing these needs, but they do address some of these fundamental challenges. As kuberenetes becomes more and more mature these package managers will only grow in sophistication and hopefully provide the full UX that users require.