When we were given the task of doing a full audit of Kubernetes for one client, we first encountered a scenario in which Kubernetes is completely on-prem, on physical machines in clusters, and is waiting for production, i.e. the final green light. To make everything more interesting, the client was MVNO in Germany. Telephony and Kubernetes are definitely something you don’t encounter every day.
On-prem Kubernetes cluster: It sounds simple, but…
The client’s requests were seemingly simple. We will get read-only access to the dev cluster, we will make a detailed inventory of everything with a description of how it works to make sure we understand how their system works, we will assess whether the solutions selected are adequate – if not, we will suggest alternatives.
Sounds simple. Time-consuming, but simple.
We thought it was.
We were wrong.
Getting back to roots: Old school dev vs cloud-first team
As a team, we are accustomed to the cloud – all the little things that characterize it and all the benefits it provides. We experienced a slight surprise when (quite expectedly) we discovered that some resources are not “cloud-like”. The duo who worked on this project has a total of about 40 years of experience in IT dating back to before the cloud. It did not take us long to realize that we were back on the old school system and that we should think that way.
What does it look like when many developers design Kubernetes architecture?
We have relatively easily come to the approach to the dev cluster and the belief that dev, staging and future production are the same, except for the differences that will be zealously presented to us in time. The cluster itself followed the philosophy that every best practice segment requires its own solution. This led to the installation of many components in Kubernetes, as each had its own role.
Our first question was; are all these components really necessary?
When we set it up with developers, we got a response based on typical programming thinking: everything is viewed as a separate whole and each requirement that must be met is one component. It cannot be said that this way of thinking is wrong.
The Ops approach involves building a skeleton and installing as little as possible, while using everything that can be used. In the case of our client, the approach was to ensure that every need was met with a special, adequate solution. Since we couldn’t dispute it, we tried to adjust.
Challenges and proposed solutions
To begin with, we needed to understand what and how the client does and how their app works. Although we received an app diagram accompanied by an explanation, we had to manually go through the complete internal communication of the components to understand all the relationships within Kubernetes and the stack itself.
As the nature of the application is communication and information exchange, it is clear that everything is completely safe and takes place within the cluster. Since the read-only approach allows only viewing resources, but not establishing a session to them, we relied on the analysis of resources and config files that stand as config maps. Service names, ingresses and endpoints are just some of the aspects that have revealed which service communicates with what through configuration. During the analysis, we found that the teams opted for quite cutting-edge solutions and that they were all from the FOSS ecosystem, which is very commendable.
Challenge 1: HA storage
As one of the challenges, I would single out the client’s decision to use local disks on servers through a local path provisioner. It is very difficult to use storage in this way without being in HA. The client has raised GlusterFS, which is not the best solution because Kubernetes is not aware of it and the provisioned volumes cannot work properly in terms of backup and recovery.
We resolved this by proposing to abandon GlusterFS and switch to the CEPH + Rook combination. CEPH as a distributed system can bind a wide range of storage, and Rook can manage them. Our proposal was also to completely relocate the storage to a special server within the data center that can be backed up offsite, thus addressing the challenge of DFS.
Challenge 2: Configuration
The second challenge was to ensure process control within Kubernetes so that the automation of the process constantly follows rules that must not be violated. The client chose the combination of Falco and Kyverno as the solution. None of them were configured beyond the demo rules, which is acceptable, but it is problematic that these solutions are intended to work in a mature setup – bound. They are difficult to configure on a dev cluster that is not fully configured. It was also clear that the dev environment was not the same as the production environment. Our challenge was to bind two policy managements, which we proposed with two sets of policies that work together.
Challenge 3: Monitoring
In the analysis, we discovered the problem of distributed monitoring that can be affected by the crash of a node. Namely, the client uses a master-worker setup in which each environment has the same number of workers and nodes. Like storage, monitoring suffers from the same ailment. What if something inevitably fails on a bare metal server or a node has to be shut down for maintenance?
We had a dilemma where the monitoring should stand. The client was unwilling to relocate the monitoring to the server cluster at this stage of the project. The solution chosen was the standard Prometheus stack, which was not meant to be distributed. Admittedly, there is a federation, but that is not the solution to the problem. Our proposal, i.e. confirmation through joint discussion, was to use Thanos – a distributed overlay over Prometheus that allows independent instances of it to exist no matter where they are located.
Our solution relied on the use of DFS that we proposed and implied that the storage layer remains abstract to the entire system and that changes at the level of said layer have no consequences for the constant inflow of data from monitoring. Thanos was the de facto solution for the complete application and all three environments at the moment when the client decides to move the monitoring stack out of the Kubernetes server.
Challenge 4: Backup Solution
The solution that the client tried to use for backup is called Velero. This is a mature backup project that has the ability to backup the volume in Kubernetes, but this is not its primary purpose or idea.
Our proposal was similar to the client’s developer approach. Namely, Velero is a backup solution, but initially designed to do a comprehensive backup of Kubernetes clusters as a container orchestration system. Kubernetes consists of a number of components and there is no integrated backup system that will allow the restoration of the irreversibly damaged cluster to be done. Velero solves this problem by removing vital configurations to a relocated and secure backup location of your choice. In our case, it was an object storage that the client had already used and which was not located in the same data center.
Challenge 5: Volume Backup
On the same line of the backup challenge was the volume backup. There is a studio Kanister on the scene, an open-source project that allows to define the API within Kubernetes in the form of a resource that is invoked and backed up in the way the client needs it. Kanister comes with definitions for backing up databases of various types. SQL, NoSQL, GraphQL, timeseries, etc. Backup of simple storage is defined as a job, just like restore.
What we’ve learned
Working with traditional technologies is a big challenge in the modern age. The cloud-centric IT scene changes us all, but it is always a good exercise to go back to the roots and think manually, component by component. Here we also see how we are actually as Ops still compatible with the dev part, and how their approach that one component solves one problem actually decentralizes processes and makes a system that acts (unnecessarily) robust actually work extremely well and be very resilient to catastrophic scenarios.
In the course of the project, we discovered new tools that are useful for different situations, such as Kanister, Reflector, Karpenter and Kyverno,
We also learned that not every distributed solution is distributed in every situation. Something has to be purposefully designed for on-premise situations, otherwise it acts like a simple union with resources that depend on when and how they are added. Such a pool of resources is not resilient and therefore not very efficient.
Discover how Mainstream can improve your business.
Contact us at sales@mainstream.eu or fill out our contact form.
The partnership between Mainstream and HC Center represents a synergy of innovative cloud solutions and expertise in digital transformation, providing advanced services to accelerate digitization in Southeast Europe.
earn how to conduct a Kubernetes environment review that goes beyond checklists. Identify security gaps, optimize performance, and prioritize fixes for a resilient K8s setup.
Application development on the cloud enables lower costs, faster delivery, greater data security and flexible scalability, with simpler management of infrastructure.