How to Fix Common Misconfigurations in Kubernetes Clusters

You’ve probably heard the joke that Kubernetes is essentially a giant pile of YAML files held together by hope and a few very stressed-out SREs. While that's a bit of an exaggeration, it gets to a core truth: Kubernetes is incredibly complex. It's a powerful orchestrator, but that power comes with a steep learning curve. When you're moving fast—deploying updates multiple times a day, scaling pods across regions, and managing service meshes—it is almost inevitable that something will be misconfigured.

The problem is that in a cloud-native environment, a tiny typo in a manifest or a "temporary" permission tweak can open a massive hole in your security. We've all been there. You just wanted the pod to start, so you gave it cluster-admin privileges "just for a second" to debug a connection issue. Then you forgot about it. Six months later, that pod is still running, and it's now a golden ticket for any attacker who manages to get a shell inside your container.

Fixing common misconfigurations in Kubernetes clusters isn't just about running a security scanner and checking off boxes. It's about understanding the "why" behind the risks. If you don't understand how a privileged container can lead to a full node breakout, you'll just keep making the same mistakes every time you write a new deployment file.

In this guide, we're going to walk through the most frequent mistakes we see in the wild and, more importantly, exactly how to fix them. We'll look at everything from Role-Based Access Control (RBAC) nightmares to the dangers of the default namespace. By the end, you should have a clear roadmap for hardening your cluster without breaking your applications.

The Danger of Over-Privileged Service Accounts (RBAC)

Role-Based Access Control (RBAC) is the heart of Kubernetes security. It dictates who can do what and where. However, RBAC is where most people start cutting corners. When a developer says, "I can't get my CI/CD pipeline to deploy the app," the easiest fix is often to grant the service account cluster-admin permissions.

It works. The pipeline turns green. Everyone is happy. But you've just created a massive vulnerability. If your CI/CD secret is leaked, the attacker doesn't just have access to one app; they have the keys to your entire kingdom.

The "Cluster-Admin" Trap

The cluster-admin role is a built-in ClusterRole that gives unrestricted access to every resource in the cluster. Using this for application-level service accounts is a cardinal sin of K8s security.

The Fix: The Principle of Least Privilege (PoLP) Instead of using broad roles, you need to define specific roles that only permit the exact actions required.

For example, if a pod only needs to read ConfigMaps in its own namespace to start up, don't give it a ClusterRole. Give it a Role (which is namespaced) with only the get and list verbs for configmaps.

Example of a Tightened Role:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: app-namespace
  name: config-reader
rules:
- apiGroups: [""] 
  resources: ["configmaps"]
  verbs: ["get", "list"]

Avoiding the Default Service Account

By default, every namespace has a service account named default. If you don't specify a service account for a pod, Kubernetes assigns it this one. Historically, the default service account had broad permissions. While modern versions are better, many legacy clusters still have the default account bound to overly permissive roles.

The Fix: Explicit Service Accounts Never rely on the default. Create a dedicated service account for every single application.

Create the ServiceAccount.
Create a Role with the minimum necessary permissions.
Create a RoleBinding to link the two.
Explicitly set serviceAccountName in your Pod spec.

If your app doesn't need to talk to the Kubernetes API at all (which is true for most web apps), take it a step further. Set automountServiceAccountToken: false in your pod specification. This prevents the API token from being mounted into the pod, meaning even if an attacker gets in, they have no token to use for lateral movement within the cluster.

Hardening the Pod Security Context

When a container runs, it doesn't just run "in" the cluster; it runs as a process on a Linux node. If that process is running as the root user, and there's a vulnerability in the container runtime or the kernel, the attacker can potentially "escape" the container and gain root access to the host machine. This is known as a container breakout.

The "Privileged: True" Problem

You'll often see privileged: true in YAML files. This essentially tells Kubernetes to give the container almost all the capabilities of the host machine. This is rarely necessary for standard applications. It's usually only needed for specialized system tools (like CNI plugins or storage drivers).

The Fix: Stop Using Privileged Mode If you find yourself needing privileged: true, ask why. Do you just need to change a network setting? Do you need to mount a specific device?

Instead of full privileged mode, use capabilities. Linux capabilities allow you to break down "root" power into smaller pieces. For example, if you only need to modify network interfaces, use CAP_NET_ADMIN instead of giving the pod full root access.

Running as Root

Many Docker images are built to run as root by default. If you deploy these as-is, your process is running with UID 0. This is a huge risk.

The Fix: Use a non-root user You should enforce non-root execution both in the Dockerfile and in the Kubernetes securityContext.

In your deployment YAML, add a securityContext section:

spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    fsGroup: 2000

runAsNonRoot: true tells Kubernetes to check if the image is trying to run as root and fail the start if it is. This forces your team to build images with a dedicated user (e.g., USER 1000 in the Dockerfile).

Read-Only Root Filesystems

Most applications don't actually need to write to their own root filesystem. They write to logs (which should go to stdout/stderr) or to mounted volumes. If an attacker gains access to a container, the first thing they usually do is download a toolkit or a script to the local disk. If the filesystem is read-only, that attack vector is blocked.

The Fix: Set readOnlyRootFilesystem to true

securityContext:
  readOnlyRootFilesystem: true

If your app needs to write temporary files, don't turn off the read-only filesystem. Instead, mount an emptyDir volume to the specific path where the app needs to write (like /tmp).

Managing Your Attack Surface: Network Policies

By default, Kubernetes has a "flat" network. This means any pod in any namespace can talk to any other pod in any other namespace. While this is great for connectivity, it's a nightmare for security. If your frontend web server is compromised, the attacker can freely scan your internal network and find your database, your cache, and your internal admin tools.

The Lack of Segmentation

Imagine a house where there are no doors between rooms—just one big open space. If a burglar gets through the front window, they have immediate access to the bedroom, the kitchen, and the safe. That's exactly how a default K8s cluster works.

The Fix: Implement a Default-Deny Policy The safest way to manage network traffic is to start by blocking everything and then explicitly allowing only what is necessary. This is the "Zero Trust" approach.

First, create a policy that drops all ingress and egress traffic for the namespace:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

Once everything is blocked, you create specific "allow" rules. For instance, if your frontend pod needs to talk to your backend pod on port 8080, you write a policy that specifically allows traffic from the frontend label to the backend label on that port.

Controlling Egress Traffic

Most people focus on who can get into their cluster, but they forget about what's going out. If a pod is compromised, the attacker will try to "phone home" to a command-and-control (C2) server to receive instructions or exfiltrate data.

The Fix: Restrict Egress Unless your pod needs to talk to the public internet (which is rare for a backend service), block all egress. If it does need internet access (e.g., to call a third-party API like Stripe or Twilio), use a service mesh like Istio or Linkerd, or use an Egress Gateway to whitelist specific external domains.

The "Point-in-Time" Trap and Continuous Testing

One of the biggest misconfigurations isn't a line of code—it's a process. Many companies do a "security audit" once a quarter. They hire a firm, the firm finds ten misconfigurations, the team fixes them, and everyone breathes a sigh of relief.

But Kubernetes environments are dynamic. You might change a ConfigMap, update a Helm chart, or add a new sidecar container tomorrow. That "clean" audit from last month is now irrelevant. This is what we call "point-in-time" security, and in a cloud-native world, it's dangerous.

This is where a shift toward Continuous Threat Exposure Management (CTEM) becomes necessary. You can't just scan for vulnerabilities; you need to simulate how an attacker actually moves through your cluster.

If you have a pod with an overly permissive RBAC role and a misconfigured Network Policy, a simple vulnerability scan might flag them as two separate "medium" risks. But in reality, together they create a "critical" path: an attacker exploits a web vulnerability, uses the RBAC role to list secrets, finds a database password, and uses the open network to dump your data.

Tools like Penetrify help bridge this gap. Instead of a static report that gathers dust, Penetrify provides on-demand security testing that scales with your cloud environment. It helps you identify these "chains" of misconfigurations—the way a small RBAC error combines with a network gap—before a malicious actor does. By moving toward a "Penetration Testing as a Service" (PTaaS) model, you stop guessing if your cluster is secure and start knowing.

Securing the API Server and Control Plane

The API server is the brain of your cluster. Everything—from kubectl commands to internal controller logic—goes through it. If the API server is exposed or misconfigured, your entire cluster is compromised.

Publicly Accessible API Servers

In the rush to get a cluster running, some teams leave the API server open to the entire internet. While it's protected by authentication, exposing the endpoint allows attackers to attempt brute-force attacks, exploit zero-day vulnerabilities in the API server itself, or perform denial-of-service attacks.

The Fix: Use Private Endpoints and Authorized Networks If you're using a managed service like EKS, GKE, or AKS, enable the "Private Cluster" option. This ensures the API server is only accessible from within your VPC or via a VPN/Bastion host. If you must keep it public, use "Authorized Networks" (IP whitelisting) to restrict access to only your office IP or your CI/CD runner IPs.

Anonymous Authentication

Some older clusters or custom installations might have anonymous authentication enabled. This allows requests to the API server that aren't authenticated to be treated as a special system:anonymous user. Depending on your RBAC settings, this user might accidentally have permissions to view pods or nodes.

The Fix: Disable Anonymous Auth Ensure the --anonymous-auth=false flag is set on your kube-apiserver. If you can't disable it for some reason, ensure that the system:anonymous user is not bound to any roles that provide information about your cluster.

Etcd Security

Etcd is the database where Kubernetes stores all its state. If an attacker gets access to etcd, they have everything: every secret, every config, and the ability to modify the state of the cluster without going through the API server.

The Fix: Encrypt and Isolate etcd

Encryption at Rest: Enable encryption for secrets in etcd so that if the disk is stolen or accessed, the secrets are useless.
Mutual TLS (mTLS): Ensure that the API server and etcd communicate using certificates. No one should be able to talk to etcd without a valid client certificate.
Network Isolation: etcd should be on a completely separate network or protected by strict firewall rules so that only the API server can reach it.

Managing Secrets without the "Secret"

The name Secret in Kubernetes is a bit of a lie. By default, Kubernetes secrets are only base64 encoded, not encrypted. Anyone with access to the API or the etcd backup can decode your database password in about two seconds using a simple terminal command: echo "base64-string" | base64 --decode.

Storing Secrets in Git (The Ultimate Sin)

It happens more often than people admit. A developer puts a secret.yaml file into a Git repository "just for a few minutes" to help a teammate, and then forgets to delete it. Now, that password lives in the Git history forever.

The Fix: External Secret Management Stop using native K8s secrets for sensitive data. Instead, use a dedicated secret manager.

HashiCorp Vault: The industry standard. It provides dynamic secrets and tight access control.
AWS Secrets Manager / Azure Key Vault / GCP Secret Manager: Great if you are already locked into a cloud provider.

To integrate these with Kubernetes, use tools like External Secrets Operator (ESO) or Secrets Store CSI Driver. These tools pull the secret from the external vault and inject it into the pod as a volume or a temporary K8s secret, ensuring the actual "source of truth" never lives in your YAML files or Git.

Secret Rotation

Most teams set a password and leave it for three years. If that password is leaked, the attacker has a permanent back door.

The Fix: Automated Rotation If you use an external manager like Vault, you can implement automatic rotation. The secret manager updates the password in the database and then updates the value in Kubernetes. Because the app reads the secret from a volume or via an API, it picks up the new password without needing a full redeploy.

Resource Limits and the "Noisy Neighbor" Problem

While not a "security" misconfiguration in the traditional sense, failing to set resource limits is a stability and availability risk. In Kubernetes, if one pod goes rogue and starts consuming all the CPU or RAM on a node, it can starve other pods—including critical system components—leading to a node crash. This is essentially a self-inflicted Denial of Service (DoS).

The Danger of "Unlimited" Pods

If you don't define resources.limits, a pod can use as much of the node's resources as it wants. If you have a memory leak in one of your apps, it will slowly eat all the RAM on the node until the Linux OOM (Out of Memory) killer starts killing processes. The problem? The OOM killer might kill your most important pod first.

The Fix: Set Requests and Limits Every container should have a request (what it needs to start) and a limit (the maximum it's allowed to use).

resources:
  requests:
    memory: "64Mi"
    cpu: "250m"
  limits:
    memory: "128Mi"
    cpu: "500m"

Requests: Used by the scheduler to find a node with enough room.
Limits: Enforced by the container runtime to prevent the pod from hogging the node.

Handling CPU Throttling

Be careful with CPU limits. Unlike memory (where hitting a limit kills the pod), hitting a CPU limit just "throttles" the pod. Your app will get slow, but it won't crash. If you see high latency in your apps, check your Prometheus metrics for CPU throttling. You might need to increase your limits or optimize your code.

Image Security and Supply Chain Risks

Your cluster is only as secure as the images you run in it. If you're pulling my-app:latest from a public registry, you're essentially running code written by a stranger on your production infrastructure.

Using the `:latest` Tag

Using :latest is a recipe for disaster. You have no idea which version of the code is actually running. When a pod restarts, it might pull a new version of the image that contains a breaking change or, worse, a malicious payload.

The Fix: Use Immutable Tags or Digests Always use a specific version tag (e.g., my-app:v1.2.3) or, for maximum security, the SHA256 digest of the image. This ensures that the exact same bytes are deployed every single time.

Running Images from Untrusted Registries

Public registries are full of "typo-squatted" images—images that look like popular ones but contain malware.

The Fix: Private Registry and Image Scanning

Use a Private Registry: Pull public images, scan them, and then push them to your own private registry (like ECR, ACR, or Harbor).
Automated Scanning: Use a scanner like Trivy or Clair to check for known CVEs (Common Vulnerabilities and Exposures) in your images.
Admission Controllers: Use a tool like Kyverno or OPA (Open Policy Agent) to prevent any image from being deployed if it hasn't been scanned or if it contains "Critical" vulnerabilities.

A Comprehensive Kubernetes Hardening Checklist

To make this actionable, here is a summary checklist you can use during your next sprint or security review.

RBAC & Access

No service accounts granted cluster-admin (unless absolutely necessary).
No apps using the default service account.
automountServiceAccountToken: false set for pods that don't need API access.
API Server is not open to the public internet.
Anonymous authentication is disabled on the API server.

Pod Security

privileged: true is forbidden for application pods.
runAsNonRoot: true and runAsUser are defined for all containers.
readOnlyRootFilesystem: true is enabled where possible.
CPU and Memory limits and requests are set for every container.

Network Security

A default-deny-all NetworkPolicy is in place for every namespace.
Explicit "Allow" rules are used for necessary service-to-service traffic.
Egress traffic is restricted to known, required endpoints.

Data & Secrets

No secrets stored in Git repositories.
Secrets are managed via an external vault (Vault, AWS SM, etc.).
Secrets are encrypted at rest in etcd.
etcd is isolated and requires mTLS for communication.

Supply Chain

Images are pulled from a private, trusted registry.
No images use the :latest tag in production.
All images are scanned for CVEs before deployment.
An admission controller blocks non-compliant images.

Common Misconfiguration Scenarios: Before and After

To give you a clearer picture of how these fixes look in practice, let's look at a few "Before and After" scenarios.

Scenario 1: The "Lazy" Deployment

Before: A developer deploys a simple Node.js API. They use the default service account, run as root, have no resource limits, and no network policies.

Risk: A vulnerability in a Node package allows an attacker to get a shell. Since they are root and have a default token, they can probe the internal network, find the database, and potentially escalate privileges to the node.

After:

Use a custom ServiceAccount with no API permissions.
Set runAsNonRoot: true.
Define cpu and memory limits.
Apply a NetworkPolicy that only allows traffic from the Ingress Controller.
Result: Even if the attacker gets a shell, they are a low-privileged user, they can't talk to other pods, and they can't crash the node by consuming all the RAM.

Scenario 2: The "Secret" Leak

Before: The team stores their database password in a Kubernetes Secret created from a local YAML file. The YAML file was accidentally committed to a feature branch.

Risk: Anyone with read access to the Git history now has the production database password.

After:

Move the password to AWS Secrets Manager.
Install the External Secrets Operator.
The K8s secret is now a "shadow" of the AWS secret and is never stored in Git.
Result: The secret is centralized, rotated automatically, and never touches a developer's local disk in plain text.

Frequently Asked Questions (FAQ)

1. Won't setting resource limits cause my pods to crash?

Yes, if you set the memory limit too low, your pod will be OOMKilled. This is actually a good thing—it's better for one pod to crash and restart than for one pod to crash your entire physical node. The trick is to monitor your app's actual usage using tools like Prometheus and Grafana, then set your limits slightly above the peak usage.

2. Is it really necessary to use a Service Mesh for network security?

For small clusters, a Service Mesh (like Istio) might be overkill. Standard Kubernetes NetworkPolicies are usually enough to get you 80% of the way there. However, if you need advanced features like mutual TLS (mTLS) between all services or complex traffic splitting, a service mesh is the right move.

3. How do I find all my current misconfigurations without checking every YAML file?

Doing this manually is impossible once you have more than a few apps. You should use automated tooling. Tools like kube-bench (which checks against CIS benchmarks) and kube-hunter are great for finding holes. For a more continuous, "attacker's eye view" of your cluster, a platform like Penetrify can automatically map your attack surface and find the paths an attacker would actually take.

4. Does `runAsNonRoot` work if the image was built as root?

If you set runAsNonRoot: true and the image's metadata says it runs as root (UID 0), Kubernetes will refuse to start the pod. You must go back to the Dockerfile and add a user (e.g., RUN useradd -u 1000 appuser && USER appuser).

5. Can I apply these security settings to an existing cluster without downtime?

Yes, but do it carefully. Don't apply a default-deny-all network policy to your entire production namespace at once, or you'll take your whole site offline. Apply policies one service at a time. Similarly, test your securityContext changes in a staging environment to ensure your app doesn't need a specific root permission to write to a folder.

Moving From Reactive to Proactive Security

Fixing misconfigurations is a constant battle. As you add more services, more developers, and more complex integrations, the "security surface" of your cluster grows. If you rely on a manual checklist or a once-a-year audit, you're essentially playing a game of whack-a-mole where the mole has a rocket launcher.

The goal should be to move from a reactive state—where you fix things after they're flagged—to a proactive state. This means integrating security into your CI/CD pipeline (DevSecOps) and using continuous testing.

Automation is the only way to keep up with the speed of Kubernetes. When you can automatically scan your manifests for privileged: true before they even hit the cluster, you've won half the battle. When you use a tool like Penetrify to continuously simulate attacks on your environment, you're no longer guessing if your network policies work—you have proof.

Remember, security isn't a destination; it's a process of reducing risk. You'll never have a "100% secure" cluster, but by fixing these common misconfigurations, you make it so expensive and difficult for an attacker that they'll likely move on to an easier target.

Ready to stop guessing about your cluster's security? Don't wait for a breach to find out your RBAC is too open or your network policies have holes. Visit Penetrify to discover how automated, on-demand security testing can help you find and fix vulnerabilities in real-time, keeping your cloud-native infrastructure truly secure.

Back to Blog