What is the difference between rootless mode and running as a non-root user inside the container?

These are two different security layers that stack together. Running as a non-root user inside the container (USER directive in Dockerfile) means the application process runs as a non-privileged user (uid 1000 or similar) inside the container namespace. If the application is compromised, the attacker has non-root access inside the container. Rootless mode goes further — the entire Docker daemon itself runs without root privileges on the host. This means even if an attacker escapes the container, they land as a non-root user on the host. The strongest configuration uses both: rootless Docker daemon on the host, and a non-root user inside each container.

Should we use Alpine or distroless as our base image?

Alpine (5-7 MB) includes a shell, package manager (apk), and musl libc — useful during development and debugging but providing more attack surface than necessary in production. Distroless images (2-20 MB depending on language runtime) contain no shell, no package manager, and no unnecessary OS utilities. If your team needs to exec into running containers for debugging, Alpine is more practical. If your organization has mature observability and does not need container shells in production, distroless is more secure. Many teams use Alpine for staging and distroless for production. The key is to never use full OS images (Debian, Ubuntu, CentOS) in production containers.

How do we handle Docker secrets in CI/CD pipelines securely?

Never pass secrets as build arguments (ARG or ENV in Dockerfile) because they are persisted in image layer metadata and visible via docker history. Instead, use Docker BuildKit secret mounts — the --mount=type=secret flag mounts a secret file during a specific build step and excludes it from the final image layers. For runtime secrets, use Docker Secrets (Swarm mode), Kubernetes Secrets with encryption at rest enabled, or an external secrets manager like HashiCorp Vault or AWS Secrets Manager. In CI/CD, store secrets in the pipeline platform (GitHub Actions secrets, GitLab CI variables) and inject them as build-time secret mounts, never as environment variables in the Dockerfile.

Docker Security Best Practices: Hardening Containers in Production

Docker containers are not secure by default. The default Docker configuration prioritizes convenience and compatibility over security — containers run as root, have access to all Linux capabilities, share the host kernel, and can mount any host directory. Every one of these defaults is a security risk in production.

Hardening Docker containers means systematically removing unnecessary privileges, minimizing the attack surface, and implementing controls that limit the blast radius when a container is compromised. This guide covers every layer of Docker security: the image build process, the container runtime configuration, the Docker daemon settings, and the host-level protections that prevent container escapes.

These practices are based on the CIS Docker Benchmark v1.7, real-world incident post-mortems, and the configurations used by organizations running thousands of containers in production.

Base Image Selection: Your First Security Decision

The base image determines your container's attack surface before you write a single line of application code. Every package, utility, and library in the base image is a potential vulnerability. The goal is to ship the smallest possible image that can run your application — nothing more.

Base Image	Size	Packages	Shell	Package Manager	CVE Surface	Use Case
ubuntu:24.04	78 MB	400+	Yes	apt	High (120+ CVEs typical)	Development only — never production
debian:bookworm-slim	52 MB	90+	Yes	apt	Medium (40-60 CVEs typical)	Legacy apps needing glibc and shell access
alpine:3.20	5.5 MB	15	Yes (ash)	apk	Low (5-15 CVEs typical)	General production — good balance of size and usability
gcr.io/distroless/static	2 MB	0	No	No	Minimal (0-3 CVEs typical)	Go binaries, Rust binaries, static compiled apps
gcr.io/distroless/base	20 MB	glibc only	No	No	Very low (2-8 CVEs typical)	C/C++ apps, apps needing glibc
gcr.io/distroless/java21	190 MB	JRE only	No	No	Low (JRE CVEs only)	Java applications
cgr.dev/chainguard/static	1.6 MB	0	No	No	Minimal (0-1 CVEs typical)	Hardened static binaries, FIPS compliance

The difference between running Ubuntu and distroless is not marginal — it is the difference between 120+ known vulnerabilities and essentially zero. Every unnecessary package is an unnecessary risk. If your application does not need a shell, do not ship a shell. If it does not need a package manager, do not include one.

Multi-Stage Builds: Separating Build from Runtime

Multi-stage builds are the most important Dockerfile technique for production security. They separate the build environment (compilers, build tools, test frameworks, source code) from the runtime environment (just the compiled binary and its dependencies). Without multi-stage builds, your production images contain everything used during compilation — attack surface that serves no runtime purpose.

Multi-Stage Build Architecture

A proper multi-stage build has three or four stages:

Stage	Base Image	Purpose	What It Contains	Included in Final Image?
Builder	Language SDK image (golang, node, rust)	Compile code, run tests	Source code, compilers, build tools, test results	No — discarded
Dependencies	Same SDK or minimal image	Install and audit production dependencies	Package manifests, resolved dependency tree	Only final artifacts
Security scan	Trivy/Grype image	Scan compiled binary and dependencies for CVEs	Scanner, scan results	No — discarded
Runtime	Distroless or Alpine	Run the application	Only compiled binary + runtime deps + non-root user	Yes — this is the production image

The runtime stage should copy only the exact files needed from the builder stage. Every file you copy is a deliberate decision, not a side effect of the build process.

Secret Handling During Builds

Build-time secrets (API keys for private registries, SSH keys for private git repos, authentication tokens) require special handling. The wrong approaches persist secrets in image layers where they are extractable by anyone who pulls the image.

Method	Security	Details
ENV or ARG in Dockerfile	Insecure — visible in docker history and image metadata	Never use for secrets. Even multi-stage builds do not protect these — the build stage layers are cached and can be extracted.
COPY then DELETE	Insecure — secret persists in earlier layer	Docker images are layered. Deleting a file in a later layer does not remove it from the layer where it was added.
BuildKit --mount=type=secret	Secure — mounted only during specific RUN step, never persisted	Use this for all build-time secrets. Mount the secret, use it, and it disappears when the RUN step completes. Not stored in any layer.
BuildKit --mount=type=ssh	Secure — SSH agent forwarded without exposing key	For cloning private git repositories. Forwards the SSH agent from the host without copying the private key into the image.

Runtime Security Configuration

Even with a perfectly hardened image, a misconfigured runtime can undo all your build-time security work. The docker run command (or the equivalent Kubernetes pod spec) determines what privileges the container actually has at runtime.

The Non-Negotiable Runtime Controls

These controls should be applied to every production container without exception:

Control	Docker Flag	What It Does	Why It Matters
Non-root user	--user 1000:1000	Runs the container process as uid 1000 instead of root	Prevents the compromised process from having root privileges inside the container. First defense against privilege escalation.
Read-only filesystem	--read-only	Mounts the container filesystem as read-only	Prevents attackers from writing backdoors, downloading exploit tools, or modifying application files. Use tmpfs mounts for directories that need writes (tmp, logs).
Drop all capabilities	--cap-drop=ALL	Removes all Linux capabilities from the container	Containers inherit a default set of 14 capabilities. Most applications need zero capabilities. Dropping all and selectively adding back only what is needed follows least privilege.
No new privileges	--security-opt=no-new-privileges	Prevents processes from gaining additional privileges via setuid/setgid binaries	Even if a setuid binary exists in the image, it cannot escalate privileges. Blocks a common container escape technique.
Memory limit	--memory=512m	Caps memory usage at the specified limit	Prevents resource exhaustion attacks. A compromised container cannot OOM-kill other containers on the host.
CPU limit	--cpus=1.0	Limits CPU usage to the specified number of cores	Prevents cryptomining and resource-based denial of service from consuming all host CPU.
No privileged mode	Never use --privileged	Privileged mode gives the container full access to host devices and disables all security protections	A privileged container is essentially root on the host. There is almost never a legitimate reason to run privileged in production.
PID limits	--pids-limit=256	Limits the number of processes the container can create	Prevents fork bomb attacks that exhaust the host process table and crash the node.

Docker ships with insecure defaults optimized for developer convenience. Every production container must apply the hardened configuration to prevent container escapes and privilege escalation.

Seccomp Profiles: Filtering System Calls

Seccomp (Secure Computing Mode) restricts which system calls a container can make. Docker ships with a default seccomp profile that blocks approximately 44 of the 300+ Linux system calls — preventing the most dangerous operations (reboot, kernel module loading, clock manipulation) while allowing most normal application behavior.

For production hardening, create a custom seccomp profile that allows only the system calls your application actually uses. The process:

Run your application with the default profile and strace/sysdig to record all system calls made during normal operation (exercise all code paths, including error handling)
Generate a custom seccomp profile containing only those observed system calls
Test the profile thoroughly in staging — missing a required system call causes the application to crash with EPERM errors
Deploy the custom profile in production with monitoring for EPERM denials

The OCI runtime default profile is a reasonable starting point for most applications. Custom profiles are worth the effort for high-security workloads or containers that handle sensitive data.

AppArmor and SELinux: Mandatory Access Control

AppArmor (Ubuntu/Debian) and SELinux (Red Hat/CentOS) provide mandatory access control (MAC) that restricts what files a container process can access, what network operations it can perform, and what capabilities it can exercise — regardless of the process's user ID or Linux capabilities.

Docker automatically applies the docker-default AppArmor profile on systems with AppArmor enabled. This profile prevents containers from writing to /proc and /sys, mounting filesystems, accessing raw sockets, and other dangerous operations. For additional hardening, create application-specific AppArmor profiles that restrict file access to only the directories your application needs.

Docker Content Trust and Image Verification

Docker Content Trust (DCT) ensures that every image you run has been cryptographically signed by a trusted publisher. Without DCT, anyone who compromises your registry or performs a man-in-the-middle attack can substitute a malicious image for a legitimate one, and Docker will run it without question.

How DCT Works

DCT uses The Update Framework (TUF) through the Notary service. When a publisher pushes a signed image:

The publisher's signing key creates a digital signature over the image manifest (the cryptographic hash of every layer)
The signature is stored in a Notary server (separate from the image registry)
When a consumer pulls the image with DCT enabled, Docker verifies the signature against the publisher's public key before running the image
If the signature is missing, invalid, or the image has been modified since signing, Docker refuses to run the image

Enable DCT by setting the environment variable DOCKER_CONTENT_TRUST=1. Once enabled, Docker will refuse to pull or run any unsigned image. This is a powerful supply chain security control — but it requires that all images in your registry are signed, which means your CI/CD pipeline must include a signing step.

Cosign: Modern Image Signing

Cosign (from the Sigstore project) is the modern alternative to DCT/Notary for image signing. It is simpler to deploy, supports keyless signing (using OIDC identity providers like GitHub Actions, Google, or Microsoft Entra ID), and stores signatures as OCI artifacts alongside the image in any OCI-compliant registry.

Keyless signing with Cosign is particularly powerful in CI/CD pipelines: the pipeline authenticates via OIDC, Sigstore's Fulcio CA issues a short-lived signing certificate, the image is signed, and the signature can be verified by anyone using the Rekor transparency log. No long-lived signing keys to manage, rotate, or protect.

Docker Daemon Hardening

The Docker daemon (dockerd) runs as root on the host and has complete control over all containers. Securing the daemon itself is as important as securing individual containers.

Setting	Configuration	Why It Matters
Enable user namespace remapping	"userns-remap": "default" in daemon.json	Maps root inside the container to a non-privileged user on the host. Even if an attacker breaks out of the container as root, they land as a non-root user on the host.
Restrict network traffic between containers	"icc": false in daemon.json	Disables inter-container communication by default. Containers can only communicate if explicitly linked or on the same user-defined network.
Enable live restore	"live-restore": true in daemon.json	Containers continue running if the Docker daemon stops. Prevents daemon restarts from causing container downtime and potential security gaps during restart.
Use TLS for remote API	"tlsverify": true with client certificates	The Docker API socket grants full control over all containers. Without TLS, anyone with network access to the socket can start, stop, or create containers.
Log level and auditing	"log-level": "info" and audit logging enabled	Security events (container creation, privilege changes, image pulls) must be logged for incident response and compliance.
Default ulimits	"default-ulimits": with nofile and nproc limits	Sets resource limits for all containers by default, protecting against resource exhaustion when individual container limits are not specified.

Rootless Docker Mode

Rootless mode runs the entire Docker daemon and all containers as a non-root user on the host. This is the strongest host-level security configuration because even a complete container escape only grants non-privileged host access.

Trade-offs of rootless mode:

Networking limitations — rootless mode uses slirp4netns or pasta for networking instead of iptables, which means slightly higher network latency (5-10 percent) and no support for host networking mode.
Storage driver — rootless mode requires the fuse-overlayfs or native overlay2 storage driver (kernel 5.11+).
Port binding — cannot bind to ports below 1024 without additional configuration (net.ipv4.ip_unprivileged_port_start=0 sysctl).
No cgroup v1 support on some systems — rootless mode with cgroup resource limits requires cgroup v2, which is the default on modern distros (Ubuntu 22.04+, Fedora 31+).

For most production workloads, these trade-offs are acceptable. The security benefit — eliminating root-level access on the host — outweighs the minor networking overhead.

Docker Network Security

Docker networking defaults are designed for convenience, not security. All containers on the default bridge network can communicate with each other and with external networks. In production, network access should follow least privilege — each container should only be able to reach the services it explicitly needs.

Network Segmentation Strategy

Network Type	Isolation Level	When to Use
Default bridge	None — all containers can communicate	Never in production. Development only.
User-defined bridge networks	Containers can only communicate within the same network	Standard production isolation. Create separate networks for each application tier (frontend, backend, database).
Internal networks	No external access — containers can only reach other containers on the same internal network	Database containers, cache containers, and other services that should never be directly accessible from outside.
Encrypted overlay (Swarm)	Same as overlay but with IPsec encryption between nodes	Multi-host deployments where inter-node traffic crosses untrusted networks.
None	Complete isolation — no network access at all	Batch processing containers that should never make network connections.

CIS Docker Benchmark: Production Compliance Scoring

The Center for Internet Security (CIS) publishes the Docker Benchmark — a comprehensive checklist of security configurations across five areas: host configuration, Docker daemon, Docker daemon configuration files, container images, and container runtime. The benchmark contains 115+ individual checks, each classified as scored (pass/fail) or not-scored (informational).

Automated Compliance with Docker Bench for Security

Docker Bench for Security is an open-source script that automates CIS Benchmark checks. It runs approximately 100 checks in under 60 seconds and produces a scored report showing which checks pass and which fail.

Target compliance levels for production:

Benchmark Section	Checks	Target Score	Common Failures
1 - Host Configuration	18 checks	90%+	Audit rules not configured, Docker partition not separate
2 - Docker Daemon	18 checks	85%+	User namespace remapping not enabled, ICC not disabled
3 - Docker Daemon Files	22 checks	95%+	File permissions too permissive on docker.sock, daemon.json
4 - Container Images	11 checks	90%+	Images running as root, no HEALTHCHECK, secrets in image layers
5 - Container Runtime	31 checks	85%+	Privileged containers, missing resource limits, no seccomp profile

CIS Docker Benchmark compliance targets by section. The automated Docker Bench for Security tool runs all checks in under 60 seconds and produces a pass/fail report.

Image Lifecycle Security

Container security does not end when the image is built. Images accumulate vulnerabilities over time as new CVEs are discovered in their base images and dependencies. A production image that scanned clean on deployment day may have 30+ critical vulnerabilities six months later.

Image Lifecycle Controls

Control	Frequency	Purpose
Registry scanning	Continuous — scan all images in registry daily	Detect new CVEs in deployed images before they are exploited. Alert when critical CVEs appear in images currently running in production.
Base image updates	Weekly for non-breaking updates, monthly for major versions	Pick up security patches in base images (Alpine, distroless). Automated via Dependabot, Renovate, or custom CI pipelines.
Image expiry/rotation	Maximum 90-day age policy	No image should run in production for more than 90 days without being rebuilt. Enforce via admission controller that checks image creation timestamp.
Vulnerability SLA	Critical: 7 days. High: 30 days. Medium: 90 days	Set clear timelines for patching CVEs by severity. Track compliance as a team metric.
SBOM generation	Every build — attached as image annotation	Generate a Software Bill of Materials for every production image in SPDX or CycloneDX format. Required for FDA, DoD, and increasingly for enterprise procurement.

The Docker Production Security Checklist

Use this checklist before promoting any container to production. Each item maps to specific CIS Benchmark checks and real-world attack patterns that have caused breaches:

Category	Check	Priority
Image	Base image is minimal (Alpine, distroless, or Chainguard). No full OS images.	Critical
Image	Multi-stage build. No build tools, compilers, or source code in production image.	Critical
Image	No secrets in image layers. Verified via docker history and secret scanning.	Critical
Image	Image is signed (DCT or Cosign) and verified before deployment.	High
Image	SBOM generated and attached. Vulnerability scan shows zero critical CVEs.	High
Runtime	Container runs as non-root user (USER directive or --user flag).	Critical
Runtime	Read-only root filesystem with tmpfs for necessary write paths.	High
Runtime	All Linux capabilities dropped (--cap-drop=ALL). Only needed caps added back.	Critical
Runtime	no-new-privileges security option enabled.	High
Runtime	Memory and CPU limits set. PID limit configured.	High
Runtime	Seccomp profile applied (default or custom).	Medium
Daemon	User namespace remapping enabled on host.	High
Network	Container on isolated user-defined network. No default bridge.	High
Network	Only necessary ports published. No --publish-all.	High
Monitoring	Container health check defined (HEALTHCHECK in Dockerfile)	Medium
Monitoring	Runtime security tool deployed (Falco, Sysdig, or equivalent)	High

No container should reach production unless it passes every Critical and High priority check in this list. The Medium priority items should be implemented for workloads handling sensitive data or exposed to untrusted input. Running Docker Bench for Security and scoring above 85 percent validates that these controls are correctly applied.