systemd can sandbox a service with kernel features (namespaces, seccomp, capabilities) — no container required. Add these to [Service], then loosen only what breaks.
A solid baseline
[Service]
# Privilege
NoNewPrivileges=true
CapabilityBoundingSet=
AmbientCapabilities=
# Filesystem
ProtectSystem=strict
ProtectHome=true
PrivateTmp=true
ReadWritePaths=/var/lib/myapp /var/log/myapp
# Kernel & devices
ProtectKernelTunables=true
ProtectKernelModules=true
ProtectControlGroups=true
ProtectClock=true
PrivateDevices=true
# Network & syscalls
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
RestrictNamespaces=true
MemoryDenyWriteExecute=true
LockPersonality=true
SystemCallFilter=@system-service
SystemCallFilter=~@privileged @resources
SystemCallArchitectures=native
What the key ones do:
NoNewPrivileges=true— the process and its children can never gain privileges (blocks setuid escalation). The single highest-value line.ProtectSystem=strict— the entire filesystem is read-only except/dev,/proc,/sys; combine withReadWritePaths=to punch through for the dirs the service actually writes. (ProtectSystem=fullis the softer version: only/usr,/boot,/etcread-only.)ProtectHome=true—/home,/root,/run/userare invisible.PrivateTmp=true— a private/tmpno other service can see.CapabilityBoundingSet=(empty) — drop all capabilities; add back only what’s needed, e.g.CapabilityBoundingSet=CAP_NET_BIND_SERVICEto bind a low port.SystemCallFilter=@system-servicethen~@privileged @resources— allow the normal service syscall set, then subtract the dangerous groups.
Allow back what it needs
- Writes files?
ReadWritePaths=/path(orStateDirectory=namefor/var/lib/name). - Binds a port < 1024?
AmbientCapabilities=CAP_NET_BIND_SERVICE+ the cap in the bounding set. - Only talks Unix sockets? Drop
AF_INET/AF_INET6fromRestrictAddressFamilies=.
Apply and score it
sudo systemctl daemon-reload
sudo systemctl restart myapp.service
systemd-analyze security myapp.service # 0 = locked down, 10 = exposed
journalctl -u myapp.service -e # check nothing got blocked
systemd-analyze security lists every directive with its exposure weight, so you can see exactly what to tighten next.
Method: start strict, restart, watch the journal for EPERM/“Operation not permitted”, and relax one directive at a time. Gotchas: MemoryDenyWriteExecute=true breaks JITs (some JVMs, V8); drop it for those. SystemCallFilter can block a runtime’s startup — if a service dies instantly after adding it, that’s usually the cause. Test on a non-production box first.