systemd can keep a service alive across crashes — but you also want a backoff delay and a limit so a service that crashes instantly on startup doesn’t spin forever.
The unit
[Unit]
Description=My resilient worker
# Stop trying if it fails 5 times within 60s (a crash loop):
StartLimitIntervalSec=60
StartLimitBurst=5
[Service]
ExecStart=/usr/local/bin/worker
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
Restart= values, in plain terms
on-failure— restart on a non-zero exit, a signal (crash), a timeout, or a watchdog ping miss. The usual choice.always— restart even on a cleanexit 0. Use for a daemon that should never stop.on-abnormal— only on signal/timeout/watchdog, not on a non-zero exit code.no(default) — never restart.
RestartSec=5 waits 5 seconds between attempts (a simple backoff).
The rate limit (StartLimit)
StartLimitIntervalSec + StartLimitBurst live in the [Unit] section (they moved there in systemd v230+). The example above gives up after 5 starts in 60 seconds and puts the unit in a failed state instead of looping. Reset a unit that hit the limit with:
sudo systemctl reset-failed myapp.service
sudo systemctl start myapp.service
Apply and verify
sudo systemctl daemon-reload
sudo systemctl restart myapp.service
journalctl -u myapp.service -f # watch it die and come back
Tip: to escalate beyond a restart, add StartLimitAction=reboot (or poweroff) so a host whose critical service can’t recover reboots itself. For a “watchdog” restart when a healthy process hangs, set WatchdogSec= and have the program call sd_notify(... WATCHDOG=1 ...).