
Your VPS can be online while your website is completely inaccessible. A crashed Nginx service, an unresponsive MySQL process, or an exhausted PHP-FPM pool can bring applications down even though the server itself is still running. Traditional monitoring tools can tell you something has failed, but they rarely fix the problem. That’s why effective server monitoring and following VPS best practices go hand in hand with maintaining reliable infrastructure.
A self-healing VPS takes the next step. Instead of waiting for someone to investigate an alert, it automatically checks critical services, attempts recovery, records every action, and escalates only when local recovery is no longer enough.
In this guide, you’ll learn how to build a lightweight self-healing VPS using Bash and Cron, with the option to extend it using the SolusVM API for infrastructure-level recovery when your hosting environment supports it.
The actual working of a Self-Healing VPS
The word appears to be quite complex. The concept is not. A self-healing VPS runs a small monitoring script every minute via cron. The script determines whether your critical services are running. If one goes down, it attempts to restart it, waits a few seconds, and checks again. If the restart succeeds, it logs the recovery and continues monitoring. If not, it retries a few more times. If all recovery attempts fail, it calls the SolusVM API of BigCloudy and requests an infrastructure-level VPS reboot.
All that can happen from the service crash to VPS restoration can be accomplished without your fingers touching a keyboard.
No monitoring agent subscription. No third-party SaaS dashboard. No complex setup. Just native Linux tools and the SolusVM API are included with BigCloudy VPS plans.
Self-Healing Infrastructure vs Traditional Monitoring
Most monitoring tools are designed to detect failures and notify administrators. Self-healing infrastructure takes the next step by attempting recovery automatically before human intervention is required.
| Feature | Traditional Monitoring | Self-Healing Infrastructure |
|---|---|---|
| Failure Detection | Detects failures | Detects failures and attempts automatic recovery |
| Response | Sends alerts to administrators | Takes corrective action automatically |
| Manual Intervention | Requires administrator response | Handles common failures independently |
| Recovery Time (MTTR) | Higher Mean Time to Recovery (MTTR) | Lower Mean Time to Recovery (MTTR) |
| Approach | Reactive approach | Proactive recovery approach |
Traditional monitoring is still useful, but it only tells you that something is broken. A self-healing system integrates monitoring and automates the recovery processes to minimize downtime and restore services quickly.
Before You Start
This guide is for Ubuntu 20.04 and Ubuntu 22.04. The commands work the same on Debian 11 and 12. All is compatible for AlmaLinux and Rocky Linux users, except for slight differences in the package names used for PHP.
What you need:
- Full root access VPS.
- The ability to log in to the server via SSH.
- Nginx, Apache, MySQL or PHP-FPM are already installed and running through systemd.
- Your SolusVM API Key and Hash are provided in the BigCloudy control panel (as described in Step 4).
If you want to confirm your services use systemd before continuing, run this:
systemctl status nginx
If you see active, you are set. If you see inactive or failed, the service is installed but not running. That is exactly the situation this script will handle automatically going forward.
How the Recovery Works, Layer by Layer
Before writing a single line of code, it helps to understand what the script actually does at each stage. This is not abstract theory. These are the exact decisions the script makes every time it runs.
Layer 1: Service check: The script calls systemctl status nginx for each service you configure. If the service is running, nothing happens. No log entry, no action. The script only reacts to problems.
Layer 2: Cooldown check: If a service is down and the script tried to restart it less than 5 minutes ago, it skips the restart and waits. This prevents the script from hammering a broken service every 60 seconds.
Layer 3: Restart attempt: The script calls systemctl restart, waits 5 seconds, and checks again. If the service recovered, it logs the success and resets the retry counter.
Layer 4: Retry tracking: If the restart failed, the script increments a counter stored in /tmp. After 3 failed attempts across multiple runs, it stops trying service-level fixes and escalates.
Layer 5: SolusVM reboot: The script sends a reboot request to the SolusVM API via curl. This is a VPS-level restart, performed outside the guest operating system. It works even if the OS itself is partially frozen.
Layer 6: Reboot storm protection: Even the SolusVM reboot has a counter. After 3 reboots within 30 minutes, automation stops entirely, and the log tells you manual investigation is required.
This layered approach means a single transient crash does not trigger a full VPS reboot. Only genuine, repeated, unrecoverable failures escalate that far.
Step 1: Set Up the Script Directory
Create a dedicated folder for everything related to the monitoring setup. This keeps the script, credentials, and config in one place with proper permissions.
mkdir -p /opt/selfheal
touch /opt/selfheal/monitor.sh
chmod +x /opt/selfheal/monitor.sh
touch /var/log/selfheal.log
chmod 640 /var/log/selfheal.log
Step 2: The Complete Monitoring Script
Open the script file:
nano /opt/selfheal/monitor.sh
Copy and paste the entire script below. Every section is commented so you understand what each part does, not just what to paste.
#!/bin/bash
# Self-Healing VPS Monitor with Email, Telegram, Slack Alerts
set -u
set -o pipefail
LOGFILE="/var/log/selfheal.log"
STATE_DIR="/var/lib/selfheal"
LOCKFILE="/tmp/selfheal.lock"
MAX_RETRIES=3
COOLDOWN_MINUTES=5
REBOOT_COOLDOWN_MINUTES=60
DISK_LIMIT=90
MEMORY_LIMIT=95
LOAD_LIMIT=20
HOSTNAME="$(hostname -f 2>/dev/null || hostname)"
# Alert Settings
ENABLE_EMAIL_ALERTS="no"
EMAIL_TO="admin@example.com"
ENABLE_TELEGRAM_ALERTS="no"
TELEGRAM_BOT_TOKEN="YOUR_BOT_TOKEN"
TELEGRAM_CHAT_ID="YOUR_CHAT_ID"
ENABLE_SLACK_ALERTS="no"
SLACK_WEBHOOK_URL="YOUR_SLACK_WEBHOOK_URL"
SERVICES=(
nginx
apache2
httpd
lsws
)
mkdir -p "$STATE_DIR"
touch "$LOGFILE"
exec 200>"$LOCKFILE"
flock -n 200 || exit 0
log() {
echo "$(date '+%Y-%m-%d %H:%M:%S') $1" >> "$LOGFILE"
}
send_alert() {
local SUBJECT="$1"
local MESSAGE="$2"
local FULL_MESSAGE="[$HOSTNAME] $MESSAGE"
log "ALERT: $SUBJECT - $MESSAGE"
if [ "$ENABLE_EMAIL_ALERTS" = "yes" ]; then
if command -v mail >/dev/null 2>&1; then
echo "$FULL_MESSAGE" | mail -s "$SUBJECT - $HOSTNAME" "$EMAIL_TO"
elif command -v mailx >/dev/null 2>&1; then
echo "$FULL_MESSAGE" | mailx -s "$SUBJECT - $HOSTNAME" "$EMAIL_TO"
else
log "ERROR: Email alert enabled but mail/mailx command not found."
fi
fi
if [ "$ENABLE_TELEGRAM_ALERTS" = "yes" ]; then
if command -v curl >/dev/null 2>&1; then
curl -s -X POST "https://api.telegram.org/bot${TELEGRAM_BOT_TOKEN}/sendMessage" \
-d chat_id="$TELEGRAM_CHAT_ID" \
-d text="$FULL_MESSAGE" >/dev/null 2>&1
else
log "ERROR: Telegram alert enabled but curl not found."
fi
fi
if [ "$ENABLE_SLACK_ALERTS" = "yes" ]; then
if command -v curl >/dev/null 2>&1; then
curl -s -X POST -H 'Content-type: application/json' \
--data "{\"text\":\"$FULL_MESSAGE\"}" \
"$SLACK_WEBHOOK_URL" >/dev/null 2>&1
else
log "ERROR: Slack alert enabled but curl not found."
fi
fi
}
get_disk_usage() {
df / | awk 'NR==2 {print $5}' | tr -d '%'
}
unit_exists() {
systemctl list-unit-files | awk '{print $1}' | grep -qx "$1.service"
}
service_exists() {
local SERVICE="$1"
if unit_exists "$SERVICE"; then
return 0
fi
if command -v "$SERVICE" >/dev/null 2>&1; then
return 0
fi
return 1
}
test_web_config() {
local SERVICE="$1"
case "$SERVICE" in
nginx)
command -v nginx >/dev/null 2>&1 && nginx -t >/dev/null 2>&1
;;
apache2)
command -v apache2ctl >/dev/null 2>&1 && apache2ctl configtest >/dev/null 2>&1
;;
httpd)
command -v httpd >/dev/null 2>&1 && httpd -t >/dev/null 2>&1
;;
lsws)
return 0
;;
*)
return 0
;;
esac
}
check_disk() {
local DISK_USAGE
DISK_USAGE=$(get_disk_usage)
if [ "$DISK_USAGE" -gt "$DISK_LIMIT" ]; then
log "WARNING: Disk usage critical (${DISK_USAGE}%). Reboot blocked."
send_alert "Disk Critical" "Disk usage is ${DISK_USAGE}%. Reboot will be blocked."
return 1
fi
log "OK: Disk usage normal (${DISK_USAGE}%)."
return 0
}
check_memory() {
local MEMORY_USAGE
MEMORY_USAGE=$(free | awk '/Mem:/ {printf("%.0f"), $3/$2 * 100}')
if [ "$MEMORY_USAGE" -gt "$MEMORY_LIMIT" ]; then
log "WARNING: Memory usage critical (${MEMORY_USAGE}%)."
send_alert "Memory Critical" "Memory usage is ${MEMORY_USAGE}%."
return 1
fi
log "OK: Memory usage normal (${MEMORY_USAGE}%)."
return 0
}
check_load() {
local LOAD
LOAD=$(awk '{print int($1)}' /proc/loadavg)
if [ "$LOAD" -gt "$LOAD_LIMIT" ]; then
log "WARNING: High system load detected (${LOAD})."
send_alert "High Load" "System load is high: ${LOAD}."
return 1
fi
log "OK: System load normal (${LOAD})."
return 0
}
check_service() {
local SERVICE="$1"
local RETRY_FILE="$STATE_DIR/${SERVICE}_retries"
local COOLDOWN_FILE="$STATE_DIR/${SERVICE}_cooldown"
if ! service_exists "$SERVICE"; then
log "SKIPPED: $SERVICE not found."
return 0
fi
if systemctl is-active --quiet "$SERVICE" 2>/dev/null; then
rm -f "$RETRY_FILE" "$COOLDOWN_FILE"
log "OK: $SERVICE is running."
return 0
fi
if [ -f "$COOLDOWN_FILE" ]; then
local LAST_ATTEMPT NOW ELAPSED
LAST_ATTEMPT=$(cat "$COOLDOWN_FILE")
NOW=$(date +%s)
ELAPSED=$(( (NOW - LAST_ATTEMPT) / 60 ))
if [ "$ELAPSED" -lt "$COOLDOWN_MINUTES" ]; then
log "INFO: $SERVICE cooldown active."
return 0
fi
fi
local RETRIES=0
if [ -f "$RETRY_FILE" ]; then
RETRIES=$(cat "$RETRY_FILE")
fi
if [ "$RETRIES" -ge "$MAX_RETRIES" ]; then
log "ESCALATION: $SERVICE exceeded retry limit."
send_alert "Service Escalation" "$SERVICE exceeded retry limit. Manual check required."
return 1
fi
log "WARNING: $SERVICE is down. Restart attempt $((RETRIES + 1)) of $MAX_RETRIES."
send_alert "Service Down" "$SERVICE is down. Restart attempt $((RETRIES + 1)) of $MAX_RETRIES."
systemctl reset-failed "$SERVICE" >/dev/null 2>&1
if ! test_web_config "$SERVICE"; then
echo $((RETRIES + 1)) > "$RETRY_FILE"
date +%s > "$COOLDOWN_FILE"
log "FAILED: $SERVICE config test failed."
send_alert "Config Test Failed" "$SERVICE config test failed. Restart skipped."
return 1
fi
if systemctl restart "$SERVICE" >/dev/null 2>&1; then
date +%s > "$COOLDOWN_FILE"
sleep 5
if systemctl is-active --quiet "$SERVICE" 2>/dev/null; then
rm -f "$RETRY_FILE" "$COOLDOWN_FILE"
log "RECOVERED: $SERVICE restarted successfully."
send_alert "Service Recovered" "$SERVICE restarted successfully."
return 0
fi
fi
echo $((RETRIES + 1)) > "$RETRY_FILE"
date +%s > "$COOLDOWN_FILE"
log "FAILED: $SERVICE restart unsuccessful."
send_alert "Restart Failed" "$SERVICE restart unsuccessful."
return 1
}
reboot_if_required() {
local DISK_USAGE
local REBOOT_FILE="$STATE_DIR/server_reboot_cooldown"
DISK_USAGE=$(get_disk_usage)
if [ "$DISK_USAGE" -gt "$DISK_LIMIT" ]; then
log "CRITICAL: Recovery failed, but disk usage is ${DISK_USAGE}%. Reboot skipped."
send_alert "Reboot Blocked" "Recovery failed, but disk usage is ${DISK_USAGE}%. Reboot skipped."
return 1
fi
if [ -f "$REBOOT_FILE" ]; then
local LAST_REBOOT NOW ELAPSED
LAST_REBOOT=$(cat "$REBOOT_FILE")
NOW=$(date +%s)
ELAPSED=$(( (NOW - LAST_REBOOT) / 60 ))
if [ "$ELAPSED" -lt "$REBOOT_COOLDOWN_MINUTES" ]; then
log "CRITICAL: Reboot required but cooldown active."
send_alert "Reboot Cooldown Active" "Recovery failed, but reboot cooldown is active."
return 1
fi
fi
date +%s > "$REBOOT_FILE"
log "CRITICAL: Recovery failed. Disk safe at ${DISK_USAGE}%. Rebooting server."
send_alert "Server Rebooting" "Recovery failed. Disk safe at ${DISK_USAGE}%. Server is rebooting now."
/sbin/shutdown -r now "Self-healing reboot: web service recovery failed"
}
log "----- Self-healing check started -----"
FAILED_COUNT=0
check_disk || true
check_memory || true
check_load || true
for SERVICE in "${SERVICES[@]}"; do
if ! check_service "$SERVICE"; then
FAILED_COUNT=$((FAILED_COUNT + 1))
fi
done
if [ "$FAILED_COUNT" -gt 0 ]; then
reboot_if_required
fi
log "----- Self-healing check completed -----"
Note: If your VPS provider offers SolusVM API access, you can extend this setup beyond service recovery and automate VPS-level actions such as reboot, start, or stop operations during critical failures.
For example, BigCloudy’s API endpoints follow a structure similar to:
https://manage.bigcloudy.com/api/v1/servers/
To target a specific VPS, append the server ID:
https://manage.bigcloudy.com/api/v1/servers/38
This allows your recovery workflow to trigger infrastructure-level actions when local service restarts are no longer effective. You can also integrate email, Slack, Telegram, or other notification systems to receive alerts whenever recovery actions, failures, or VPS reboots occur.
Save with Ctrl+O, Enter, then Ctrl+X.
Make the script executable if you have not already:
chmod +x /opt/selfheal/monitor.sh
Run it once manually to confirm there are no errors:
bash /opt/selfheal/monitor.sh
Check the log. If all your services are healthy, it will be empty. That is correct behaviour. The script only writes when something needs attention.
cat /var/log/selfheal.log
Step 3: Schedule Automated Health Checks with Cron
Now that your monitoring script is ready, schedule it to run automatically using Cron. This allows the VPS to perform health checks at regular intervals without any manual intervention.
Open your crontab:
crontab -e
Add the following line to run the monitoring script every minute:
* * * * * /bin/bash /opt/selfheal/monitor.sh
Save and exit the editor. Cron will automatically pick up the new schedule, and the script will begin checking your services every 60 seconds.
Verify the Cron Job
To confirm the job has been added successfully, run:
crontab -l
You should see the scheduled entry listed.
Confirm the Script Is Running
Wait a minute or two, then verify that Cron is executing the script.
On Ubuntu and Debian:
grep "monitor.sh" /var/log/syslog | tail -10
On RHEL, Rocky Linux, AlmaLinux, or CentOS:
grep "monitor.sh" /var/log/cron | tail -10
If your system uses systemd-journald instead of traditional log files, you can also check Cron activity with:
journalctl -u cron --since "10 minutes ago"
or on RHEL-based systems:
journalctl -u crond --since "10 minutes ago"
Once you confirm the script is running every minute, your VPS is continuously monitoring critical services and can begin responding to failures automatically.
Step 4: Connect to SolusVM
This is what sets apart a good monitoring script from a truly self-healing infrastructure. If the service-level restarts are not successful, the script will invoke BigCloudy’s SolusVM API and ask to reboot the VPS at the virtualization level. The reboot occurs in the absence of the guest OS, so it can be used even if the guest OS is unresponsive.
Here’s how to obtain your API credentials from BigCloudy:
- Sign in to your BigCloudy client area.
- Navigate to Services, then to My Services and then click on your VPS.
- Click on Login to SolusVM to access the server control panel.
- Go to the API section in the left sidebar of the SolusVM panel
- If it is disabled, enable API Access.
- Copy the API Key and API Hash that are displayed. They are similar to this:
Key: ABCDE-FGHIJ-KLMNO
Hash: 7f3d2a1b9c4e8a0d2b4c6e8f0a2b4c9e
Now store them in a separate credentials file rather than directly in the script:
nano /opt/selfheal/.solusvm_creds
Add these two lines with your actual values:
SOLUSVM_KEY="YOUR_API_KEY_HERE"
SOLUSVM_HASH="YOUR_API_HASH_HERE"
Lock the file so only root can read it:
chmod 600 /opt/selfheal/.solusvm_creds
Then update the monitor script to load from that file. Open monitor.sh and replace the two credential lines near the top with:
source /opt/selfheal/.solusvm_creds
Test the API connection before relying on it. This command only fetches status and does not reboot anything:
curl -X GET \
curl -X GET \
"https://cloud.bigcloudy.com/api/v1/servers" \
-H "Authorization: Bearer YOUR_API_TOKEN" \
-H "Accept: application/json"
A working response looks like this:
<status>success</status>
<statusmsg>online</statusmsg>
If you see <status>error</status>, double-check that you copied the key and hash correctly and that API access is enabled in the SolusVM panel. If the request times out, verify the panel URL from your BigCloudy dashboard.
Step 5: Read Your Logs Like a Pro
The log file at /var/log/selfheal.log tells you exactly what your server has been doing while you were not watching. Here is what each type of entry means:
Clean recovery, service came back on first try:
2025-06-10 03:14:22 WARNING: nginx is down. Restart attempt 1 of 3.
2025-06-10 03:14:27 RECOVERED: nginx back online after attempt 1.
Service down, but cooldown is active:
2025-06-10 03:15:28 INFO: MySQL is down. Cooldown active (1/5 min). Waiting.
All retries failed, escalating:
2025-06-10 03:29:40 ESCALATE: MySQL failed 3 times. Triggering SolusVM reboot.
2025-06-10 03:29:42 REBOOT RESPONSE: <status>success</status>
Reboot limit hit, manual action needed:
2025-06-10 04:02:11 REBOOT BLOCKED: Limit of 3 reached. Manual check required.
2025-06-10 04:02:11 ACTION NEEDED: Review /var/log/selfheal.log and inspect services manually.
To filter only the events that needed action:
grep -E "WARNING|FAILED|ESCALATE|REBOOT" /var/log/selfheal.log
To see only today’s entries:
grep "$(date '+%Y-%m-%d')" /var/log/selfheal.log
To prevent the log from growing indefinitely, set up automatic rotation:
nano /etc/logrotate.d/selfheal
Paste this:
/var/log/selfheal.log {
weekly
rotate 4
compress
missingok
notifempty
}
This keeps 4 weeks of compressed history and runs automatically. No manual maintenance required.
Step 6: Lock Down the Script
This script can reboot your entire VPS. Treat its security accordingly.
# Script: only root can read or run it
chmod 700 /opt/selfheal/monitor.sh
# Credentials: only root can read it
chmod 600 /opt/selfheal/.solusvm_creds
# Log file: root can write, others cannot read sensitive recovery data
chmod 640 /var/log/selfheal.log
# Confirm ownership
chown root:root /opt/selfheal/monitor.sh
chown root:root /opt/selfheal/.solusvm_creds
Two non-negotiable rules: never put the script or credentials anywhere under your web root, and never change the SolusVM API calls from HTTPS to HTTP, even temporarily.
Step 7: Test Everything Before You Rely on It
A recovery system you have never tested is a script you cannot trust. Here are three tests that verify the whole chain works.
Test 1: Basic service recovery
Open two terminal windows. In the first, watch the log in real time:
tail -f /var/log/selfheal.log
In the second, stop Nginx:
systemctl stop nginx
Within 60 seconds, you should see the WARNING entry appear in the first terminal, followed by the RECOVERED entry. Nginx should be running again:
systemctl is-active nginx
Test 2: Retry limit and escalation
Simulate a service that cannot restart by breaking its configuration:
# Back up the config first
cp /etc/nginx/nginx.conf /etc/nginx/nginx.conf.backup
# Inject a syntax error
echo "invalid_directive;" >> /etc/nginx/nginx.conf
# Stop the service
systemctl stop nginx
Watch the log for the next 15-20 minutes. The three failed restart attempts should be separated by the cooldown of 5 minutes, and followed by the ESCALATE entry.
When you finish testing, restore everything:
cp /etc/nginx/nginx.conf.backup /etc/nginx/nginx.conf
systemctl start nginx
rm -f /tmp/selfheal_nginx_*
Test 3: SolusVM API connection
Run the status check from Step 4. Confirm you get a success response before the reboot path is ever needed in a real incident.

What Problems This Actually Solves
Not everything benefits from automated restart, but most common VPS failures do. Here is an honest breakdown:
Nginx crashes due to memory pressure: The script restarts it in under 60 seconds. Users experience a brief hiccup at most.
MySQL hitting a connection or memory limit: Usually recoverable with a restart. If it keeps crashing, the log will show repeated FAILED entries, which tells you there is an underlying resource or config issue to investigate.
PHP-FPM worker exhaustion: Add check_service php8.1-fpm to the script. Exhausted worker pools restart cleanly and recover without data loss.
Disk full causing silent MySQL crashes: The check_disk function handles this. It clears logs older than 7 days and runs journalctl –vacuum-time=3d automatically when the disk hits 85%.
Frozen VPS where SSH is unreachable: The SolusVM API reboot handles this. It operates at the hypervisor level, outside the guest OS entirely.
Kernel panic or hardware failure: Automation cannot help here. When the reboot limit is reached, and the server still does not recover, the log tells you clearly. That is your signal to open a support ticket with BigCloudy.
The Part Most Guides Skip: Resetting After an Incident
Once you have investigated and fixed the underlying problem that triggered the automation, you need to reset the counters before the script will attempt recovery again.
rm -f /tmp/selfheal_*
This clears all retry files, cooldown timers, and reboot counters. The script starts fresh on the next cron run. Do not skip this step, or the script will sit at its reboot limit indefinitely and do nothing on the next failure.
Final Thoughts
A self-healing VPS won’t prevent every outage, but it can significantly reduce the time it takes to recover from one.
With Bash, Cron, structured logging, and SolusVM-powered recovery, you can automate common fixes, respond to failures faster, and spend less time reacting to routine server issues.
The goal isn’t to build a server that never fails. It’s to build one that knows how to recover when it does.
