Architecture
bgrun is split into 4 crates with a strict dependency graph. The daemon and CLI never link directly — they communicate over a Unix socket using a plain-text NDJSON protocol.
Crate layout
bgrun-proto— zero I/O. Types only. Any crate can depend on it.bgrun-core— zero I/O. Pure logic and data structures. Depends only onbgrun-proto.bgrun-cli— depends onbgrun-proto+bgrun-core. Talks to daemon over Unix socket.bgrun-daemon— depends onbgrun-proto+bgrun-core. Full daemon with tokio, nix, tracing.
The CLI never imports bgrun-daemon. Path utilities (socket_path, state_dir) are in bgrun-proto::paths so both crates share them without coupling.
Daemon lifecycle
Startup
- The CLI detects whether the daemon socket exists.
- If missing, it spawns
bgrun-daemonas a child process (no arguments). - The daemon process forks itself:
- The parent exits immediately, returning control to the CLI.
- The child continues as the actual daemon.
- The child calls
setsid()to create a new session and detach from the terminal. BGRUN_DAEMONIZED=1is set to prevent recursive double-fork.
Socket setup
The daemon binds a Unix domain socket at $XDG_RUNTIME_DIR/bgrun/daemon.sock (or /tmp/bgrun-$UID/daemon.sock as fallback). It listens in a tokio accept loop, spawning a new task per connection.
Orphan re-adoption
On startup, the daemon reads all persisted jobs from $XDG_DATA_DIR/bgrun/jobs/$ID/:
- If PID is alive (
kill(pid, 0)succeeds): the job is re-inserted into the in-memory store and a background task monitors it every 2 seconds. - If PID is dead: the job is marked
Crashedin status.json.
This ensures no monitoring gap after a daemon restart.
Shutdown
The daemon runs until killed (pkill bgrun-daemon) or an idle timeout expires. On restart via the CLI, the new instance re-adopts orphaned children as described above.
Reactive idle auto-shutdown
The daemon spawns a background monitor that uses tokio::sync::Notify for zero-CPU idle polling:
- A global
LIFECYCLE_NOTIFYstatic is defined inrunner.rs. - Every job spawn (
spawn_job,spawn_pty_job) and job exit (handle_job_exit) calls.notify_one(). - The monitor loop:
- Active jobs > 0: parks indefinitely on
.notified().await— 0 CPU usage. - Active jobs = 0: races
.notified()againstBGRUN_IDLE_TIMEOUT(default 60s).
- Active jobs > 0: parks indefinitely on
- On timeout expiry, the socket file is deleted and the process exits.
This replaces any periodic polling approach, ensuring the daemon consumes no CPU while idle.
Protocol
The CLI and daemon communicate over a Unix socket using NDJSON (Newline-Delimited JSON).
Request
{"id":"req-uuid","command":"Run","args":{"cmd":["sleep","300"],"name":"sleeper",...}}
One JSON object per line. The command field uses tagged enum serialization (serde’s #[serde(tag = "command", content = "args")]) so all commands share the same outer envelope.
Response
{"id":"req-uuid","ok":true,"data":{"id":"abc123","state":"running",...}}
{"id":"req-uuid","ok":false,"error":"job not found"}
Every request gets exactly one response line. The response format is:
req_id— matches the request’sidok— boolean success indicatordata— command-specific payload (only whenokis true)error— error message string (only whenokis false)
Commands
| Command | Request args | Response data |
|---|---|---|
Run | RunArgs { cmd, name, workspace, readiness, restart, pty, max_runtime_ms, max_rss_mb, env, after, cwd, pty_cols, pty_rows } | JobRecord |
RunGroup | { jobs: Vec<RunArgs> } | Vec<JobRecord> |
Status | { id } | JobStatus |
List | { workspace? } | Vec<JobRecord> |
Kill | { id?, workspace? } | { killed: Vec<String> } |
Tail | { id, lines, digest, level?, strip_ansi } | { lines: Vec<LogLine> } or LogDigest |
Diff | { id, lines?, strip_ansi } | { cursor: u64, lines: Vec<LogLine> } |
Wait | { id, timeout_ms } | WaitResult { ready: bool, elapsed_ms: u64 } |
Send | { id, data } | { ok: true } |
Stats | { id } | ResourceStats { cpu_pct, rss_mb, uptime_secs } |
Attach | { id } | hijacks socket for raw byte streaming |
Expect | { id, pattern, is_regex, timeout_ms } | { matched: bool, line_number, content } |
ResizePty | { id, cols, rows } | { resized: true } |
Process lifecycle
Spawn (runner.rs)
- Idempotency check: if a named job already exists and is alive, return its record.
- Dependency wait: if
afteris set, poll the store until the named job reachesReady,Exited,Crashed, orKilled(120s timeout). - Spawn:
tokio::process::Commandwith piped stdin/stdout/stderr, process group 0. - Output capture: stdout and stderr are piped to async tasks that write to
stdout.logwith log rotation at 50MB. - Stdin handle: stored in a global
HashMap<String, ChildStdin>keyed by job ID. - Ready check: if
readinessis set, spawn a background task that polls every 200ms up to 60s. - Max runtime: if
max_runtime_msis set, a tokio sleep task fires after the duration, killing the job if still alive. - Memory limit: if
max_rss_mbis set, a background task polls RSS every 1s and kills the job if exceeded. - Lifecycle notify:
LIFECYCLE_NOTIFY.notify_one()is called to wake the auto-shutdown monitor. - Persist: write
meta.json(fullJobRecord) andstatus.jsonto disk.
Monitor
After spawn, the daemon spawns monitor_job() which:
- Awaits the child process exit.
- Determines exit type (SIGKILL → Crashed, non-zero exit → Crashed, zero exit → Exited).
- If
restart = OnCrashand the process crashed, spawns a new instance afterbackoff_msdelay. - Writes status.json.
Kill
- Send
SIGTERMto the process group (killpg). - Wait up to 5 seconds.
- If still alive, send
SIGKILLto the process group. - Transition state to
Killed.
Readiness system
Four checker implementations, all implementing the ReadinessChecker async trait:
| Strategy | Checker | Mechanism |
|---|---|---|
LogPattern("string") | LogPatternChecker | Reads the log file from last offset, searches for substring. Offset-tracked to avoid re-scanning. |
TcpPort(3000) | TcpPortChecker | TcpStream::connect("127.0.0.1:3000") |
HttpPoll("http://...") | HttpPollChecker | reqwest::GET, 500ms timeout, checks for 2xx |
FileExists("/path") | FileExistsChecker | tokio::fs::metadata() |
The readiness_loop polls the checker every 200ms. Exits immediately if:
- Checker returns
true→ transition toReady, persist status. - Job is no longer alive (killed/crashed) → stop polling.
- 60-second timeout elapses → stop polling (job stays in
Running).
Persistence
Jobs are persisted to disk under $XDG_DATA_DIR/bgrun/jobs/$ID/:
jobs/abc123/
├── meta.json # Full JobRecord (cmd, name, workspace, pid, state, readiness, restart, pty, max_runtime, max_rss_mb, env)
├── status.json # Current state, exit_code, ready_at, restart_count, cursor
└── stdout.log # Captured stdout/stderr (rotated at 50MB → stdout.log.1)
- meta.json is written once on spawn. Contains all configuration needed to re-create the job.
- status.json is updated on state transitions. Restored on daemon restart.
- stdout.log grows unbounded. Rotated at 50MB. The daemon writes with
tokio::fs::File::create(true).append(true).
An audit log at $XDG_DATA_DIR/bgrun/audit.log records daemon startup timestamps in NDJSON format.
Resource monitoring
A global SYSINFO_SYSTEM static (once_cell::sync::Lazy<Arc<Mutex<System>>>) is initialized once in runner.rs. Both get_stats and the memory monitor share this single instance, avoiding per-call allocation:
#![allow(unused)]
fn main() {
let mut sys = SYSINFO_SYSTEM.lock().unwrap();
sys.refresh_processes(ProcessesToUpdate::All);
let proc = sys.process(Pid::from_u32(pid));
// proc.cpu_usage(), proc.memory(), proc.run_time()
}
Memory RSS guardrails
When --max-rss <MB> is passed to bgrun run, the daemon spawns monitor_memory_limit() — a tokio task that polls RSS every 1 second. If the process exceeds the limit, it is killed through the normal kill flow (SIGTERM → SIGKILL).
The max_rss_mb value is persisted in meta.json and restored on daemon restart, so memory limits survive reboots.
Tool schemas
bgrun schema <command> prints JSON Schema (draft-07) for any command’s argument struct using the schemars crate. The derive macros are on RunArgs, KillArgs, TailArgs, Command, ReadinessStrategy, and RestartPolicy in bgrun-proto.
This allows AI agents to discover expected input shapes at runtime without hardcoded tool definitions.
ID resolution
JobStore::resolve_id() accepts three formats:
- Full UUID — exact match against the job’s canonical ID.
- Job name — exact match against the name index (
--name). - Unique prefix — at least 4 characters matching exactly one job’s UUID.
This is called at the start of every daemon handler, so bgrun tail abc1, bgrun status my-server, and bgrun kill 55f3a all work transparently.
Log tail implementation
tail_lines uses a two-pass approach to avoid loading the entire file into memory:
- Pass 1: scan forward, tracking newline byte positions in a ring buffer of N+1 entries.
- Pass 2: seek to the start offset of the Nth-from-last line, read only that tail portion.
diff_since seeks directly to the cursor offset from status.json, counting lines from start for correct line numbering.
This works for log files of any size without O(file_size) memory usage.