Tenax SOC On-Call Automation

1

When a Threat Fires

A piece of malware runs on a client machine at 2 am. Three minutes later, a phone is ringing. Here is exactly what happened in between.

What this app actually does

The Tenax SOC On-Call Automation is a custom alarm system for a cybersecurity team. It watches security tools, decides who needs to know, and makes sure the right person actually picks up.

🔍

Watch

Polls SentinelOne and the SIEM every 30 seconds for new threats. The SIEM is the live paging source today; SentinelOne is currently in Slack-only preview (it posts detections for visibility but opens no ticket and fires no page).

📣

Notify

Posts to Slack instantly. Creates a ticket in ConnectWise for tracking.

📞

Escalate

Calls the on-call analyst by phone if the alert isn't acknowledged in time. Then their backup. Then the manager.

Trace an alert from detection to phone call

Click "Next Step" to follow a real SentinelOne threat through the system.

🛡️

SentinelOne

⚙️

Worker

🗄️

PostgreSQL

💬

Slack

🎫

ConnectWise

📞

Twilio

Click "Next Step" to begin

The first filter: contained is not closed

Before deciding what to do with an S1 threat, the system asks two different questions. SentinelOne’s agent contains many threats automatically (it quarantines or kills the file) — but the agent acting is not the same as a human reviewing the case. We are a SOC: a contained threat we’ve never looked at could still have surrounding alerts, a root cause, or lateral spread.

sentinelone.py


def is_contained(threat: dict) -> bool:
    # The agent ACTED (automatic). Says nothing about review.
    m = threat["threatInfo"].get("mitigationStatus", "").lower()
    return m in ("mitigated", "blocked", "remediated")

def is_closed(threat: dict) -> bool:
    # A HUMAN / the case resolved it.
    ti = threat["threatInfo"]
    return (ti.get("incidentStatus", "").lower() == "resolved"
        or ti.get("analystVerdict", "").lower()
           in ("false_positive", "true_positive"))

Plain English

is_contained — did the S1 agent act on the file?

It quarantined, blocked, or remediated it. This is automatic…

…and tells us nothing about whether a person looked at it.

is_closed — did a human or the case actually resolve it?

Either the S1 incident itself was closed…

…or an analyst rendered a verdict (false or true positive).

Only this means "reviewed — nothing more to do."

💡

Why split them?

A real case: a signed malware dropper (pdfguruhub.msi) landed on a managed endpoint. The agent auto-quarantined it — contained — but analystVerdict was undefined and the incident was unresolved: not closed. The old single is_mitigated() check treated "agent quarantined" the same as "case closed" and silently dropped it, so nobody at the SOC ever saw it. The fix: drop only is_closed threats; surface is_contained-but-not-closed ones for triage. How loudly is the subject of Module 6.

Check your understanding

An analyst tells you the worker was restarted after an outage and immediately paged them for a 20-minute-old SentinelOne threat. They think this is a bug. Is it?

A SentinelOne threat gets automatically quarantined (contained) by S1 while the on-call analyst is still being paged — but no analyst has rendered a verdict yet. What happens?

SentinelOne sends the same threat alert twice in quick succession (this can happen when S1 re-reports). What does this system do?

2

Meet the Cast

Six services that run 24/7. Click any component to learn what it does and why it exists.

The system map — click to explore

🌐 Internet boundary (Caddy reverse proxy handles TLS)

📦 Docker containers

🚀 oncall-api

⚙️ worker

🗄️ postgres

⚡ redis

☁️ External services (called over HTTPS)

🛡️ SentinelOne

📞 Twilio

💬 Slack

🎫 ConnectWise

Click any component above to see what it does.

The heartbeat: what runs, and when

The worker runs a Celery beat scheduler. Think of it as a list of alarms that go off on a timer.

worker/tasks.py


celery_app.conf.beat_schedule = {
    "poll-sentinelone": {
        "task": "app.worker.tasks.poll_sentinelone",
        "schedule": float(settings.s1_poll_interval),
    },
    "check-auto-resolve": {
        "task": "app.worker.tasks.check_auto_resolve",
        "schedule": 30.0,
    },
    "check-deferred-escalations": {
        "task": "app.worker.tasks.check_deferred_escalations",
        "schedule": 30.0,
    },
}

Plain English

This dictionary is the worker's to-do list — tasks it runs automatically on a timer.

Every 30 seconds (configurable), ask SentinelOne: "Any new threats?"

Every 30 seconds, check open S1 incidents: "Did S1 mitigate any of these?" If yes, auto-close them.

Every 30 seconds, check: "Are there any business-hours incidents that the SOC hasn't picked up in time?" If yes, start calling.

⏱️

Why 30 seconds?

30 seconds is the sweet spot between "fast enough to page before an analyst's screen locks" and "not so aggressive it hammers the S1 API." The poll interval is an env variable — you can tune it without a code change.

The internal conversation: how components hand off work

Watch how the API and worker stay decoupled using Redis as a message broker.

🔑

Why Redis in the middle?

An escalation call takes 3–5 minutes. If that ran inside the API web process, no other alerts could be processed until the call finished. By handing the work to the worker via Redis, the API stays instant and the multi-minute call loop runs safely in the background.

Where to find what

Every file has one job. This makes it easy to know exactly where to look when something breaks — or when you want to add a feature.

app/

main.pyFastAPI app startup, router wiring

config.pyAll environment variables in one place

services/

escalation.pyCore logic: create_incident, routing, ack, resolve

schedule.pyWho is on-call right now? Reads schedule.yaml

twilio_voice.pyPhone calls + IVR scripts (what gets spoken aloud)

sentinelone.pyS1 API polling and threat parsing

slack.pyAll Slack notification message builders

connectwise.pyCW ticket create/update/close

worker/

tasks.pyCelery beat schedule + all background task bodies

schedule.yamlTeam roster, rotation blocks, holidays

Check your understanding

A client calls to report they can't reach anyone on-call. You want to trace why the escalation chain didn't run. Which process would you check logs for first?

You want to add a new analyst to the escalation chain. Which file needs to change?

3

Should We Page?

Not every alert wakes someone up. The system runs a triage checklist before dialing a single number — and the logic is more interesting than you’d expect.

The triage checklist, in order

Every incoming alert passes through these gates. The first gate that says "stop" wins.

1

Duplicate?

Same source_id already open? Suppress silently. No double-paging.

2

Bundleable?

SentinelOne alert on the same machine within 10 minutes of an open parent? Attach it as a child. One page for the storm, not fifty.

3

High severity?

A SIEM alert pages only at or above the level threshold (default 15, labeled critical). The band just below it (12–14, labeled high) opens a CW ticket on the SOC board but never rings a phone. Direct calls always page. For SentinelOne it’s state-aware: an active threat (or contained ransomware) pages; a threat the agent already contained opens a SOC triage ticket only — see Module 6.

4

After hours?

Business hours (8–5 CT, Mon–Fri, non-holiday)? SOC gets a 15-minute window to pick up the ticket before paging starts. After hours → page immediately.

5

Override active?

Admin override can force ALWAYS or NEVER page mode regardless of time-of-day. AUTO (default) uses normal logic.

The exact routing decision in code

This is the moment the system decides: page now, defer, or skip entirely.

services/escalation.py


    high = await _is_high_severity(incident)
    direct = source in ("voice", "callback")

    if not high:
        # Low severity: SOC ticket only, never page
        db.add(IncidentEvent(event_type="page_suppressed" ...))
        return incident

    if is_after or direct:
        run_escalation.delay(str(incident.id), title)
    else:
        delay = await get_escalation_delay_minutes()
        # SOC has {delay} min to pick up before paging

Plain English

Ask: "Is this alert severe enough to page someone?" (SIEM uses a level threshold; for S1, an active threat or contained ransomware is — contained malware is a ticket only.)

Ask: "Did a human call in directly?" Voice calls always escalate immediately, any hour.

If the answer to "high severity?" is NO:

Write a note saying the page was intentionally suppressed (so the deferred-page checker doesn’t pick it up later).

Done — open a ticket, post to Slack, but never call anyone.

If it IS high severity AND (after hours OR a direct call):

Fire the escalation chain right now as a background task.

Otherwise (business hours, SOC is likely at their desks):

Read the configured delay (default 15 min) from the database.

The deferred-escalation checker will pick this up after the grace window expires.

The storm problem: 50 alerts, one page

Imagine a pen tester runs a script on a client machine. SentinelOne fires 50 threat alerts in 10 minutes — one per executable. Without bundling, that’s 50 parallel phone calls to the analyst.

⚡

The Problem

A single attack event can produce dozens of individual S1 alerts — one per malicious file, per process, per connection. Each one would normally create its own incident and page chain.

🪄

The Solution

If a new S1 alert arrives for the same client machine within 10 minutes of an open parent incident, it becomes a "child" of that parent. Only the parent escalates. Children just attach silently.

📊

The Throttle

Even Slack notifications are throttled: first 3 siblings post verbatim, then only milestone counts (10, 25, 50, 100…). So 100 siblings produce ~7 messages, not 100.

🧠

Severity-monotonic bundling for SIEM

The SIEM bundler has an extra rule: a new alert only bundles into a parent if the parent's severity level is equal or higher. A noisy low-level parent can’t "swallow" and silence a more serious alert on the same machine — that one opens its own incident and pages.

Three paths through the triage desk

🚨

New Alert

💬

Slack

🎫

CW Ticket

⏱️

15 min timer

📞

Phone call

Click "Next Step" to trace a business-hours alert

Check your understanding

During business hours, a high-severity SIEM alert fires. An analyst opens the ConnectWise ticket and moves it to "In Progress" 8 minutes later. Does the phone ring?

An IP Pathways support rep calls the Tenax SOC hotline at 10am on a Tuesday. Does the on-call analyst get paged immediately or after the 15-minute grace window?

Machine 'ACCOUNTING-PC-01' already has an open SIEM incident (severity level 6). A new SIEM alert fires for the same machine at severity level 12. What happens?

4

The Phone Tree

Who gets called, in what order, and what happens when they don’t answer. The escalation chain is smarter than a simple list of numbers.

How the chain gets built from a schedule file

The chain isn’t hardcoded. It’s derived every time from schedule.yaml using a set of role-priority rules.

services/schedule.py


def _derive_chain(primary, config):
    chain = []
    push(primary)

    analysts = [m for m in config["team"]
                if m["role"] == "analyst"]

    if primary["role"] == "analyst":
        for a in analysts: push(a)
    elif primary["role"] == "manager":
        for a in analysts: push(a)

    for role in ["manager", "vp", "president"]:
        for m in config["team"]:
            if m["role"] == role: push(m)

Plain English

Build the chain starting with whoever is the primary this week.

Collect all people with the analyst role from the team roster.

If the primary is an analyst: add all other analysts next (peer backup).

If the primary is a manager: fill secondary/tertiary with analysts first.

Always append the management tail in order: manager → VP → president.

Walk every team member at each role level in YAML order.

The dedup guard (push) ensures no one appears twice.

🔄

Three override layers on top of YAML

Before the derived chain is used, three overrides are checked in priority order: (1) a holiday on_call field for today’s exact date, (2) a per-week primary override set in the admin UI, and (3) a full admin chain-reorder that replaces the entire derived chain. The last wins.

The typical five-tier chain

Each level gets two call attempts before the chain advances. No one is skipped — every level must either acknowledge or exhaust both rings.

1️⃣

Primary analyst

This week’s on-call person from schedule.yaml. They get two calls: the first ring, then a second about a minute later (gap configurable, default 60s) to punch through iPhone DND mode.

2️⃣

Secondary analyst

The other analyst on the team. Same two-attempt pattern. If the team only has one analyst, this level is skipped automatically.

3️⃣

Manager

If both analysts are unreachable, the manager gets the call. They are always in the chain tail regardless of who the primary is.

4️⃣

VP

Next in the org chart. Reached only if the manager also doesn’t pick up both calls. Rare in practice.

5️⃣

President

Last resort. If the chain is exhausted without an ack, a self-alert fires to notify the system owner that no one answered.

Inside a single escalation level: two attempts, then advance

Watch what happens when the system calls the primary analyst and they don’t pick up.

What the phone actually says when someone answers

Twilio reads a scripted announcement aloud via text-to-speech. The script is built at call time from the incident data.

services/twilio_voice.py


def _announcement_text(
    source, client, severity, title):
    src = _pretty_source(source)
    parts = ["This is the Tenax SOC."]

    intro_bits = [src]
    if severity_phrase:
        intro_bits.append(severity_phrase)
    intro_bits.append("alert")
    intro = " ".join(intro_bits)

    if client:
        intro += f" for {client}"
    parts.append(intro + ".")
    if title:
        parts.append(f"{title}.")
    parts.append("Press 1 to acknowledge.")

Plain English — sample output

Build a spoken script from four fields: source, severity, client, title.

Convert the raw source name: "sentinelone" → "SentinelOne".

Always start: "This is the Tenax SOC."

Build the intro clause left to right:

Start with the source tool name.

Add the severity word if present ("critical severity", "high severity").

Close the clause with "alert".

Append "for [client]" if the alert belongs to a specific client.

Example so far: "SentinelOne critical severity alert for Acme Corp."

Add the alert title: "Malicious activity detected — ACCOUNTING-PC-01."

End with the IVR prompt: "Press 1 to acknowledge."

📞

How the DND bypass actually works

The bypass comes from placing a second call to the same person about a minute after the first goes unanswered (the gap is configurable, default 60s). iOS’s Repeated Calls feature lets a second call from the same number within three minutes break through Do Not Disturb / Focus. The two Gather blocks you see in the TwiML are just re-prompts within a single call — the full announcement, then a shorter repeat — not the bypass mechanism itself.

Check your understanding

The manager is this week's primary on-call. When _derive_chain() builds the chain, where does the manager appear?

An after-hours incident fires and neither analyst answers any calls. How many total Twilio calls are placed before the manager gets their first ring?

An incident is queued for escalation at 2:05am. At 2:06am an admin applies a full-chain override. The worker picks up the task at 2:07am. Which chain does the worker use?

5

Stand Down

How an incident ends — whether by human acknowledgment, automatic mitigation, or the system monitoring itself to know it’s still working.

Six states an incident can be in

Every incident row in the database moves through a defined set of statuses. The transitions determine whether paging continues, pauses, or stops entirely.

new

Waiting

Incident was created. Either waiting for the deferred-escalation timer (business hours) or about to page immediately (after hours).

paging

Calling

The escalation chain is actively running. Twilio is placing calls. The chain advances person-by-person until someone presses 1 or the chain is exhausted.

acked

Acknowledged

A human pressed 1 on their phone, or the SOC moved the CW ticket to “In Progress”. Paging stops. The CW ticket is promoted to the SOC board if it was staged on Tier 1.

monitored

Contained, awaiting triage

S1’s agent contained the threat but no human has reviewed it (is_contained, not is_closed). The paging chain stops mid-ring, but the incident stays open as a triage item until an analyst verdicts it. Contained ≠ closed.

resolved

Auto-resolved

The threat is now closed upstream — an analyst rendered a verdict, or the S1 case was resolved (is_closed) — and check_auto_resolve caught it. The paging chain stops mid-ring if active. Slack gets a stand-down message.

closed

Closed

Manually closed from the admin UI, or the CW ticket was closed and the system synced the status. Final state — no further actions fire.

The automatic stand-down: how S1 mitigation stops the page

Every 30 seconds, the worker checks open SentinelOne incidents and asks: “Did S1 handle this on its own?”

worker/tasks.py + services/sentinelone.py


# Beat schedule, every 30s
"check-auto-resolve": {
    "task": "...check_auto_resolve",
    "schedule": 30.0,
},

# For each open S1 incident:
if is_closed(threat_data):        # human/case resolved
    await resolve_incident(incident.id)   # → resolved
elif is_contained(threat_data):    # agent acted, no review
    await mark_monitored(incident.id)    # → monitored (open)

Plain English

The Celery beat scheduler fires this task every 30 seconds, around the clock.

For every open SentinelOne incident, re-query the threat’s current state, then:

If a human/case closed it (verdict or resolved) → mark the incident resolved and post a Slack stand-down.

If the agent only contained it (no review yet) → move it to "monitored": stop the page, but keep it open…

…so a contained-but-untriaged threat doesn’t vanish unreviewed. It waits in the triage queue.

⏱️

The chain polls for terminal status between calls

While waiting for each ring to complete, wait_for_terminal_status() polls the database every 3 seconds. resolved and monitored are both chain-halting states: if check_auto_resolve lands either one mid-ring, the escalation chain sees it on the next poll and stops cleanly — no further calls are placed. (A threat that self-contains mid-page stops ringing but stays open for triage.)

Two safety systems you probably won’t need — until you do

🕹️

Admin override: force ALWAYS / NEVER page

The admin UI has a three-position switch: AUTO (default — use time-of-day logic), ALWAYS (treat every incident as after-hours and page immediately regardless of clock), and NEVER (treat everything as business hours — SOC gets the grace window even at 2am). Useful during planned downtime or drills.

🫀

Worker heartbeat: is the system itself alive?

Every 15 seconds, the worker writes a timestamp to Redis (worker_heartbeat). Every 60 seconds, the monitoring_watchdog task checks that timestamp. If the heartbeat is stale, a self-alert fires — paging the system owner that the worker has stopped executing tasks.

🔭

Poller staleness: is threat detection still running?

The watchdog also checks when each scheduled task (S1 poll, SIEM poll, auto-resolve, deferred-escalation checker) last completed successfully. If any of them goes silent past its threshold, a self-alert fires. The system watches itself so it can’t silently fail.

🚨

Self-alert: what fires when the system detects its own failure

A self-alert is an internal incident that pages the system owner’s personal phone and posts to a separate Slack channel. It fires for: heartbeat stale, chain exhausted with no ack, Twilio call failures, and stuck-escalation backstop trips. These are never shown to clients — they’re operational health signals.

Full lifecycle: alert to stand-down

🚨

S1 Threat

💬

Slack + CW

📞

Phone call

🛡️

S1 mitigates

👁️

Monitored

Click “Next Step” to watch an incident self-close

Check your understanding

A high-severity S1 threat fires after hours. The analyst is being paged. SentinelOne auto-quarantines the threat 20 seconds into the first ring (no analyst verdict yet). What happens next?

An admin sets the override to ALWAYS. A medium-severity SIEM alert fires at 10am on a Tuesday. What happens?

The Celery worker crashes and stops executing tasks. How does the system detect this?

6

SentinelOne Triage

SentinelOne’s agent contains most threats on its own. The hard question isn’t “did we stop it?” — it’s “has a human actually looked?” This module is how we cut the noise without going blind.

The principle: containment is not review

S1 in protect mode auto-quarantines malware the moment it’s written to disk. That’s containment — the file can’t run. But as a SOC for other companies, our job doesn’t end at “the file is quarantined.” We still owe the customer the investigation: were there surrounding alerts? How did it get there? Did anything spread before the agent caught it?

🧭

The one question that routes everything

“Can this wait until the SOC is staffed?” A threat that may still be executing can’t wait → page. A threat the agent already contained almost always can → triage ticket, worked the next business day — unless it’s a high-blast-radius class (ransomware), where the wait-cost is too high. A threat a human already reviewed → nothing.

The routing matrix

Every in-scope S1 detection lands in exactly one of these buckets. The bucket decides the loudness.

page

Active threat — not contained, not closed

The threat may still be running. Pages after-hours / SOC grace window during business hours. Unchanged from before.

page

Contained ransomware / critical

Even contained, ransomware signals campaign & blast-radius risk — one node contained doesn’t mean the others are. Worth a call.

ticket

Contained malware — malicious, untriaged

The bulk of the queue. SOC ticket only, never pages (a Slack FYI while S1 is in preview). Worked next business day. This is your triage list.

drop

Suspicious / PUA / adware

Low confidence, high false-positive rate. Suppressed — not worth a ticket each. (Revisit the threshold if something slips through.)

drop

Already reviewed — analyst verdict or case resolved

is_closed. A human already judged it (even a false positive) — nothing more to surface.

🎚️

The threshold that keeps it quiet

Contained threats are surfaced only when contained_needs_triage passes: confidence = malicious and the classification isn’t PUA/adware/blank. That’s what keeps routine auto-quarantine noise out of the queue — measured at ~8 real malicious-malware detections per week across all managed customers, not hundreds. The whole feature is behind the s1_surface_contained toggle (default off).

You picked up a contained-threat ticket. Now what?

These tickets are about investigation, not firefighting — the file is already quarantined. Work them in order.

1

Confirm the containment

In the S1 console, verify the agent quarantined/killed it and the endpoint is healthy. Is the machine network-connected or isolated? Did mitigation fully succeed, or partially?

2

Look for surrounding alerts

Pivot on the host, site, and user around the detection time — in S1 and the SIEM. One quarantined dropper can be the visible tip of a malvertising/phishing campaign hitting other users.

3

Find the root cause

How did it land? Download, email, USB, lateral movement? Note the source and the storyline — that’s what the customer actually needs from us.

4

Render a verdict in S1

Mark the threat true positive or false positive (or resolve the case). That sets is_closed — check_auto_resolve then stands our incident down automatically on its next poll. Closing the loop is the point.

5

Escalate if it’s bigger than one file

Evidence of spread, multiple hosts, credential theft, or a live foothold? Treat it as an active incident — loop in the customer and follow the standard IR path, don’t just close the ticket.

Worked example: the dropper that almost went unseen

This is the real case that drove the whole feature.

🕵️

pdfguruhub.msi — contained, but nobody knew

A user at a managed customer downloaded a signed malicious MSI dropper. S1 in protect mode auto-quarantined it on write — mitigationStatus: mitigated. But analystVerdict was undefined and incidentStatus was unresolved: contained, not closed.

Under the old single is_mitigated() check, that threat was silently dropped at ingest — no Slack post, no ticket, no record. A real malicious dropper on a customer endpoint, and the SOC never saw it. With the contained-vs-closed split, that same detection now surfaces as a contained-malware triage ticket: no 2am page (it was contained), but it’s in the queue, guaranteed to be worked — surrounding alerts checked, root cause found, verdict rendered. Mitigation bought time; investigation is still the job.

Check your understanding

S1 auto-quarantines a malicious Trojan on a managed endpoint. No analyst has touched it. With s1_surface_contained on, what does the system do?

Same situation, but the contained, malicious detection is classified Ransomware. What changes?

You finish triaging a contained-malware ticket and mark the threat a true positive in the S1 console. What happens to our on-call incident?