When a Threat Fires
A piece of malware runs on a client machine at 2 am. Three minutes later, a phone is ringing. Here is exactly what happened in between.
What this app actually does
The Tenax SOC On-Call Automation is a custom alarm system for a cybersecurity team. It watches security tools, decides who needs to know, and makes sure the right person actually picks up.
Polls SentinelOne and the SIEM every 30 seconds for new threats. The SIEM is the live paging source today; SentinelOne is currently in Slack-only preview (it posts detections for visibility but opens no ticket and fires no page).
Posts to Slack instantly. Creates a ticket in ConnectWise for tracking.
Calls the on-call analyst by phone if the alert isn't acknowledged in time. Then their backup. Then the manager.
Trace an alert from detection to phone call
Click "Next Step" to follow a real SentinelOne threat through the system.
The first filter: contained is not closed
Before deciding what to do with an S1 threat, the system asks two different questions. SentinelOne’s agent contains many threats automatically (it quarantines or kills the file) — but the agent acting is not the same as a human reviewing the case. We are a SOC: a contained threat we’ve never looked at could still have surrounding alerts, a root cause, or lateral spread.
def is_contained(threat: dict) -> bool:
# The agent ACTED (automatic). Says nothing about review.
m = threat["threatInfo"].get("mitigationStatus", "").lower()
return m in ("mitigated", "blocked", "remediated")
def is_closed(threat: dict) -> bool:
# A HUMAN / the case resolved it.
ti = threat["threatInfo"]
return (ti.get("incidentStatus", "").lower() == "resolved"
or ti.get("analystVerdict", "").lower()
in ("false_positive", "true_positive"))
A real case: a signed malware dropper (pdfguruhub.msi) landed on a managed endpoint.
The agent auto-quarantined it — contained — but analystVerdict was
undefined and the incident was unresolved: not closed. The old single
is_mitigated() check treated "agent quarantined" the same as "case closed" and silently
dropped it, so nobody at the SOC ever saw it. The fix: drop only is_closed threats; surface
is_contained-but-not-closed ones for triage. How loudly is the subject of Module 6.
Check your understanding
An analyst tells you the worker was restarted after an outage and immediately paged them for a 20-minute-old SentinelOne threat. They think this is a bug. Is it?
A SentinelOne threat gets automatically quarantined (contained) by S1 while the on-call analyst is still being paged β but no analyst has rendered a verdict yet. What happens?
SentinelOne sends the same threat alert twice in quick succession (this can happen when S1 re-reports). What does this system do?
Meet the Cast
Six services that run 24/7. Click any component to learn what it does and why it exists.
The system map β click to explore
The heartbeat: what runs, and when
The worker runs a Celery beat scheduler. Think of it as a list of alarms that go off on a timer.
celery_app.conf.beat_schedule = {
"poll-sentinelone": {
"task": "app.worker.tasks.poll_sentinelone",
"schedule": float(settings.s1_poll_interval),
},
"check-auto-resolve": {
"task": "app.worker.tasks.check_auto_resolve",
"schedule": 30.0,
},
"check-deferred-escalations": {
"task": "app.worker.tasks.check_deferred_escalations",
"schedule": 30.0,
},
}
30 seconds is the sweet spot between "fast enough to page before an analyst's screen locks" and "not so aggressive it hammers the S1 API." The poll interval is an env variable β you can tune it without a code change.
The internal conversation: how components hand off work
Watch how the API and worker stay decoupled using Redis as a message broker.
An escalation call takes 3β5 minutes. If that ran inside the API web process, no other alerts could be processed until the call finished. By handing the work to the worker via Redis, the API stays instant and the multi-minute call loop runs safely in the background.
Where to find what
Every file has one job. This makes it easy to know exactly where to look when something breaks β or when you want to add a feature.
Check your understanding
A client calls to report they can't reach anyone on-call. You want to trace why the escalation chain didn't run. Which process would you check logs for first?
You want to add a new analyst to the escalation chain. Which file needs to change?
Should We Page?
Not every alert wakes someone up. The system runs a triage checklist before dialing a single number β and the logic is more interesting than you’d expect.
The triage checklist, in order
Every incoming alert passes through these gates. The first gate that says "stop" wins.
Same source_id already open? Suppress silently. No double-paging.
SentinelOne alert on the same machine within 10 minutes of an open parent? Attach it as a child. One page for the storm, not fifty.
A SIEM alert pages only at or above the level threshold (default 15, labeled critical). The band just below it (12–14, labeled high) opens a CW ticket on the SOC board but never rings a phone. Direct calls always page. For SentinelOne it’s state-aware: an active threat (or contained ransomware) pages; a threat the agent already contained opens a SOC triage ticket only — see Module 6.
Business hours (8β5 CT, MonβFri, non-holiday)? SOC gets a 15-minute window to pick up the ticket before paging starts. After hours β page immediately.
Admin override can force ALWAYS or NEVER page mode regardless of time-of-day. AUTO (default) uses normal logic.
The exact routing decision in code
This is the moment the system decides: page now, defer, or skip entirely.
high = await _is_high_severity(incident)
direct = source in ("voice", "callback")
if not high:
# Low severity: SOC ticket only, never page
db.add(IncidentEvent(event_type="page_suppressed" ...))
return incident
if is_after or direct:
run_escalation.delay(str(incident.id), title)
else:
delay = await get_escalation_delay_minutes()
# SOC has {delay} min to pick up before paging
The storm problem: 50 alerts, one page
Imagine a pen tester runs a script on a client machine. SentinelOne fires 50 threat alerts in 10 minutes β one per executable. Without bundling, that’s 50 parallel phone calls to the analyst.
A single attack event can produce dozens of individual S1 alerts β one per malicious file, per process, per connection. Each one would normally create its own incident and page chain.
If a new S1 alert arrives for the same client machine within 10 minutes of an open parent incident, it becomes a "child" of that parent. Only the parent escalates. Children just attach silently.
Even Slack notifications are throttled: first 3 siblings post verbatim, then only milestone counts (10, 25, 50, 100β¦). So 100 siblings produce ~7 messages, not 100.
The SIEM bundler has an extra rule: a new alert only bundles into a parent if the parent's severity level is equal or higher. A noisy low-level parent can’t "swallow" and silence a more serious alert on the same machine β that one opens its own incident and pages.
Three paths through the triage desk
Check your understanding
During business hours, a high-severity SIEM alert fires. An analyst opens the ConnectWise ticket and moves it to "In Progress" 8 minutes later. Does the phone ring?
An IP Pathways support rep calls the Tenax SOC hotline at 10am on a Tuesday. Does the on-call analyst get paged immediately or after the 15-minute grace window?
Machine 'ACCOUNTING-PC-01' already has an open SIEM incident (severity level 6). A new SIEM alert fires for the same machine at severity level 12. What happens?
The Phone Tree
Who gets called, in what order, and what happens when they don’t answer. The escalation chain is smarter than a simple list of numbers.
How the chain gets built from a schedule file
The chain isn’t hardcoded. It’s derived every time from schedule.yaml using a set of role-priority rules.
def _derive_chain(primary, config):
chain = []
push(primary)
analysts = [m for m in config["team"]
if m["role"] == "analyst"]
if primary["role"] == "analyst":
for a in analysts: push(a)
elif primary["role"] == "manager":
for a in analysts: push(a)
for role in ["manager", "vp", "president"]:
for m in config["team"]:
if m["role"] == role: push(m)
push) ensures no one appears twice.Before the derived chain is used, three overrides are checked in priority order: (1) a holiday on_call field for today’s exact date, (2) a per-week primary override set in the admin UI, and (3) a full admin chain-reorder that replaces the entire derived chain. The last wins.
The typical five-tier chain
Each level gets two call attempts before the chain advances. No one is skipped β every level must either acknowledge or exhaust both rings.
This week’s on-call person from schedule.yaml. They get two calls: the first ring, then a second about a minute later (gap configurable, default 60s) to punch through iPhone DND mode.
The other analyst on the team. Same two-attempt pattern. If the team only has one analyst, this level is skipped automatically.
If both analysts are unreachable, the manager gets the call. They are always in the chain tail regardless of who the primary is.
Next in the org chart. Reached only if the manager also doesn’t pick up both calls. Rare in practice.
Last resort. If the chain is exhausted without an ack, a self-alert fires to notify the system owner that no one answered.
Inside a single escalation level: two attempts, then advance
Watch what happens when the system calls the primary analyst and they don’t pick up.
What the phone actually says when someone answers
Twilio reads a scripted announcement aloud via text-to-speech. The script is built at call time from the incident data.
def _announcement_text(
source, client, severity, title):
src = _pretty_source(source)
parts = ["This is the Tenax SOC."]
intro_bits = [src]
if severity_phrase:
intro_bits.append(severity_phrase)
intro_bits.append("alert")
intro = " ".join(intro_bits)
if client:
intro += f" for {client}"
parts.append(intro + ".")
if title:
parts.append(f"{title}.")
parts.append("Press 1 to acknowledge.")
The bypass comes from placing a second call to the same person about a minute after the first goes unanswered (the gap is configurable, default 60s). iOS’s Repeated Calls feature lets a second call from the same number within three minutes break through Do Not Disturb / Focus. The two Gather blocks you see in the TwiML are just re-prompts within a single call β the full announcement, then a shorter repeat β not the bypass mechanism itself.
Check your understanding
The manager is this week's primary on-call. When _derive_chain() builds the chain, where does the manager appear?
An after-hours incident fires and neither analyst answers any calls. How many total Twilio calls are placed before the manager gets their first ring?
An incident is queued for escalation at 2:05am. At 2:06am an admin applies a full-chain override. The worker picks up the task at 2:07am. Which chain does the worker use?
Stand Down
How an incident ends β whether by human acknowledgment, automatic mitigation, or the system monitoring itself to know it’s still working.
Six states an incident can be in
Every incident row in the database moves through a defined set of statuses. The transitions determine whether paging continues, pauses, or stops entirely.
Incident was created. Either waiting for the deferred-escalation timer (business hours) or about to page immediately (after hours).
The escalation chain is actively running. Twilio is placing calls. The chain advances person-by-person until someone presses 1 or the chain is exhausted.
A human pressed 1 on their phone, or the SOC moved the CW ticket to “In Progress”. Paging stops. The CW ticket is promoted to the SOC board if it was staged on Tier 1.
S1’s agent contained the threat but no human has reviewed it (is_contained, not is_closed). The paging chain stops mid-ring, but the incident stays open as a triage item until an analyst verdicts it. Contained ≠ closed.
The threat is now closed upstream — an analyst rendered a verdict, or the S1 case was resolved (is_closed) — and check_auto_resolve caught it. The paging chain stops mid-ring if active. Slack gets a stand-down message.
Manually closed from the admin UI, or the CW ticket was closed and the system synced the status. Final state β no further actions fire.
The automatic stand-down: how S1 mitigation stops the page
Every 30 seconds, the worker checks open SentinelOne incidents and asks: “Did S1 handle this on its own?”
# Beat schedule, every 30s
"check-auto-resolve": {
"task": "...check_auto_resolve",
"schedule": 30.0,
},
# For each open S1 incident:
if is_closed(threat_data): # human/case resolved
await resolve_incident(incident.id) # β resolved
elif is_contained(threat_data): # agent acted, no review
await mark_monitored(incident.id) # β monitored (open)
While waiting for each ring to complete, wait_for_terminal_status() polls the database every 3 seconds. resolved and monitored are both chain-halting states: if check_auto_resolve lands either one mid-ring, the escalation chain sees it on the next poll and stops cleanly β no further calls are placed. (A threat that self-contains mid-page stops ringing but stays open for triage.)
Two safety systems you probably won’t need β until you do
The admin UI has a three-position switch: AUTO (default β use time-of-day logic), ALWAYS (treat every incident as after-hours and page immediately regardless of clock), and NEVER (treat everything as business hours β SOC gets the grace window even at 2am). Useful during planned downtime or drills.
Every 15 seconds, the worker writes a timestamp to Redis (worker_heartbeat). Every 60 seconds, the monitoring_watchdog task checks that timestamp. If the heartbeat is stale, a self-alert fires β paging the system owner that the worker has stopped executing tasks.
The watchdog also checks when each scheduled task (S1 poll, SIEM poll, auto-resolve, deferred-escalation checker) last completed successfully. If any of them goes silent past its threshold, a self-alert fires. The system watches itself so it can’t silently fail.
A self-alert is an internal incident that pages the system owner’s personal phone and posts to a separate Slack channel. It fires for: heartbeat stale, chain exhausted with no ack, Twilio call failures, and stuck-escalation backstop trips. These are never shown to clients β they’re operational health signals.
Full lifecycle: alert to stand-down
Check your understanding
A high-severity S1 threat fires after hours. The analyst is being paged. SentinelOne auto-quarantines the threat 20 seconds into the first ring (no analyst verdict yet). What happens next?
An admin sets the override to ALWAYS. A medium-severity SIEM alert fires at 10am on a Tuesday. What happens?
The Celery worker crashes and stops executing tasks. How does the system detect this?
SentinelOne Triage
SentinelOne’s agent contains most threats on its own. The hard question isn’t “did we stop it?” — it’s “has a human actually looked?” This module is how we cut the noise without going blind.
The principle: containment is not review
S1 in protect mode auto-quarantines malware the moment it’s written to disk. That’s containment — the file can’t run. But as a SOC for other companies, our job doesn’t end at “the file is quarantined.” We still owe the customer the investigation: were there surrounding alerts? How did it get there? Did anything spread before the agent caught it?
“Can this wait until the SOC is staffed?” A threat that may still be executing can’t wait → page. A threat the agent already contained almost always can → triage ticket, worked the next business day — unless it’s a high-blast-radius class (ransomware), where the wait-cost is too high. A threat a human already reviewed → nothing.
The routing matrix
Every in-scope S1 detection lands in exactly one of these buckets. The bucket decides the loudness.
The threat may still be running. Pages after-hours / SOC grace window during business hours. Unchanged from before.
Even contained, ransomware signals campaign & blast-radius risk — one node contained doesn’t mean the others are. Worth a call.
The bulk of the queue. SOC ticket only, never pages (a Slack FYI while S1 is in preview). Worked next business day. This is your triage list.
Low confidence, high false-positive rate. Suppressed — not worth a ticket each. (Revisit the threshold if something slips through.)
is_closed. A human already judged it (even a false positive) — nothing more to surface.
Contained threats are surfaced only when contained_needs_triage passes:
confidence = malicious and the classification isn’t PUA/adware/blank.
That’s what keeps routine auto-quarantine noise out of the queue — measured at ~8 real
malicious-malware detections per week across all managed customers, not hundreds. The whole feature
is behind the s1_surface_contained toggle (default off).
You picked up a contained-threat ticket. Now what?
These tickets are about investigation, not firefighting — the file is already quarantined. Work them in order.
In the S1 console, verify the agent quarantined/killed it and the endpoint is healthy. Is the machine network-connected or isolated? Did mitigation fully succeed, or partially?
Pivot on the host, site, and user around the detection time — in S1 and the SIEM. One quarantined dropper can be the visible tip of a malvertising/phishing campaign hitting other users.
How did it land? Download, email, USB, lateral movement? Note the source and the storyline — that’s what the customer actually needs from us.
Mark the threat true positive or false positive (or resolve the case). That sets is_closed — check_auto_resolve then stands our incident down automatically on its next poll. Closing the loop is the point.
Evidence of spread, multiple hosts, credential theft, or a live foothold? Treat it as an active incident — loop in the customer and follow the standard IR path, don’t just close the ticket.
Worked example: the dropper that almost went unseen
This is the real case that drove the whole feature.
A user at a managed customer downloaded a signed malicious MSI dropper. S1 in protect
mode auto-quarantined it on write — mitigationStatus: mitigated. But
analystVerdict was undefined and incidentStatus was
unresolved: contained, not closed.
Under the old single is_mitigated() check, that threat was silently dropped at ingest —
no Slack post, no ticket, no record. A real malicious dropper on a customer endpoint, and the SOC never saw it.
With the contained-vs-closed split, that same detection now surfaces as a contained-malware triage ticket:
no 2am page (it was contained), but it’s in the queue, guaranteed to be worked — surrounding alerts checked,
root cause found, verdict rendered. Mitigation bought time; investigation is still the job.
Check your understanding
S1 auto-quarantines a malicious Trojan on a managed endpoint. No analyst has touched it. With s1_surface_contained on, what does the system do?
Same situation, but the contained, malicious detection is classified Ransomware. What changes?
You finish triaging a contained-malware ticket and mark the threat a true positive in the S1 console. What happens to our on-call incident?