Sensor Architecture Can Help Keep Us Up and Running: Part 1
In the constant press of rolling out ever better products and services to our customers, it can be easy-- and often necessary-- to fall into a reactive mode around reliability. When our systems break, we have an incident process that pulls people together to figure out the problem, and fix, it as quickly as possible. That process works well and is necessary and important for particularly complex system issues, but also potentially costly in customer and staff disruption, and wellness. If we look closely, there are sometimes phenomena that we inertially accept as being routinely resolved in a relatively costly incident mode that could be handled much more cheaply and routinely. With a little sensor architecture thinking and development, we can engineer routine monitoring and follow-up processes, minimizing disruption to all and preserving our incident response energy for the really big, complex, and unforeseeable issues.
Sensor systems architecture is a somewhat niche, unofficial, engineering field that grew over many decades out of systems analysis and sensor engineering for national security. When the decisions sensors trigger are loaded with risk, a whole science develops around making sensors work to manage it. Instead of looking at a pile of data that is available and asking ourselves how to use it as-is for our problems, we start by asking ourselves what decisions we need to make and what information is required to make those decisions with confidence. Then, we purposefully design the concert of sensors, analysis, and communications systems that will provide that information with the timeliness and quality needed. If available data can contribute, great, but we need to be aware of the gap between what that data actually means and what we need, and engineer the difference.
Here's an example. Let's say there's a type of incident that is happening repeatedly, because an alarm that was purchased to detect intrusions in multiple locations is going off much more frequently than actual intrusions are occurring. Maybe there hasn't even yet been a real intrusion--just false alerts. Every time this alarm goes off, an incident bridge is spun up and people are called away from their normal work to go look in emergency mode for supporting information to assess whether an actual intrusion occurred. There's a general playbook of what data sources to check. Rarely are the incident subject matter experts (SMEs) truly confident about the meaning of what they find there. However, the evidence indicates the alert was caused by something other than an adversarial intrusion, and there hasn't yet been bad consequences from a misled assessment. People shrug, close the incident, reset the sensor, and move on about their other work. Rinse, wash, and repeat. Not only is this bad because resources are being wasted on incident pace work when a little more data could combine with the alarm to drastically reduce the operational load of alerts, but it's also leading to a buildup of expectation that alerts associated with this alarm are false, leading people into a confirmation bias trap that may cause them to overlook a real event in the future, like the "Boy Who Cried Wolf" story. This is a case of what can happen when a sensor is purchased and deployed without architectural thinking.
In the next post, we'll explore how a sensor systems architect would approach this problem so that the sensors deployed serve our decisions instead of just creating more work and uncertainty.