Incident management system at scale.
A multi-million line incident management system was under active development but increasingly hard to work with. The mobile app, sending heavy telemetry from the field, was unreliable. Cloud costs were ballooning. Server-side changes were destabilizing the platform, slowing the pace of new features.
We took end-to-end ownership for nearly two years and shipped through:
- Stabilizing the mobile app under heavy telemetry load
- Reducing cloud infrastructure costs by 50%
- Stabilizing the server side so new features could land safely
- Moving deployment to infrastructure-as-code
- Introducing AI automation incrementally into the operational workflow
What it tells you: we can take ownership of a large incident management system, stabilize it under real production load, and ship new capability through that environment.