How CIT Responds to Unplanned Service Disruptions and Outages
This article applies to: Essentials for IT Professionals
This overview is provided as an illustration of the process a CIT service team member, support, or other individual tasked with response is expected to follow when an unplanned performance issue, disruption, or outage affects a CIT-supported service.
The CIT service delivery team verifies the disruption or outage.
They create a parent (major) incident in the ticketing system and associate incidents with it.
In online chat, CIT service delivery team members record the event is occurring, and continue to monitor.
For IT Status Alerts:
If a custom notification is not required, CIT service delivery team members post the initial disruption or outage on the IT Status Alerts page. The alert also appears on IT@Cornell pages and CUInfo Campus Alerts, and an email is sent to the Net-Announce-L e-list.
- When a custom notification or a status update is needed, the CIT service delivery team member outlines the message they want to post using chat, then work with communicators to post the update.
For a high severity major incident (or suspected), CIT's high severity incident manager is engaged. If additional or alternate channels are needed, it.cornell.edu/alert will indicate where to check in for chat and voice incident response.
To get help from additional technical resources, such as a DBA or system administrator, CIT staff can use CIT's on-call list.
Service delivery team staff continue working on the issue and providing timely updates. They identify corrective actions being taken. If possible, they estimate time to resolution. They log updates in chat (any new information, milestones, setbacks, or that the status has not changed). They coordinate with communicators to post regular updates at the IT Status Alerts page. Staff working the incident mutually set expectations for when updates will occur.
Generally speaking, during a high severity incident, updates are expected from CIT every half hour, even if just to say the work is still continuing.
- In chat, CIT service team members record that the problem has been resolved.
- They ask the IT event manager or IT Communication to close the alert on the IT Status Alerts page.
- They resolve the major incident ticket and its child tickets.