How CIT Responds to Unplanned Service Disruptions and Outages
This article applies to: Essentials for IT Professionals
This overview is provided as an illustration of the process a CIT service team member, support, or other individual tasked with response is expected to follow when an unplanned performance issue, disruption, or outage affects a CIT-supported service.
1. Verify the disruption or outage
Using standard methods, CIT staff verify the alarm, disruption, outage, or reported issue.
2. Record the disruption or outage
In online chat:
- CIT service team members record the event is occurring.
- They continue to monitor.
For IT Status Alerts:
If a custom notification is not required, CIT service team members post the initial disruption or outage on the IT Status Alerts page. The alert is then posted at the IT Status Alerts and IT@Cornell pages, and an email is sent to Net-Announce-L.
When a custom notification or a status update is needed, the CIT service team member outlines the message they want to post using chat, then calls the IT event manager to tell them to post the update.
When assistance is needed with wording, the service team member can contact IT Communication on-call staff.
IT event manager:
The CIT service team member engages the on-call IT event manager...
- Whenever they need a status update posted on the IT Status Alerts page.
- If they need assistance coordinating during the disruption or outage (including engaging CIT's major incident manager).
3. Engage additional resources as needed
- For an IT major incident (or suspected major incident), the IT event manager engages CIT's major incident manager. If an IT major incident is declared, it.cornell.edu/alert will indicate where to check in for chat and voice incident response.
- To get help from additional technical resources, such as a DBA or system administrator, CIT staff can use CIT's on-call list.
- For help writing custom updates, CIT staff can contact IT Communication.
4. Continue working on the issue and providing timely updates
- CIT service team members identify corrective actions being taken. If possible, they estimate time to resolution.
- They log updates in chat (any new information, milestones, setbacks, or that the status has not changed).
- They coordinate with the IT event manager or IT Communication to post regular updates at the IT Status Alerts page. Staff working the incident mutually set expectations for when updates will occur.
Generally speaking, during a major incident, updates are expected from CIT every half hour, even if just to say the work is still continuing.
5. Notify everyone when resolved
- In chat, CIT service team members record that the problem has been resolved.
- They ask the IT event manager or IT Communication to close the alert on the IT Status Alerts page.
- They submit a CIT change request (CRQ) form to log the unplanned service disruption.