Outages Affecting Shared File Services
Types of outages that could affect Shared File Services (SFS)
This article applies to: Shared File Services
There are many types of events that could cause outages at the primary site (Rhodes Hall). Some of these outages are short term, some long term, some planned, some unplanned, some resulting in data loss (at the primary site), and some only resulting in temporary inaccessibility to the primary site. Each of these outage categories are covered below, but it should be noted that not all outage event types can be anticipated, so some discretion may be used in deciding what to do for any particular event, based on the circumstances of that event.
Almost all network and SFS routine maintenance, including code updates, can be performed without taking the primary site down for more than a few minutes. For this type of outage, no promotion would be done. Notification of any such brief planned outages would be done well in advance of the scheduled maintenance activity.
Lengthy planned outages are very rare. However, in the event that we do have a lengthy planned outage in the future (i.e., more than 4 hours), a proactive "synchronized promotion" may/will be performed. Because this would be a planned promotion, it could be performed with zero data loss (an RPO of zero). Both the initial promotion and the later resynchronization and reversion to the Rhodes copy would be scheduled and announced in advance. The promotion and reversion would each involve a brief interruption in service. During such lengthy planned outages, non-replicated shares would be unavailable.
Short Term, No Data Loss
Because promoting a replicated share to primary status can result in the loss of recently made changes to that share (i.e., an RPO > 0), we will not promote a replicated share in this event. Rather, to preserve all updates, we will wait until the outage is resolved, restoring accessibility to the primary share's copy in Rhodes as soon as possible.
Long Term, No Data Loss
If it is clear that the primary copy in Rhodes Hall will be inaccessible for a lengthy time (greater than 8 hours), but that no data has been lost, then CIT will promote all replicated shares in CCC to primary status. This will result in the loss of up to the last 4 hours of updates made to those shares. When the event is resolved, shares will be resynchronized and reverted back to the Rhodes Hall copy at a scheduled off-peak time. This reversion will not result in any additional data loss, but will involve a short interruption of service.
Unpredictable Term, No Data Loss
There are some events that could occur where the Rhodes Hall copy may become inaccessible, but for an unpredictable amount of time. This can happen when the root cause of an outage cannot be easily ascertained, and thus it is not possible to predict when the outage will be resolved. These events will be handled on a case-by-case basis, using our best information available at the time. The reason for not immediately promoting the secondary CCC copy to primary status is that this will result in the loss of up to the last 4 hours of changes. If there is significant uncertainty about how long the outage will be after 8 hours, then we would promote the secondary CCC copies to primary status at that time.
Any Term, Data Loss
If there is an event in Rhodes Hall that results in the destruction of data (e.g., a fire), then the replicated CCC copy would be promoted to primary status as soon as possible. This will result in the loss of up to the last 4 hours of updates made. In this event, time it takes to promote the secondary CCC copy to primary status (aka the RTO or Recovery Time Objective) will be dependent on the availability of SFS administrators. During normal business hours, the expectation is that such a promotion would happen within 1 hour. During non-business hours, the goal would be to have the promotion completed within 4 hours. The actual act of doing the promotion is fairly quick, but during off-hours, it may take time for an administrator to be contacted and become able to perform the promotion.
In such an event (e.g. a fire) Data which is NOT replicated will require restoration from the Offsite Disaster Recovery Copy. This will result in the loss of up to the last 24 hours of updates made, as the Offsite Disaster Recovery Copy is refreshed once per day. In this event, time for restoration (aka the RTO or Recovery Time Objective) will be measured in days to weeks.