Outages Affecting Shared File Services

Types of outages that could affect Shared File Services (SFS)

This article applies to: Shared File Services

Outage Types

Many types of events can cause outages at the primary site (Rhodes Hall). Some of these outages are short term, some long term; some planned, some unplanned; some resulting in data loss (at the primary site), and some only resulting in temporary inaccessibility to the primary site. Each of these outage categories are covered below. Each outage event is different, requiring new decisions about what to do for any particular event based on the circumstances of that event.

SFS Replication

Most of what follows applies only to SFS (Shared File Services) shares that are replicated. When requesting a share, you are prompted to say yes or no to this replication option.

In this context, the word “promotion” means shifting the replicated data to become the share. In other words, what used to be the replication destination is turned into the “new” share.

We use asynchronous replication. This means that it is not constantly synchronized, but is mirrored from Rhodes to the destination in CCC on a four-hour schedule. As a result, the replicated data is up to four hours behind the source.

For shares that are not replicated, recovery will require pulling from the DR (disaster recovery) copy. This recovery will vary depending on the circumstances, but it is likely to require weeks to complete.

Planned Outages

SFS routine maintenance, including code updates, and other CIT maintenance (notably network maintenance) that impacts SFS can often be performed without taking the primary site down for more than a few minutes. For this type of outage, no promotion would be done. Users are notified well in advance of these scheduled maintenance activities.

Lengthy planned outages are very rare. In the event that we do have a lengthy planned outage (that is, more than four hours), a proactive “synchronized promotion” may be performed. Because this would be a planned promotion, it could be coordinated with SFS customers to reduce data loss to zero. These activities would be scheduled and announced in advance. The promotion and reversion would each involve brief interruptions to service. During such lengthy planned outages, non-replicated shares would be unavailable.

Unplanned Outages

Short Term, Primary Data Intact but Inaccessible

Because promoting a replicated share to primary status can result in the loss of recently made changes to that share, we will not promote a replicated share in this event. Rather, to preserve all updates, we will wait until the outage is resolved, restoring accessibility to the primary share's copy in Rhodes as soon as possible.

Long Term, Primary Data Intact but Inaccessible

If it is clear that the primary copy in Rhodes Hall will be inaccessible for a lengthy time (greater than eight hours), CIT will promote replicated shares in CCC to primary status. This will result in the loss of up to the last four hours of updates made to those shares. When the event is resolved, shares will be resynchronized and reverted back to the Rhodes Hall copy at a scheduled off-peak time. This reversion will not result in additional data loss, but will involve a brief interruption of service.

Unpredictable Term, Primary Data Intact but Inaccessible

There are some events that could occur where the Rhodes Hall copy may become inaccessible, but for an unpredictable amount of time. This can happen when the root cause and time to recovery of an outage cannot be easily ascertained. These events will be handled on a case-by-case basis, using our best information available at the time. The reason for not immediately promoting the secondary CCC copy to primary status is that this will result in the loss of up to the last four hours of changes. If there is significant uncertainty about how long the outage will be after eight hours, then we would promote the secondary CCC copies to primary status.

Any Term, Data Loss

If an event in Rhodes Hall results in the destruction of data (for example, a fire), the replicated CCC copy would be promoted to primary status as soon as possible. This will result in the loss of up to the last four hours of updates made. In this event, the time it takes to promote the secondary CCC copy to primary status (the Recovery Time Objective or RTO) will be dependent on the availability of SFS administrators and affiliated services (DNS, networking, etc.). During normal business hours, the expectation is that such a promotion would happen within approximately one hour. During non-business hours, the goal would be to have the promotion completed within four hours.

In such an event, data which is NOT replicated will require restoration from the Offsite Disaster Recovery Copy. This will result in the loss of up to the last 24 hours of updates made, as the Offsite Disaster Recovery Copy is refreshed once per day. In this event, time for restoration (RTO) will be measured in days to weeks.

Comments?

To share feedback about this page or request support, log in with your NetID

IT@Cornell

Shared File Services Articles

Shared File Services

Support and Maintenance

Available to: