Home >

Homer Alex et al. Cloud Design Patterns

Posted on July 31, 2016
@book{homer2014,
author = {Alex Homer and John Sharp and Larry Brader and Masashi Narumoto and Trent Swanson},
title = {{Cloud Design Patterns: Prescriptive Architecture Guidance for Cloud Applications}},
publisher = {Microsoft},
year = {2014},
isbn = {1621140369},
url = {https://msdn.microsoft.com/en-us/library/dn568099.aspx?f=255}
}

Definitions

Availability

“Availability defines the proportion of time that a system is functional and working. It will be affected by system errors, infrastructure problems, malicious attacks, and system load. It is usually measured as a percentage of uptime.” (Homer et al., 2014, p. 1)

Cache-Aside Pattern

The Cache-Aside Pattern is used to optimise repeated access to information held in a data store by loading data on demand into a cache from a data store and also handling the situations in which the data has become stale. (Homer et al., 2014, pp. 9–13)

Circuit Breaker Pattern

The Circuit Breaker Pattern is used to improve the stability and resilience of an application by handling faults that may take a variable amount of time to rectify when connecting to a remote service or resource. (Homer et al., 2014, p. 14)

Compensating Transaction Pattern

The Compensating Transaction Pattern is used to undo the work performed by a series of steps if one or more steps fail in a distributed application that rely on eventual consistency as opposed to strong transactional consistency. (Homer et al., 2014, p. 23)

Patterns

Cache-Aside Pattern

The Cache-Aside Pattern is used to optimise repeated access to information held in a data store by loading data on demand into a cache from a data store and also handling the situations in which the data has become stale. (Homer et al., 2014, pp. 9–13)

Problem

The caching of data is challenging because of two reasons. The first is that cached data should be up to date as far as possible whilst minimising access to the data store at the same time. The second is that and the situations that arise when the data in the cache has become stale must be detected and handled.

Solution

Apply the Cache-Aside Pattern which consists of manually implementing a read-through and a write-through strategy.

1. Read-through: load the data into the cache on demand when it is first requested.

2. Write-through: if an application updates information, it can emulate a write-through strategy by a) making a modification to the data store, and b) invalidating the corresponding item in the cache so that the new version is retrieved the next time the data item is requested from the cache.

Further Notes

Homer et al. (2014, p. 10) mention a number of issues and trade-offs that must be taking into account when considering this pattern:

1. Cached Data Lifetime: A expiry period that is too short will result in excessive data retrieval actions. An expiry period that is too long will render the cached data stale.

2. Data Eviction Policy: Caches often require to evict data because they have less size than that of the underlying data store they are meant to improve access for. A global eviction policy adopting a least-recently-used policy for selecting items to evict may not always be appropriate and the use of expiry times and eviction policies per data item type may be sometimes necessary.

3. Cache Priming: Prepopulating the cache at startup time may be useful for data items that are always expected to be accessed even if they may eventually expire or be evicted.

4. Consistency: The Cache-Aside Pattern does not guarantee consistency between the cache and the data store since the data store may be changed at any time by external processes.

5. Local (In-Memory Caching): In a cluster scenario, each instance of an application is likely to have its own private copy of the same cached data. In the event that the inconsistency among the various copies become problematic, a shared/distributed caching mechanism would be more optimal. Said mechanism would propagate eviction and expiry events across the cluster.

Circuit Breaker Pattern

The Circuit Breaker Pattern is used to improve the stability and resilience of an application by handling faults that may take a variable amount of time to rectify when connecting to a remote service or resource. (Homer et al., 2014, p. 14)

Problem

The repeated invocation from a consumer to a service provider whose dependency or resource is failing—for example, a service is timing out because it connects to a database management system that is overloaded with an excessive number of complex, long-running queries—may result in the exacerbation of the underlying problem and cascading failures .

For example, if the database management system is not responding in a timely fashion because it is running too many queries, insisting on throwing more queries at it will do little to help. It is also probable that the northbound interaction systems will try even more often to invoke the affected service in the event of failures, making the problem even worse.

Cascading failures (Homer et al., 2014, p. 14) may occur when the consumers—or intermediary systems—throw back an error back to the service requester in reaction to an error in the dependency but there is also the potential issue of increased resource allocation (memory, threads, blocked IO handlers, transaction locks, database connections, and so on) whilst waiting for the affected service to recover. This behaviour may lead to new points of contention and failures.

Solution

Implement a proxy that monitors the health of the failure-prone operation and that is able to dynamically enable and disable it depending of the health state.

Disabling an operation means preventing it from being executed, and throwing an exception or error immediately back to the requester. (Homer et al., 2014, p. 15)

Homer et al. (2014, p. 15) suggest that the proxy may be implemented as a state machine using the following states:

• Closed: requests are routed through to the operation’s underlying logic.

• Open: requests results in an immediate exception or error.

• Half-Open: a limited number of requests are allowed in order to determine whether the fault that was previously causing the failure has been fixed.

The transitions among the above states abide to the following rationale:

• Closed -> Open — The maximum number of failures per time period has been exceeded or a parallel process has detected a fault in the operation itself or one or more of its dependencies.

• Open -> Half-Open — The circuit breaking period has elapsed and the health-check process has started.

• Half-Open -> Open — A sufficient number of requests have proven successful after the circuit has been reestablished.

Further Notes

Homer et al. (2014, pp. 17–18) mention a number of issues and trade-offs that must be taken into account when considering this pattern:

1. Exception Handling: The requester must be prepared to handle the exception.

2. Fault Classification: The circuit breaker proxy may apply more relevant strategies according to the nature of the fault. For example, more “permanent” failures such as a disk being full —requiring manual intervention— could result in the circuit remaining for longer in the Open state.

3. Logging: In order to enable an administrator to monitor the operations that the circuit breaker encapsulates.

4. Recoverability: The choice of an appropriate recovery window that strikes the best balance between waiting sufficiently enough for the application to recover and avoiding too many exceptions when the underlying operation has already recovered.

5. Ping Strategy: A circuit breaker can have a parallel ping process to test whether the operation has recovered rather than relying on the Half-Open state approach.

6. Manual Override: In order for an administrator to override the circuit breaker’s automated behaviour and put the underlying operation in a given state.

7. Concurrency: The behaviour of the circuit breaker in a multi-threaded environment must be sound; for example, it should not block concurrent requests.

8. Resource Differentiation: The granularity of the underlying faults must be understood to avoid generalising operations that belong to different fault classes. For example, a fault concerning a database management system may be specific to a table; in this case, putting all operations that interact with the system in the Open state might not be ideal.

9. Accelerated Circuit Breaking: The operation’s underlying fault may provide sufficient information to determine that a resolution will take long so that the circuit can be tripped immediately rather than waiting for a given number of failures to be logged. For example, a disk full fault.

10. Replying Failed Requests: The circuit breaker may record the details of each request to a journal so that they re replayed back when the operation’s underlying resource becomes available.

11. Blocking the Circuit Breaker: If the operation that runs under a circuit breaker is blocking, a timeout setting that is too long (or that of its underlying resource) may lead to a disproportionate number of threads being tied up.

12. Local versus Remote Resource: This pattern is mainly applicable for operations that rely on an external, network resource. It is not meant to substitute regular exceptions for local in-memory business logic.

Compensating Transaction Pattern

The Compensating Transaction Pattern is used to undo the work performed by a series of steps if one or more steps fail in a distributed application that rely on eventual consistency as opposed to strong transactional consistency. (Homer et al., 2014, p. 23)

Problem

In distributed applications, strong transactional consistency (in other words, transactions that abide to ACID properties) cannot be implemented since the data is not managed by a single data store that benefits from complete control over the atomic operations that make up the transaction. Therefore, transactions rely on an eventual consistency assumption; a transaction may be inconsistent for a while, in the expectation that it will become consistent at some point in the future.

Homer et al. (2014, p. 25) use the example of a travel booking transaction involving multiple flights and hotel reservations:

1. Book a seat on flight F1 from Seattle to London
2. Book a seat on flight F2 from London to Paris
3. Book a seat on flight F3 from Paris to Seattle
4. Reserve a room at hotel H1 in London
5. Reserve a room at hotel H2 in Paris

In the above example, we can see that whilst the customer is performing step 3, the transaction is not yet consistent since the flights F1 and F2 have been booked but the flight F3 is still in progress; likewise the reservations for hotels H1 and H2 are still pending.

However, what happens if it turns out that another passenger buys the last remaining seat in the flight F3 and the entire booking must be aborted? There is no roll-back capability as in the case of a traditional ACID database management system.

Solution

The solution is the implementation of a compensating transaction mechanism. A compensating transaction mechanism—often called a workflow—keeps track of each “undo” or “revert” action that must be performed to cancel the changes produced by each operation in a transaction.

Unlike a single data store, the “do” action and “undo” actions may have a radically different shape, so the programmer usually requires to instrument the workflow with precise instructions about both scenarios for every step.

Considerations

When a transaction is being aborted, it is also in an inconsistent state until all changes have been reverted. Therefore, a distributed transaction is inconsistent until it has either been successfully completed, or all of its atomic steps have been reverted in the event of a failure.

In general, a transaction manager (or workflow) that implements the Compensation Transaction Pattern may exhibit the following states:

• Started: The transaction manager is waiting for a first action to be performed.

• In Progress: The first action has been performed.

• Rollback In Progress: The transaction has failed or aborted and the transaction manager is in the process of undoing the previously performed actions.

• Failed In Progress: The next action could not be executed—human assistance may be required; or another retry—but the overall transaction has not been aborted.

• Failed During Rollback: Unable to undo an action in order to successfully rollback the transaction.

• Completed: All actions that make up the transaction have been performed.

• Rollback Completed: All rollback actions have been performed.

Homer, A., Sharp, J., Brader, L., Narumoto, M., Swanson, T., 2014. Cloud Design Patterns: Prescriptive Architecture Guidance for Cloud Applications. Microsoft.