purebill.com

Stephen Jones writing on billing and application migration

subscribe to purebill link
. Home . About . Archive . Links . Billing . Reference . Subscribe . Search . .
. Column Archive . Article Archive .

Column - 29 August 2008

Applications fail - design for ease of recovery

Summary

Whilst infrastructure designs allow for hardware failure (employing redundancy, failover and a range of other techniques), the software infrastructure equivalent bakes 'ease of recovery' into the application's initial design. When a software problem occurs impacting the application's core data and related tasks, an application design that helps the support staff limit the problem's scope, identify its impact and resume normal processing will pay off.

With applications ranging from the transactional (websites) through to more batch driven processing (billing), a software failure will look different in different contexts and the scope of the failure's impact will also vary. Impacts will vary from an account in error being placed in suspense through to a core business application halted until a resolution is made.

Failure at each step in the processing chain needs to be considered for its impact on other work being performed, and how its particular recovery resolution would be performed. Any recovery path identified as requiring the entire application be halted needs special design attention since it suggests a core application process, and once the fix has been commenced, no business activity can be performed until the fix has completed.

Questions that can guide a recovery design review include:

  • If a failure occurs, how long before it is detected?
  • How far downstream will the data have gone? How will the scope of the failure's impact be identified to external groups?
  • Once a problem has been identified, how does the design stop the identified problem from getting larger?
  • Can the support staff easily stop or disable processing to minimise a problem's impact? (Can they restart processing easily once a fix has been deployed?)
  • Does the design allow individual fixes to be addressed without stopping the application's other processing streams?
  • Are concurrent streams performing the same work isolated from the stream being fixed? Will the planned fix impact their processing?
  • What tools are required to perform expected fixes? Are these common and well understood?
  • Is the data required to perform the fix still available if a problem is identified after a short delay?
  • What is the expected recovery time? What are the data volume assumptions around that timeline?
  • Are external support teams required, or can the application's support staff address all but the worst problems themselves?
  • Is the recovery nuanced, or is it always roll-back (with its delay), fix and roll-forward?
  • Do the answers change by day-of-week, day-of-month or when specific processing is being performed?

Modest recovery times by just the application's support staff employing tools used on a daily basis suggest the application's design supports the recovery goal well.

Better to consider the recovery approach in a calm measured way up front where the application's design can be changed and tested if required, than to perform the thinking with the application broken, crucial data lost and possibly unrecoverable, and the business owner asking questions...

Note: This column was first posted by myself on the site 97 Things Every Software Architect Should Know website under a Creative Commons Attribution 3 license.

Tags: , ,

[ Share with others ]

Post this page to a social bookmarking site:

delicious logo delicious diggit logo Digg it furl logo Furl google logo Google
reddit logo reddit stumbleupon logo StumbleUpon technorati logo Technorati yahoo myweb logo Yahoo MyWeb

 

Other 'purebill' columns

Previous column: Managing outage processing through alternate landing zones

Next column: Provide operational statistics to business users and support staff

All previous purebill columns can be found in the archive section.

Recent Updates

Sign up to receive a brief text email when a new purebill column is published.

JUMP TO TOP go to top of page
.
Comments welcome: stephenjones(at)purebill.com Stephen Jones © 2004-2010 - Copyright and reprint rules | Sitemap .