Kaiser Permanente




Building High Reliability Organizations:

Anticipating Failure to Increase Reliability - Creating an HRO Culture

High reliability organizations (HROs) are organizations that succeed in avoiding catastrophe in environments where accidents might be expected to occur due to the combination of complex processes and risk factors.  Large healthcare providers frequently find themselves in such an environment.  What is the human cost of a thousand "wrong" prescriptions being mailed out in a single day because of a corrupted data base?  What is the cost of three hundred patients' medical records suddenly disappearing because a server goes down and patients die as a result?  These were some of the very real concerns facing the IT Department of a major healthcare provider as they updated their systems and processes and moved to electronic record keeping.

The Challenge

When one of FCG's clients, a large U.S. healthcare organization with more than 37 medical centers and over 400 office buildings, implemented an electronic medical records system a few years ago, its Information Technology (IT) department moved from the relative safety of the "back office" into a high stakes environment - the operating room; the exam room; the pharmacy; even members' homes.  Huge investments in IT application were taking place.  The department's new centralization resulted in a major shift in power and geography.  IT employees were suddenly faced with unknowable complexity and the potential for catastrophic occurrences - events that could impact patient safety, cause business interruption and result in breaches of security.

If systems failed, a database was corrupted or the network crashed, patients could "disappear," along with their digital medical records.  Without traditional paper files, x-rays and test results, doctors would be operating "blind."  While the new technology promised huge advances from an administrative and clinical perspective, the last thing a doctor prepping for surgery wanted to worry about was whether the latest version of the software had been adequately stress-tested before going into production.  Conversely, when IT did its job, magic was possible.  X-rays and MRIs could be read by experts anywhere in the country.  When a physician entered a patient's symptoms onto their laptop, best practice guidelines for medical treatment popped up on the screen.  A patient's entire medical history traveled with them anywhere they went - care was no longer compromised by a lack of critical, up-to-date information.  The situation presented IT leadership with a challenge - how best to hold employees to a new level of accountability for doing the right thing, at the right time and in the right way, when there was so much at stake.

The Solution

It quickly became clear that simply improving how IT did things was too small an intervention.  Changing what IT did - altering the structure, technology, processes and/or practices - was a more attractive option; for example, reducing system variation 15% a year, building a new software testing lab or staffing up the risk management organization.  However, this approach, too, fell short of the magnitude of change that was required.  These incremental and significant changes contributed to overall patient safety and well being, but they weren't aggressive enough given the time constraints and pressures IT faced.  The harsh reality was that IT had to change who they were and completely reinvent themselves.

FCG helped the organization frame the issue strategically; assess the current IT culture; and guide in developing a robust understanding of the target culture - an HRO culture of reliability.  With FCG's help, the organization set out to build the operational effectiveness (systems, processes and execution) required to support an entirely new level of reliability performance.  FCG designed cultural "immersion sessions" for all 5,500 IT employees, to begin building the desired culture and help employees understand the profound shift that would be required of them, while deeply connecting them to the delivery of health care.

FCG conducted in-depth interviews with IT employees to discover what they believed about their work, the organization itself, and the relationship between behaviors on the one hand, and results and consequences on the other.  Their findings came as a surprise to top management ranks.  IT employees already believed that their work had a direct impact on people's lives and health; on alleviating suffering; and on increasing quality of life for the company's patients and members.  In many cases, they were willing to expend a tremendous amount of effort and attention to do so.  In their view, the organization's structure, policies and processes often got in the way of the positive impact they wanted to make through their work.  Many felt frustrated and disillusioned.

It became clear that calling out the connection to health outcomes was not going to materially change the accountability dynamics.  It would not introduce new insight for the organization about the high stakes environment created by moving to an electronic medical records system, or let individuals and workgroups know what they needed to do differently.  What was required was to come together around a new way of thinking and doing things; a way that would rally the culture around achieving improved performance and results.  Performing successfully as an organization would allow IT employees to deliver high quality, highly-reliable services, resulting in better healthcare outcomes and reducing the number of potential failures and errors.

The desired results for the large scale change effort were threefold.  First, increase systems availability from 93% to a sustained 99.7%.  Second, reinforce the commitment of all 5,500 IT employees in their contribution to and responsibility for delivering high quality health care.  And last, dramatically improve both the quality and reliability of IT products and services.

Four workstreams were undertaken to achieve the desired results.  The first was to engage, educate and enroll all IT employees in high reliability principles through experiential sessions.  The second was to craft a high reliability maturity model, as well as a high reliability maturity assessment instrument, to develop a baseline for the current HRO maturity level within IT.  The third workstream involved building "video" case studies highlighting IT service disruptions that could be used in high reliability Immersion and Introduction sessions.  And the fourth workstream required that the organization develop an internal capability to deliver these sessions, as well as build case studies for ongoing learning purposes.

In designing a simulation that would make the truth of the present situation more visceral for IT employees, FCG needed to create an environment of increasing complexity, with the potential for catastrophic failure, and with systems that were "tightly-coupled" (more interdependencies, more coordination, more information flow).  The solution involved using dominos - as many as 10,000 in one session - to create complicated patterns in four separate regions which ultimately linked to giant dominos in the center of the room-sized grid.  The goal?  in groups of 40-50, build out the system to serve as many patients as possible, as reliably as possible.  This translated as "keep ALL the dominos standing," including a 70-pound red domino representing 1,000 patients.  The simulation not only illustrated the critical importance of operational effectiveness and reliability, but also helped employees discover key approaches they could adopt to improve in these areas.

The Domino Simulation used red and white dominos to simulate "technology" and "adverse medical events."  The simulation mirrored the history of IT's growth, and used role specificity and "stop action" debriefs to reinforce the five hallmarks of an HRO.  It was also an exercise in growing mindfulness.

The Outcome

The organization has had numerous HRO success stories.  High reliability principles were applied to a successful second data center power upgrade after the initial effort went horribly wrong, creating a four-day, system-wide outage that endangered patient safety, disrupted business operations and seriously damaged equipment.  High reliability principles and concepts were followed in the build of a new Data Center, eliminating embedded design flaws present in previous efforts.  HRO principles and concepts were also instrumental in formulating a new Disaster Recovery Plan and acquiring the requisite system capabilities.  Systems availability increased from 93% to a sustained 99.9%; which in turn allowed the elctronic medical records system to be fully implemented and begin generating its anticipated ROI.

The HRO work continued.  Senior management followed through on their goal of having all IT employees participate in either the half-day Introduction session or the full-day Immersion session.  The capability to deliver these sessions internally was fully realized in 2007, and all HRO sessions have been delivered by internal resources since then.  With this level of emphasis, the IT organization found themselves rapidly moving up the HRO Maturity Model scale - from Level One (Surviving multiple failures) to Level Two (Building containment processes and avoiding a catastrophic event), and on their way to reaching Level Three (Organizational Stabilization).

What is the cost of losing a data center that thousands of employees and doctors depend on 24 hours a day to do their jobs, and millions of members stake their lives on?  Developing an HRO culture does not guarantee that a catastrophic event such as this will never occur.  However, when every employee takes personal responsibility for their own actions and is constantly vigilant for the miscues of others, the odds of failure decrease significantly.  When something does go wrong, employees are prepared to act, not react.  They move quickly and efficiently to correct what's gone wrong, as opposed to making things worse.  When everyone in the organization thinks and acts on the premise that the system is endangered until there is conclusive proof that it is not, it is then that the risk of a catastrophic event is significantly reduced.