Death March Project

A software project which is aptly characterized by the book DeathMarch.

INTRODUCTION

IT projects with greater than 50% chance of failure are commonly referred to as �Death March� projects. Although not openly discussed in the past, Death March (DM) projects are becoming more popular in discussions around the water cooler, on Internet blogs, and even in professional journals and periodicals. Discussing and acknowledging DMs is important part of the process to learn about them and provides a variety of benefits. These benefits include an understanding of how they occur (i.e. what causes them?), what is the best way to manage them, identify strategies for surviving DMs, and even therapeutic benefits for those who actually lived through a DM. This paper chronicles a real-life death march project and examines the reasons why it occurred in hopes that others may learn from the experience.

BACKGROUND

In the early 1990s the American government engineered and built the world�s largest Differential GPS (DGPS) system to provide 100% DGPS coverage along the coastal Unites States, including Puerto Rico, Hawaii, Alaska, and the Mississippi River. The initial system included 50+ remote broadcast sites controlled by two control station systems running a custom software application written in the Ada program language. The system only provided a limited ability to control the broadcast sites, and did not produce data or reports for analysis and system management. By the mid-90s, political pressure from farmers in the mid-west produced a congressional mandate and funding to expand the DGPS system to provide free differential GPS nationwide. Meeting this mandate required an additional 100+ remote broadcast sites, which was more than the existing control station could manage. Consequently, phase one of the nationwide expansion involved re-engineering the control station system to support the control and management of 200 remote broadcast sites and was scheduled for completion in 24 months. Phase two was scheduled to begin immediately after the new control station was completed, and involved building 60+ new remote broadcast sites over a 60 month period.

THE BEGINNING

The center hired to re-engineer the DGPS control station used a combination of SpiralModel development, ObjectOrientedDesign (OOD), and C++ language mixed with traditional engineering principles. At the time they did not have a complete SystemsDevelopmentLifeCycle (SDLC) process to guide them. As a result several key steps were overlooked at the beginning of development.

Development began when a small team of mechanical and electrical engineers began extracting FunctionalRequirements from the existing system. After analyzing the existing system, they met with the customers and end users to gather the additional functionality required to support system expansion. One of the team members then entered the complete set of requirements into a dynamic object oriented requirements system (DOORS) and the lead engineer began designing a system using OOD. As the system was being designed in-house, a small group of contractors (i.e. 4) were hired to began building the communications components of the application.

The engineering team failed to work through the concept phase of the SDLC process that contains three key elements: BusinessCase, ProjectManagement Plan, and Funding Plan. These three elements create a solid foundation upon which the rest of the project is built. Without them, the future of the project would be jeopardized.

The first stage of the project lasted 2 years, and involved four lead ProjectManagers (PM). This was also problematic because none of the PMs worked on the project long enough to completely understand the full scope of the work or level of effort required to bring it to fruition. As a result, the marketing staff over committed the project to the customers, agreed to delivery dates without a project plan, and no one requested a funding increase to move from what should have been labeled as the concept phase into full scale development.

THE MIDDLE

The middle of the project involved Requirements Validation, Design, Development, and Testing. The fourth and final PM assigned to the project was a young student who just graduated from collage with a degree in computer science. Within two months of working on the project, he recognized the project was in trouble and immediately began taking action to correct the situation.

The first step taken by the new PM was to meet with the project�s customers and validate the core system requirements and functionality. Afterward, he reviewed the system design documentation and created a business case supported by a project management and funding plan. These documents were presented to the Executive Steering Committee at the engineering center, and used to request additional funding or extend the time allotted for development. Since the project was working against a congressionally mandated deadline for beginning operation, the steering committee was forced to triple the projects budget over the next four years.

With a revised budget in place, the PM immediately hired three new programmers, a database administrator, technical writer, and an administrative assistant. The software development team was now comprised of a staff of 10 people, and an additional 15 engineers were assigned to a hardware and networking group. Together the entire team of 25 would spend the next two years working 60-80 hour weeks to bring the system to completion.

The first version of the application was finished in 12 months using the original OOD plan created by the first PM. During the first year, team member enjoyed 40 hour work weeks, vacation time with family, and a relaxed work environment. Unfortunately, it wasn�t until the first application was completed that the team realized it contained a critical flaw. Although the system functioned correctly, it only performed at 20% of the required 200 site minimum, which meant the new control station would only manage 50% of the existing broadcast sites. This was completely unacceptable.

The team immediately went to work on identifying BottleNecks in the system. In the end, the flaw in the system was so fundamental that the whole application had to be redesigned. Once a new design was developed and several concepts were tested, the team began working in earnest on the second version of the application. The second year of application development was nothing like the first. Members routinely worked 80-100 hour weeks; weekends and vacation time were consumed by work days; and as a result several junior members quit and left the project.

The second application took 10 months to develop, and when finished it supported 60% of the minimum 200 site requirement. Although 40% below the minimal requirement, the DGPS system only operated 80 remote broadcast sites at the time, and the new control station supported system expansion to 120 sites, which was an acceptable level to the customer. Consequently, the control station was fielded, and engineers could start building the new remote broadcast sites. This marked the beginning of phase two of the DGPS system expansion, and a Milestone for the project.

THE END

Although the first version of the control station was fielded, the pressure was not off the software development team yet because the system only operated at 60% of the minimal performance requirements, and contained less than 45% of the core system functionality. So the team continued to work 60-80 hour weeks for the next year until it was able to deliver a system meeting the performance requirements with all of the core functionality.

The control station was very unique in that it was required to receive, decipher, and interpret up to 2,000 text messages every minute from up to 200 remote locations on a 24 x 7 x 365 basis with a 99.7% system availability. The system was also required to have a real time system status display on an interactive GIS map with built in control functionality. The control functionality created and sent messages back to the remote locations at burst rates nearly matching the incoming data rates. And as if that was not enough, the customer wanted to run performance reports using the same system hardware and LAN, as well as mirror data between three locations to allow for shifting control of the system between two coasts and an emergency backup system.

These extreme system constraints posed a significant challenge to Commercial Off The Shelf (COTS) products, which became a limiting factor in nearly all of the major system functionality. In many cases, the team had to build custom applications to either overcome bugs in COTS products or eliminate them completely from the control station system.

For example: Versioning control became problematic early on because the team was building a multithreaded application to run on multiprocessor hardware using concurrent development. Simultaneously managing the application code between four developers was difficult and none of the products met all of their needs. The final solution adopted by the team was to document work-arounds that became Standard Operating Practices everyone was required to follow religiously.

This makes no sense to me. What does version control / configuration control have to do with multithreaded software? Issues of thread-safety and concurrency, or lack of same, are totally different than version control. Managing revisions to code contributed by 4 developers is difficult? Nobody heard of CVS, Perforce, ClearCase, Subversion, etc? Sounds like a junior team with no experienced staff.

Reports software also became a challenge while developing the second version of the system because products either lacked what the project considered basic functionality, or there were so many bugs in the application that it was not functional on the system. After several months of trying numerous products, the programmers on the team knew more details about two of the most popular products on the market than the products� technical support staff did. In fact, the team stopped reporting bugs they found in the software because COTS providers couldn't or wouldn't fix them. In response to this lack of support, the programmers built custom applications to overcome COTS limitations.

The second version of the control station was fielded four years after the original requirements gathering began by the small team of engineers. It passed all performance tests, and contained 100% of the core system functionality. The effort required to bring the entire project to fruition included:

     * Over 218,800 man hours of work;
     * $2.9M in contracted labor costs;
     * 22 COTS products;
     * 7 Custom software applications;
     * More than 15,000 lines of programming code; 
     * And over $5M in hardware costs.

In the end the customer was delighted with the results, and the government�s DGPS system began providing free service nationwide in 2004.

The numbers cited here are bizarre: 218,800 man-hours of work is 27,250 man-days. With $2.9 million in labor costs, that's a billing rate of only $106 a day! The absolute cheapest legit India offshore contractors are $300/day (in 2007 dollars), so something is really wrong here! Plus, 15,000 lines of code works out to 0.55 LOC/day - which is possible, and LOC/day isn't a great metric, but still indicative that there's something wrong going on.

It isn't bizarre - some people will work on uncompensated or under-compensated overtime. If you make 40k salary and work 80 hours a week, you really make 20k etc.

CONCLUSION

There were several contributing factors that lead to this death march project. The first was the failure of the initial development team to follow a detailed SystemsDevelopmentLifeCycle plan. Failure to follow an SDLC allowed project development to begin and progress without a project plan or fully developed budget. This created a situation where the project was over committed and woefully under funded.

Problems were exacerbated by the fact four lead PMs were assigned to the project within the first two years of development. As a result, only the last PM assigned was able to comprehend the full scope and effort required to develop the system. None of the other PMs communicated with the marketing manager who over promised the project and established delivery deadlines with the customer without discussing them with the PM. This caused a wide gap in expectations between the customer and development team.

In the rush to get the project going, the fourth PM failed to recognize how performance was impacted by software systems and design. In spite of just having completed a degree in Computer Science, nothing taught in school prepared him to evaluate system performance from design diagrams. This led to a complete re-engineering of the system design and an additional year of development and testing.

Finally, the limitations of COTS products was not anticipated during project development. Here again, the problem was due to inexperience of several team members and a lack of understanding of how performance would be impacted by running multiple COTS products on the same hardware platforms as the main control software. This led to the creation of five custom applications required to overcome COTS limitations, and contributed to an additional four months of development time on the second system version.

In the end, it was the strength and conviction of all members on the team that produced in a happy ending to this story. Without their na�ve commitment, boundless energy, and selfless efforts this project would have never survived. The Nationwide DGPS system is the largest system of its kind in the world, and officially began providing free service across the continental U.S. in 2004, only 84 months after the project began, and on schedule.

FrankKlucznik

I'm afraid, your story almost common, except Congress is not usually involved. I've worked a number of projects that were oversold, underestimated, understaffed, underfunded, poorly managed, and people were overworked.