Is correlation among failures so strong that a system should explicitly consider such effects?
Here we investigate how strong failure correlation is for PlanetLab nodes and for the collection of web servers. Since correlated failures tend to be rare, we need a long trace to observe them; so, we do not study DNS_trace because of its short duration. We are interested in the distribution for the number of near-simultaneous failures. In each probe interval, we determine the number of near-simultaneous failures by counting the number of nodes that are unavailable in the interval but were available in the previous interval. Figures 7 and 8 plot the PDF for the number of near-simultaneous failures in PL_trace and WS_trace, respectively. We observe that large-scale correlated failures do happen: PL_trace shows an event where 58 nodes failed near-simultaneously; while WS_trace has a failure event of 42 web servers. Both graphs also show the fitting of the beta-binomial distribution (BBD) and geometric distribution to the measured data. BBD is used in [11] to model correlated failures in