Important Notice: Our web hosting provider recently started charging us for additional visits, which was unexpected. In response, we're seeking donations. Depending on the situation, we may explore different monetization options for our Community and Expert Contributors. It's crucial to provide more returns for their expertise and offer more Expert Validated Answers or AI Validated Answers. Learn more about our hosting issue here.

Is correlation among failures so strong that a system should explicitly consider such effects?

April 26, 2017correlation explicitly failures strong system

0

Posted

Is correlation among failures so strong that a system should explicitly consider such effects?

1 Answer

0

Posted

Here we investigate how strong failure correlation is for PlanetLab nodes and for the collection of web servers. Since correlated failures tend to be rare, we need a long trace to observe them; so, we do not study DNS_trace because of its short duration. We are interested in the distribution for the number of near-simultaneous failures. In each probe interval, we determine the number of near-simultaneous failures by counting the number of nodes that are unavailable in the interval but were available in the previous interval. Figures 7 and 8 plot the PDF for the number of near-simultaneous failures in PL_trace and WS_trace, respectively. We observe that large-scale correlated failures do happen: PL_trace shows an event where 58 nodes failed near-simultaneously; while WS_trace has a failure event of 42 web servers. Both graphs also show the fitting of the beta-binomial distribution (BBD) and geometric distribution to the measured data. BBD is used in [11] to model correlated failures in