Tutorial 2 video 

Tutorial 3 video 

 Tutorial 4 video 

 Tutorial 6 video 

New this year!

 This year,  DSN will  accommodate a renovated tutorial track featuring the following exciting news: 

  • Multi-day tutorial schedule: tutorials will be performed during the conference schedule to increase the opportunity of the conference attendees to interact with leaders in different areas of system dependability. The program includes 6 exciting tutorials (with substantial hands-on activities), among which 5 are with a strong industry presence.
  • Flat Tutorial Pass: registered tutorial attendees will have access to up to three tutorials performed during the workshop and conference days.
  • Tutorial Certificates: Tutorial attendees will receive a DSN Certificate on Advanced Training in System Dependability after completing at least one tutorial. 
Tutorials Chairs
Catello Di Martino (catello.di_martino [at], Bell Labs - Alcatel Lucent, US
Roberto Baldoni (baldoni [at], Sapienza Univ. of Rome, IT

Tutorial 1: Common Safety Method for Risk Evaluation and Assessment (CSM-RA) and Hazard Analysis Tutorial 
Nuno Silva and Francisco Moreira (Critical Software)
Date: June 28, duration: 4h 

Safety systems require accident avoidance. This is covered by application standards, processes, techniques and tools that support the identification, analysis, elimination or reduction to an acceptable level of system risks and hazards. Ideally, a safety system should be free of hazards. However, both industry and academia have been struggling to ensure appropriate risk and hazard analysis, especially in what concerns completeness of the hazards, formalization, and timely analysis in order to influence the specifications and the implementation. This tutorial will provide insights on the fundamentals of CSM-RA based and complemented with Hazard Analysis and when and how to apply them. The relation and similarities of these processes with industry standards and the system life cycles will be highlighted and a specific hands-on session will guide the attendees through several example cases of the application of the CSM-RA, for the railway domain, with the identification and management of the hazards related to the system or system proposed changes. 

You can dowload the Flyer including the abstract and bios here

Tutorial 2: Reliability and Availability Modeling in Practice 
Kishor Trivedi (Duke University) and Andrea Bobbio (Universita’ del Piemonte Orientale)
Date: June 28, duration: 4h

Introductive Video

The diffusion of IT in any area of the human activity requires a high level of dependability of the digital systems, and necessitates the application of accurate modeling techniques. In this tutorial we will expose methods used in reliability, availability, performability and survivability modeling and analysis of systems in practice. Non-state-space solution methods are often used to solve reliability block diagrams, fault trees and reliability graphs. Relatively efficient algorithms are known to handle systems with hundreds of components and have been implemented in many software packages. We will show the usage of these model types through practical examples and via the software package SHARPE. Nevertheless many practical problems cannot be handled by such algorithms. Bounding algorithms are then used in such cases as was done for a major subsystem of Boeing 787. 
Non-statespace methods derive their efficiency from the independence assumption that is often violated in practice. State space methods based on Markov chains, stochastic Petri nets, semi-Markov and Markov regenerative processes can be used to capture various kinds of dependencies among system components. Markov models, Markov Reward models and stochastic Petri net will be illustrated through practical problems and using the SHARPE software package. However, the resulting state space explosion severely restricts the size of the problem that can be solved. Hierarchical and fixed-point iterative methods provide a scalable alternative that combines the strengths of state space and nonstate-space methods and have been extensively used to solve real-life problems. The use of hierarchical and fixed point iterative methods will be also illustrated via large system examples and the SHARPE software package.

 You can dowload the Flyer including the abstract and bios here

Tutorial 3: Activating Protection and Exercising Recovery Against Large-Scale Outages on the Cloud
HariGovind Ramasamy, Long Wang, Richard Harper and Ruchi Mahindru (IBM Research)
Date: June 28, duration: 6h

Introductive Video

The tutorial is designed to be hands-on and will be organized as a full-day activity. First, we will introduce terminology, theory, concepts, and metrics for providing resiliency on a cloud platform. We will catalog factors that make building resilient applications on the cloud easy in some cases and particularly complicated in other cases. The bulk of the tutorial will focus on educating the audience with a series of hands-on exercises, in which they will access a pre-created cloud virtual infrastructure and applications, activate protection against outages at multiple levels of the cloud stack, orchestrate recovery procedure for a simulated site-level outage, and orchestrate failback to the primary site (simulating the reconstruction of the primary site). The hands-on exercises will be tailored to enable audience members to gain a strong grasp of the practical challenges involved in cloud resiliency, e.g., determining recovery priorities based on business criticality, recovery groups, and coordinated recovery across multiple virtual machines constituting a business application. Through the exercises, we will reinforce core design principles and design elements for building resilient cloud applications. We will recap with a survey of commercial and academic solutions and conclude with emerging areas (e.g., container-based resiliency) and future research challenges in cloud resiliency.

You can dowload the Flyer including the abstract and bioshere

Tutorial 4: Measuring Resiliency through Field Data: Techniques, Tools and Challenges 
Antonio Pecchia (Critiware), Marcello Cinque (University of Naples Federico II) and Veena Mendiratta (Bell Labs – Nokia) 
Date: June 29, duration: 4h 30 mn

Data collected under real workload conditions can provide troves of valuable information about the stresses the systems encounter and their responses to them. Textual/numeric data and log files produced by applications, operating systems, networks, and other monitoring sources play a key role for assessing system reliability. Practitioners, academia, and industry strongly recognize the inherent value of log data. Data-driven evaluation deepens our understanding of the system dependability behavior, and enables stronger design and better monitoring strategies. However, in spite of recent advances, data-driven reliability evaluation keeps posing challenging questions due to the scale, complexity and diversity of applications. 
This full-day tutorial focuses on methodologies, tools and state-of-the-art techniques underlying data-driven system reliability evaluation. The goal of the tutorial is to deliver a well-balanced mix of theory and practice by (i) introducing state-of-the-art techniques to characterize and model failure data starting from data, (ii) presenting industrial case studies and assessments of real-world systems (iii) providing exciting hands-on sessions where attendees will be guided in the analysis of a real log data. Research issues and novel directions will be introduced during the tutorial to foster the discussion among attendees.

 You can dowload the Flyer including the abstract and bios here

Tutorial 5: Building Highly-Available Distributed SDN Applications with ONOS 
Thomas Vachuska, Brian O'Connor and Ali Al-Shabibi (OnLABS)
Date: June 30, duration: 3h 30mn


ONOS (Open Network Operating System) is a distributed applications platform aimed at building SDN applications for service provider networks. Size and critical nature of these networks dictate that the platform and control applications built atop of it must be resilient to failures, must be scalable and perform fast both in terms of reaction latency and throughput of control operations. 
In this tutorial, the attendees will implement a distributed ONOS application called BYON (Build Your Own Network). Through hands-on exercises, the audience will get familiar with the ONOS SDK and experience how to implement an ONOS service, a distributed ONOS store, and how to use parts of the CLI and Northbound API provided by the ONOS platform.

You can dowload the Flyer including the abstract and bios here

Tutorial 6: Resilience for Scientific Computing: From Theory to Practice
Franck Cappello (Argonne National Lab) and George Bosilca (University of Tennessee)
Date: July 1st, duration: 4h

Introductive Video

Resilience becomes a critical issue for large-scale platforms. This tutorial provides a comprehensive survey of fault-tolerant techniques for high-performance computing, with a fair balance between practice and theory. It is organized along four main topics:
(i) An overview of failure types (software/hardware, transient/fail-stop) observed in the field and typical probability distributions (Exponential, Weibull, Log-Normal) used to model failures inter arrival time.
(ii) General-purpose techniques, which include several fault tolerance protocols, replication, prediction and silent error detection;
(iii) Application-specific techniques, such as ABFT for grid-based algorithms or fixed-point convergence for iterative applications.
(iv) Practical deployment of fault tolerant techniques. Relevant examples based on computational solver routines will be protected with a mix of checkpoint-restart and advanced recovery techniques in a hands-on session.

The tutorial is open to all DSN'16 attendees who are interested in the current status and expected promise of Resilience approaches for scientific applications. There are no audience prerequisites: background will be provided for all protocols and probabilistic models. However, basic knowledge of MPI will be helpful for the hands-on session.

Online user: 1 RSS Feed