Debugging Big Data Applications

Debugging Big Data Applications

Over the last few years, a vast amount of data has become available from a variety of heterogeneous sources, including social networks and cyber-physical systems. This state of the things has pushed recent research in the direction of investigating computing platforms and programming environments that support processing massive quantities of data. Systems like MapReduce, Spark, Flink, Storm, Hive, PigLating, Hadoop, HDFS have emerged to address the technical challenges posed by the nature of these computations, including parallelism, distribution, network communication and fault tolerance.

The problem of debugging such applications, however, is still open. Traditional debuggers can hardly help because of the design of Bid Data frameworks: the code that developers write is not directly executed, but it is deployed on a system that features distribution, parallel execution, re-execution of slow computations and re-allocation of failed computations – all aspects that are not explicit in programmers’ code. To complicate things further, many Bid Data frameworks are based on a declarative functional programming model which clashes with the nature of the abstractions used by traditional debuggers (e.g., stepping over statements, inspecting state).

This thesis aims at the design of a debugging system for Big Data applications. The objective is to use techniques for data provenance to track the flow of data in the application. This way, the location of a fault can be traced back to the root of the error significantly reducing the time required to debug Big Data software.


Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing (HotCloud'10). USENIX Association, Berkeley, CA, USA, 10-10.