Analysis and Testing of Big Data Applications

Analysis and Testing of Big Data Applications

Over the last few years, a vast amount of data has become available from a variety of heterogeneous sources, including social networks and cyber-physical systems. This state of the things has pushed recent research in the direction of investigating computing platforms and programming environments that support processing massive quantities of data. Systems like MapReduce, Spark, Flink, Storm, Hive, PigLating, Hadoop, HDFS have emerged to address the technical challenges posed by the nature of these computations, including parallelism, distribution, network communication and fault tolerance.

Despite the popularity of such systems, there has been little attention to aspects in the development process other than programming itself. For example, testing Big Data applications is an area that remains largely unexplored. This is even more surprising considering that testing has a long tradition in Software Engineering from a research standpoint (e.g., concoholic testing, mutation testing) as well as for practitioners, with established testing techniques and tools that are widespread in industry (e.g., JUnit).

The goal of this thesis is to develop a testing methodology for Big Data applications focusing on the Apache Spark platform. The candidate will apply testing techniques based on symbolic execution to the setting of Big Data. Ideally, the thesis will include a comparison of different approaches as well as the development of a new methodology specifically tailored for Big Data.


Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing (HotCloud'10). USENIX Association, Berkeley, CA, USA, 10-10.