Evaluating Code-Similarity Metrics
The goal of this thesis is to build a heuristic that can determine the quality of code-containing answers from the community question and answer (CQA) site Stack Overflow. If high and poor quality answers can be identified, Stack Overflow can be mined for these answers and their code snippets. They can then be used to improve developer-assistance tools that rely on code snippets as training examples, primarily code recommendation systems. These systems support users of IDEs with intelligent, context-sensitive help. They typically gather examples from traditional code repositories. Grading the quality of codecontaining answers is the first step to opening up millions of answers from Stack Overflow to be possibly used as positive and negative examples for these systems.
Our approach is to conduct a user study in which respondents rate answers with code snippets from Stack Overflow. Compared to previous work, our study employs a comprehensive definition of answer quality, which includes whether the answer is correct, well-explained, and contains good-practice code. About 50 users actively participate in the study, providing 131 rated answers. The resulting ratings are used as a ground truth for building and verifying heuristics of answer quality. We evaluate 10 features of Stack Overflow’s posts and users for their predictive power. The best cross-validated prediction can be achieved by a heuristic that incorporates four features. These features are: the number of down-votes, whether the answer is accepted, the length of text in addition to the code snippet, and the answerer’s reputation points. A linear regression heuristic with these features achieves an improvement of 13 percent over its baseline and a R2 value of 0.3.