Retriveal-augmented code models

Bachelor Thesis, Master Thesis

State-of-the-art models for code generation and understanding, such as CodeBERT and CodeT5, are basically LLMs designed for natural language (BERT, T5) but trained on a large corpus of code. However, LLMs are known to hallucinate, i.e, they often produce sentences and information that are factually incorrect. One way to prevent, or at least reduce the impact of, hallucination in LLMs is to augment them with information retrieval.

It is currently not known if these models hallucinate when used for code intelligence tasks too. The thesis would focus on developing methods to understand hallucinations in the models. Further, explore the possibilities of retrieval-based augmentation to reduce the effect of hallucinations.