Automatic Code Summarization Using Abbreviation Expansion and Subword Segmentation

doi:10.1111/exsy.13835

DOI: 10.1111/exsy.13835 ISSN: 0266-4720

Automatic Code Summarization Using Abbreviation Expansion and Subword Segmentation

Yu‐Guo Liang, Gui‐Sheng Fan, Hui‐Qun Yu, Ming‐Chen Li, Zi‐Jie Huang

ABSTRACT

Automatic code summarization refers to generating concise natural language descriptions for code snippets. It is vital for improving the efficiency of program understanding among software developers and maintainers. Despite the impressive strides made by deep learning‐based methods, limitations still exist in their ability to understand and model semantic information due to the unique nature of programming languages. We propose two methods to boost code summarization models: context‐based abbreviation expansion and unigram language model‐based subword segmentation. We use heuristics to expand abbreviations within identifiers, reducing semantic ambiguity and improving the language alignment of code summarization models. Furthermore, we leverage subword segmentation to tokenize code into finer subword sequences, providing more semantic information during training and inference, thereby enhancing program understanding. These methods are model‐agnostic and can be readily integrated into existing automatic code summarization approaches. Experiments conducted on two widely used Java code summarization datasets demonstrated the effectiveness of our approach. Specifically, by fusing original and modified code representations into the Transformer model, our Semantic Enhanced Transformer for Code Summarizsation (SETCS) serves as a robust semantic‐level baseline. By simply modifying the datasets, our methods achieved performance improvements of up to 7.3%, 10.0%, 6.7%, and 3.2% for representative code summarization models in terms of BLEU‐4, METEOR, ROUGE‐L and SIDE, respectively.

Outline

Automatic Code Summarization Using Abbreviation Expansion and Subword Segmentation

ABSTRACT

More from our Archive