A Comprehensive Review of Code Embedding Models: Architectures, Applications, Evaluation, and Latest Developments

Info 0 references

Dec 15, 2025 0 read

Architectures and Methodologies of Prominent Code Embedding Models

This section provides a detailed overview of the foundational architectures and methodologies of prominent code embedding models, specifically focusing on CodeBERT, GraphCodeBERT, CodeT5, and UniXCoder. These models aim to convert source code into continuous vector spaces to facilitate various code-related tasks by employing unique approaches to source code representation and self-supervised pre-training tasks.

CodeBERT

CodeBERT is a bimodal pre-trained model designed for both programming and natural languages, built upon the Transformer architecture, specifically an encoder-only model that follows the design of BERT . Its primary approach to source code representation is to treat code as a sequence of tokens . CodeBERT is pre-trained on natural language (NL) and programming language (PL) pairs across six programming languages: Python, Java, JavaScript, PHP, Ruby, and Go 1.

The pre-training objectives for CodeBERT involve two self-supervised tasks:

Masked Language Modeling (MLM): This objective entails randomly masking input tokens in both natural language comments and source code, then training the model to predict the original tokens . This task encourages the model to align natural language and programming language representations by leveraging comment context when the code context is insufficient to infer masked tokens .
Replaced Token Detection (RTD): This task involves randomly replacing input tokens and training the model to detect whether specific tokens have been replaced 2.

GraphCodeBERT

GraphCodeBERT is a Transformer-based encoder model built upon BERT, distinguished by its design to consider the inherent structure of code . It extends the standard Transformer by incorporating a graph-guided masked attention function, explicitly integrating code structure into its architecture . Unlike models that solely treat code as a token sequence, GraphCodeBERT leverages semantic-level code structure, specifically data flow, during pre-training . Data flow is represented as a graph where variables are nodes, and directed edges indicate the "where-the-value-comes-from" relationship between these variables. This semantic structure is considered less complex and more efficient for representing dependencies compared to abstract syntax trees (ASTs) . The input to the model includes the paired comment, source code, and the extracted set of variables (nodes from the data flow graph) .

GraphCodeBERT employs structure-aware pre-training tasks in addition to standard masked language modeling:

Masked Language Modeling (MLM): This follows the standard BERT approach, where 15% of tokens from the source code and paired comment are masked, and the model predicts the original tokens .
Data Flow Edge Prediction: To learn representations from the data flow, the model randomly samples a percentage (e.g., 20%) of variable nodes, masks their incoming or outgoing data flow edges, and then predicts these masked edges . This task promotes the learning of structure-aware representations .
Variable-Alignment across Source Code and Data Flow (Node Alignment): This task aims to align representations between source code tokens and data flow variables. It involves masking connections between code tokens and sampled variable nodes, subsequently predicting which code token a variable is identified from, based on data flow information .

CodeT5

CodeT5 is an encoder-decoder model that follows the architecture of T5 (Text-to-Text Transfer Transformer) . It adapts the T5 model by considering crucial token type information, particularly from identifiers, to leverage code-specific structural details in its source code representation .

CodeT5 proposes specific pre-training tasks designed to capture code-specific characteristics:

Identifier-aware Pre-training Task: This task aims to leverage structural information specifically related to code identifiers 2.
Bimodal Dual Generation Pre-training Task: Designed to augment the alignment between natural language and programming language representations 2.

UniXCoder

UniXCoder is a unified cross-modal pre-trained model based on a multi-layer Transformer 3. It is designed to support both understanding and generation tasks and offers flexibility to operate in encoder-only, decoder-only, or encoder-decoder modes by utilizing mask attention matrices with prefix adapters ([Enc], [Dec], [E2D]) to control context access 3. UniXCoder goes beyond simple code sequences by integrating multi-modal contents, specifically Abstract Syntax Tree (AST) and code comments, alongside the source code itself 3. To incorporate AST (which is tree-structured) into a sequential model, it proposes a one-to-one mapping method to transform an AST into a sequence that preserves all its structural information. This flattened AST sequence, along with the code comment, is then used as part of the model's input in parallel with the source code 3.

UniXCoder employs a comprehensive set of pre-training tasks:

Masked Language Modeling (MLM): Applied in encoder-only mode, where 15% of input tokens (from code, comment, and flattened AST) are masked, and the model predicts the original tokens 3. This task leverages semantic information from comments and syntactic information from AST to infer masked code tokens 3.
Unidirectional Language Modeling (ULM): Used for decoder-only mode, this task trains the model to predict the next token auto-regressively, enabling support for tasks like code completion 3.
Denoising Objective: This task is utilized when the model operates in encoder-decoder mode 3.
Multi-modal Contrastive Learning: This objective leverages the AST to enhance the semantics of code fragment embeddings 3.
Cross-modal Generation: This task uses code comments to align embeddings between different programming languages 3.

The following table summarizes the key characteristics of these prominent code embedding models:

Model	Foundational Architecture	Key Source Code Representation Approach	Primary Pre-training Objectives
CodeBERT	Transformer (Encoder-only, BERT-based), Bimodal NL/PL	Sequence of tokens (code, comments)	Masked Language Modeling (MLM), Replaced Token Detection (RTD)
GraphCodeBERT	Transformer (Encoder, BERT-based), Graph-guided attention	Data flow (semantic structure via graphs of variables and dependencies)	MLM, Data Flow Edge Prediction, Variable-Alignment
CodeT5	Transformer (Encoder-decoder, T5-based)	Token type information, especially identifiers	Identifier-aware Pre-training Task, Bimodal Dual Generation Pre-training Task
UniXCoder	Multi-layer Transformer (flexible modes: encoder-only, decoder-only, encoder-decoder)	Multi-modal: Abstract Syntax Tree (AST) as flattened sequence, code comments, source code	MLM, Unidirectional Language Modeling (ULM), Denoising Objective, Multi-modal Contrastive Learning, Cross-modal Generation

Applications and Use Cases of Code Embedding Models

Code embedding models have revolutionized the way machine learning is applied to source code by transforming complex programmatic structures into numerical representations that capture both semantic and syntactic properties. This capability extends their utility beyond basic code search to critical tasks such as automated program repair, vulnerability detection, and intelligent code generation systems.

Automated Program Repair (APR)

Automated Program Repair (APR) systems aim to automatically identify and fix software bugs. Code Language Models (CLMs) have emerged as powerful tools in this domain, leveraging code embeddings to understand and manipulate code structures.

Effectiveness: CLMs demonstrate significant capabilities in automated program repair. Even without fine-tuning, these models can fix 72% more bugs than state-of-the-art deep-learning (DL)-based APR techniques 4. Fine-tuning CLMs with specific APR training data further enhances their performance, yielding improvements from 31% to 1,267% and enabling them to fix 46% to 164% more bugs than existing DL-based APR techniques 4. For instance, the fine-tuned InCoder-6B model fixed 100 (164%) more bugs than the best DL-based APR techniques across four benchmarks 4. Models like PLBART, CodeGen, and InCoder show competitive fixing capabilities even without fine-tuning, especially on benchmarks such as QuixBugs and HumanEval-Java 4. Furthermore, Large Language Models (LLMs) like Code Llama and Mistral, when fine-tuned on C/C++ vulnerability datasets, significantly improve repair accuracy and adaptability compared to older methods such as VulRepair for automated code vulnerability repair 5.

Underlying Mechanisms: APR systems typically use data-driven models that learn transformations from pairs of buggy and fixed code 6. CLMs are trained on massive, unlabeled code corpora for general language modeling tasks, such as next token prediction 4. Deep learning-based APR techniques adapt DL models to accept a buggy program as input and produce a patched version 4. Fine-tuning CLMs involves training them with specific APR datasets, which is crucial for learning effective bug-fixing patterns 4. These methods often incorporate hybrid approaches, combining insights from natural language processing (NLP) with the formal structure of code, including Abstract Syntax Trees (AST), data flow, and control flow graphs 6.

Challenges: Traditional DL-based APR tools often generate numerous candidate patches (hundreds to thousands) and require hours for validation, which is impractical for developers who prefer few patches and quick responses 4. Without fine-tuning, CLMs struggle to effectively utilize information from buggy lines and can even perform worse when given this context, sometimes generating incorrect or uncompilable patches by attempting to follow the flawed code 4. Fine-tuned CLMs might also potentially over-rely on buggy lines 4. There is also a challenge in ensuring fair and comprehensive evaluations, necessitating the use of test datasets devoid of training samples to prevent data leakage . Repairing security vulnerabilities specifically is an underexplored area compared to general bug repair 6.

Vulnerability Detection

Code embedding models combined with deep learning classifiers have proven highly effective in automating the detection of security vulnerabilities in source code.

Effectiveness: In Python source code vulnerability detection, the combination of Bidirectional Long Short-Term Memory (BiLSTM) with Word2Vec embeddings achieved superior performance, with an average precision of 96.2%, recall of 93.3%, F-score of 94.7%, and accuracy of 98.6% 7. Convolutional Neural Network (CNN) combined with GraphCodeBERT also showed strong results, attaining an average precision of 94.4%, recall of 91.9%, F-score of 93.3%, and accuracy of 97.3% 7. In some studies, Word2Vec has been shown to outperform CodeBERT and FastText in terms of precision, recall, and F-score when used with LSTM and GRU classifiers 7.

Here is a summary of observed performance in Python vulnerability detection:

Model Combination	Precision	Recall	F-score	Accuracy
BiLSTM + Word2Vec	96.2%	93.3%	94.7%	98.6%
CNN + GraphCodeBERT	94.4%	91.9%	93.3%	97.3%

Underlying Mechanisms: Automated vulnerability detection systems require source code to be transformed into structured numeric formats using code embeddings 7. Commonly used embedding techniques include:

Word2Vec: Used to encode Python code tokens into continuous vector representations, often trained on a corpus of Python GitHub repositories 7.
CodeBERT: A pre-trained model based on the BERT architecture, designed for both programming and natural languages, which captures syntactic and semantic features from large datasets of paired code and natural language 7.
GraphCodeBERT: An extension of CodeBERT that integrates code structure information like data flow and syntax graphs to represent richer semantic relationships 7.

These embeddings serve as feature vectors for deep learning classifiers such as BiLSTM networks, which process sequential data in both forward and backward directions, and CNNs, which are adept at extracting spatial features from structured data 7.

Challenges: Traditional static analysis tools often generate many false positives 7. Manually defining features for rule-based systems is time-consuming and relies heavily on human expertise 7. Dynamic analysis requires extensive, expert-crafted test cases 7. The effectiveness of deep learning models in vulnerability detection is significantly influenced by the quality and relevance of the code representation 7. There is a critical need to select appropriate combinations of embedding techniques and classifiers to optimize performance 7. Python, despite its widespread use, has received relatively limited attention in vulnerability detection research compared to languages like C 7.

Intelligent Code Generation Systems

Intelligent code generation systems, powered by code embeddings and large language models, significantly boost developer productivity by automating code creation.

Effectiveness: AI code generation systems provide instant coding support, offering completions, snippets, and even entire functions 8. This technology significantly boosts developer productivity; for instance, GitHub Copilot alone generated over 82 billion lines of code in its first year 8. Researchers estimate that AI code generation can save developers up to 30% of their coding time 8. Furthermore, it democratizes software development, making coding more accessible to individuals with less experience 8.

Underlying Mechanisms: AI code generation operates on machine learning algorithms trained on vast amounts of existing source code, often from open-source projects 8. It leverages advanced Large Language Models (LLMs) and Generative AI techniques 8. Key mechanisms include:

Autocomplete features: AI predicts and suggests code completions based on learned patterns 8.
Natural language input: Users describe desired functionality in natural language, and the AI generates corresponding code snippets or full functions 8.
Direct interaction: Conversational interfaces allow developers to request code or bug fixes 8.

These systems employ deep learning algorithms and extensive neural networks trained on diverse code datasets 8. Examples include GitHub Copilot (using OpenAI's Codex), ChatGPT and other GPT models (fine-tuned for code generation), Amazon Q Developer, Google's Gemini and Vertex AI (which use PaLM 2), Code Llama (an open-source model), and TabNine 8. These tools support various programming languages like Python, C++, Java, JavaScript, and more 8.

Challenges: A primary challenge is ensuring the quality and reliability of AI-generated code, as it can sometimes be buggy or insecure, necessitating rigorous human review and testing 8. A study on Copilot-generated code found only 28.7% of problems were solved correctly, with 51.2% partially correct and 20.1% incorrect 8. Maintainability is another concern; Generative AI may produce overly complex or "over-engineered" code, leading to unnecessary abstractions or intricate logic that complicate future debugging and collaboration 8. This can increase technical debt, as models might prioritize syntactical correctness over efficiency or long-term maintainability 8. There are also concerns about a potential "loss of control," where over-reliance on AI might diminish developers' fundamental coding skills and expertise 8. Additionally, issues around licensing and copyright infringement for AI-generated code remain 8. Effective integration requires developers to understand the strengths and limitations of these tools and use them as aids rather than replacements for human judgment 8.

Latest Developments, Trends, and Research Progress

The past 24 months (approximately May 2022 to May 2024) have witnessed a profound transformation in code embedding models, largely driven by the ascendancy of Large Language Models for Code (LLMs4Code). This period marks a significant paradigm shift from traditional encoder-only architectures to advanced decoder-only LLMs, impacting model architectures, training techniques, and performance across various code-related tasks 9.

I. The Emergence of Large Language Models for Code (LLMs4Code)

The landscape of representation learning for code has transitioned from encoder-only models, which primarily capture bidirectional context, to larger, decoder-only LLMs such as GPT, LLaMA, and Mistral for embeddings 9. These LLMs benefit from significantly more parameters and extensive pre-training corpora, increasingly showcasing emergent capabilities in understanding and processing code 9. Many foundational LLMs released or substantially updated within this timeframe exhibit varying code capabilities 10.

Key Foundational LLMs with Code Capabilities (2022-2024) 10:

Year	Model Name	Month
2022	BLOOM	November
2023	LLaMA	February
2023	GPT-4	March
2023	LLaMA 2	July
2023	Phi-1.5	September
2023	Baichuan 2	September
2023	Qwen	September
2023	Mistral	October
2023	Gemini	December
2023	Phi-2	December
2023	YAYI2	December
2024	DeepSeek	January
2024	Mixtral	January
2024	DeepSeekMoE	January
2024	Orion	January
2024	OLMo	February
2024	Gemma	February
2024	Claude 3	March
2024	Yi	March
2024	Poro	April
2024	JetMoE	April
2024	LLaMA 3	April
2024	Reka Core	April
2024	Phi-3	April
2024	OpenELM	April
2024	Tele-FLM	April
2024	DeepSeek-V2	May
2024	GECKO	May
2024	MAP-Neo	May

Additionally, several general-purpose LLMs have been specifically adapted for code through additional pre-training on code-related data 10:

Adapted LLM Name	Based On	Release Date
Minerva	PaLM	June 2022
PaLM 2	-	May 2023
Code LLaMA	LLaMA 2	August 2023
Lemur	LLaMA 2	October 2023
BTX	LLaMA 2	March 2024
HiRoPE	-	March 2024
CodeGemma	Gemma	April 2024

II. New Model Architectures

Architectural innovations during this period have aimed at enhancing the efficiency, versatility, and contextual understanding of code embedding models:

Mixture of Experts (MoE) Models: The introduction of Sparse MoE architectures, exemplified by Mixtral 8x7B (January 2024), demonstrates superior performance by efficiently activating a subset of expert layers compared to larger dense LLMs. DeepSeekMoE (January 2024) is another notable example .
Bidirectional Contextualization for LLMs: While many generative LLMs use mono-directional attention, models like Gecko (2024) and LLM2vec (2024) integrate bidirectional attention mechanisms to capture broader contextual dependencies for improved embeddings. GritLM (2024) further unifies embedding and generative tasks using bidirectional attention. However, BGE-ICL (2024) suggests that enabling bidirectional attention during embedding fine-tuning might conflict with the model's original generative pre-training setup 9.
Matryoshka Embeddings (ME): This technique trains embeddings to store crucial information in their initial dimensions, allowing flexible truncation for trade-offs in performance, speed, and memory usage 9.
Multimodal LLMs: Frameworks such as NVLM (September 2024) explore various multimodal architectures, including Unified Embedding-Decoder and Cross-Modality Attention. DreamLLM (ICLR 2024) is another multimodal LLM framework designed for comprehensive comprehension and creation of interleaved documents by directly sampling in raw multimodal space .

III. Advancements in Pre-training and Fine-tuning Techniques

The evolution of code embedding models is marked by sophisticated advancements in training methodologies:

Embedding Derivation Strategies from LLMs:
- Tuning-Free Methods: These methods extract embeddings directly from LLM hidden states without explicit training for embedding tasks. Examples include PromptBERT, PromptEOL, PromptSTH, PromptSUM, MetaEOL, GenEOL, and echo embeddings 9.
- Tuning-Based Methods: These involve continued supervised fine-tuning using contrastive learning with paired text data 9. Embeddings are typically derived from the [EOS] token (e.g., SGPT (2022), GTE-Qwen2-7B (2023), E5-mistral-7b (2024), Echo-mistral-7b (2024), BGE-ICL (2024)) or by mean pooling of the last hidden layer (e.g., Instructor-XL (2023), GritLM-7B (2024), LLM2vec (2024), Gecko (2024), NV-Retriever (2024)). More advanced techniques like using a latent attention layer are explored in NV-Embed (2024) 9.
Parameter-Efficient Tuning (PEFT): Techniques like Low-Rank Adaptation (LoRA) and its extension, Weight-Decomposed LoRA (DoRA) (February 2024), have become essential for efficiently adapting LLMs. While LoRA may learn less new knowledge from distinct domains, it significantly reduces catastrophic forgetting, offering a balance for fine-tuning with limited resources .
Specialized Pre-training for Code Models:
- Encoders: Beyond Masked Language Modeling (MLM), objectives now include Runtime Dependency (RTD), Data Flow Graph (DFG) edge/node prediction (GraphCodeBERT, 2020), Identifier Prediction, AST Edge Prediction (SynCoBERT, 2021), Node Type MLM (DISCO, 2021), Type Inference (Code-MVP, May 2022), Deobfuscation (CodeSage, 2024), and various contrastive learning schemes 10.
- Decoders: Causal Language Modeling (CLM) and Fill-in-the-Middle (FIM) are dominant objectives. Notable models include SantaCoder (January 2023), CodeGeeX (March 2023), StarCoder (May 2023), CodeFuse (October 2023), DeepSeek Coder (January 2024), StarCoder2 (February 2024), CodeShell (March 2024), CodeQwen1.5 (April 2024), and Granite (May 2024) 10.
- Encoder-Decoders: Techniques such as Span Corruption (CodeT5+, May 2023), DAE, Identifier Tagging, Text2Code, Code2Text, and Text-Code Contrastive Learning are employed 10.
Instruction Fine-tuning (SFT) and Alignment: This has become a prevalent technique to enhance LLMs for code-specific tasks. Models like MFTCoder (November 2023), Magicoder (December 2023), WaveCoder (December 2023), Astraios (January 2024), DolphCoder (February 2024), SafeCoder (February 2024), CCT (March 2024), SAT (April 2024), CodeFort (April 2024), XFT (April 2024), AIEV-Instruct (May 2024), and AlchemistCoder (May 2024) leverage techniques including Evol-Instruct, Ranking Feedback (RRTF), multi-task fine-tuning, and specialized instruction sets 10.
Reinforcement Learning (RL) on Code: RL methods are increasingly utilized to improve code LLMs, often incorporating external feedback mechanisms. Examples include CodeRL (July 2022), PPOCoder (January 2023) (execution-based feedback), RLTF (July 2023) (unit test feedback), B-Coder (October 2023) (value-based), IRCoCo (January 2024) (immediate rewards), StepCoder (February 2024) (compiler feedback), and RLPF & DPA (April 2024) (performance-aligned) 10.
Contextual Expansion: Recent efforts involve expanding the context of queries or documents with few-shot examples, neighbor document information, generated thoughts, or synthesized user queries to enhance embeddings. The Reinforced Information Retrieval framework exemplifies this by mutually enhancing query-expansion and embedding models 9.
Dataset Development: The FineWeb Dataset (June 2024) provides 15 trillion tokens for LLM pre-training, facilitating the development of larger models. Similarly, the MathCodeInstruct dataset (ICLR 2024), used by MathCoder, combines natural language, code, and execution results for mathematical reasoning .

IV. Impact on Performance and Capabilities in Code-Related Tasks

LLMs for code have demonstrated significant advancements across various software engineering tasks:

Code Completion: LLMs substantially enhance code completion performance across multiple programming languages and contexts, boosting developer productivity by accurately predicting relevant code snippets 11.
Code Generation: While LLMs show strong aptitude, complex tasks still present challenges, especially regarding self-repair capabilities unless augmented with robust feedback from stronger models or human input. CodeChain (ICLR 2024) improves modularity and correctness in code generation for complex problems by encouraging self-revision and reuse of sub-modules 12.
Mathematical Reasoning: MathCoder (ICLR 2024) models, fine-tuned on code and execution results, have achieved state-of-the-art scores in mathematical datasets, even surpassing GPT-4 on competition-level problems 12.
Code-based Reasoning: AgentBench (ICLR 2024) evaluates LLMs as agents, indicating that training on code and high-quality alignment data improves their reasoning and decision-making in interactive environments. Projects under "Coding for Reasoning" (e.g., PAL, PoT, CoC) utilize LLMs to generate code for enhanced reasoning tasks .
Resource Efficiency: SliceGPT (ICLR 2024) introduces a novel sparsification technique that reduces LLM embedding dimensions and overall parameters (up to 25% for models like OPT 66B and Llama-2 70B) without significant accuracy loss, leading to faster inference and lower compute costs. CLEX (ICLR 2024) effectively extends Transformer context windows up to 4x training length with negligible impact on latency 12.
Retrieval-Augmented Generation (RAG): Retrieval-augmentation proves as effective as long context windows, and combining both strategies further boosts performance in tasks like question answering and query-based summarization. SuRe (Summarized Retrieval, ICLR 2024) enhances open-domain QA by providing summaries of retrieved passages, leading to more grounded answers 12.

V. Observed Shifts in Research Paradigms

The research landscape for code embedding models has evolved significantly:

LLMs as Foundational Embedders: A clear trend exists towards leveraging the inherent capabilities of large generative LLMs directly for embedding tasks, moving beyond traditional, task-specific embedding models 9.
Focus on Data Quality and Quantity: There's an increased emphasis on creating vast, high-quality, and diverse datasets (e.g., FineWeb) for pre-training, and developing sophisticated data generation and curation techniques for fine-tuning .
Alignment and Behavioral Control: Significant investment in instruction tuning and reinforcement learning from human or machine feedback is evident to align LLMs' behavior with specific code-related goals and human preferences. Insights suggest that sometimes in-context learning alone can achieve alignment comparable to weight updates (e.g., Urial, ICLR 2024) .
Computational Efficiency: Ongoing efforts to make LLMs more deployable and sustainable through innovations like MoE architectures, parameter-efficient fine-tuning (LoRA, DoRA), and optimizing inference-time computation are prominent .
Modularity and Advanced Reasoning: A growing trend towards developing LLMs that can handle complex, multi-step reasoning tasks by promoting modular code generation and internal "journey learning" processes, such as in OpenAI's O1 (discussed in O1 Replication Journey, October 2024) .
Multimodality Integration: Increasing exploration of combining code with other modalities (e.g., images) within LLM frameworks to support broader applications is also a developing area 13.

Introduction to Code Embedding Models

Code embedding models are a fundamental concept in applying machine learning to software engineering, designed to convert source code into continuous vector spaces. This transformation facilitates various code-related tasks by enabling machines to understand and process programming languages more effectively 14. By representing code as dense numerical vectors, these models can capture the intricate semantic and syntactic relationships within code, which is crucial for automating complex software development processes. The importance of code embeddings spans across numerous applications in modern software engineering and machine learning, including but not limited to code search, code completion, bug detection, vulnerability analysis, and code generation.

This report delves into the foundational architectures, source code representation approaches, and pre-training objectives of several prominent code embedding models. Specifically, we will explore CodeBERT, which treats code primarily as sequences of tokens; GraphCodeBERT, which incorporates code structure through data flow; CodeT5, an encoder-decoder model leveraging token type information; and UniXCoder, a unified cross-modal model integrating Abstract Syntax Trees (ASTs) and comments 14. These models showcase diverse strategies for effectively embedding source code to support a wide array of downstream applications.

Evaluation Metrics and Benchmarking

To assess the quality and performance of code embedding models, researchers utilize a variety of standard and specialized evaluation metrics alongside diverse benchmark datasets tailored to specific programming tasks and languages. These evaluations aim to quantify and compare the effectiveness of models in tasks such as code search, code completion, bug detection, and code generation 15. Evaluation approaches are broadly categorized as intrinsic, which measure properties of the embedding space directly, and extrinsic, which measure performance on downstream tasks 17. While intrinsic metrics are fast and interpretable, their accuracy in predicting real-world performance can be limited 18.

Evaluation Metrics

Evaluation metrics for code embedding models can be categorized into similarity-based, execution-based, feedback-based/qualitative, and general large language model (LLM) evaluation metrics.

Similarity-Based Metrics: These metrics assess the resemblance between generated code and reference solutions.
- Traditional NLP Metrics:
  - BLEU (Bilingual Evaluation Understudy): Measures the overlap of n-grams between generated and reference code, and is widely used in preliminary evaluations despite its limitations due to complex syntax and varying identifiers in code 16. CodeXGLUE employs BLEU for tasks such as code repair, translation, text-to-code generation, and summarization 19.
  - ROUGE (Recall-Oriented Understudy for Gisting Evaluation): Focuses on the recall of n-grams, which is useful for capturing logical steps even with different variable names 16.
  - METEOR (Metric for Evaluation of Translation with Explicit ORdering): Combines precision, recall, lexical matching, and semantic relationships to offer a nuanced approach to code evaluation 16.
  - Exact Match (EM): A stringent metric requiring the generated code to be identical to the reference in syntax, structure, and content 16. It is applied in CodeXGLUE for tasks including code completion, code repair, text-to-code generation, and code summarization 19.
  - Edit Distance (Levenshtein Distance): Quantifies the minimum single-character edits needed to transform one code sequence into another, providing a flexible measure that tolerates minor differences 16. CodeXGLUE uses it for line-level code completion 19.
- Code-Specific Similarity Metrics:
  - CodeBLEU: Extends the traditional BLEU metric by integrating weighted n-gram matching, syntactic Abstract Syntax Tree (AST) matching, and semantic data flow matching to capture internal logic and functionality 16. It has shown a higher Pearson correlation with programmer scoring for natural language to code generation 16, and is utilized in CodeXGLUE for code repair, code translation, text-to-code generation, and code summarization 19.
  - CrystalBLEU: A refined BLEU-based metric designed to measure code similarity more precisely by reducing noise from trivially shared n-grams inherent in programming languages, while maintaining language-agnosticism and efficiency 20.
  - Data Flow Analysis and Semantic Similarity: These metrics evaluate code quality by comparing data flows or the functional and behavioral equivalence of code snippets 16.
Execution-Based Metrics: These metrics focus on the functional correctness and performance of generated code.
- Compilation/Interpretation Success Rate: Assesses whether generated code can be executed without syntactic errors, indicating adherence to programming language rules 16.
- Unit Test Pass Rate (Pass@k): Measures the proportion of generated code snippets that successfully pass a set of predefined unit tests 16. Pass@k, which denotes the probability of code passing tests within the first 'k' attempts, is a foundational metric, particularly for benchmarks like HumanEval 16.
- Performance and Efficiency Evaluation: Quantifies the actual runtime performance, including time and space complexity, to ensure generated code is not only correct but also optimized for practical applications 16. Benchmarks such as EffiBench and Mercury are specifically designed for this purpose 16.
Feedback-Based and Qualitative Metrics: These involve human judgment or high-level analysis.
- Blind Peer Review: Reviewers evaluate code quality, including functionality, clarity, and maintainability, without knowing the source model to mitigate bias 16.
- Real-World Application Evaluation: Generated code is deployed in live environments to assess its practicality, reliability, error rates, debugging effort, and maintenance costs 16.
- Readability Evaluation: Involves human assessment of code clarity, logical structure, naming conventions, and commenting practices 16.
- Maintainability Evaluation: Focuses on the ease with which code can be updated or modified in the future, considering modular design, documentation, and adherence to coding standards 16.
- Embedding Visualization: Techniques like t-distributed Stochastic Neighbor Embedding (t-SNE) reduce high-dimensional embeddings to 2D or 3D representations, allowing for qualitative analysis of clustering patterns, outliers, and semantic relationships 18.
General LLM Evaluation Metrics (applicable to code outputs):
- Factuality: Verifies the accuracy and correctness of information in the output 22.
- Relevance: Determines how well the output addresses the given input or query 22.
- Coherence and Fluency: Evaluate the logical flow, consistency, grammatical correctness, and naturalness of the generated text 22.
- Safety and Moderation: Checks for harmful content, bias, or violations of content policies 22.
- Specialized Format Checks: Include JSON validity, SQL correctness (syntactic and semantic), and Numeric Difference for structured, query, and numerical outputs respectively 22.

Benchmark Datasets

Benchmark datasets provide standardized environments for evaluating code embedding models across various tasks and programming languages:

Benchmark Name	Primary Task(s)	Key Features / Description	Programming Languages	Key Metrics
HumanEval	Code Generation	164 hand-written Python programming tasks, each with natural language descriptions and unit tests for functional correctness. Widely used for assessing LLM capabilities 15.	Python	Pass@k (Pass@1, Pass@5, Pass@10) 15
MBPP	Code Generation	974 entry-level Python tasks from natural language descriptions, covering common programming concepts, with example solutions and test cases 15.	Python	Pass@k 16
CodeXGLUE	Code Understanding & Generation	Features 14 datasets and 10 diverse tasks including clone detection, defect detection, code completion, translation, search, repair, text-to-code generation, summarization, and documentation translation 15.	Java, C/C++, Python, PHP, JavaScript, Ruby, Go, C#	BLEU, Exact Match Accuracy, F1 score, CodeBLEU, Mean Average Precision (MAP), Mean Reciprocal Rank (MRR), Levenshtein Edit Similarity 16
SWE-bench	Bug Detection / Program Repair	Challenges models to resolve real-world software issues from GitHub, sourcing over 2200 issues and pull requests from 12 widely used Python repositories 15.	Python	Successful patch generation (implied) 15
DS-1000	Code Generation	1000 coding challenges derived from StackOverflow questions, focusing on data science tasks and spanning seven popular Python libraries (NumPy, Pandas, TensorFlow, PyTorch, scikit-learn) 15.	Python	Functional correctness via test cases, adherence to API usage constraints 15
APPS	Code Generation	10,000 problems from competitive programming platforms (Codeforces, Kattis), ranging from simple to complex algorithmic challenges, with test cases and ground-truth solutions 15.	Python	Successful code generation and test passage (implied) 15
EvalPlus	Functional Correctness	An evaluation framework that significantly augments test cases for benchmarks like HumanEval (80x) and MBPP (35x) using an automatic test input generator 15.	Python (extends existing benchmarks)	Functional correctness 15
CrossCodeEval	Cross-File Code Completion	A multilingual benchmark designed to evaluate code completion across multiple files within a project, capturing real-world dependencies and modularity 15.	Python, Java, TypeScript, C#	Accurate cross-file completion (implied) 15
RepoBench	Repository-Level Code Auto-completion	Comprises three tasks: RepoBench-R (Retrieval), RepoBench-C (Code Completion), and RepoBench-P (Pipeline, combining retrieval and completion), reflecting real-world challenges with cross-file information 15.	Python, Java	Retrieval relevance, next line prediction (implied) 15
Code Lingua	Programming Language Translation	Evaluates programming language translation by assessing models' ability to understand code semantics and translate them into a target language, tracking bug introduction and fixes 15. Includes datasets like CodeNet and Avatar.	C, C++, Go, Java, Python	Semantic fidelity, bug tracking (introduced/fixed) 15
ClassEval	Class-Level Code Generation	A manually crafted benchmark consisting of 100 class-level Python coding tasks (over 400 methods), designed with dependencies to mirror real-world software engineering scenarios 15.	Python	Correctness via provided test suites 15
LiveCodeBench	Code Generation, Self-Repair, Execution, Prediction	Evaluates LLM coding abilities on 400 problems from platforms like LeetCode, AtCoder, and CodeForces, with problem release dates to assess generalization to unseen tasks 15.	Multiple (from competitive platforms)	Code generation, self-repair, code execution, test output prediction 15
CodeElo	Competition-Style Code Generation	Uses problems from Codeforces to evaluate LLMs on competitive programming tasks, employing an Elo rating system for performance comparison against human contestants 15.	Not specified	Elo rating 15
ResearchCodeBench	Research Code Implementation	Contains 212 coding challenges derived from recent machine learning research papers, tasking LLMs to implement executable code from conceptual descriptions and context 15.	Not specified	Functional correctness against curated tests 15
SciCode	Scientific Problem Solving (Code Generation)	Curated by scientists, this benchmark includes 80 main problems (338 subproblems) in six natural science domains (math, physics, chemistry, biology, materials science, computational) to test knowledge recall, reasoning, and code synthesis 15.	Not specified	Functional correctness via gold-standard solutions and test cases 15
CodeNet	Code Classification, Completion, Similarity	Developed by IBM, it's a large-scale dataset with over 14 million code samples and approximately 5000 problems 16.	55 languages	Task-dependent (e.g., classification accuracy, completion accuracy, similarity scores) 16
CoderUJB	Functional Code Generation, Test Generation, Program Repair, Defect Detection	A comprehensive Java benchmark with 2,239 programming problems extracted from 17 open-source Java projects, providing complete project context for various tasks 16.	Java	Pass@k, count@n, coverage@n, accuracy 16
VerilogEval	Verilog Code Generation & Verification	Dedicated to hardware design tasks (combinational/sequential logic, state machine design), with detailed natural language descriptions and design constraints 16.	Verilog (HDL)	Synthesis success rate, simulation pass rate, design performance (timing, power, area) 16
Spider	Text-to-SQL Generation	A complex and cross-domain benchmark comprising thousands of natural language questions and their corresponding SQL queries 16.	SQL	SQL correctness (implied) 16
AtCoder, CodeContest	Competitive Programming	Datasets with coding problems from competitive programming platforms like AtCoder, Codeforces, LeetCode, and HackerRank, assessing models' ability to generate correct and efficient solutions 16.	Multiple	Not explicitly listed
Defects4J	Bug Detection / Program Repair	A benchmark specifically designed for bug detection and automated program repair 16.	Not specified	Bug detection accuracy, successful repair (implied) 16