CodeBERT model has the state-of-the-art performances in tasks related to programming language processing23. It features capturing semantic connections between natural language and programming language. According to Yuan et al.34, CodeBERT can achieve 61% of accuracy in software vulnerabilities discovery which is generally higher than mainstream models Word2Vec35, FastText36 and GloVe37 (46%, 41% and 29% respectively). In our research, smart contracts are based on programming language Solidity. Therefore, we optimize the CodeBERT model and employ it in our study. CNN is a commonly used and typical deep learning model with an excellent generality in processing images and texts. LSTM is also a deep learning model featuring in processing long texts and it can effectively learn time sequence in texts which CNN is not adaptive to do. Both CNN and LSTM have achieved significantly high accuracy (0.958 and 0.959 respectively) in source code vulnerabilities detection, according to Xu et al.38. We attempt to employ CNN and LSTM models as comparisons with CodeBERT model and further analyze the performances of them in our tasks.
Figure 1 illustrates the complete process of developing a vulnerability detection model called Lightning Cat for smart contracts, which consists of three stages. The first stage involves building and preprocessing the labeled dataset of vulnerable Solidity code. In the second stage, training three models (Optimized-CodeBERT, Optimized-LSTM, and Optimized-CNN) and comparing their performance to determine the best one. Finally, in the third stage, the selected model is evaluated using the Sodifi-benchmark dataset to assess its effectiveness in detecting vulnerabilities.
Lightning Cat Model Development Process.
During the data preprocessing phase, we collect three datasets and subsequently perform data cleaning. Finally, we employ the CodeBERT model to encode the data.
Our primary training dataset comprises three main sources: 10,000 contracts from the Slither Audited Smart Contracts Dataset39, 20,000 contracts from smartbugs-wild40, and 1,000 typical smart contracts with vulnerabilities identified through expert audits, overall 31,000 contracts. To effectively compare results with other auditing tools, we choose to use the SolidiFI benchmark dataset41 as our test set, a dataset containing contracts containing 9,369 bugs.
Within our test set SolidiFI-benchmark, there are three static detection tools which are Slither, Mythril, and Smatcheck as well as all identified four common vulnerability types which are Re-entrancy, Timestamp-Dependency, Unhandled-Exception, and tx.origin. To ensure the completeness and fairness of the results, our proposed Lightning Cat model primarily focused on these four types of vulnerabilities for comparison. Table 1 displays the mapping between the four types of vulnerabilities and the three auditing tools.
Considering that a complete contract might consist of multiple Solidity files and a single Solidity file might contain several vulnerable code snippets, we utilized the Slither tool to extract 30,000 functions containing these four types of vulnerabilities from the data sources39,40. Additionally, we manually annotate the problematic code snippets within the contracts audited by experts, overall 1,909 snippets. The training set comprises 31,909 code snippets. For the test set, we extract 5,434 code snippets related to these four vulnerabilities from the SolidiFI-benchmark dataset. The processing procedures for the training and test sets can be seen in Fig. 2.
The length of a smart contract typically depends on its functionality and complexity. Some complex contracts can exceed several thousand tokens. However, handling long text has been a long-standing challenge in deep learning42. Transformer-based models can only handle a maximum of 512 tokens. Therefore, we attempted two methods to address the issue of text length exceeding 510 tokens.
The data is split into chunks of 510 tokens each, and all the chunks are assigned the same label. For example, if we have a group of Re-entrancy vulnerability code with a length of 2000 tokens, it would be split into four chunks, each containing 510 tokens. If there are chunks with fewer than 510 tokens, we pad them with zeros. However, the training results show that the models loss does not converge. We speculate that this is due to the introduction of noise from unrelated chunks, which negatively affects the models generalization capability.
Audit experts extracted the function code of vulnerabilities from smart contracts and assigned corresponding vulnerability labels. If the extracted code exceeds 510 tokens, it is truncated, and if the code falls short of 510 tokens, it is padded with zeros. This approach ensures consistent input data length, addresses the length limitation of Transformer models, and preserves the characteristics of the vulnerabilities.
After comparing the two methods, we observed that training on vulnerability-based function code helped the models loss function converge better. Therefore, we chose to use this data processing method in subsequent experiments. Additionally, we removed unrelated characters such as comments and newline characters from the functions to enhance the models performance. As shown in Fig. 3, we only extracted the function parts containing the vulnerability code, reducing the length of the training dataset while maintaining the vulnerability characteristics. This approach not only improves the models accuracy, but also enhances its generalization ability.
Extraction of Vulnerable Function Code (We partition the smart contract as a whole and extract only the functions where the vulnerabilities are present. In the provided image, we focus on the withdrawALL function, which serves as our training dataset. If a contract contains multiple vulnerabilities, we extract multiple corresponding functions).
CodeBERT is a pretraining model based on the Transformer architecture, specifically designed for learning and processing source code. By undergoing pretraining on extensive code corpora, CodeBERT acquires knowledge of the syntax and semantic relationships inherent in source code, as well as the interactive dynamics between different code segments.
During the data preprocessing stage, CodeBERT is employed due to its strong representation ability. The source code undergoes tokenization, where it is segmented into tokens that represent semantic units. Subsequently, the tokenized sequence is encoded into numerical representations, with each token mapped to a unique integer ID, forming the input token ID sequence. To meet the models input requirements, padding and truncation operations are applied, ensuring a fixed sequence length. Additionally, an attention mask is generated to distinguish relevant positions from padded positions containing invalid information. Thus, the processed data includes input IDs and attention masks, transforming the source code text into a numericalized format compatible with the model while indicating the relevant information through the attention mask.
For Optimized-LSTM and Optimized-CNN models, direct processing of input IDs and masks is not feasible. Therefore, CodeBERT is utilized to further process the data and convert it into tensor representations of embedding vectors. The input IDs and attention masks obtained from the preprocessing steps are passed to the CodeBERT model to obtain meaningful representations of the source code data. These embedding vectors can be used as inputs for Optimized-LSTM and Optimized-CNN models, facilitating their integration for subsequent vulnerability detection.
In the current stage, our approach involves the utilization of three machine learning models: Optimized-CodeBERT, Optimized-LSTM, and Optimized-CNN. The CodeBERT model is specifically fine-tuned to enhance its compatibility with the target task by accepting preprocessed input IDs and attention masks as input. However, in the case of Optimized-LSTM and Optimized-CNN models, we do not conduct any fine-tuning on the CodeBERT model for data preprocessing.
CodeBERT is a specialized application that utilizes the Transformer model for learning code representations in code-related tasks. In this paper, we focus on fine-tuning the CodeBERT model to specifically address the needs of smart contract vulnerability detection. The CodeBERT model is built upon the Transformer architecture, which comprises multiple encoder layers. Prior to entering the encoder layers of CodeBERT, the input data undergoes an embedding process. Following the encoding stage of CodeBERT, fully connected layers are added for classification purposes. The model architecture of our CodeBERT implementation is depicted in Fig. 4.
Our Optimized-CodeBERT Model Architecture.
Word Embedding and Position Encoding In the data preprocessing stage, we have utilized a specialized CodeBERT tokenizer to process each word into the input information. In this model tranining stage, the tokenizer employs embedding methods, which are used to convert text or symbol data into vector representations. This processing transforms each word into a 512-dimensional word embedding. In addition, we introduce position embedding, which is a technique introduced to assist the model in understanding the positional information within the sequence. It associates each position with a specific vector representation to express the relative positions of tokens in the sequence. For a given position i and dimension k, the Position Encoding (text {PE}(i, k)) is computed as follows:
$$begin{aligned} text {PE}(i, k) = {left{ begin{array}{ll} sin left( frac{i}{10000^{2k/d}}right) &{} text {if } k text { is even} \ cos left( frac{i}{10000^{2k/d}}right) &{} text {if } k text { is odd} end{array}right. } end{aligned}$$
Here, d represents the dimension of the input sequence. The formula utilizes sine and cosine functions to generate position vectors, injecting positional information into the embeddings. The exponential term (frac{i}{10000^{2k/d}}) controls the rate of change of the position encoding, ensuring differentiation among positions. By adding the Position Encoding to the Word Embedding, positional information is integrated into the embedded representation of the input sequence. This enables CodeBERT to better comprehend the semantics and contextual relationships of different positions in the code. The processing steps are illustrated in Fig. 5.
Word and Position Embedding Process.
Encoder layers The CodeBERT model performs deep representation learning by stacking multiple encoder layers. Each encoder layer comprises two sub-layers: multi-head self-attention and feed-forward neural network. The self-attention mechanism helps encode the relationships and dependencies between different positions in the input sequence. The feed-forward neural network is responsible for independently transforming and mapping the features at each position.
The multi-head self-attention mechanism calculates attention weights, denoted as (w_{ij}), for each position i in the input code sequence. The attention weights are computed using the following equation:
$$begin{aligned} w_{ij} = text {Softmax}left( frac{{q_i cdot k_j}}{sqrt{d}}right) end{aligned}$$
Here, (q_i) represents the query at position i, (k_j) denotes the key at position j, and d is the dimension of the queries and keys. The output of the self-attention mechanism at position i, denoted as (o_i), is obtained by multiplying the attention weights (w_{ij}) with their corresponding values (v_j) and summing them up:
$$begin{aligned} o_i = sum _{j=1}^{n} w_{ij} cdot v_j end{aligned}$$
where n is the length of the input sequence.
Each encoder layer also contains a feed-forward neural network sub-layer, which processes the output of the self-attention sub-layer using the following equation:
$$begin{aligned} text {FFN}(x) = text {ReLU}(x cdot W_1 + b_1) cdot W_2 + b_2 end{aligned}$$
Here, x represents the output of the self-attention sub-layer, and (W_1, b_1) and (W_2, b_2) are the parameters of the feed-forward neural network.
Fully connected layers To output the classification labels, we added fully connected layers. Firstly, we added a new linear layer with 100 features on top of the existing linear layer. To avoid the limited capacity of a single linear layer, we utilized the ReLU activation function. Additionally, to prevent overfitting, we introduced a dropout layer with a dropout rate of 0.1 after the activation layer. Lastly, we used a linear layer with four features for the output. During the fine-tuning process, the parameters of these new layers were updated.
The Optimized-LSTM model is specifically designed for processing sequential data, capable of capturing temporal dependencies and syntactic-semantic information43. For the task of smart contract vulnerability detection, our constructed Optimized-LSTM model provides a serialization-based representation of Solidity source code, taking into account the order of statements and function calls. The Optimized-LSTM model captures the syntax, semantics, and dependencies within the code, enabling an understanding of the logical structure and execution flow. Compared to traditional RNNs, the Optimized-LSTM model we constructed addresses the issue of vanishing or exploding gradients when handling long sequences44. This is accomplished through the key mechanism of gated cells, which enable selective retention or forgetting of previous states. The model consists of shared components across time steps, including the cell, input gate, output gate, and forget gate. In the Optimized-LSTM model, we have defined an LSTM layer and a fully connected layer, with the LSTM layer being the core component. Within the LSTM layer, the input (x^{(t)}), the output from the previous time step (h^{(t-1)}), and the cell state from the previous time step (c^{(t-1)}) are fed into an LSTM unit. This unit contains a forget gate (f^{(t)}), an input gate (i^{(t)}), and an output gate (o^{(t)}), as shown in Fig. 6.
The Architecture of Optimized-LSTM.
In the model, we utilize a bidirectional Optimized-LSTM, where the forward Optimized-LSTM and backward Optimized-LSTM are independent and concatenated at the final step. This allows for better capture of long-term dependencies and local correlations within the sequence. During the forward propagation of the model, the input x is first passed through the Optimized-LSTM layer to obtain the output h and the final cell state c. Since the lengths of the data instances may vary, we calculate the average output by averaging the outputs at each time step in h. Then, the average output is fed into a fully connected layer to obtain the final prediction output y. We used the cross-entropy loss function L for training, which is defined as:
$$begin{aligned} L_i=-sum _{j=1}^N y_{i,j}log {hat{y}}_{i,j}. end{aligned}$$
Here, N represents the number of classes, (y_{(i,j)}) denotes the probability of the jth class in the true label of sample i, and ({hat{y}}_{(i,j)}) represents the probability of sample i being predicted as the jth class by the model.
The Convolutional Neural Network (CNN) is a feedforward neural network that exhibits remarkable advantages when processing two-dimensional data, such as the two-dimensional structures represented by code45. In our model design, we transform the code token sequence into a matrix, and CNN efficiently extracts local features of the code and captures the spatial structure, effectively capturing the syntax structure, relationships between code blocks, and important patterns within the code.
The Optimized-CNN primarily consists of convolutional layers, pooling layers, fully connected layers, and activation functions. Its core idea is to extract features from input data through convolution operations, reduce the dimensionality of feature maps through pooling layers, and ultimately perform classification or regression tasks through fully connected layers46. The key module of the Optimized-CNN is the convolutional layer, which is computed as follows:
$$begin{aligned} y_{i,j}=sigma left( sum _{k=1}^Ksum _{l=1}^Lsum _{m=1}^M w_{k,l,m}x_{i+l-1,j+m-1,k}+bright) end{aligned}$$
Here, (x_{(i,j,k)}) represents the element value of the input data at the i-th row, j-th column, and k-th channel, (w_{(k,l,m)}) represents the weight value of the k-th channel, l-th row, and m-th column of the convolutional kernel, and b represents the bias term. (sigma) denotes the activation function, and in this case, we use the Rectified Linear Unit (ReLU).
The output of the convolutional layer is passed to the pooling layer for further processing. The commonly used pooling methods are Max Pooling and Average Pooling. In this case, we employ Max Pooling, and the calculation formula is as follows:
$$begin{aligned} y_{i,j}=max limits _{m=1}^Mmax limits _{n=1}^N x_{i+m-1,j+n-1} end{aligned}$$
Pooling operations can reduce the dimensionality of feature maps, model parameters, and to some extent alleviate overfitting issues. Finally, a fully connected layer is used to compute the model, which is expressed as:
$$begin{aligned} y=sigma (Wx+b) end{aligned}$$
Here, x represents the output of the previous layer, W and b denote the weights and bias terms, and (sigma) is the activation function. By stacking multiple convolutional layers, pooling layers, and fully connected layers, we construct a Optimized-CNN model as shown in Fig. 7, which has powerful feature extraction and classification capabilities for smart contract classification.
The Architecture of Optimized-CNN.
Read this article:
Deep learning-based solution for smart contract vulnerabilities ... - Nature.com
Read More..