Bit-LoRA as an application of BitNet and 1.58 bit neural network technologies – Towards Data Science

For experiments I took my proprietary data for a text generation task. The data itself is not that important here, I would just say that it is kind of a small subset of the instructions dataset used to train instruction following LLMs. As the base model I decided to use microsoft/Phi-3-mini-4k-instruct model. I did 3 epochs of LoRA adapters tuning with fp16 mixed precision training using Huggingface Trainer and measured the loss on evaluation. After that I implemented BitNet (replacing the linear layers in LoRA adapters) and 1.58 bit LoRA training and reported the results. I used 4 bit base model quantization with BitsAndBytes during the training in Q-LoRA configuration.

The following LoRA hyperparameters were used: rank = 32, alpha = 16, dropout = 0.05.

3.1. Classic LoRA training

For all LoRA experiments QLoRA approach was used in the part of the base model quantization with NF4 and applying LoRA to all the linear layers of the base model. Optimizer is Paged AdamW with warmup and cosine annealing down to 90% of the maximum learning rate. Maximum learning rate equals 2e-4. Train/test split was random, the test set is 10% from the whole dataset.

3.2. LoRA BitNet implementation

For BitNet LoRA training the approach from BitNet: Scaling 1-bit Transformers for Large Language Models was used with the code for its implementation. According to BitNet paper the weights of the LoRA layers were binarized with scaling:

At the same time activations should be also quantized according to the paper:

According to the formulas provided you can see that each parameter is being transformed with the sign function to be either +1 or -1, those parameters are multiplied by quantized and normalized input X and scaled with the mean absolute value of parameters of the layer. Code implementation:

All the code above is from https://github.com/kyegomez/BitNet GitHub repository.

After LoRA training the adapter weights can be merged with the base model because of the fact that each LoRA adapter is just a pair of linear layers without biases and non-linear activations. Normalization of activations (LN(x)) and their quantization in the approach are making LoRA adapters merger more difficult (after merger LoRA adapter share the same inputs for the linear layer as the base model these layers work with activations without any additional modifications), that is why the additional experiment without normalization and activations quantization was conducted and led to better performance. To do such a modifications we should just modify forward method of the BitLinear class:

Presented code is quantization aware training, because the master weights of each BitLinear layer are still in high precision, while we binarize the weights during the forward pass (the same we can do during the model inference). The only issue here is that we additionally have a scale parameter that is individual to each layer and has high precision.

After we get BitLinear layers we need to replace linear layers in the LoRA adapter with these new linear layers to apply BitLinear modification to classic LoRA. To do so we can rewrite update_layer method of the LoraLayer class (peft.tuners.lora.layer.LoraLayer) with the same method but with BitLinear layers instead of Linear:

After we create such a class we can replace the update_layer method of the original LoraLayer with the new one:

3.3. 1.58 bit LoRA

For this experiment the approach from The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits was used. The conceptual difference is that instead of binarization to +1 and -1 in this paper authors propose to quantize weights to -1, 0 and +1 for better accuracy.

Authors excluded activations scaling from the pipeline that was creating extra difficulties for merger with the base model in our experiments. In our experiments we additionally removed activation quantization from the pipeline to make LoRA adapter merger simpler.

To tune the LoRA adapters with this approach we should simply update the weight_quant function with following:

For the 1.58 bit implementation I used Binary Magic: Building BitNet 1.58bit Using PyTorch from Scratch publication as the starting point.

View original post here:

Bit-LoRA as an application of BitNet and 1.58 bit neural network technologies - Towards Data Science

Related Posts

Comments are closed.