LLM Finetuning Toolkit

Github

Getting Started

LLM Finetuning toolkit is a config-based CLI tool for launching a series of finetuning experiments and gathering their results. From one single yaml config file, you can define the following:

Data

Bring your own dataset in any of json, csv, and huggingface formats
Define your own prompt format and inject desired columns into the prompt

Fine Tuning

Configure desired hyperparameters for quantization and LoRA fine-tune.

Ablation

Intuitively define multiple hyperparameter settings to iterate through

Inference

Configure desired sampling algorithm and parameters

Testing

Test desired properties such as length and similarity against reference text

Content

This documentation page is organized in the following sections:

Quick Start provides a quick overview of the toolkit and helps you get started running your own experiments
Configuration walks you through all the changes that can be made to customize your experiments
Developer Guides goes over how to extend each component for custom use-cases and for contributing to this toolkit
API Reference details the underlying modules of this toolkit

Installation

Clone Repository

git clone https://github.com/georgian-io/LLM-Finetuning-Hub.git
cd LLM-Finetuning-Hub/

Install CLI

docker (recommended) poetry (recommended) pip conda

docker (recommended)

# build image
docker build -t llm-toolkit .
# launch container
docker run -it llm-toolkit              # with CPU
docker run -it --gpus all llm-toolkit   # with GPU

poetry (recommended)

First, make sure poetry is installed

Then run:

poetry install

pip

pip install -r requirements.txt

conda

conda create --name llm-toolkit python=3.11
conda activate llm-toolkit
pip install -r requirements.txt

Running the Toolkit

The toolkit has everything you need to get started. This guide will walk you through the initial setup, explain the key components of the configuration, and offer advice on customizing your fine-tuning job. Let's dive in!

First, make sure you have read the installation guide above and installed all the dependencies. Then, To launch a LoRA fine-tuning job, run the following command in your terminal:

python3 toolkit.py

This command initiates the fine-tuning process using the settings specified in the default YAML configuration file config.yaml.

save_dir: "./experiment/"

ablation:
  use_ablate: false

# Data Ingestion -------------------
data:
  file_type: "huggingface" # one of 'json', 'csv', 'huggingface'
  path: "yahma/alpaca-cleaned"
  prompt:
    >- # prompt, make sure column inputs are enclosed in {} brackets and that they match your data
    Below is an instruction that describes a task. 
    Write a response that appropriately completes the request. 
    ### Instruction: {instruction}
    ### Input: {input}
    ### Output:
  prompt_stub:
    >- # Stub to add for training at the end of prompt, for test set or inference, this is omitted; make sure only one variable is present
    {output}
  test_size: 0.1 # Proportion of test as % of total; if integer then # of samples
  train_size: 0.9 # Proportion of train as % of total; if integer then # of samples
  train_test_split_seed: 42

# Model Definition -------------------
model:
  hf_model_ckpt: "NousResearch/Llama-2-7b-hf"
  quantize: true
  bitsandbytes:
    load_in_4bit: true
    bnb_4bit_compute_dtype: "bf16"
    bnb_4bit_quant_type: "nf4"

# LoRA Params -------------------
lora:
  task_type: "CAUSAL_LM"
  r: 32
  lora_alpha: 16
  lora_dropout: 0.1
  target_modules:
    - q_proj
    - v_proj
    - k_proj
    - o_proj
    - up_proj
    - down_proj
    - gate_proj

# Training -------------------
training:
  training_args:
    num_train_epochs: 5
    per_device_train_batch_size: 4
    optim: "paged_adamw_32bit"
    learning_rate: 2.0e-4
    bf16: true # Set to true for mixed precision training on Newer GPUs
    tf32: true
  sft_args:
    max_seq_length: 1024

inference:
  max_new_tokens: 1024
  do_sample: True
  top_p: 0.9
  temperature: 0.8

Quick Start

5 minute guide to getting started with the toolkit.

Artefact Outputs

This config will run finetuning and save the artefacts under directory ./experiment/[unique_hash]. Each unique configuration will generate a unique hash, so that our tool can automatically pick up where it left off. For example, if you need to stop the training before it finishes, you can relaunch the script and the program will automatically load the existing dataset that was generated in the directory, allowing you to resume where you left off instead of starting over from the beginning.

Finetuning

Launch Finetuning Job

First, make sure you have read the installation guide and installed all the dependencies. Then, To launch a LoRA fine-tuning job, run the following command in your terminal:

python3 toolkit.py

Default Config

This command initiates the fine-tuning process using the settings specified in the default YAML configuration file config.yaml.

save_dir: "./experiment/"

ablation:
  use_ablate: false

# Data Ingestion -------------------
data:
  file_type: "huggingface" # one of 'json', 'csv', 'huggingface'
  path: "yahma/alpaca-cleaned"
  prompt:
    >- # prompt, make sure column inputs are enclosed in {} brackets and that they match your data
    Below is an instruction that describes a task. 
    Write a response that appropriately completes the request. 
    ### Instruction: {instruction}
    ### Input: {input}
    ### Output:
  prompt_stub:
    >- # Stub to add for training at the end of prompt, for test set or inference, this is omitted; make sure only one variable is present
    {output}
  test_size: 0.1 # Proportion of test as % of total; if integer then # of samples
  train_size: 0.9 # Proportion of train as % of total; if integer then # of samples
  train_test_split_seed: 42

# Model Definition -------------------
model:
  hf_model_ckpt: "NousResearch/Llama-2-7b-hf"
  quantize: true
  bitsandbytes:
    load_in_4bit: true
    bnb_4bit_compute_dtype: "bf16"
    bnb_4bit_quant_type: "nf4"

# LoRA Params -------------------
lora:
  task_type: "CAUSAL_LM"
  r: 32
  lora_alpha: 16
  lora_dropout: 0.1
  target_modules:
    - q_proj
    - v_proj
    - k_proj
    - o_proj
    - up_proj
    - down_proj
    - gate_proj

# Training -------------------
training:
  training_args:
    num_train_epochs: 5
    per_device_train_batch_size: 4
    optim: "paged_adamw_32bit"
    learning_rate: 2.0e-4
    bf16: true # Set to true for mixed precision training on Newer GPUs
    tf32: true
  sft_args:
    max_seq_length: 1024

inference:
  max_new_tokens: 1024
  do_sample: True
  top_p: 0.9
  temperature: 0.8

🎉 Congratulations! You've ran the first fine-tuning job using this toolkit!

Artefact Outputs

This config will run finetuning and save the artefacts under directory ./experiment/[unique_hash]. Each unique configuration will generate a unique hash, so that our tool can automatically pick up where it left off. For example, if you need to stop the training before it finishes, you can relaunch the script and the program will automatically load the existing dataset that was generated in the directory, allowing you to resume where you left off instead of starting over from the beginning.

After the script finishes running you will see these distinct artifacts:

/config/config.yml: copy of the config file used for this experiment

/dataset/dataset.pkl: generated pkl file in huggingface Dataset format

/model/*: model weights saved using huggingface format

/results/results.csv: csv of prompt, ground truth, and predicted values

/qa/qa.csv: csv of quality assurance unit tests (e.g. vector similarity between gold and predicted output)

WARNING

Remember to carefully review the documentation and requirements of the new model you choose to ensure compatibility with your task and the toolkit.

Quality Assurance

Once the model is trained, it’s crucial to verify its readiness for production. We offer Quality Assurance testing specifically tailored for Language Model applications. This approach is distinct from conventional testing methods, as there’s currently no direct means of ensuring that a fine-tuned model meets enterprise standards. Moreover, developers have the flexibility to integrate their own tests into the process.

Available Tests

Generation Property

Generation Length

Function: LengthTest
Description: Determines the length of the summarized output and the input sentence. The output length is expected to exceed the input length, aligning with the specific use case.

POS Composition

Description: Analyzes the grammar of the generated output, focusing on:
- Verb Percentage: Indicates the proportion of verbs present.
- Adjective Percentage: Indicates the proportion of adjectives present.
- Noun Percentage: Indicates the proportion of nouns present.

Word Similarity

Word Overlap

Function: WordOverLapTest
Description: Determines the length of the summarized output and the input sentence. The output length is expected to exceed the input length, aligning with the specific use case.

ROUGE Score

Function: RougeScore
Description: Computes the Rouge score for the output, providing insight into the quality of summarization.

Embedding Similarity

Jaccard Similarity

Function: JaccardSimilarity
Description: Calculates similarity by encoding inputs and outputs.

Dot Product (Cosine) Similarity

Function: DotProductSimilarity
Description: Computes the dot product between the encoded inputs and outputs

Configuration

The configuration file is the central piece that defines the behavior of the toolkit. It is written in YAML format and consists of several sections that control different aspects of the process, such as data ingestion, model definition, training, inference, and quality assurance.

LoRA

The lora section configures the Low-Rank Adaptation (LoRA) settings. Supplied arguments are used to construct a peft LoraConfig object. It includes the following parameters:

Parameters

task_type: Type of transformer architecture; for decoder only - use CAUSAL_LM. for encoder-decoder - use SEQ_2_SEQ_LM
r: The rank of the LoRA adaptation matrices.
lora_alpha: The scaling factor for the LoRA adaptation.
lora_dropout: The dropout probability for the LoRA layers.
target_modules: The list of module names to apply LoRA to.
fan_in_fan_out: Flag to indicate if the layer weights are stored in a (fan_in, fan_out) order.
modules_to_save: List of additional module names to save in the final checkpoint.
layers_to_transform: The list of layer indices to apply LoRA to.
layers_pattern: The regular expression pattern to match layer names for LoRA application.

Example

lora:
  r: 32
  lora_alpha: 16
  lora_dropout: 0.1
  target_modules:
    - q_proj
    - v_proj
    - k_proj
    - o_proj
    - up_proj
    - down_proj
    - gate_proj
  fan_in_fan_out: false
  modules_to_save: null
  layers_to_transform: null
  layers_pattern: null

Advanced Usage

fan_in_fan_out

The fan_in_fan_out parameter is a boolean flag that indicates whether the weights of the layers being adapted are stored in a (fan_in, fan_out) order. This is important for correctly applying the LoRA adaptation.

Example

lora:
  fan_in_fan_out: true

In this example, setting fan_in_fan_out to true indicates that the weights of the layers being adapted are stored in a (fan_in, fan_out) order. If the weights are stored in a different order, you should set this parameter to false.

layers_to_transform

The layers_to_transform parameter is used to specify the indices of the layers to which LoRA should be applied. This allows you to selectively apply LoRA to specific layers of the model.

Example

lora:
  layers_to_transform: [2, 4, 6]

In this example, LoRA will be applied to the layers with indices 2, 4, and 6. The layer indices are zero-based, so the first layer has an index of 0, the second layer has an index of 1, and so on.

You can also specify a single layer index:

Example

lora:
  layers_to_transform: 3

In this case, LoRA will be applied only to the layer with index 3.

layers_pattern

The layers_pattern parameter allows you to specify a regular expression pattern to match the names of the layers to which LoRA should be applied. This provides a more flexible way to select layers based on their names.

Example

lora:
  layers_pattern: "transformer\.h\.\d+\.attn"

In this example, the regular expression pattern transformer\.h\.\d+\.attn will match the names of the attention layers in a transformer model. The pattern will match layer names like transformer.h.0.attn, transformer.h.1.attn, and so on.

You can adjust the regular expression pattern to match the specific layer names in your model.

TIP

When use_ablate is set to true, the toolkit will generate multiple configurations by permuting the specified parameters. This allows you to easily compare different settings and their impact on the model's performance.

API Reference

API Reference for all the important modules

Main Classes

Data

Ingestors

Ingestor

class

< source >

src.data.ingestor.Ingestor

( path: str )

The Ingestor class is an abstract base class for data ingestors.

Parameters

path: str - The path of the dataset.

to_dataset() -> Dataset ( )

An abstract method to be implemented by subclasses. Converts the input data to a Dataset object.

Returns

Dataset - The converted Dataset object.

JSON Ingestor

class

< source >

src.data.ingestor.JsonIngestor

( path: str )

The JsonIngestor class is a subclass of Ingestor for ingesting JSON data.

Parameters

path: str - The path of the JSON dataset.

to_dataset() -> Dataset ( )

Converts the JSON data to a Dataset object.

Returns

Dataset - The converted Dataset object.

CSV Ingestor

class

< source >

src.data.ingestor.CsvIngestor

( path: str )

The CsvIngestor class is a subclass of Ingestor for ingesting CSV data.

Parameters

path: str - The path of the CSV dataset.

to_dataset() -> Dataset ( )

Converts the CSV data to a Dataset object.

Returns

Dataset - The converted Dataset object.

Dataset Generator

class

< source >

src.data.dataset_generator.DatasetGenerator

( file_type: str, path: str, prompt: str, prompt_stub: str, test_size: Union[float, int], train_size: Union[float, int], train_test_split_seed: int )

The DatasetGenerator class is responsible for generating and formatting datasets for training and testing.

Parameters

file_type: str - The type of input file ("json", "csv", or "huggingface").
path: str - The path to the input file or HuggingFace dataset.
prompt: str - The prompt template for formatting the dataset.
prompt_stub: str - The prompt stub used during training.
test_size: Union[float, int] - The size of the test set.
train_size: Union[float, int] - The size of the training set.
train_test_split_seed: int - The random seed for splitting the dataset.

get_dataset ( )

Generates and returns the formatted train and test datasets.

Returns

A tuple containing the train and test datasets.

save_dataset ( save_dir: str )

Saves the generated dataset to the specified directory.

Parameters

save_dir: str - The directory to save the dataset.

load_dataset_from_pickle ( save_dir: str )

Loads the dataset from a pickle file in the specified directory.

Parameters

save_dir: str - The directory containing the dataset pickle file.

Returns

A tuple containing the loaded train and test datasets.

Getting Started

Data

Fine Tuning

Ablation

Inference

Testing

Content

Installation

Clone Repository

Install CLI

Running the Toolkit

Quick Start

Finetuning

Artefact Outputs

Custom Finetuning

Quality Assurance

Finetuning

Launch Finetuning Job

Default Config

Artefact Outputs

Quality Assurance

Available Tests

Generation Property

Word Similarity

Embedding Similarity

Configuration

General Structure

Data

Model

LoRA

Training

Inference

Quality Assurance

Ablation

Putting it All Together

LoRA

Parameters

Example

Advanced Usage

fan_in_fan_out​

layers_to_transform

layers_pattern

API Reference

Main Classes

Pydantic Models

Main Classes

Data

Finetuning

Inference

Quality Assurance

UI

Utilities

Data

Ingestors

Ingestor

JSON Ingestor

CSV Ingestor

Dataset Generator

Dataset Generator

Company

Resources

Legal

fan_in_fan_out