Getting Started
LLM Finetuning toolkit is a config-based CLI tool for launching a series of finetuning experiments and gathering their results. From one single yaml
config file, you can define the following:
Data
- Bring your own dataset in any of
json
,csv
, andhuggingface
formats - Define your own prompt format and inject desired columns into the prompt
Fine Tuning
- Configure desired hyperparameters for quantization and LoRA fine-tune.
Ablation
- Intuitively define multiple hyperparameter settings to iterate through
Inference
- Configure desired sampling algorithm and parameters
Testing
- Test desired properties such as length and similarity against reference text
Content
This documentation page is organized in the following sections:
- Quick Start provides a quick overview of the toolkit and helps you get started running your own experiments
- Configuration walks you through all the changes that can be made to customize your experiments
- Developer Guides goes over how to extend each component for custom use-cases and for contributing to this toolkit
- API Reference details the underlying modules of this toolkit
Installation
Clone Repository
git clone https://github.com/georgian-io/LLM-Finetuning-Hub.git cd LLM-Finetuning-Hub/
Install CLI
# build image docker build -t llm-toolkit . # launch container docker run -it llm-toolkit # with CPU docker run -it --gpus all llm-toolkit # with GPU
Running the Toolkit
The toolkit has everything you need to get started. This guide will walk you through the initial setup, explain the key components of the configuration, and offer advice on customizing your fine-tuning job. Let's dive in!
First, make sure you have read the installation guide above and installed all the dependencies. Then, To launch a LoRA fine-tuning job, run the following command in your terminal:
python3 toolkit.py
This command initiates the fine-tuning process using the settings specified in the default YAML configuration file config.yaml
.
save_dir: "./experiment/" ablation: use_ablate: false # Data Ingestion ------------------- data: file_type: "huggingface" # one of 'json', 'csv', 'huggingface' path: "yahma/alpaca-cleaned" prompt: >- # prompt, make sure column inputs are enclosed in {} brackets and that they match your data Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: {instruction} ### Input: {input} ### Output: prompt_stub: >- # Stub to add for training at the end of prompt, for test set or inference, this is omitted; make sure only one variable is present {output} test_size: 0.1 # Proportion of test as % of total; if integer then # of samples train_size: 0.9 # Proportion of train as % of total; if integer then # of samples train_test_split_seed: 42 # Model Definition ------------------- model: hf_model_ckpt: "NousResearch/Llama-2-7b-hf" quantize: true bitsandbytes: load_in_4bit: true bnb_4bit_compute_dtype: "bf16" bnb_4bit_quant_type: "nf4" # LoRA Params ------------------- lora: task_type: "CAUSAL_LM" r: 32 lora_alpha: 16 lora_dropout: 0.1 target_modules: - q_proj - v_proj - k_proj - o_proj - up_proj - down_proj - gate_proj # Training ------------------- training: training_args: num_train_epochs: 5 per_device_train_batch_size: 4 optim: "paged_adamw_32bit" learning_rate: 2.0e-4 bf16: true # Set to true for mixed precision training on Newer GPUs tf32: true sft_args: max_seq_length: 1024 inference: max_new_tokens: 1024 do_sample: True top_p: 0.9 temperature: 0.8
Finetuning
Launch Finetuning Job
The toolkit has everything you need to get started. This guide will walk you through the initial setup, explain the key components of the configuration, and offer advice on customizing your fine-tuning job. Let's dive in!
First, make sure you have read the installation guide and installed all the dependencies. Then, To launch a LoRA fine-tuning job, run the following command in your terminal:
python3 toolkit.py
Default Config
This command initiates the fine-tuning process using the settings specified in the default YAML configuration file config.yaml
.
save_dir: "./experiment/" ablation: use_ablate: false # Data Ingestion ------------------- data: file_type: "huggingface" # one of 'json', 'csv', 'huggingface' path: "yahma/alpaca-cleaned" prompt: >- # prompt, make sure column inputs are enclosed in {} brackets and that they match your data Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: {instruction} ### Input: {input} ### Output: prompt_stub: >- # Stub to add for training at the end of prompt, for test set or inference, this is omitted; make sure only one variable is present {output} test_size: 0.1 # Proportion of test as % of total; if integer then # of samples train_size: 0.9 # Proportion of train as % of total; if integer then # of samples train_test_split_seed: 42 # Model Definition ------------------- model: hf_model_ckpt: "NousResearch/Llama-2-7b-hf" quantize: true bitsandbytes: load_in_4bit: true bnb_4bit_compute_dtype: "bf16" bnb_4bit_quant_type: "nf4" # LoRA Params ------------------- lora: task_type: "CAUSAL_LM" r: 32 lora_alpha: 16 lora_dropout: 0.1 target_modules: - q_proj - v_proj - k_proj - o_proj - up_proj - down_proj - gate_proj # Training ------------------- training: training_args: num_train_epochs: 5 per_device_train_batch_size: 4 optim: "paged_adamw_32bit" learning_rate: 2.0e-4 bf16: true # Set to true for mixed precision training on Newer GPUs tf32: true sft_args: max_seq_length: 1024 inference: max_new_tokens: 1024 do_sample: True top_p: 0.9 temperature: 0.8
🎉 Congratulations! You've ran the first fine-tuning job using this toolkit!
Artefact Outputs
This config will run finetuning and save the artefacts under directory ./experiment/[unique_hash]
. Each unique configuration will generate a unique hash, so that our tool can automatically pick up where it left off. For example, if you need to stop the training before it finishes, you can relaunch the script and the program will automatically load the existing dataset that was generated in the directory, allowing you to resume where you left off instead of starting over from the beginning.
After the script finishes running you will see these distinct artifacts:
/config/config.yml
: copy of the config file used for this experiment
/dataset/dataset.pkl
: generated pkl file in huggingface Dataset format
/model/*
: model weights saved using huggingface format
/results/results.csv
: csv of prompt, ground truth, and predicted values
/qa/qa.csv
: csv of quality assurance unit tests (e.g. vector similarity between gold and predicted output)
WARNING
Remember to carefully review the documentation and requirements of the new model you choose to ensure compatibility with your task and the toolkit.
Quality Assurance
Once the model is trained, it’s crucial to verify its readiness for production. We offer Quality Assurance testing specifically tailored for Language Model applications. This approach is distinct from conventional testing methods, as there’s currently no direct means of ensuring that a fine-tuned model meets enterprise standards. Moreover, developers have the flexibility to integrate their own tests into the process.
Available Tests
Generation Property
Generation Length
- Function:
LengthTest
- Description: Determines the length of the summarized output and the input sentence. The output length is expected to exceed the input length, aligning with the specific use case.
- Description: Analyzes the grammar of the generated output, focusing on:
- Verb Percentage: Indicates the proportion of verbs present.
- Adjective Percentage: Indicates the proportion of adjectives present.
- Noun Percentage: Indicates the proportion of nouns present.
Word Similarity
- Function:
WordOverLapTest
- Description: Determines the length of the summarized output and the input sentence. The output length is expected to exceed the input length, aligning with the specific use case.
- Function:
RougeScore
- Description: Computes the Rouge score for the output, providing insight into the quality of summarization.
Embedding Similarity
- Function:
JaccardSimilarity
- Description: Calculates similarity by encoding inputs and outputs.
- Function:
DotProductSimilarity
- Description: Computes the dot product between the encoded inputs and outputs
LoRA
The lora
section configures the Low-Rank Adaptation (LoRA) settings. Supplied arguments are used to construct a peft
LoraConfig
object. It includes the following parameters:
Parameters
task_type
: Type of transformer architecture; for decoder only - useCAUSAL_LM
. for encoder-decoder - useSEQ_2_SEQ_LM
r
: The rank of the LoRA adaptation matrices.lora_alpha
: The scaling factor for the LoRA adaptation.lora_dropout
: The dropout probability for the LoRA layers.target_modules
: The list of module names to apply LoRA to.fan_in_fan_out
: Flag to indicate if the layer weights are stored in a (fan_in, fan_out) order.modules_to_save
: List of additional module names to save in the final checkpoint.layers_to_transform
: The list of layer indices to apply LoRA to.layers_pattern
: The regular expression pattern to match layer names for LoRA application.
Example
lora: r: 32 lora_alpha: 16 lora_dropout: 0.1 target_modules: - q_proj - v_proj - k_proj - o_proj - up_proj - down_proj - gate_proj fan_in_fan_out: false modules_to_save: null layers_to_transform: null layers_pattern: null
Advanced Usage
fan_in_fan_out
The fan_in_fan_out
parameter is a boolean flag that indicates whether the weights of the layers being adapted are stored in a (fan_in, fan_out) order. This is important for correctly applying the LoRA adaptation.
lora: fan_in_fan_out: true
In this example, setting fan_in_fan_out
to true
indicates that the weights of the layers being adapted are stored in a (fan_in, fan_out)
order. If the weights are stored in a different order, you should set this parameter to false.
layers_to_transform
The layers_to_transform
parameter is used to specify the indices of the layers to which LoRA should be applied. This allows you to selectively apply LoRA to specific layers of the model.
lora: layers_to_transform: [2, 4, 6]
In this example, LoRA will be applied to the layers with indices 2, 4, and 6. The layer indices are zero-based, so the first layer has an index of 0, the second layer has an index of 1, and so on.
You can also specify a single layer index:
lora: layers_to_transform: 3
In this case, LoRA will be applied only to the layer with index 3.
layers_pattern
The layers_pattern
parameter allows you to specify a regular expression pattern to match the names of the layers to which LoRA should be applied. This provides a more flexible way to select layers based on their names.
lora: layers_pattern: "transformer\.h\.\d+\.attn"
In this example, the regular expression pattern transformer\.h\.\d+\.attn
will match the names of the attention layers in a transformer model. The pattern will match layer names like transformer.h.0.attn
, transformer.h.1.attn
, and so on.
You can adjust the regular expression pattern to match the specific layer names in your model.
TIP
When use_ablate
is set to true, the toolkit will generate multiple configurations by permuting the specified parameters. This allows you to easily compare different settings and their impact on the model's performance.
Data
Ingestors
Ingestor
class
< source >
src.data.ingestor.Ingestor
( path: str )
The Ingestor class is an abstract base class for data ingestors.
path: str
- The path of the dataset.
to_dataset() -> Dataset
( )
An abstract method to be implemented by subclasses. Converts the input data to a Dataset object.
Returns
Dataset
- The converted Dataset object.
JSON Ingestor
class
< source >
src.data.ingestor.JsonIngestor
( path: str )
The JsonIngestor class is a subclass of Ingestor for ingesting JSON data.
Parameters
path: str
- The path of the JSON dataset.
to_dataset() -> Dataset
( )
Converts the JSON data to a Dataset object.
Returns
Dataset
- The converted Dataset object.
CSV Ingestor
class
< source >
src.data.ingestor.CsvIngestor
( path: str )
The CsvIngestor class is a subclass of Ingestor for ingesting CSV data.
Parameters
path: str
- The path of the CSV dataset.
to_dataset() -> Dataset
( )
Converts the CSV data to a Dataset object.
Returns
Dataset
- The converted Dataset object.
Dataset Generator
Dataset Generator
class
< source >
src.data.dataset_generator.DatasetGenerator
( file_type: str, path: str, prompt: str, prompt_stub: str, test_size: Union[float, int], train_size: Union[float, int], train_test_split_seed: int )
The DatasetGenerator class is responsible for generating and formatting datasets for training and testing.
Parameters
file_type: str
- The type of input file ("json", "csv", or "huggingface").path: str
- The path to the input file or HuggingFace dataset.prompt: str
- The prompt template for formatting the dataset.prompt_stub: str
- The prompt stub used during training.test_size: Union[float, int]
- The size of the test set.train_size: Union[float, int]
- The size of the training set.train_test_split_seed: int
- The random seed for splitting the dataset.
get_dataset
( )
Generates and returns the formatted train and test datasets.
Returns
A tuple containing the train and test datasets.
save_dataset
( save_dir: str )
Saves the generated dataset to the specified directory.
Parameters
save_dir: str
- The directory to save the dataset.
load_dataset_from_pickle
( save_dir: str )
Loads the dataset from a pickle file in the specified directory.
Parameters
save_dir: str
- The directory containing the dataset pickle file.
Returns
A tuple containing the loaded train and test datasets.