LLM Fine-tuning Toolkit

Getting Started

LLM Fine-tuning toolkit is a config-based CLI tool for launching a series of fine-tuning experiments and gathering their results. From one single yaml config file, users can define the following:

image

Data

  • Bring a own dataset in any of json, csv, and huggingface formats.
  • Define prompt format and inject desired columns into the prompt.

Fine Tuning

  • Configure desired hyperparameters for quantization and LoRA fine-tune.

Ablation

  • Define multiple hyperparameter settings to iterate through.

Inference

  • Configure desired sampling algorithm and parameters.

Testing

  • Test desired properties such as length and similarity against reference text.

This documentation page is organized in the following sections:

  • Quick Start provides a quick overview of the toolkit and helps get started running own experiments.
  • Configuration walks through all the changes that can be made to customize the experiments.
  • Developer Guides goes over how to extend each component for custom use-cases and for contributing to this toolkit.
  • API Reference details the underlying modules of this toolkit.
git clone https://github.com/georgian-io/LLM-Finetuning-Hub.git
cd LLM-Finetuning-Hub/
docker (recommended)
# build image
docker build -t llm-toolkit .
# launch container
docker run -it llm-toolkit # with CPU
docker run -it --gpus all llm-toolkit # with GPU
poetry (recommended)
pip
conda

This guide is intended to walk through the initial setup, explain the key components of the configuration, and offer advice on customizing the fine-tuning job.

First, make sure you have read the installation guide above and installed all the dependencies. Then, to launch a LoRA fine-tuning job, run the following command in your terminal:

python3 toolkit.py

This command initiates the fine-tuning process using the settings specified in the default YAML configuration file config.yaml.

save_dir: "./experiment/"

ablation:
  use_ablate: false

# Data Ingestion -------------------
data:
  file_type: "huggingface" # one of 'json', 'csv', 'huggingface'
  path: "yahma/alpaca-cleaned"
  prompt:
    >- # prompt, make sure column inputs are enclosed in {} brackets and that they match your data
    Below is an instruction that describes a task. 
    Write a response that appropriately completes the request. 
    ### Instruction: {instruction}
    ### Input: {input}
    ### Output:
  prompt_stub:
    >- # Stub to add for training at the end of prompt, for test set or inference, this is omitted; make sure only one variable is present
    {output}
  test_size: 0.1 # Proportion of test as % of total; if integer then # of samples
  train_size: 0.9 # Proportion of train as % of total; if integer then # of samples
  train_test_split_seed: 42

# Model Definition -------------------
model:
  hf_model_ckpt: "NousResearch/Llama-2-7b-hf"
  quantize: true
  bitsandbytes:
    load_in_4bit: true
    bnb_4bit_compute_dtype: "bf16"
    bnb_4bit_quant_type: "nf4"

# LoRA Params -------------------
lora:
  task_type: "CAUSAL_LM"
  r: 32
  lora_alpha: 16
  lora_dropout: 0.1
  target_modules:
    - q_proj
    - v_proj
    - k_proj
    - o_proj
    - up_proj
    - down_proj
    - gate_proj

# Training -------------------
training:
  training_args:
    num_train_epochs: 5
    per_device_train_batch_size: 4
    optim: "paged_adamw_32bit"
    learning_rate: 2.0e-4
    bf16: true # Set to true for mixed precision training on Newer GPUs
    tf32: true
  sft_args:
    max_seq_length: 1024

inference:
  max_new_tokens: 1024
  do_sample: True
  top_p: 0.9
  temperature: 0.8

Fine-tuning

This guide will walk you through the initial setup, explain the key components of the configuration, and offer advice on customizing your fine-tuning job.

First, read the installation guide and install all the dependencies. Then, to launch a LoRA fine-tuning job, run the following command in your terminal:

python3 toolkit.py

This command initiates the fine-tuning process using the settings specified in the default YAML configuration file config.yaml.

Artefact Outputs

This config will run fine-tuning and save the artefacts under directory ./experiment/[unique_hash]. Each unique configuration will generate a unique hash, so that our tool can automatically pick up where it left off. For example, if you need to stop the training before it finishes, you can relaunch the script and the program will automatically load the existing dataset that was generated in the directory, allowing you to resume where you left off instead of starting over from the beginning.

After the script finishes running you will see these distinct artifacts:

/config/config.yml: copy of the config file used for this experiment

/dataset/dataset.pkl: generated pkl file in huggingface Dataset format

/model/*: model weights saved using huggingface format

/results/results.csv: csv of prompt, ground truth, and predicted values

/qa/qa.csv: csv of quality assurance unit tests (e.g. vector similarity between gold and predicted output)

Custom Fine-tuning

You can modify config.yaml to launch custom training jobs. For a more detailed and nuanced treatment of what you can input into the config file, please reference the "Configuration" section of the documentation.

Change the file_type and path under data in config.yml to point to your custom dataset. Ensure your dataset is properly formatted and adjust the prompt accordingly.

New Config
...
data:
  file_type: "csv"
  path: "path/to/your/dataset.csv"
  prompt: "Your custom prompt template with {column_name} placeholders"
...
Old Config

Adjust the r and lora_alpha parameters in the LoRA section to experiment with different adaptation strengths.

New Config
...
lora:
  r: 64
  lora_alpha: 32
...
Old Config

Modify hf_model_ckpt to fine-tune a different base model. Ensure it is compatible with your task and make sure to specify the right modules to tune (different models may have different module names).

New Config
...
model:
  hf_model_ckpt: "EleutherAI/gpt-neo-1.3B"
  target_modules:
    - c_attn
    - c_proj
    - c_fc
    - c_mlp.0
    - c_mlp.2
...
Old Config

In the new config snippet for changing the model, we've updated the hf_model_ckpt to use the "EleutherAI/gpt-neo-1.3B" model instead of "NousResearch/Llama-2-7b-hf". We've also adjusted the target_modules to match the module names specific to the GPT-Neo architecture.

WARNING

Remember to carefully review the documentation and requirements of the new model you choose to ensure compatibility with your task and the toolkit.

Quality Assurance

Once the model is trained, verify its readiness for production. Quality Assurance testing specifically tailored for Language Model applications may be useful in verifying the model. This approach is distinct from conventional testing methods, as there’s currently no direct means of ensuring that a fine-tuned model meets enterprise standards. Moreover, developers have the flexibility to integrate their own tests into the process.

Generation Length

  • Function: LengthTest
  • Description: Determines the length of the summarized output and the input sentence. The output length is expected to exceed the input length, aligning with the specific use case.

POS Composition

  • Description: Analyzes the grammar of the generated output, focusing on:

Word Overlap

  • Function: WordOverLapTest
  • Description: Determines the length of the summarized output and the input sentence. The output length is expected to exceed the input length, aligning with the specific use case.

ROUGE Score

  • Function: RougeScore
  • Description: Computes the Rouge score for the output, providing insight into the quality of summarization.

Jaccard Similarity

  • Function: JaccardSimilarity
  • Description: Calculates similarity by encoding inputs and outputs.

Dot Product (Cosine) Similarity

  • Function: DotProductSimilarity
  • Description: Computes the dot product between the encoded inputs and outputs.

General Structure

The configuration file has a hierarchical structure with the following main sections:

  • save_dir: The directory where the experiment results will be saved.
  • ablation: Settings for ablation studies.
  • data: Configuration for data ingestion.
  • model: Model definition and settings.
  • lora: Configuration for LoRA (Low-Rank Adaptation).
  • training: Settings for the training process.
  • inference: Configuration for the inference stage.

Each section contains subsections and parameters that fine-tune the behavior of the toolkit.

Data

The data section defines how the input data is loaded and preprocessed. It includes the following parameters:

  • file_type: The type of the input file, which can be "json", "csv", or "huggingface".
  • path: The path to the input file or the name of the Hugging Face dataset.
  • prompt: The prompt template used for formatting the input data. Use brackets to specify column names.
  • prompt_stub: The prompt stub used during training (i.e. this will be omitted during inference for completion). Use brackets to specify the column name.
  • train_size: The size of the training set, either as a float (proportion) or an integer (number of examples).
  • test_size: The size of the test set, either as a float (proportion) or an integer (number of examples).
  • train_test_split_seed: The random seed used for splitting the data into train and test sets.
data:
  file_type: "csv"
  path: "path/to/your/dataset.csv"
  prompt: >-
    Below is an instruction that describes a task.
    Write a response that appropriately completes the request.
    ### Instruction: {instruction}
    ### Input: {input}
    ### Output:
  prompt_stub: >-
    {output}
  test_size: 0.1
  train_size: 0.9
  train_test_split_seed: 42

Model

The model section defines the base model and load settings. It includes the following parameters:

  • hf_model_ckpt: The path or name of the pre-trained model checkpoint from the Hugging Face Model Hub.
  • device_map: The device map for model parallelism. Set to "auto" for automatic device mapping or specify a custom device map.
  • quantize: Boolean flag to enable quantization of the model weights; if true, then loads it with bitsandbytes config
  • bitsandbytes: Settings for quantization using BitsAndBytesConfig object within transformers.
  • load_in_8bit: Flag to enable 8-bit quantization.
  • llm_int8_threshold: Outlier threshold for 8-bit quantization.
  • llm_int8_skip_modules: List of module names to exclude from 8-bit quantization.
  • llm_int8_enable_fp32_cpu_offload: Flag to enable offloading of non-quantized weights to CPU.
  • load_in_4bit: Flag to enable 4-bit quantization using bitsandbytes.
  • bnb_4bit_compute_dtype: Compute dtype for 4-bit quantization.
  • bnb_4bit_quant_type: Quantization data type for 4-bit quantization.
  • bnb_4bit_use_double_quant: Flag to enable double quantization for 4-bit quantization.
model:
  hf_model_ckpt: "NousResearch/Llama-2-7b-hf"
  device_map: "auto"
  quantize: true
  bitsandbytes:
    load_in_4bit: true
    bnb_4bit_compute_dtype: "bf16"
    bnb_4bit_quant_type: "nf4"

LoRA

The lora section configures the Low-Rank Adaptation (LoRA) settings. Supplied arguments are used to construct a peft LoraConfig object. It includes the following parameters:

  • task_type: Type of transformer architecture; for decoder only - use CAUSAL_LM. for encoder-decoder - use SEQ_2_SEQ_LM
  • r: The rank of the LoRA adaptation matrices.
  • lora_alpha: The scaling factor for the LoRA adaptation.
  • lora_dropout: The dropout probability for the LoRA layers.
  • target_modules: The list of module names to apply LoRA to.
  • fan_in_fan_out: Flag to indicate if the layer weights are stored in a (fan_in, fan_out) order.
  • modules_to_save: List of additional module names to save in the final checkpoint.
  • layers_to_transform: The list of layer indices to apply LoRA to.
  • layers_pattern: The regular expression pattern to match layer names for LoRA application.
lora:
  r: 32
  lora_alpha: 16
  lora_dropout: 0.1
  target_modules:
    - q_proj
    - v_proj
    - k_proj
    - o_proj
    - up_proj
    - down_proj
    - gate_proj
  fan_in_fan_out: false
  modules_to_save: null
  layers_to_transform: null
  layers_pattern: null

The fan_in_fan_out parameter is a boolean flag that indicates whether the weights of the layers being adapted are stored in a (fan_in, fan_out) order. This boolean flag is important for correctly applying the LoRA adaptation.

Example
lora:
  fan_in_fan_out: true

In this example, setting fan_in_fan_out to true indicates that the weights of the layers being adapted are stored in a (fan_in, fan_out) order. If the weights are stored in a different order, you should set this parameter to false.

The layers_to_transform parameter is used to specify the indices of the layers to which LoRA should be applied. This parameter allows the user you to selectively apply LoRA to specific layers of the model.

Example
lora:
  layers_to_transform: [2, 4, 6]

In this example, LoRA will be applied to the layers with indices 2, 4, and 6. The layer indices are zero-based, so the first layer has an index of 0, the second layer has an index of 1, and so on.

You can also specify a single layer index:

Example
lora:
  layers_to_transform: 3

In this case, LoRA will be applied only to the layer with index 3.

The layers_pattern parameter allows the user to specify a regular expression pattern to match the names of the layers to which LoRA should be applied. This provides a more flexible way to select layers based on their names.

Example
lora:
  layers_pattern: "transformer\.h\.\d+\.attn"

In this example, the regular expression pattern transformer\.h\.\d+\.attn will match the names of the attention layers in a transformer model. The pattern will match layer names like transformer.h.0.attn, transformer.h.1.attn, and so on.

You can adjust the regular expression pattern to match the specific layer names in your model.

Training

The training section configures the training process. It includes two subsections:

  • training_args: General training arguments such as the number of epochs, batch size, gradient accumulation steps, optimizer, learning rate, etc.
    • num_train_epochs: Number of training epochs.
    • per_device_train_batch_size: Batch size per training device.
    • gradient_accumulation_steps: Number of steps for gradient accumulation.
    • gradient_checkpointing: Flag to enable gradient checkpointing.
    • optim: Optimizer to use for training.
    • logging_steps: Number of steps between logging.
    • learning_rate: Learning rate for the optimizer.
    • bf16: Flag to enable BF16 mixed-precision training.
    • tf32: Flag to enable TF32 mixed-precision training.
    • fp16: Flag to enable FP16 mixed-precision training.
    • max_grad_norm: Maximum gradient norm for gradient clipping.
    • warmup_ratio: Ratio of total training steps used for a linear warmup.
    • lr_scheduler_type: Type of learning rate scheduler.
  • sft_args: Arguments specific to the SFT (Supervised Fine-Tuning) process.
    • max_seq_length: Maximum sequence length for input sequences.
    • neftune_noise_alpha: Alpha parameter for NEFTUNE noise embeddings. If not None, activates NEFTUNE noise embeddings.
training:
  training_args:
    num_train_epochs: 5
    per_device_train_batch_size: 4
    gradient_accumulation_steps: 4
    gradient_checkpointing: true
    optim: "paged_adamw_32bit"
    logging_steps: 100
    learning_rate: 2.0e-4
    bf16: true
    tf32: true
    max_grad_norm: 0.3
    warmup_ratio: 0.03
    lr_scheduler_type: "constant"
  sft_args:
    max_seq_length: 5000
    neftune_noise_alpha: null

Inference

The inference section sets the parameters for the inference stage. It includes:

  • max_new_tokens: The maximum number of new tokens to generate.
  • use_cache: Whether to use the cache during inference.
  • do_sample: Whether to use sampling during inference.
  • top_p: The cumulative probability threshold for top-p sampling.
  • temperature: The temperature value for sampling.
inference:
  max_new_tokens: 1024
  use_cache: true
  do_sample: true
  top_p: 0.9
  temperature: 0.8

Quality Assurance

🚧 The qa section is not yet directly configurable in the provided configuration file and is currently being integrated into the CLI toolkit. In the meantime, however, you can manually execute and extend the toolkit to include quality assurance tests by implementing custom test classes that inherit from LLMQaTest.

 

 

 

 

 

 

Ablation

The ablation section controls the settings for ablation studies. It includes:

  • use_ablate: Whether to perform ablation studies.
  • study_name: The name of the ablation study.

TIP

When use_ablate is set to true, the toolkit will generate multiple configurations by permuting the specified parameters. This allows you to compare different settings and their impact on the model's performance.

ablation:
  use_ablate: true
  study_name: "ablation_study_1"

Putting it All Together

To create a custom configuration file, start by copying the provided template and modify the parameters according to your needs. Pay attention to the structure and indentation of the YAML file to ensure it is parsed correctly.

Once you have defined your configuration, you can run the toolkit with your custom settings. The toolkit will load the configuration file, preprocess the data, train the model, perform inference and optionally run quality assurance tests and ablation studies based on the configuration.

Remember to adjust the paths, prompts and other parameters to match the user’s specific use case. Experiment with different settings to find the optimal configuration for the task.

Here's an example of a complete configuration file combining all the sections:

save_dir: "./experiments"

ablation:
  use_ablate: true
  study_name: "ablation_study_1"

data:
  file_type: "csv"
  path: "path/to/your/dataset.csv"
  prompt: "Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: {instruction} ### Input: {input} ### Output:"
  prompt_stub: "{output}"
  test_size: 0.1
  train_size: 0.9
  train_test_split_seed: 42

model:
  hf_model_ckpt: "NousResearch/Llama-2-7b-hf"
  device_map: "auto"
  quantize: true
  bitsandbytes:
    load_in_4bit: true
    bnb_4bit_compute_dtype: "bf16"
    bnb_4bit_quant_type: "nf4"

lora:
  r: 32
  lora_alpha: 16
  lora_dropout: 0.1
  target_modules:
    - q_proj
    - v_proj
    - k_proj
    - o_proj
    - up_proj
    - down_proj
    - gate_proj
  fan_in_fan_out: false
  modules_to_save: null
  layers_to_transform: null
  layers_pattern: null

training:
  training_args:
    num_train_epochs: 5
    per_device_train_batch_size: 4
    gradient_accumulation_steps: 4
    gradient_checkpointing: true
    optim: "paged_adamw_32bit"
    logging_steps: 100
    learning_rate: 2.0e-4
    bf16: true
    tf32: true
    max_grad_norm: 0.3
    warmup_ratio: 0.03
    lr_scheduler_type: "constant"
  sft_args:
    max_seq_length: 5000
    neftune_noise_alpha: null

inference:
  max_new_tokens: 1024
  use_cache: true
  do_sample: true
  top_p: 0.9
  temperature: 0.8

Extending Modules

The toolkit provides a modular and extensible architecture that allows developers to customize and enhance its functionality to suit their specific needs. Each component of the toolkit, such as data ingestion, fine tuning, inference and quality assurance testing, is designed to be easily extendable.

image ()

There are various scenarios where you might want to extend a particular module of the toolkit. For example:

Data Ingestion: If you have a custom data format or source that is not supported out of the box, the Ingestor class can be extended to handle your specific data format. For instance, if you have data stored in a proprietary binary format, you can create a new subclass of Ingestor that reads and processes your binary data and converts it into a compatible format for the toolkit.

Fine-tuning: If you want to experiment with different fine-tuning techniques or modify the fine-tuning process, you can extend the Finetune class. For example, if you want to incorporate a custom loss function or implement a new fine-tuning algorithm, you can create a subclass of Finetune and override the necessary methods to include your custom logic.

Inference: If you need to modify the inference process or add custom post-processing steps, you can extend the Inference class. For instance, if you want to apply domain-specific post-processing to the generated text or integrate the inference process with an external API, you can create a subclass of Inference and implement your custom functionality.

Quality Assurance (QA) Testing: If you have specific quality metrics or evaluation criteria that are not included in the existing QA tests, you can extend the LLMQaTest class to define your own custom tests. For example, if you want to evaluate the generated text based on domain-specific metrics or compare it against a custom benchmark, you can create a new subclass of LLMQaTest and implement your custom testing logic.

By extending the toolkit's components, you can tailor it to your specific requirements and incorporate custom functionality that is not provided by default. This flexibility allows you to adapt the toolkit to various domains, data formats, and evaluation criteria.

In the following sections, we will provide detailed guidance on how to extend each component of the toolkit, along with code examples.

To extend the data ingestor component, follow these steps:

  1. Open the file src/data/ingestor.py.
  2. Define a new class that inherits from the abstract base class Ingestor.
  3. Implement the required abstract method to_dataset in your custom ingestor class. This method should load and preprocess the data from the specified source and return a Dataset object.
  4. Update the `get_ingestor` function to include your custom ingestor class based on a new file type or data source.
Example
from src.data.ingestor import Ingestor

class CustomIngestor(Ingestor):
    def __init__(self, path):
        self.path = path

    def to_dataset(self):
        # Implement the logic to load and preprocess data from the specified path
        ...

def get_ingestor(data_type):
    if data_type == "custom":
        return CustomIngestor
    ...

To extend the finetuning component, follow these steps:

  1. Create a new file in the src/finetune directory, e.g., custom_finetune.py.
  2. In this file, define a new class that inherits from the abstract base class Finetune from src/finetune/finetune.py.
  3. Implement the required abstract methods finetune and save_model in your custom finetuning class.
  4. The finetune method should take the training dataset and perform the finetuning process using the provided configuration.
  5. The save_model method should save the fine tuned model to the specified directory.
  6. Modify the toolkit.py file to import your custom finetuning class and use it instead of the default LoRAFinetune class if needed.
Example
from src.finetune.finetune import Finetune

class CustomFinetune(Finetune):
    def finetune(self, train_dataset: Dataset):
        # Implement your custom finetuning logic here
        ...

    def save_model(self):
        # Implement the logic to save the finetuned model
        ...

To extend the inference component, follow these steps:

  1. Create a new file in the src/inference directory, e.g., custom_inference.py.
  2. In this file, define a new class that inherits from the abstract base class Inference from src/inference/inference.py.
  3. Implement the required abstract methods infer_one and infer_all in your custom inference class.
  4. The infer_one method should take a single prompt and generate the model's prediction.
  5. The infer_all method should iterate over the test dataset and generate predictions for each example.
  6. Modify the toolkit.py file to import your custom inference class and use it instead of the default LoRAInference class if needed.
Example
from src.inference.inference import Inference

class CustomInference(Inference):
    def infer_one(self, prompt: str):
        # Implement the logic to generate a prediction for a single prompt
        ...

    def infer_all(self):
        # Implement the logic to generate predictions for the entire test dataset
        ...

To extend the quality assurance (QA) tests, follow these steps:

  1. Open the file src/qa/qa_tests.py.
  2. Define a new class that inherits from the abstract base class LLMQaTest from src/qa/qa.py.
  3. Implement the required abstract property test_name and the abstract method get_metric in your custom QA test class.
  4. The test_name property should return a string representing the name of the test.
  5. The get_metric method should take the prompt, ground truth, and model prediction, and return a metric value (e.g., float, int, or bool) indicating the test result.
  6. Include instances of new CustomQATest when instantiating the LLMTestSuite object.
Example
from src.qa.qa import LLMQaTest

class CustomQATest(LLMQaTest):
    @property
    def test_name(self):
        return "Custom QA Test"

    def get_metric(self, prompt, ground_truth, model_prediction):
        # Implement the logic to calculate the metric for the custom QA test
        ...


test_suite = LLMTestSuite([JaccardSimilarityTest(), CustomQATest()], prompts, ground_truths, model_preds)

Contribution Guide

Thank you for your interest in contributing to this open-source project.

To start contributing to the project, follow these steps:

  1. Fork the repository on GitHub.
  2. Clone your forked repository to your local machine.
  3. Create a new branch for your feature or bug fix.
  4. Make your changes and commit them with descriptive commit messages.
  5. Push your changes to your forked repository.
  6. Submit a pull request to the main repository's main branch.

Before submitting a pull request, ensure that your code follows the project's coding style and passes all existing tests (🚧 work in progress).

To set up the development environment, follow these steps:

  1. Install the required dependencies using recommended installation methods.
  2. Run the existing tests to ensure everything is functioning correctly (🚧 work in progress).

When contributing code to the project, please adhere to the following guidelines:

  • Follow the PEP 8 style guide for Python code.
  • Use black formatter
  • Use meaningful variable and function names that clearly convey their purpose.
  • Write docstrings for classes, methods, and functions to provide clear documentation.
  • Include inline comments to explain complex or non-obvious code sections.
  • Break down large functions or methods into smaller, reusable components.
  • Write unit tests for new features or bug fixes to ensure code correctness.

Improvements to the project's documentation (either the documentation site or docstrings in codebase) are appreciated. If you find any errors, inconsistencies or areas that need clarification, please feel free to submit a pull request with the necessary changes.

When contributing to the documentation, follow these guidelines:

  • Use clear and concise language.
  • Provide step-by-step instructions or examples when appropriate.
  • Ensure that the documentation is up to date with the latest changes in the codebase.
  • Maintain a consistent formatting and structure throughout the documentation.

If you encounter any bugs or issues, or have suggestions for improvements, please submit an issue on the project's GitHub repository. When submitting an issue, provide as much detail as possible, including:

  • A clear and descriptive title.
  • Steps to reproduce the issue or bug.
  • Expected behavior and actual behavior.
  • Any relevant error messages or logs.
  • Your operating system and Python version.

When submitting a pull request, please ensure that:

  • Your code adheres to the project's coding guidelines.
  • Your changes are well-tested and do not introduce new bugs.
  • Your commit messages are descriptive and explain the purpose of the changes.
  • You have updated the relevant documentation, if necessary.
  • Once your pull request is submitted, the project maintainers will review your changes and provide feedback. Be prepared to make revisions or address any concerns raised during the review process.

By contributing to this project, you agree that your contributions will be licensed under the LICENSE file in the repository.

Data

src.data.ingestor.Ingestor

( path: str )

The Ingestor class is an abstract base class for data ingestors.

Parameters

  • path: str - The path of the dataset.

to_dataset() -> Dataset
(  )

An abstract method to be implemented by subclasses. Converts the input data to a Dataset object.

Returns

Dataset - The converted Dataset object.

src.data.ingestor.JsonIngestor

( path: str )

The JsonIngestor class is a subclass of Ingestor for ingesting JSON data.

Parameters

  • path: str - The path of the JSON dataset.

to_dataset() -> Dataset
(  )

Converts the JSON data to a Dataset object.

Returns

Dataset - The converted Dataset object.

src.data.ingestor.CsvIngestor

( path: str )

The CsvIngestor class is a subclass of Ingestor for ingesting CSV data.

Parameters

  • path: str - The path of the CSV dataset.

to_dataset() -> Dataset
(  )

Converts the CSV data to a Dataset object.

Returns

Dataset - The converted Dataset object.

src.data.ingestor.HuggingfaceIngestor

( path: str )

The HuggingfaceIngestor class is a subclass of Ingestor for ingesting data from a HuggingFace dataset.

Parameters

  • path: str - The path or name of the HuggingFace dataset.

to_dataset() -> Dataset
(  )

Converts the HuggingFace data to a Dataset object.

Returns

Dataset - The converted Dataset object.

function

< source >

src.data.ingestor.get_ingestor

( data_type: str )

A function to get the appropriate ingestor class based on the data type.

Parameters

  • data_type: str - The type of data ("json", "csv", or "huggingface").

Returns

Ingestor - The corresponding ingestor class.

src.data.dataset_generator.DatasetGenerator

( file_type: str, path: str, prompt: str, prompt_stub: str, test_size: Union[float, int], train_size: Union[float, int], train_test_split_seed: int )

The DatasetGenerator class is responsible for generating and formatting datasets for training and testing.

Parameters

  • file_type: str - The type of input file ("json", "csv", or "huggingface").
  • path: str - The path to the input file or HuggingFace dataset.
  • prompt: str - The prompt template for formatting the dataset.
  • prompt_stub: str - The prompt stub used during training.
  • test_size: Union[float, int] - The size of the test set.
  • train_size: Union[float, int] - The size of the training set.
  • train_test_split_seed: int - The random seed for splitting the dataset.

get_dataset
(  )

Generates and returns the formatted train and test datasets.

Returns

A tuple containing the train and test datasets.

save_dataset
( save_dir: str )

Saves the generated dataset to the specified directory.

Parameters

  • save_dir: str - The directory to save the dataset.

load_dataset_from_pickle
( save_dir: str )

Loads the dataset from a pickle file in the specified directory.

Parameters

  • save_dir: str - The directory containing the dataset pickle file.

Returns

A tuple containing the loaded train and test datasets.

Fine-tuning

src.finetune.finetune.Finetune

The Finetune class is an abstract base class for finetuning models.

finetune
(  )

An abstract method to be implemented by subclasses. Fine-tunes the model.

save_model
(  )

An abstract method to be implemented by subclasses. Saves the fine-tuned model.

src.finetune.lora.LoRAFinetune

( config: Config, directory_helper: DirectoryHelper )

The LoRAFinetune class is a subclass of Finetune for finetuning models using LoRA (Low-Rank Adaptation).

Parameters

  • config: Config - The configuration object.
  • directory_helper: DirectoryHelper - The directory helper object.

finetune
( train_dataset: Dataset )

Fine-tunes the model using the provided training dataset.

Parameters

  • train_dataset: Dataset - The training dataset.

save_model
(  )

Saves the fine-tuned model.

Inference

src.inference.inference.Inference

The Inference class is an abstract base class for performing inference.

infer_one
( prompt: str )

An abstract method to be implemented by subclasses. Performs inference on a single prompt.

Parameters

  • prompt: str - The input prompt.

infer_all
(  )

An abstract method to be implemented by subclasses. Performs inference on all test examples.

src.inference.lora.LoRAInference

( test_dataset: Dataset, label_column_name: str, config: Config, dir_helper: DirectoryHelper )

The LoRAInference class is a subclass of Inference for performing inference using LoRA models.

Parameters

  • test_dataset: Dataset - The test dataset.
  • label_column_name: str - The name of the label column in the test dataset.
  • config: Config - The configuration object.
  • dir_helper: DirectoryHelper - The directory helper object.

infer_one
( prompt: str )

Performs inference on a single prompt and returns the generated text.

Parameters

  • prompt: str - The input prompt.

Returns

str - The generated text.

infer_all
(  )

Performs inference on all test examples in test_dataset and saves the results in a csv.

Quality Assurance

src.qa.qa.LLMQaTest

The LLMQaTest class is an abstract base class for defining quality assurance tests for language models.

test_name
(  )

An abstract property to be implemented by subclasses. Returns the name of the test.

get_metric
( prompt: str, ground_truth: str, model_pred: str )

Computes the metric for the test based on the input prompt, the ground truth output, and the model's predicted output.

Parameters

  • prompt: str - The input prompt.
  • ground_truth: str - The ground truth output.
  • model_pred: str - The model's predicted output.

Returns

The computed metric, which can be a float, an int, or a bool.

src.qa.qa_tests.LengthTest

(   )

A quality assurance test that measures the absolute difference in length between the ground truth text and the model's prediction. This test aims to evaluate the summary length consistency, providing a straightforward metric for assessing how closely the model's output matches the expected length of the ground truth.

test_name
( ) -> str

Returns

str - "Summary Length Test"

get_metric
( prompt:str, ground_truth: str, model_pred: str ) -> int

Parameters

  • prompt: str - The input prompt.
  • ground_truth: str - The ground truth output.
  • model_pred: str - The model's predicted output.

Returns

int - the absolute difference in character length between the ground truth and model prediction

src.qa.qa_tests.JaccardSimilarityTest

(   )

Evaluates the similarity between the ground truth text and the model's prediction using the Jaccard Similarity measure. This metric calculates the size of the intersection divided by the size of the union of the sample sets, providing insight into how similar the predicted text is to the ground truth in terms of unique words.

test_name
( ) -> str

Returns

str - "Jaccard Similarity"

get_metric
(prompt: str, ground_truth: str, model_prediction: str) -> float

Parameters

  • prompt: str - The input prompt.
  • ground_truth: str - The ground truth output.
  • model_prediction: str - The model's predicted output.

Returns

float - the Jaccard Similarity score between the ground truth and model prediction.

src.qa.qa_tests.DotProductSimilarityTest

(   )

Evaluates the semantic similarity between the ground truth and the model's prediction by computing the dot product between their sentence embeddings. This test aims to capture the closeness of the meanings of the two texts, providing a more nuanced understanding of the model's performance in preserving semantic content.

test_name
( ) -> str

Returns

str - "Semantic Similarity"

get_metric
(prompt: str, ground_truth: str, model_prediction: str) -> float

Parameters

  • prompt: str - The input prompt.
  • ground_truth: str - The ground truth output.
  • model_pred: str - The model's predicted output.

Returns

float - the dot product similarity score, indicating the degree of semantic similarity between the ground truth and model prediction.

src.qa.qa_tests.RougeScoreTest

(   )

Measures the Rouge score, specifically the precision of Rouge-1, between the model's prediction and the ground truth. This test focuses on the overlap of unigrams, providing a metric for assessing the model's ability to reproduce key words and phrases from the ground truth.

test_name
( ) -> str

Returns

str - "Rouge Score"

get_metric
( prompt: str, ground_truth: str, model_pred: str ) -> float

Parameters

  • prompt: str - The input prompt.
  • ground_truth: str - The ground truth output.
  • model_pred: str - The model's predicted output.

Returns

float - the precision component of the Rouge-1 score, reflecting the proportion of the model's unigrams found in the ground truth.

src.qa.qa_tests.WordOverlapTest

(   )

A test that calculates the percentage of word overlap between the ground truth and the model's prediction, after removing common stop words. This metric provides a straightforward measure of content similarity, emphasizing the shared vocabulary while ignoring frequent but less meaningful words.

test_name
( ) -> str

Returns

str - "Word Overlap Test"

get_metric
( prompt: str, ground_truth: str, model_pred: str ) -> float

Parameters

  • prompt: str - The input prompt.
  • ground_truth: str - The ground truth output.
  • model_pred: str - The model's predicted output.

Returns

float - the percentage of overlap in significant words between the ground truth and model prediction, indicating content overlap.

src.qa.qa_tests.VerbPercent

(   )

Assesses the composition of the model's prediction by calculating the percentage of verbs within the text. This test provides insights into the dynamic versus static nature of the content generated by the model, with a higher proportion of verbs potentially indicating more active and vivid descriptions.

test_name
( ) -> str

Returns

str - "Verb Composition"

get_metric
(prompt: str, ground_truth: str, model_prediction: str) -> float

Parameters

  • prompt: str - The input prompt.
  • ground_truth: str - The ground truth output.
  • model_pred: str - The model's predicted output.

Returns

float - the percentage of words classified as verbs in the model prediction, shedding light on the action-oriented nature of the generated text.

src.qa.qa_tests.AdjectivePercent

(   )

Focuses on the proportion of adjectives in the model's prediction to evaluate the descriptiveness and detail within the generated text. This test helps gauge how well the model captures and conveys detailed attributes and qualities of subjects in its outputs.

test_name
( ) -> str

Returns

str - "Adjective Composition"

get_metric
( prompt: str, ground_truth: str, model_pred: str ) -> float

Parameters

  • prompt: str - The input prompt.
  • ground_truth: str - The ground truth output.
  • model_pred: str - The model's predicted output.

Returns

float - the proportion of adjectives, offering insight into the richness and descriptiveness of the model's language.

src.qa.qa_tests.NounPercent

(   )

Evaluates the model's prediction by calculating the percentage of nouns, providing a measure of how substantially the model generates content with tangible subjects and entities. This test can indicate the model's ability to maintain focus on key topics and to populate its narratives with relevant nouns.

test_name
( ) -> str

Returns

str - "Noun Composition"

get_metric
( prompt: str, ground_truth: str, model_pred: str ) -> float

Parameters

  • prompt: str - The input prompt.
  • ground_truth: str - The ground truth output.
  • model_pred: str - The model's predicted output.

Returns

float - the percentage of nouns in the text, reflecting on the subject matter density and relevance in the generated content.

src.qa.qa.LLMTestSuite

( tests: List[LLMQaTest], prompts: List[str], ground_truths: List[str], model_preds: List[str] )

The LLMTestSuite class represents a suite of quality assurance tests for language models.

Parameters

  • tests: List[LLMQaTest] - A list of LLMQaTest objects representing the tests to run.
  • prompts: List[str] - A list of input prompts.
  • ground_truths: List[str] - A list of ground truth outputs.
  • model_preds: List[str] - A list of model's predicted outputs.

run_tests
( )

Runs all the tests in the suite and returns the results as a dictionary mapping test names to their corresponding metrics.

Returns

A dictionary mapping test names to their corresponding metrics, which can be floats, ints, or bools.

print_test_results
( )

Prints the test results in a tabular format.

save_test_results
( path: str )

Saves the test results to a CSV file.

Parameters

  • path: str - The path to save the CSV file.

User Interface

src.ui.ui.UI

The UI class is an abstract base class for user interface components. This class outlines a framework for displaying information and facilitating interaction with users across various stages of the toolkit's execution, including dataset creation, fine-tuning, inference, and quality assurance testing.

This component is designed to be subclassed, with specific implementations providing concrete methods for all interactions required by the toolkit. These interactions could range from input collection for dataset specifications to displaying fine-tuning progress, presenting inference results, and summarizing quality assurance test outcomes.

Utilities

src.utils.save_utils.DirectoryList

( save_dir:str, config_hash:str )

The DirectoryList class represents a structured approach to managing directories for saving various components of experiment results, ensuring organized storage and easy access to different types of data generated during the experiment's lifecycle.

Class Attributes

  • save_dir: str - The base directory where all experiment results are saved. It acts as the root directory for storing the outputs of different experiments.
  • config_hash: str - A unique identifier for each experiment configuration, used to create separate subdirectories under save_dir for different experiment runs.

Properties

  • experiment: str - Returns the path to the specific experiment directory, combining save_dir with config_hash. This directory acts as the container for all data related to a particular experiment configuration.
  • config: str - Returns the path to the configuration file within the experiment directory. This file stores the experiment's settings and parameters.
  • dataset: str - Returns the path to the directory where dataset files are stored, allowing for separation between different types of data used or generated by the experiment.
  • weights: str - Returns the path to the directory where model weights are saved, facilitating easy access to trained models.
  • results: str - Returns the path to the directory where experiment results, such as metrics or output files, are stored.
  • qa: str - Returns the path to the directory dedicated to quality assurance or testing results, ensuring that evaluation outputs are organized and retrievable.

src.utils.save_utils.DirectoryHelper

( config_path: str, config: Config )

The DirectoryHelper class provides helper methods for managing directories and saving configurations, facilitating the organization and preservation of experiment settings and outcomes.

Parameters

  • config_path: str - The path to the configuration file.
  • config: Config - The configuration object.

Attributes

  • config_path: str - The path to the configuration file. This attribute stores the location of the main configuration file used by the experiment, enabling the DirectoryHelper to access and manage experiment configurations.
  • config: Config - The configuration object. This attribute holds the actual configuration settings loaded from the configuration file, encapsulating all experiment parameters and settings in an accessible object format.
  • sqids: Sqids - An instance of the Sqids class, which is used for managing unique config IDs to track experiments.
  • save_paths: DirectoryList - Represents the structured list of directory paths associated with the current experiment.

save_config() -> None
(  )

Saves the configuration to a file, ensuring that experiment parameters are documented and reproducible.

src.utils.ablation_utils.generate_permutations

( yaml_dict: dict, model: BaseModel )

Generates permutations of a YAML dictionary based on specified ablations. This function is pivotal for creating variations in configurations or datasets, enabling thorough testing and exploration of different scenarios.

Parameters

  • yaml_dict: dict - The YAML dictionary containing the definitions for ablations.
  • model: BaseModel - The pydantic BaseModel object, which might be used to reference or validate the permutations against specific criteria or configurations.

Returns

A list of permuted dictionaries, each representing a variation of the original YAML dictionary based on the defined ablations, facilitating extensive testing or data variation analysis.

Pydantic Models

src.pydantic_models.config_model.Config

( save_dir: Optional[str], ablation: AblationConfig, accelerate: Optional[bool], data: DataConfig, model: ModelConfig, lora: LoraConfig, training: TrainingConfig, inference: InferenceConfig )

Represents the overall configuration for the toolkit, including all necessary settings and parameters required for its operation. This configuration encapsulates everything from data handling and model specification to training and inference settings, as well as ablation studies and optimization flags.

Attributes

  • save_dir: Directory for saving outputs.
  • ablation: Configuration for ablation studies.
  • accelerate: Enables multi-GPU training if set to True.
  • data: Data ingestion configuration.
  • model: Model configuration, including specifics for handling and optimization.
  • lora: Configuration for LoRA (Low-Rank Adaptation) adjustments.
  • training: Training configurations, including batch sizes, learning rates, and more.
  • inference: Inference settings, such as token limits and sampling strategies.

src.pydantic_models.config_model.DataConfig

( file_type: Literal['json', 'csv', 'huggingface'], path: Union[FilePath, HfModelPath], prompt: str, prompt_stub: str, train_size: Optional[Union[float, int]], test_size: Optional[Union[float, int]], train_test_split_seed: int )

Represents the configuration for data ingestion, specifying how data is loaded, prepared, and split for training and testing. This component is crucial for ensuring that the data feeding into the model is correctly formatted and segmented.

Attributes

  • file_type: The format of the dataset file (JSON, CSV, or a HuggingFace dataset).
  • path: Path to the dataset or HuggingFace model.
  • prompt: Template for generating model inputs.
  • prompt_stub: Template fragment used during training.
  • train_size: Specifies the size of the training dataset.
  • test_size: Specifies the size of the test dataset.
  • train_test_split_seed: Seed for reproducible train/test splits.

Represents the configuration for data ingestion.

src.pydantic_models.config_model.ModelConfig

( hf_model_ckpt: Optional[str], device_map: Optional[str], quantize: Optional[bool], bitsandbytes: BitsAndBytesConfig )

Details the configuration for the model, including paths to pre-trained models, device mapping for training, and options for quantization to optimize model performance and resource usage.

Attributes

  • hf_model_ckpt: Path or identifier for a HuggingFace model checkpoint.
  • device_map: Specifies how the model should be distributed across available devices.
  • quantize: Enables model quantization for performance optimization.
  • bitsandbytes: Settings for BitsAndBytes quantization strategies.

src.pydantic_models.config_model.BitsAndBytesConfig

( load_in_8bit: Optional[bool], llm_int8_threshold: Optional[float], llm_int8_skip_modules: Optional[List[str]], llm_int8_enable_fp32_cpu_offload: Optional[bool], llm_int8_has_fp16_weight: Optional[bool], load_in_4bit: Optional[bool], bnb_4bit_compute_dtype: Optional[str], bnb_4bit_quant_type: Optional[str], bnb_4bit_use_double_quant: Optional[bool] )

Represents the configuration for BitsAndBytes quantization, offering detailed control over how models are quantized to improve performance and reduce memory footprint. These settings allow for advanced optimization techniques, including 8-bit and 4-bit quantization, with options for handling outliers and mixed precision.

Attributes

  • load_in_8bit: Enable 8-bit quantization with specifics on handling outliers and module exceptions.
  • llm_int8_threshold: Threshold for outlier detection in 8-bit quantization.
  • llm_int8_skip_modules: Modules to exclude from 8-bit quantization.
  • llm_int8_enable_fp32_cpu_offload: Offloads part of the model to CPU in fp32 to save memory.
  • llm_int8_has_fp16_weight: Allows 16-bit weights in conjunction with 8-bit quantization.
  • load_in_4bit: Enable 4-bit quantization for further size and speed optimization.
  • bnb_4bit_compute_dtype: Defines the computational datatype in 4-bit quantization.
  • bnb_4bit_quant_type: Specifies the quantization datatype in 4-bit layers.
  • bnb_4bit_use_double_quant: Enables nested quantization for potentially higher efficiency.

src.pydantic_models.config_model.LoraConfig

( r: Optional[int], task_type: Optional[str], lora_alpha: Optional[int], bias: Optional[str], lora_dropout: Optional[float], target_modules: Optional[List[str]], fan_in_fan_out: Optional[bool], modules_to_save: Optional[List[str]], layers_to_transform: Optional[Union[List[int], int]], layers_pattern: Optional[str] )

Details the configuration for applying LoRA (Low-Rank Adaptation) to a model, enhancing its ability to adapt to new tasks without extensive retraining. LoRA settings determine how and where these adaptations are applied within the model architecture.

Attributes

  • r: Rank for the LoRA adaptation, affecting the number of trainable parameters.
  • task_type: Indicates the model's task type during training to guide LoRA adjustments.
  • lora_alpha: Scaling factor for LoRA parameters.
  • bias: Specifies how biases are handled in LoRA-adapted layers.
  • lora_dropout: Dropout rate for LoRA layers, helping prevent overfitting.
  • target_modules: Model components targeted for LoRA adaptation.
  • fan_in_fan_out: Adjusts weight shape assumptions for compatibility with LoRA.
  • modules_to_save: Explicitly marks non-LoRA modules for retention and training.
  • layers_to_transform: Identifies specific layers for LoRA transformation.
  • layers_pattern: Pattern matching for selecting layers for adaptation.

src.pydantic_models.config_model.TrainingConfig

( training_args: TrainingArgs, sft_args: SftArgs )

Encapsulates the configuration for the training process, including both general training parameters and settings specific to Supervised Fine-Tuning (SFT). This dual configuration approach allows for fine-grained control over the training regimen.

Attributes

  • training_args: Core training arguments, covering epochs, batch sizes, and optimization strategies.
  • sft_args: Supervised Fine-Tuning arguments, providing additional options for fine-tuning performance.

src.pydantic_models.config_model.TrainingArgs

( num_train_epochs: Optional[int], per_device_train_batch_size: Optional[int], gradient_accumulation_steps: Optional[int], gradient_checkpointing: Optional[bool], optim: Optional[str], logging_steps: Optional[int], learning_rate: Optional[float], bf16: Optional[bool], tf32: Optional[bool], fp16: Optional[bool], max_grad_norm: Optional[float], warmup_ratio: Optional[float], lr_scheduler_type: Optional[str] )

Defines the core training parameters for the model, covering every aspect from epoch counts to specific hardware optimizations. These arguments provide a comprehensive toolkit for customizing the training process to suit different models, datasets, and hardware configurations.

Attributes

  • num_train_epochs: Specifies the total number of epochs for training.
  • per_device_train_batch_size: Sets the batch size for each training device.
  • gradient_accumulation_steps: Determines the number of steps to accumulate gradients before updating model parameters.
  • gradient_checkpointing: Enables memory-efficient gradient checkpointing.
  • optim: Chooses the optimizer for training.
  • logging_steps: Configures the frequency of logging for training metrics.
  • learning_rate: Sets the initial learning rate.
  • bf16: Activates BF16 training for compatible hardware.
  • tf32: Enables TF32 precision on NVIDIA Ampere GPUs.
  • fp16: Engages FP16 precision for faster computation and reduced memory usage.
  • max_grad_norm: Caps the norm of the gradients to prevent explosion.
  • warmup_ratio: Adjusts the learning rate as a proportion of the total training steps for a gradual start.
  • lr_scheduler_type: Specifies the learning rate scheduler to be used.

src.pydantic_models.config_model.SftArgs

( max_seq_length: Optional[int], neftune_noise_alpha: Optional[float] )

Captures the specific configurations for Supervised Fine-Tuning (SFT), including parameters that affect how models process input sequences and apply NEFTune noise embeddings. These settings help tailor the fine-tuning process to the instructional nuances of the dataset.

Attributes

  • max_seq_length: Limits the length of input sequences to the model.
  • neftune_noise_alpha: Activates NEFTune noise embeddings, which can significantly enhance model performance for instruction-based fine-tuning by introducing a controlled amount of noise into the embeddings.

src.pydantic_models.config_model.InferenceConfig

( max_new_tokens: Optional[int], use_cache: Optional[bool], do_sample: Optional[bool], top_p: Optional[float], temperature: Optional[float], epsilon_cutoff: Optional[float], eta_cutoff: Optional[float], top_k: Optional[int] )

Defines the parameters governing the inference phase, focusing on output generation and sampling behavior. These settings allow users to balance between creativity, diversity, and fidelity to the input prompt.

Attributes

  • max_new_tokens: Limits the number of new tokens generated.
  • use_cache: Enables caching for efficiency during generation.
  • do_sample: Activates stochastic sampling for output generation.
  • top_p: Controls the nucleus sampling threshold.
  • temperature: Adjusts the sharpness of the probability distribution.
  • epsilon_cutoff: Introduces cutoffs to refine sampling strategies.
  • eta_cutoff: Further refines sampling cutoffs for nuanced control.
  • top_k: Limits the sampling pool to the top-k most likely tokens.

src.pydantic_models.config_model.AblationConfig

( use_ablate: Optional[bool], study_name: Optional[str] )

Specifies whether ablation studies are to be conducted and, if so, under what overarching study name. This configuration is crucial for systematically exploring the impact of various model components and settings on performance.

Attributes

  • use_ablate: Enables the execution of ablation studies.
  • study_name: Provides a label for grouping related ablation experiments.