format_conversation_dataset

tools for formatting datasets for fine tuning

Convert your diarized content into a dataset that can be used to finetune a model!

Install

pip install format_conversation_dataset

How to use

Designate a speaker number as the ‘assistant’ and supply input and output file paths, and this module with do the rest.

from format_covnersation_dataset.core import *

convert_file(‘input/file/path’, ‘output/file/path’, 1, “You are participating in a conversation”)

This will output a json format like so:

{‘messages’: [ { ‘role’ : ‘system’, ‘content’ : ‘You are participating in a conversation’ }, { ‘role’ : ‘user’, ‘content’ : ‘SPEAKER_02 : Hello everyone SPEAKER_03 : Good morning SPEAKER_04 : Hi’ }, { ‘role’ : ‘assistant’, ‘content’ : ‘Hi!’ },
]}