format_conversation_dataset
tools for formatting datasets for fine tuning
Convert your diarized content into a dataset that can be used to finetune a model!
Install
pip install format_conversation_dataset
How to use
Designate a speaker number as the ‘assistant’ and supply input and output file paths, and this module with do the rest.
from format_covnersation_dataset.core import *
convert_file(‘input/file/path’, ‘output/file/path’, 1, “You are participating in a conversation”)
This will output a json format like so:
{‘messages’: [ { ‘role’ : ‘system’, ‘content’ : ‘You are participating in a conversation’ }, { ‘role’ : ‘user’, ‘content’ : ‘SPEAKER_02 : Hello everyone SPEAKER_03 : Good morning SPEAKER_04 : Hi’ }, { ‘role’ : ‘assistant’, ‘content’ : ‘Hi!’ },
]}