Renamed prompts, added promptfoo config for testing and iterating on …

…LLM prompts, etc.
nomadkaraoke · Nov 21, 2023 · d9e3fee · d9e3fee
1 parent 1a51aba
commit d9e3fee
Show file tree

Hide file tree

Showing 7 changed files with 165 additions and 59 deletions.
diff --git a/lyrics_transcriber/llm_prompts/README.md b/lyrics_transcriber/llm_prompts/README.md
@@ -0,0 +1,10 @@
+To get started, set your OPENAI_API_KEY environment variable.
+
+Next, edit promptfooconfig.yaml.
+
+Then run:
+```
+promptfoo eval
+```
+
+Afterwards, you can view the results by running `promptfoo view`
diff --git a/..._prompts/llm_lyrics_correction_prompt.txt → ...orrection_andrew_handwritten_20231118.txt b/..._prompts/llm_lyrics_correction_prompt.txt → ...orrection_andrew_handwritten_20231118.txt
@@ -3,7 +3,7 @@ Your task is to take two lyrics data inputs with two different qualities, and us
 
 Your response needs to be in JSON format and will be sent to an API endpoint. Only output the JSON, nothing else, as the response will be converted to a Python dictionary.
 
-You will be provided with one or more reference data, containing published lyrics for a song, as plain text, from different online sources.
+You will be provided with reference lyrics for the song, as plain text, from an online source.
 These should be reasonably accurate, with generally correct words and phrases. 
 However, they may not be perfect, and sometimes whole sections (such as a chorus or outro) may be missing or assumed to be repeated.
 
@@ -37,3 +37,19 @@ The response JSON object needs to contain all of the following fields:
   - end: The end timestamp for this word, estimated if not known for sure.
   - confidence: Your self-assessed confidence score (from 0 to 1) of how likely it is that this word is accurate. If the word has not changed from the data input, keep the existing confidence value.
 
+Reference lyrics:
+
+{{reference_lyrics}}
+
+Previous two corrected lines:
+
+{{previous_two_corrected_lines}}
+
+Upcoming two uncorrected lines:
+
+{{upcoming_two_uncorrected_lines}}
+
+Data input:
+
+{{segment_input}}
+
diff --git a/lyrics_transcriber/llm_prompts/llm_prompt_lyrics_correction_gpt_optimised_20231119.txt b/lyrics_transcriber/llm_prompts/llm_prompt_lyrics_correction_gpt_optimised_20231119.txt
@@ -0,0 +1,36 @@
+You are a song lyric corrector for a karaoke video studio, specializing in correcting lyrics for synchronization with music videos. Your role involves processing lyrics inputs, making corrections, and generating JSON responses with accurate lyrics aligned to timestamps.
+
+Task:
+- Receive lyrics data inputs of varying quality.
+- Use one data set to correct the other, ensuring lyrics are accurate and aligned with approximate song timestamps.
+- Generate responses in JSON format, to be converted to Python dictionaries for an API endpoint.
+
+Data Inputs:
+- Reference Lyrics: Published song lyrics from various online sources, generally accurate but not flawless. Be aware of potentially missing or incorrect sections (e.g., choruses, outros).
+- Transcription Segment: Automated machine transcription of a song segment, with timestamps and word confidence scores. Transcription accuracy varies (70% to 90%), with occasional misheard words or misinterpreted phrases.
+
+Additional Context:
+- When available, you'll receive the previous 2 corrected lines and the next 1 uncorrected segment for context.
+
+Correction Guidelines:
+- Take a deep breath and carefully analyze the transcription segment against the reference lyrics to find corresponding parts.
+- Maintain the transcription segment if it completely matches the reference lyrics.
+- Correct misheard or similar-sounding words.
+- Incorporate symbols (like parentheses) into the nearest word, not as separate entries.
+- Removing a word or two for accuracy is permissible.
+
+Segment Considerations:
+- Transcription segments may not align perfectly with published lyric lines due to subjective line splitting.
+- Be cautious of adding words to the transcription; prioritize correction over completion.
+- Avoid duplicating words already present in the "Next (un-corrected) transcript segment".
+
+JSON Response Structure:
+- id: Segment ID from input data.
+- text: Corrected lyrics for the segment.
+- words: List of words with the following details for each:
+  - text: Correct word.
+  - start: Estimated start timestamp.
+  - end: Estimated end timestamp.
+  - confidence: Confidence score (0-1) on word accuracy. Retain existing score if unchanged.
+
+Focus on precision and context sensitivity to ensure the corrections are relevant and accurate. Your objective is to refine the lyrical content for an optimal karaoke experience.
diff --git a/...lm_prompts/llm_lyrics_matching_prompt.txt → ..._matching_andrew_handwritten_20231118.txt b/...lm_prompts/llm_lyrics_matching_prompt.txt → ..._matching_andrew_handwritten_20231118.txt
diff --git a/lyrics_transcriber/llm_prompts/promptfooconfig.yaml b/lyrics_transcriber/llm_prompts/promptfooconfig.yaml
@@ -0,0 +1,39 @@
+# This configuration runs each prompt through a series of example inputs and checks if they meet requirements.
+# Learn more: https://promptfoo.dev/docs/configuration/guide
+
+prompts:
+  - file://llm_prompt_lyrics_correction_*.txt
+providers: [openai:gpt-3.5-turbo-0613, openai:gpt-4-1106-preview]
+tests:
+  - description: First test case - automatic review
+    vars:
+      var1: first variable's value
+      var2: another value
+      var3: some other value
+    # For more information on assertions, see https://promptfoo.dev/docs/configuration/expected-outputs
+    assert:
+      - type: equals
+        value: expected LLM output goes here
+      - type: contains
+        value: some text
+      - type: javascript
+        value: 1 / (output.length + 1)  # prefer shorter outputs
+
+  - description: Second test case - manual review
+    # Test cases don't need assertions if you prefer to manually review the output
+    vars:
+      var1: new value
+      var2: another value
+      var3: third value
+
+  - description: Third test case - other types of automatic review
+    vars:
+      var1: yet another value
+      var2: and another
+      var3: dear llm, please output your response in json format
+    assert:
+      - type: contains-json
+      - type: similar
+        value: ensures that output is semantically similar to this text
+      - type: model-graded-closedqa
+        value: ensure that output contains a reference to X
diff --git a/lyrics_transcriber/transcriber.py b/lyrics_transcriber/transcriber.py
@@ -30,6 +30,8 @@ def __init__(
         log_formatter=None,
         transcription_model="medium",
         llm_model="gpt-4-1106-preview",
+        llm_prompt_matching="lyrics_transcriber/llm_prompts/llm_prompt_lyrics_matching_andrew_handwritten_20231118.txt",
+        llm_prompt_correction="lyrics_transcriber/llm_prompts/llm_prompt_lyrics_correction_andrew_handwritten_20231118.txt",
         render_video=False,
         video_resolution="360p",
         video_background_image=None,
@@ -62,24 +64,29 @@ def __init__(
 
         self.transcription_model = transcription_model
         self.llm_model = llm_model
+        self.llm_prompt_matching = llm_prompt_matching
+        self.llm_prompt_correction = llm_prompt_correction
         self.openai_client = OpenAI()
         self.openai_client.log = self.log_level
 
         self.render_video = render_video
         self.video_resolution = video_resolution
         self.video_background_image = video_background_image
         self.video_background_color = video_background_color
-        self.font_size = 100
 
         match video_resolution:
             case "4k":
                 self.video_resolution_num = ("3840", "2160")
+                self.font_size = 250
             case "1080p":
                 self.video_resolution_num = ("1920", "1080")
+                self.font_size = 140
             case "720p":
                 self.video_resolution_num = ("1280", "720")
+                self.font_size = 100
             case "360p":
                 self.video_resolution_num = ("640", "360")
+                self.font_size = 50
             case _:
                 raise ValueError("Invalid video_resolution value. Must be one of: 4k, 1080p, 720p, 360p")
 
@@ -170,7 +177,7 @@ def copy_files_to_output_dir(self):
     def validate_lyrics_match_song(self):
         at_least_one_online_lyrics_validated = False
 
-        with open("lyrics_transcriber/llm_prompts/llm_lyrics_matching_prompt.txt", "r") as file:
+        with open(self.llm_prompt_matching, "r") as file:
             llm_matching_instructions = file.read()
 
         for online_lyrics_source in ["genius", "spotify"]:
@@ -183,7 +190,7 @@ def validate_lyrics_match_song(self):
                 f'Data input 1:\n{self.outputs["transcribed_lyrics_text"]}\nData input 2:\n{self.outputs[online_lyrics_text_key]}\n'
             )
 
-            # self.logger.debug(f"llm_instructions:\n{llm_instructions}\ndata_input_str:\n{data_input_str}")
+            # self.logger.debug(f"system_prompt:\n{system_prompt}\ndata_input_str:\n{data_input_str}")
 
             self.logger.debug(f"making API call to LLM model {self.llm_model} to validate {online_lyrics_source} lyrics match")
             response = self.openai_client.chat.completions.create(
@@ -245,93 +252,91 @@ def write_corrected_lyrics_data_file(self):
 
         corrected_lyrics_dict = {"segments": []}
 
-        with open("lyrics_transcriber/llm_prompts/llm_lyrics_correction_prompt.txt", "r") as file:
-            llm_instructions = file.read()
+        with open(self.llm_prompt_correction, "r") as file:
+            system_prompt_template = file.read()
 
-        reference_data_count = 1
-
-        if self.outputs["genius_lyrics_text"]:
-            llm_instructions += f'\nReference data {reference_data_count}:\n{self.outputs["genius_lyrics_text"]}\n'
-            reference_data_count += 1
-
-        if self.outputs["spotify_lyrics_text"]:
-            llm_instructions += f'\nReference data {reference_data_count}:\n{self.outputs["spotify_lyrics_text"]}\n'
-            reference_data_count += 1
-
-        # TODO: Add more to the LLM instructions (or consider post-processing cleanup) to get rid of overlapping segments
-        # when there are background vocals or other overlapping lyrics
+        reference_lyrics = self.outputs["genius_lyrics_text"] or self.outputs["spotify_lyrics_text"]
+        system_prompt = system_prompt_template.replace("{{reference_lyrics}}", reference_lyrics)
 
         # TODO: Test if results are cleaner when using the vocal file from a background vocal audio separation model
-
         # TODO: Record more info about the correction process (e.g before/after diffs for each segment) to a file for debugging
         # TODO: Possibly add a step after segment-based correct to get the LLM to self-analyse the diff
 
+        self.outputs["llm_transcript"] = ""
         self.outputs["llm_transcript_filepath"] = os.path.join(
             self.cache_dir, "lyrics-" + self.get_song_slug() + "-llm-correction-transcript.txt"
         )
-        self.outputs["llm_transcript"] = ""
 
         total_segments = len(self.outputs["transcription_data_dict"]["segments"])
         self.logger.info(f"Beginning correction using LLM, total segments: {total_segments}")
 
         with open(self.outputs["llm_transcript_filepath"], "a", buffering=1) as llm_transcript_file:
             self.logger.debug(f"writing LLM chat instructions: {self.outputs['llm_transcript_filepath']}")
-            llm_instructions_header = f"--- SYSTEM instructions passed in for all segments ---:\n\n"
-            self.outputs["llm_transcript"] += llm_instructions_header + llm_instructions + "\n"
-            llm_transcript_file.write(llm_instructions_header + llm_instructions + "\n")
+
+            llm_transcript_header = f"--- SYSTEM instructions passed in for all segments ---:\n\n{system_prompt}\n"
+            self.outputs["llm_transcript"] += llm_transcript_header
+            llm_transcript_file.write(llm_transcript_header)
 
             for segment in self.outputs["transcription_data_dict"]["segments"]:
-                # Don't waste dollars on GPT when testing, Andrew ;)
+                # # Don't waste OpenAI dollars when testing!
                 # if segment["id"] > 10:
-                #     break
-
-                simplified_segment = {
-                    "id": segment["id"],
-                    "start": segment["start"],
-                    "end": segment["end"],
-                    "confidence": segment["confidence"],
-                    "text": segment["text"],
-                    "words": segment["words"],
-                }
-
-                simplified_segment_str = json.dumps(simplified_segment)
+                #     continue
+                # if segment["id"] < 20 or segment["id"] > 24:
+                #     continue
+
+                llm_transcript_segment = ""
+                segment_input = json.dumps(
+                    {
+                        "id": segment["id"],
+                        "start": segment["start"],
+                        "end": segment["end"],
+                        "confidence": segment["confidence"],
+                        "text": segment["text"],
+                        "words": segment["words"],
+                    }
+                )
 
-                extra_context_prompt = ""
+                previous_two_corrected_lines = ""
+                upcoming_two_uncorrected_lines = ""
 
                 if segment["id"] > 2:
-                    extra_context_prompt = "Context: Previous two corrected lines:\n\n"
-
-                    for previous_segment in corrected_lyrics_dict["segments"]:
-                        if previous_segment["id"] == (segment["id"] - 2):
-                            extra_context_prompt += previous_segment["text"].strip() + "\n"
-                            break
-
                     for previous_segment in corrected_lyrics_dict["segments"]:
-                        if previous_segment["id"] == (segment["id"] - 1):
-                            extra_context_prompt += previous_segment["text"].strip() + "\n"
-                            break
+                        if previous_segment["id"] in (segment["id"] - 2, segment["id"] - 1):
+                            previous_two_corrected_lines += previous_segment["text"].strip() + "\n"
 
                     for next_segment in self.outputs["transcription_data_dict"]["segments"]:
-                        if next_segment["id"] == (segment["id"] + 1):
-                            extra_context_prompt += "Context: Next (un-corrected) transcript segment:\n\n"
-                            extra_context_prompt += next_segment["text"].strip() + "\n"
-                            break
-
-                data_input_str = f"{extra_context_prompt}\nData input:\n\n{simplified_segment_str}\n"
+                        if next_segment["id"] in (segment["id"] + 1, segment["id"] + 2):
+                            upcoming_two_uncorrected_lines += next_segment["text"].strip() + "\n"
+
+                llm_transcript_segment += f"--- Segment {segment['id']} / {total_segments} ---\n"
+                llm_transcript_segment += f"Previous two corrected lines:\n\n{previous_two_corrected_lines}\nUpcoming two uncorrected lines:\n\n{upcoming_two_uncorrected_lines}\nData input:\n\n{segment_input}\n"
+
+                # fmt: off
+                segment_prompt = system_prompt_template.replace(
+                    "{{previous_two_corrected_lines}}", previous_two_corrected_lines
+                ).replace(
+                    "{{upcoming_two_uncorrected_lines}}", upcoming_two_uncorrected_lines
+                ).replace(
+                    "{{segment_input}}", segment_input
+                )
 
                 self.logger.info(
                     f'Calling completion model {self.llm_model} with instructions and data input for segment {segment["id"]} / {total_segments}:'
                 )
-                # self.logger.debug(data_input_str)
-
-                llm_transcript_segment = f"--- INPUT for segment {segment['id']} / {total_segments} ---:\n\n"
-                llm_transcript_segment += data_input_str
 
                 response = self.openai_client.chat.completions.create(
                     model=self.llm_model,
                     response_format={"type": "json_object"},
-                    messages=[{"role": "system", "content": llm_instructions}, {"role": "user", "content": data_input_str}],
+                    seed=10,
+                    temperature=0.4,
+                    messages=[
+                        {
+                            "role": "user", 
+                            "content": segment_prompt
+                        }
+                    ],
                 )
+                # fmt: on
 
                 message = response.choices[0].message.content
                 finish_reason = response.choices[0].finish_reason

diff --git a/pyproject.toml b/pyproject.toml
@@ -1,6 +1,6 @@
 [tool.poetry]
 name = "lyrics-transcriber"
-version = "0.12.6"
+version = "0.12.7"
 description = "Automatically create synchronised lyrics files in ASS and MidiCo LRC formats with word-level timestamps, using Whisper and lyrics from Genius and Spotify"
 authors = ["Andrew Beveridge <[email protected]>"]
 license = "MIT"