Successfully got correction using openai chat completion model workin…

…g, with each lyric segment making a separate API call. Output JSON is combined and written to cache file, as are plain text lyrics for comparison
nomadkaraoke · Nov 17, 2023 · 2838dde · 2838dde
1 parent 34b7bb1
commit 2838dde
Show file tree

Hide file tree

Showing 7 changed files with 384 additions and 59 deletions.
diff --git a/.gitignore b/.gitignore
@@ -161,3 +161,5 @@ cython_debug/
 
 # Project specific
 /output/
+/input/
+
diff --git a/example-llm-chatcompletion-response.py b/example-llm-chatcompletion-response.py
@@ -0,0 +1,210 @@
+ChatCompletion(
+    id='chatcmpl-8LzBzNRHRTo8eKK3OWu1CyGM8R1ag', 
+    choices=[
+        Choice(finish_reason='stop', index=0, message=ChatCompletionMessage(content='
+        {
+            "segments": [
+                {
+                    "id": 4,
+                    "text": "I'm nobody's fool and yet it's clear to me",
+                    "words": [
+                        {
+                            "text": "I'm",
+                            "start": 32.58,
+                            "end": 32.7,
+                            "confidence": 0.854
+                        },
+                        {
+                            "text": "nobody's",
+                            "start": 32.7,
+                            "end": 33.4,
+                            "confidence": 0.992
+                        },
+                        {
+                            "text": "fool",
+                            "start": 33.4,
+                            "end": 33.66,
+                            "confidence": 0.997
+                        },
+                        {
+                            "text": "and",
+                            "start": 33.66,
+                            "end": 33.88,
+                            "confidence": 0.445
+                        },
+                        {
+                            "text": "yet",
+                            "start": 33.88,
+                            "end": 34.18,
+                            "confidence": 0.952
+                        },
+                        {
+                            "text": "it's",
+                            "start": 34.18,
+                            "end": 34.48,
+                            "confidence": 0.956
+                        },
+                        {
+                            "text": "clear",
+                            "start": 34.48,
+                            "end": 34.86,
+                            "confidence": 0.9
+                        },
+                        {
+                            "text": "to",
+                            "start": 34.86,
+                            "end": 35.16,
+                            "confidence": 0.843
+                        },
+                        {
+                            "text": "me",
+                            "start": 35.16,
+                            "end": 35.9,
+                            "confidence": 0.992
+                        }
+                    ]
+                },
+                {
+                    "id": 5,
+                    "text": "I don't have a strategy",
+                    "words": [
+                        {
+                            "text": "I",
+                            "start": 36.46,
+                            "end": 36.7,
+                            "confidence": 0.994
+                        },
+                        {
+                            "text": "don't",
+                            "start": 36.7,
+                            "end": 37.06,
+                            "confidence": 0.999
+                        },
+                        {
+                            "text": "have",
+                            "start": 37.06,
+                            "end": 37.3,
+                            "confidence": 0.999
+                        },
+                        {
+                            "text": "a",
+                            "start": 37.3,
+                            "end": 37.48,
+                            "confidence": 0.986
+                        },
+                        {
+                            "text": "strategy",
+                            "start": 37.48,
+                            "end": 38.52,
+                            "confidence": 0.999
+                        }
+                    ]
+                },
+                {
+                    "id": 6,
+                    "text": "It's just like taking candy from a baby",
+                    "words": [
+                        {
+                            "text": "It's",
+                            "start": 39.3,
+                            "end": 39.58,
+                            "confidence": 0.986
+                        },
+                        {
+                            "text": "just",
+                            "start": 39.58,
+                            "end": 39.8,
+                            "confidence": 0.992
+                        },
+                        {
+                            "text": "like",
+                            "start": 39.8,
+                            "end": 40.06,
+                            "confidence": 0.999
+                        },
+                        {
+                            "text": "taking",
+                            "start": 40.06,
+                            "end": 40.46,
+                            "confidence": 0.986
+                        },
+                        {
+                            "text": "candy",
+                            "start": 40.46,
+                            "end": 41.0,
+                            "confidence": 0.997
+                        },
+                        {
+                            "text": "from",
+                            "start": 41.0,
+                            "end": 41.38,
+                            "confidence": 0.996
+                        },
+                        {
+                            "text": "a",
+                            "start": 41.38,
+                            "end": 41.6,
+                            "confidence": 0.839
+                        },
+                        {
+                            "text": "baby",
+                            "start": 41.6,
+                            "end": 42.2,
+                            "confidence": 0.998
+                        }
+                    ]
+                },
+                {
+                    "id": 7,
+                    "text": "And I think I must be",
+                    "words": [
+                        {
+                            "text": "And",
+                            "start": 42.86,
+                            "end": 43.18,
+                            "confidence": 0.958
+                        },
+                        {
+                            "text": "I",
+                            "start": 43.18,
+                            "end": 43.4,
+                            "confidence": 0.982
+                        },
+                        {
+                            "text": "think",
+                            "start": 43.4,
+                            "end": 43.88,
+                            "confidence": 0.998
+                        },
+                        {
+                            "text": "I",
+                            "start": 43.88,
+                            "end": 44.2,
+                            "confidence": 0.984
+                        },
+                        {
+                            "text": "must",
+                            "start": 44.2,
+                            "end": 44.44,
+                            "confidence": 0.964
+                        },
+                        {
+                            "text": "be",
+                            "start": 44.44,
+                            "end": 44.6,
+                            "confidence": 0.993
+                        }
+                    ]
+                }
+            ]
+        }', 
+        role='assistant', 
+        function_call=None, 
+        tool_calls=None))
+    ], 
+    created=1700250803, 
+    model='gpt-4-1106-preview', 
+    object='chat.completion', 
+    system_fingerprint='fp_a24b4d720c', 
+    usage=CompletionUsage(completion_tokens=1210, prompt_tokens=2329, total_tokens=3539)
+)
diff --git a/lyrics_transcriber/llm_correction_instructions.txt b/lyrics_transcriber/llm_correction_instructions.txt
@@ -27,4 +27,3 @@ The response JSON object needs to contain all of the following fields:
     - start: The start timestamp for this word, estimated if not known for sure.
     - end: The end timestamp for this word, estimated if not known for sure.
     - confidence: Your self-assessed confidence score (from 0 to 1) of how likely it is that this word is accurate. If the word has not changed from data input 1, keep the existing confidence value.
-
diff --git a/lyrics_transcriber/llm_correction_instructions_2.txt b/lyrics_transcriber/llm_correction_instructions_2.txt
@@ -0,0 +1,29 @@
+You are a song lyric corrector for a karaoke video studio, responsible for reading lyrics inputs, correcting them and generating JSON-based responses containing the corrected lyrics according to predefined criteria. 
+Your task is to take two lyrics data inputs with two different qualities, and use the data in one to make a best effort attempt to correct the other, producing reasonably accurate lyrics.
+
+Your response needs to be in JSON format and will be sent to an API endpoint. Only output the JSON, nothing else, as the response will be converted to a Python dictionary.
+
+You will be provided with reference data containing published lyrics for a song, as plain text. 
+These should be reasonably accurate, with generally correct words and phrases. 
+However, they may not be perfect, and sometimes whole sections (such as a chorus or outro) may be missing or assumed to be repeated.
+
+Data input will contain one segment of an automated machine transcription of lyrics from a song, with start/end timestamps and confidence scores for every word in that segment.
+The timestamps for words are usually quite accurate, but the actual words which were heard by the transcription are typically only around 70% to 90% accurate.
+As such, it is common for there to be segments where most of the words are correct but one or two are wrong, or a single word may have been mistaken as two different words.
+
+Carefully analyse the segment in the data input, and compare with the lyrics in the reference data, attempting to find part of the lyrics which is most likely to correspond with this segment.
+If all of the words match up correctly with part of the published lyrics, great! You can add that whole segment to your response.
+If some of the words match up but there are a couple of differences, correct those differences.
+If you need to delete a word or two in order to correct the lyrics, that's acceptable.
+If you need to add a word or two which were missing from the transcription, that's acceptable - you'll need to estimate the start and end timestamps based on the timestamps of the surrounding words.
+
+The response JSON object needs to contain all of the following fields:
+
+- id: The id of the segment, from the data input
+- text: The full text of the corrected lyrics for this segment
+- words: this is a list
+  - text: The correct word
+  - start: The start timestamp for this word, estimated if not known for sure.
+  - end: The end timestamp for this word, estimated if not known for sure.
+  - confidence: Your self-assessed confidence score (from 0 to 1) of how likely it is that this word is accurate. If the word has not changed from the data input, keep the existing confidence value.
+
diff --git a/lyrics_transcriber/llm_correction_instructions_3.txt b/lyrics_transcriber/llm_correction_instructions_3.txt
@@ -0,0 +1,19 @@
+As a song lyric corrector for a karaoke video studio, your job involves processing lyrics inputs, making corrections, and generating JSON responses. 
+You work with two data sets: a reference data set of published lyrics and a machine-transcribed segment of a song. 
+Your primary task is to compare these datasets and correct the transcribed lyrics to match the reference data as closely as possible.
+
+Your response should be formatted in JSON, to be sent to an API endpoint. The JSON output will include:
+
+id: The identifier of the segment from the first data input.
+text: The corrected lyric text for the segment.
+words: A list containing each word in the segment, with fields for:
+ - text: The correct word.
+ - start: The start timestamp for the word, estimated if necessary.
+ - end: The end timestamp for the word, estimated if necessary.
+ - confidence: A score (0 to 1) indicating the confidence in the accuracy of the word. Retain existing confidence values for unchanged words.
+
+The reference data is generally accurate but may have imperfections or missing sections. 
+The transcribed data includes timestamps and confidence scores for each word, but the accuracy of the words is only about 70-90%. 
+Your role involves meticulously analyzing the transcribed segment, comparing it with the reference lyrics, and making necessary corrections. 
+This process may involve adding, deleting, or modifying words to ensure the final output is as accurate as possible.
+
Original file line number	Diff line number	Diff line change
Expand Up		@@ -161,3 +161,5 @@ cython_debug/

		# Project specific
		/output/
		/input/
Original file line number	Diff line number	Diff line change
Expand Up		@@ -27,4 +27,3 @@ The response JSON object needs to contain all of the following fields:
		- start: The start timestamp for this word, estimated if not known for sure.
		- end: The end timestamp for this word, estimated if not known for sure.
		- confidence: Your self-assessed confidence score (from 0 to 1) of how likely it is that this word is accurate. If the word has not changed from data input 1, keep the existing confidence value.