New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Word-precision timecodes for audio w/o lyrics in online databases, optionally providing manual transcript #6

Open

porg opened this issue Feb 4, 2024 · 4 comments

porg commented Feb 4, 2024

Are the following use cases supported?

Goal / Desired output: Lyrics or subtitle file with word-precision timecodes

Application for lyrics file: Karaoke, foreign language learning, etc.
Application for video subtitles: Foreign language learning, search and jump to specific word in video, etc.

Starting point(s)

Unpublished audio file for which there yet exists no public lyrics/transcripts.
- a) Voice only as an isolated audio track.
- b) Voice on top of instruments, no separation.
Optional: Provide a manually created transcript file (simple plain text, line by line, no timecodes) to aid the processing with reliable cues.
- Applications:
  - Less standardized language such as a dialect.
  - Or voices which for artistic purposes have an extraordinary pronunciation or tone naturally by the singer (think singers in Heavy Metal, Jazz, stylizations like Vibrato or Jodeling or voices like Björk) or due to heavy effects such as vocoder, echo, distortion, etc.
- Added value for lyrics file creator: No need to create the timecodes by hand.

Author

porg commented Feb 4, 2024

Sample file

This is a short audio sample file with 4 lines:

to test whether speech to text with word precision works reliably based on real phonetic detection and mapping
or rather only using simpler heuristics tricks like character/syllable counting in the text and distributing time among the words accordingly.

Audio File

sing-rap-read.mp4

AAC audio codec in MP4 container
Changed suffix from .m4a to .mp4 to comply with GitHubs allowed file extensions.

Lyrics

Line by line lyrics

Word by word I sing to you!
And word by word I'm rappin' to you!
And word by word I'm reading out to you, indeed!
Thanks.

Remarks on each lyrics line — What it tests for

Singing: Some words are intentionally stretched quite long. Some words also contain a tonal change within.
Rapping: 1980ies rapping style/tempo. An algorithm which performs pure text analysis will seem quite reliable.
Speech: Containing extra long pauses between some words plus some word stretching. This detects any trickery quite brutally.
Final: Extra long pause before the line starts. And then only a single word. Any real phonetic-correlated timecoding should also get this correctly.

Test results of various AI lyrics detection apps

Published in followup messages.
So far I found that all of them did no real phonetic timecode mapping but just some correlation/estimation tricks.

Author

porg commented Feb 4, 2024 •

edited

Loading

Croonify

Fails almost everywhere IMHO:

sing-rap-read--croonfiy.mp4

Author

porg commented Feb 4, 2024

Noraebang by Gaudio Lab

Overall verdict: Quite good at some positions. But at pauses or stretchings still fails miserably. Possibly only its trickery/estimation is better. Doubting that real full phonetical mapping takes place, as the failure with word pauses indicates.

sing-rap-read--noraebang-by-gaudio-lab.mp4

The stretched words seem well in sync. A simple text/sylable estimation mapping would not get that too well.
Really exact word starts. Though no surprise, as rap is inherently quite rhythmic. But still there is some intermediate emphasis / de-emphasis, and the especially the word starts seem still spot on.
Within the word stretching of "And word by word" it still seems quite in sync, but then when I some unusual pausing occurs, it totally looses sync.
Already totally lost. It is already showing line 4 "Thanks" while I still have not uttered line 3's last word "indeed".

Author

porg commented Feb 4, 2024

Your software: Karaokenerds Lyrics Transcriber

Got audio only, no transcript.
✅ Produces amazingly good word by word synchronization!
ℹ️ Only some minor flaws in transcription and timing. See below, in the line by line evaluation.

sing-rap-read--karaokenerds-lyrics-transcriber.mp4

Singing: Perfect sync despite word stretching.
- Only flaw: The Present Simple expression "I sing to you" is transliterated as Present Continuous "I'm singing to you". Is there some grammar correction applied in some pre-processing or post-processing loop?
- Idea: Provide some fine tuning flag whether to take the input as literal as possible or whether to apply some level of plausibility checks / automatic grammar fixing.
Rap: Perfect sync!
Reading: Really good. Gets the pauses correctly.
- Little flaws: "to" and "you" start a bit prematurely.
- "Indeed" after the long pause is made into a new line. Legit.
  - In the spirit of my proposal Word-precision timecodes for audio w/o lyrics in online databases, optionally providing manual transcript #6 would it be possible that the app sticks to the line wrapping as intentionally provided in the supplied unsynchronized lyrics file? e.g. the word "indeed!" being the sentence end after a pause still on the same line.
Final single word: Perfect sync again.

porg mentioned this issue

api key error #3

Open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment