Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word-precision timecodes for audio w/o lyrics in online databases, optionally providing manual transcript #6

Open
porg opened this issue Feb 4, 2024 · 4 comments

Comments

@porg
Copy link

porg commented Feb 4, 2024

Are the following use cases supported?

Goal / Desired output: Lyrics or subtitle file with word-precision timecodes

  • Application for lyrics file: Karaoke, foreign language learning, etc.
  • Application for video subtitles: Foreign language learning, search and jump to specific word in video, etc.

Starting point(s)

  1. Unpublished audio file for which there yet exists no public lyrics/transcripts.
    • a) Voice only as an isolated audio track.
    • b) Voice on top of instruments, no separation.
  2. Optional: Provide a manually created transcript file (simple plain text, line by line, no timecodes) to aid the processing with reliable cues.
    • Applications:
      • Less standardized language such as a dialect.
      • Or voices which for artistic purposes have an extraordinary pronunciation or tone naturally by the singer (think singers in Heavy Metal, Jazz, stylizations like Vibrato or Jodeling or voices like Björk) or due to heavy effects such as vocoder, echo, distortion, etc.
    • Added value for lyrics file creator: No need to create the timecodes by hand.
@porg
Copy link
Author

porg commented Feb 4, 2024

Sample file

This is a short audio sample file with 4 lines:

  • to test whether speech to text with word precision works reliably based on real phonetic detection and mapping
  • or rather only using simpler heuristics tricks like character/syllable counting in the text and distributing time among the words accordingly.

Audio File

sing-rap-read.mp4
  • AAC audio codec in MP4 container
  • Changed suffix from .m4a to .mp4 to comply with GitHubs allowed file extensions.

Lyrics

Line by line lyrics

  1. Word by word I sing to you!
  2. And word by word I'm rappin' to you!
  3. And word by word I'm reading out to you, indeed!
  4. Thanks.

Remarks on each lyrics line — What it tests for

  1. Singing: Some words are intentionally stretched quite long. Some words also contain a tonal change within.
  2. Rapping: 1980ies rapping style/tempo. An algorithm which performs pure text analysis will seem quite reliable.
  3. Speech: Containing extra long pauses between some words plus some word stretching. This detects any trickery quite brutally.
  4. Final: Extra long pause before the line starts. And then only a single word. Any real phonetic-correlated timecoding should also get this correctly.

Test results of various AI lyrics detection apps

  • Published in followup messages.
  • So far I found that all of them did no real phonetic timecode mapping but just some correlation/estimation tricks.

@porg
Copy link
Author

porg commented Feb 4, 2024

Croonify

  • Fails almost everywhere IMHO:
sing-rap-read--croonfiy.mp4

@porg
Copy link
Author

porg commented Feb 4, 2024

Noraebang by Gaudio Lab

Overall verdict: Quite good at some positions. But at pauses or stretchings still fails miserably. Possibly only its trickery/estimation is better. Doubting that real full phonetical mapping takes place, as the failure with word pauses indicates.

sing-rap-read--noraebang-by-gaudio-lab.mp4
  1. The stretched words seem well in sync. A simple text/sylable estimation mapping would not get that too well.
  2. Really exact word starts. Though no surprise, as rap is inherently quite rhythmic. But still there is some intermediate emphasis / de-emphasis, and the especially the word starts seem still spot on.
  3. Within the word stretching of "And word by word" it still seems quite in sync, but then when I some unusual pausing occurs, it totally looses sync.
  4. Already totally lost. It is already showing line 4 "Thanks" while I still have not uttered line 3's last word "indeed".

@porg
Copy link
Author

porg commented Feb 4, 2024

Your software: Karaokenerds Lyrics Transcriber

  • Got audio only, no transcript.
  • ✅ Produces amazingly good word by word synchronization!
  • ℹ️ Only some minor flaws in transcription and timing. See below, in the line by line evaluation.
sing-rap-read--karaokenerds-lyrics-transcriber.mp4
  1. Singing: Perfect sync despite word stretching.
    • Only flaw: The Present Simple expression "I sing to you" is transliterated as Present Continuous "I'm singing to you". Is there some grammar correction applied in some pre-processing or post-processing loop?
    • Idea: Provide some fine tuning flag whether to take the input as literal as possible or whether to apply some level of plausibility checks / automatic grammar fixing.
  2. Rap: Perfect sync!
  3. Reading: Really good. Gets the pauses correctly.
  4. Final single word: Perfect sync again.

@porg porg mentioned this issue Feb 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant