Proposal: Add `fidelity="word"` attribute to the `<podcast:transcript>` tag #600

ryan-lp · 2024-01-31T04:37:47Z

ryan-lp
Jan 31, 2024

I have made several earlier proposals for how word timestamps could be implemented in JSON, SRT and VTT:

Proposal: Reserve the JSON transcript format for word timestamps #484
Proposal: Music lyrics file format #519
Proposal: Music lyrics format and general word timestamp format (WebVTT version) #599
(HTML could also support word timestamps in theory, although I will focus on the formats that are more widely used.)

The first proposal actually dates back to 2021, and it's probably unlikely that this proposal will be adopted at such a late stage now, since adopting it would make the vast amount of transcript data that has been published in the intervening time a violation of the spec.

However, since all of these different transcript formats have been demonstrated as being capable of encoding word timestamps, this suggests to me that we could actually solve this issue of declaring whether something is word level by bringing it outside of the format and putting it into the <podcast:transcript> tag. By specifying the attribute fidelity="word", the feed makes a clear declaration that this transcript can be relied on for word timestamps. NOT that it has mostly word timestamps, but that it has word timestamps ENTIRELY throughout.

Since there has been so little attention given to declaring word timestamps, I can't preempt what the objections might be. One objection for why we don't need this might be that an app could easily determine whether a JSON transcript has word timestamps by simply examining whether any segment within the transcript contains spaces. E.g. this transcript clearly doesn't have word timestamps, as evidenced by the spaces:

{
  "version": "1.0.0",
  "segments": [
    {
      "speaker": "tim-pritlove",
      "startTime": 0.0050000000000000001,
      "endTime": 1.97,
      "body": "Guten Morgen Linus."
    },
    {
      "speaker": "linus-neumann",
      "startTime": 0.0060000000000000001,
      "endTime": 3.7669999999999999,
      "body": "Guten Morgen Lino. Guten Morgen Tim."
    },
    {
      "speaker": "tim-pritlove",
      "startTime": 4.452,
      "endTime": 7.8710000000000004,
      "body": "Was du dann eigentlich auch noch auf der äh Toilettenparty."
    },

However, not all languages work like this. You will not find any spaces in the following transcript yet it is not word level:

{
  "version": "1.0.0",
  "segments": [
    {
      "speaker": "Speaker 1",
      "startTime": 6.792,
      "endTime": 7.974,
      "body": "みなさんおはようございます"
    },
    {
      "speaker": "Speaker 1",
      "startTime": 7.974,
      "endTime": 7.974,
      "body": " "
    },
    {
      "speaker": "Speaker 1",
      "startTime": 7.994,
      "endTime": 8.515,
      "body": "こんにちは"
    },
    {
      "speaker": "Speaker 1",
      "startTime": 8.515,
      "endTime": 8.515,
      "body": " "
    },
    {
      "speaker": "Speaker 1",
      "startTime": 8.66,
      "endTime": 10.165,
      "body": "ヨガと名産イソラクターの"
    },

The memory/CPU/storage resources required to accurately identify word boundaries in such arbitrary text for a language that doesn't use spaces is going to be prohibitive in most apps.

As a further example, here is a transcript of a podcast from Australian tourism targeting Japanese tourists. It uses word-level timestamps whenever the English speaker is speaking, but not when the Japanese speaker is speaking:

{
  "version": "1.0.0",
  "segments": [
    {
      "speaker": "Music",
      "startTime": 0.27,
      "endTime": 0.84,
      "body": "音楽"
    },
    {
      "speaker": "Shozo",
      "startTime": 23.91,
      "endTime": 24.24,
      "body": "みなさん、こんにちは～。"
    },
    {
      "speaker": "Shozo",
      "startTime": 23.91,
      "endTime": 24.24,
      "body": "初めまして。クイーンズランド州政府観光局のしばしょうこと、柴田です。太陽に愛されるオーストラリア・クイーンズランド州より旬の観光情報とオージーイングリッシュのワンポイントレッスンを隔週でお届けします。"
    },
    {
      "speaker": "Shozo",
      "startTime": 23.91,
      "endTime": 24.24,
      "body": "いつもはテレビやラジオ、YouTubeを聞いている時間のうち、15分間だけ、この番組にお付き合いいただければと思います。"
    },
    {
      "speaker": "Paul",
      "startTime": 23.91,
      "endTime": 24.24,
      "body": "Welcome"
    },
    {
      "speaker": "Paul",
      "startTime": 23.91,
      "endTime": 24.24,
      "body": "to"
    },
    {
      "speaker": "Paul",
      "startTime": 23.91,
      "endTime": 24.24,
      "body": "Sunset"
    },
    {
      "speaker": "Paul",
      "startTime": 23.91,
      "endTime": 24.24,
      "body": "Q.L.D."
    },

(Related proposals for multilingual transcripts: #483 and #370)

This proposal leaves open the possibility for future additions to the set of possible fidelity levels. E.g. phrase, sentence, paragraph/chapter.

I do also want to raise some potential concerns/issues with word timestamps:

We don't actually have a clear definition of what a word is. Do punctuation symbols collapse to the word or not? There is no standardisation on this. For example, It's called "Fun Animations" could be segmented either as ", Fun, Animations, " or "Fun, Animations".
In multilingual transcripts that mix languages with spaces and languages without spaces, it is not going to be clear how to correctly reinsert the spaces between words only when required, and not when not required. Whisper outputs transcripts in an interesting way based on tiktoken (similar to SentencePiece) which embeds the space within the word, before each word. This allows easy concatenation of all words, even if multilingual, to reproduce the original transcript without any tricky inference logic to guess where to put spaces. So in the previous example, it might be represented as ", Fun, Animations, " or with collapsing punctuation, "Fun, Animations". Or using the Japanese example, we might have いつも, は (note the absence of spaces allows for correct concatenation). Alternatively, the JSON format conceivably allows for the addition of language tags for each and every word which would support an inference approach.

If I were designing things from scratch, I would prefer to not collapse punctuation by default, and I would take the SentencePiece style approach of preserving the spaces in word timestamps to avoid any ambiguity about how to actually reconstruct the original transcript. But these practices are hard to change now that people have already adopted their own practices. Some people out there are collapsing punctuation, some are not. Some people (yes) are preserving spaces, some are not.

In any case, this will probably be my last proposal here. Sadly, there has been very little interest in, or support for, any of these transcript-related proposals. I am guessing that if I wait another 2 or 3 years, a few comments might trickle in, but nothing will change. It seems the only reason there has been any slight recognition in the past few days that there is work to be done with transcripts was when the transcript tag became the first tag that Apple publicly adopted. I have my own mission to spread accurate transcripts to all corners of the world, and especially in languages that are not currently supported adequately by the spec. I have decided to invest my time into other more productive ways to make this happen, other than trying to get the spec authors to take notice of these issues. For example, developing software to parse the existing transcripts published under the current standard, and reformat them into something that is actually useful to the apps, and ultimately to the listeners who depend on them (I'm not saying that is ideal, and this sort of processing would depend on heavy server-side resources if someone were to deploy it, but that seems to be where my focus would be better spent). Hopefully someone else out there can take over this mission of trying to get the spec changed, but that will no longer be me. Best wishes.

Dwev · 2024-02-08T21:59:14Z

Dwev
Feb 8, 2024

@ryan-lp Change takes time, and people need a chance to digest proposals, so even if there has been a lack of activity, it may not mean a lack of interest. As you highlighted, with Apple now supporting transcripts, that has created more interest in transcript-related ideas and proposals, so if you have an interest in this, it might be to your benefit to stick around. You have app developers who are active here, so it can be a useful channel with them. That said, there is a certain tone to some of your posts that might make people reluctant to engage. You have valuable things to say, but that can sometimes be obscured by how they are said. Unsolicited feedback, so take it as you like.

In response to your actual proposal here, though, I like the idea of marking the fidelity, so players know what they can do with the transcript. While I understand word level when editing (using tools like Descript to edit audio), the only purpose I've seen for word level for the consumer is marketing videos in the "shorts/TikTok" style, where each word is highlighted. I could see word level on music lyrics being more useful. Apart from these use cases, I'm curious why word level is desirable for the consumer?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Add `fidelity="word"` attribute to the `<podcast:transcript>` tag #600

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Proposal: Add fidelity="word" attribute to the <podcast:transcript> tag #600

ryan-lp Jan 31, 2024

Replies: 1 comment

Dwev Feb 8, 2024

Proposal: Add `fidelity="word"` attribute to the `<podcast:transcript>` tag #600

ryan-lp
Jan 31, 2024

Dwev
Feb 8, 2024