Proposal: Add fidelity="word"
attribute to the <podcast:transcript>
tag
#600
Replies: 1 comment
-
@ryan-lp Change takes time, and people need a chance to digest proposals, so even if there has been a lack of activity, it may not mean a lack of interest. As you highlighted, with Apple now supporting transcripts, that has created more interest in transcript-related ideas and proposals, so if you have an interest in this, it might be to your benefit to stick around. You have app developers who are active here, so it can be a useful channel with them. That said, there is a certain tone to some of your posts that might make people reluctant to engage. You have valuable things to say, but that can sometimes be obscured by how they are said. Unsolicited feedback, so take it as you like. In response to your actual proposal here, though, I like the idea of marking the fidelity, so players know what they can do with the transcript. While I understand word level when editing (using tools like Descript to edit audio), the only purpose I've seen for word level for the consumer is marketing videos in the "shorts/TikTok" style, where each word is highlighted. I could see word level on music lyrics being more useful. Apart from these use cases, I'm curious why word level is desirable for the consumer? |
Beta Was this translation helpful? Give feedback.
-
I have made several earlier proposals for how word timestamps could be implemented in JSON, SRT and VTT:
The first proposal actually dates back to 2021, and it's probably unlikely that this proposal will be adopted at such a late stage now, since adopting it would make the vast amount of transcript data that has been published in the intervening time a violation of the spec.
However, since all of these different transcript formats have been demonstrated as being capable of encoding word timestamps, this suggests to me that we could actually solve this issue of declaring whether something is word level by bringing it outside of the format and putting it into the
<podcast:transcript>
tag. By specifying the attributefidelity="word"
, the feed makes a clear declaration that this transcript can be relied on for word timestamps. NOT that it has mostly word timestamps, but that it has word timestamps ENTIRELY throughout.Since there has been so little attention given to declaring word timestamps, I can't preempt what the objections might be. One objection for why we don't need this might be that an app could easily determine whether a JSON transcript has word timestamps by simply examining whether any segment within the transcript contains spaces. E.g. this transcript clearly doesn't have word timestamps, as evidenced by the spaces:
However, not all languages work like this. You will not find any spaces in the following transcript yet it is not word level:
The memory/CPU/storage resources required to accurately identify word boundaries in such arbitrary text for a language that doesn't use spaces is going to be prohibitive in most apps.
As a further example, here is a transcript of a podcast from Australian tourism targeting Japanese tourists. It uses word-level timestamps whenever the English speaker is speaking, but not when the Japanese speaker is speaking:
(Related proposals for multilingual transcripts: #483 and #370)
This proposal leaves open the possibility for future additions to the set of possible fidelity levels. E.g. phrase, sentence, paragraph/chapter.
I do also want to raise some potential concerns/issues with word timestamps:
It's called "Fun Animations"
could be segmented either as"
,Fun
,Animations
,"
or"Fun
,Animations"
."
,Fun
,Animations
,"
or with collapsing punctuation,"Fun
,Animations"
. Or using the Japanese example, we might haveいつも
,は
(note the absence of spaces allows for correct concatenation). Alternatively, the JSON format conceivably allows for the addition of language tags for each and every word which would support an inference approach.If I were designing things from scratch, I would prefer to not collapse punctuation by default, and I would take the SentencePiece style approach of preserving the spaces in word timestamps to avoid any ambiguity about how to actually reconstruct the original transcript. But these practices are hard to change now that people have already adopted their own practices. Some people out there are collapsing punctuation, some are not. Some people (yes) are preserving spaces, some are not.
In any case, this will probably be my last proposal here. Sadly, there has been very little interest in, or support for, any of these transcript-related proposals. I am guessing that if I wait another 2 or 3 years, a few comments might trickle in, but nothing will change. It seems the only reason there has been any slight recognition in the past few days that there is work to be done with transcripts was when the
transcript
tag became the first tag that Apple publicly adopted. I have my own mission to spread accurate transcripts to all corners of the world, and especially in languages that are not currently supported adequately by the spec. I have decided to invest my time into other more productive ways to make this happen, other than trying to get the spec authors to take notice of these issues. For example, developing software to parse the existing transcripts published under the current standard, and reformat them into something that is actually useful to the apps, and ultimately to the listeners who depend on them (I'm not saying that is ideal, and this sort of processing would depend on heavy server-side resources if someone were to deploy it, but that seems to be where my focus would be better spent). Hopefully someone else out there can take over this mission of trying to get the spec changed, but that will no longer be me. Best wishes.Beta Was this translation helpful? Give feedback.
All reactions