Proposal: JSON transcripts v2 #574
ryan-lp
started this conversation in
Enhancement Proposal
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
This proposal aims to overcome some of the limitations of the JSON transcript format v1 by:
Proposal
JSON Transcript File (v2)
MIME type:
application/jtf
File extension:
.jtf
The proposed v2 of the JSON Transcript File has the following attributes:
version
: the version of the transcript format specificationgenerator
: (optional) the name of the service or software used to generate the transcript, if applicablestatus
: (optional) one of:PENDING
- the transcript is yet to be createdIN_PROGRESS
- transcription is in progress, and the document may contain a partial transcriptionFAILED
- no transcript was produced due to a server errorCOMPLETED
- the transcript is complete (default)edited
: (optional) one of:NO
the transcript has not been edited by a human (default)IN_PROGRESS
the transcript is being edited, and the document may contain partial editsYES
the transcript has been edited by a human to human standardsretryAfter
: (optional) the number of seconds after which the client should re-fetch the transcriptitems
: an array of information about each wordspeaker
: the speaker of the wordstartTime
: the start time of the word in secondsendTime
: the end time of the word in secondsbody
: the word transcript including punctuation, and optionally including leading whitespacelanguage
: (optional) the language code of the wordExample:
Food for thought...
The above is fine, but there is also an alternative.
Borrowing from the way that AI transformer architectures represent transcripts as sequences of tokens, a radical departure from the above that would also be more efficient storage-wise would be to differentiate between word tokens, speaker tokens and language tokens. Thus, rather than putting a speaker and language attribute on every single word, speaker "changes" and language "changes" are inserted as separate tokens in the sequence among the words:
Now to get a little silly with this, AI models can sometimes go one step further by making even the timestamp into its own token. I'm not suggesting we go that far, but this is certainly more efficient storage wise since the
endTime
is almost always going to be equal to thestartTime
of the following word, and it is redundant to store the same timestamp twice. If timestamps are separate tokens that you can insert between words, you wouldn't need to duplicate the timestamp. Pauses are then represented by having two timestamp tokens in a row. Again, let's not go that far, because it actually poses difficulties for the parser which must lookahead to find the end timestamp of the current word, and you would need to impose some additional rules on what constitutes a well-formed file, where, for example, you must at minimum have a timestamp token before each word, and after the last word.Notes
Unique MIME type and filename extension
We should avoid using the generic
.json
extension and MIME type for any format that we intend to be widely adopted as a standard. Having a distinct extension and MIME type allows different viewers/readers/parsers to be selected automatically based on the type of the file that is evident by either the filename or the Content-Type header in HTTP.An example of this is the GeoJSON format (RFC 7946) which has a
.geojson
extension andapplication/geo+json
MIME type.Restrict use for word timestamps
The primary use case for apps that consume the JSON format is to get word timestamps. However, v1 is rather broad in what level of fidelity can in theory be supported, to the point where it is legal for a podcaster to not only put each word into its own timestamped segment (which is what the consuming apps will want), but it is also legal for a podcaster to put the entire episode transcript into a single segment with only a single timestamp (I have seen this before). Such flexibility means unfortunately that this format can't be relied on by the apps that consume it. Rather than use one format for a flexible range of different fidelities, this proposal suggests using the JSON format for word-based precision, and the SRT for subtitle block-based precision (where a subtitle is one or two lines at a time).
Karaoke support
Karaoke-style subtitles/captions are used to highlight the individual words in the lyrics of a song while they are being sung (i.e. word timestamps), while also formatting the lyrics with phrasally appropriate line breaks (like SRT). SRT has the ability to satisfy both requirements, although the v1 JSON format satisfies only the first requirement, and so the v2 proposal addresses this through the addition of optional whitespace prefixes. This whitespace can be freely stripped by apps that want to ignore the hints provided by the publisher, however those hints suggest where lines should be broken, and also more generally where whitespace should go.
In the above example, notice that the body of each word can actually be concatenated together into one long string to give the original text-based transcript. This reads as follows, including whitespace:
Using whitespace prefixes comes from the idea of Byte Pair Encoding tokenisers which are a certain method of taking a block of text, such as a transcript, and simply cutting it up into pieces, and those pieces tend to be words. Since the English language includes spaces, the tokenisers end up treating the spaces as a natural part of the beginning of the word. In languages that don't use spaces, the tokens end up having no spaces, and so the tokens can be perfectly stitched together and you'll get the original transcript with the appropriate spaces in the correct places.
I have merely added the newline character into the picture so that line breaks are also able to be reconstructed from the tokens.
Transcription services that wish to offer both transcription and editing facilities can thus provide a way for musicians to edit not only the words but also where to split new phrases onto new lines, as would be needed for karaoke-type displays.
Language switching
This is already illustrated sufficiently by the example.
Progress status
It is worth noting that this sort of thing can also be done via HTTP headers. The purpose in including these attributes in the JSON file itself is so that self-hosted podcasters also have a way of setting these values.
AI generated / human edited
We should really have this tagging option available in many other places, and it's important I think to get in there early and start tagging. There are many reasons it is important to do so, including that podcast listeners may want to know what they're getting, and once data is put out there without any tagging, it is going to be next to impossible to go back and tag the accumulated history of untagged content. In the case of transcripts, let's also realise that the people who actually depend on transcripts are negatively impacted by poor quality transcripts, and so we can provide this audience a service by letting them know how much they can trust that what they're reading is an accurate reflection of what was spoken. Another reason is that the next generation of AI transcription software will be trained on the current generation of transcripts. An objective of AI training is of course to train only on high quality transcripts, and without such tagging, the next generation of transcript software will actually end up being trained on the output of the previous generation of transcription software, and that would lead to poor quality transcripts into the future.
Feedback
This proposal collects together all of the things I think were missing from v1, but it may be incomplete, or there may be features you would take out. Feedback is welcome.
Beta Was this translation helpful? Give feedback.
All reactions