Proposal: JSON transcripts v2 #574

ryan-lp · 2023-10-14T15:25:42Z

ryan-lp
Oct 14, 2023

This proposal aims to overcome some of the limitations of the JSON transcript format v1 by:

specifying a unique MIME type and filename extension
restricting the spec to word timestamps only rather than arbitrary precision
adding Karaoke support
supporting language switching similarly to speaker switching
adding status information for transcripts in progress

Proposal

JSON Transcript File (v2)

MIME type: application/jtf
File extension: .jtf

The proposed v2 of the JSON Transcript File has the following attributes:

version: the version of the transcript format specification
generator: (optional) the name of the service or software used to generate the transcript, if applicable
status: (optional) one of:
- PENDING - the transcript is yet to be created
- IN_PROGRESS - transcription is in progress, and the document may contain a partial transcription
- FAILED - no transcript was produced due to a server error
- COMPLETED - the transcript is complete (default)
edited: (optional) one of:
- NO the transcript has not been edited by a human (default)
- IN_PROGRESS the transcript is being edited, and the document may contain partial edits
- YES the transcript has been edited by a human to human standards
retryAfter: (optional) the number of seconds after which the client should re-fetch the transcript
items: an array of information about each word
- speaker: the speaker of the word
- startTime: the start time of the word in seconds
- endTime: the end time of the word in seconds
- body: the word transcript including punctuation, and optionally including leading whitespace
- language: (optional) the language code of the word

Example:

{
  "version": "2.0.0",
  "status": "IN_PROGRESS",
  "retryAfter": 600,
  "items": [
    {
      "speaker": "Barry Gib",
      "startTime": 0.5,
      "endTime": 0.75,
      "body": " Ah,",
      "language": "en"
    },
    {
      "speaker": "Barry Gib",
      "startTime": 1,
      "endTime": 1.25,
      "body": " ha,",
      "language": "en"
    },
    {
      "speaker": "Barry Gib",
      "startTime": 1.5,
      "endTime": 2.0,
      "body": " ha,",
      "language": "en"
    },
    {
      "speaker": "Barry Gib",
      "startTime": 2.25,
      "endTime": 2.50,
      "body": " ha,",
      "language": "en"
    },
    {
      "speaker": "Barry Gib",
      "startTime": 2.75,
      "endTime": 3.0,
      "body": "\n stayin'",
      "language": "en"
    },
    {
      "speaker": "Barry Gib",
      "startTime": 3.0,
      "endTime": 3.5,
      "body": " alive,",
      "language": "en"
    },
    {
      "speaker": "Barry Gib",
      "startTime": 3.75,
      "endTime": 4.0,
      "body": " stayin'",
      "language": "en"
    },
    {
      "speaker": "Barry Gib",
      "startTime": 4.0,
      "endTime": 4.5,
      "body": " alive",
      "language": "en"
    },
  ]
}

Food for thought...

The above is fine, but there is also an alternative.

Borrowing from the way that AI transformer architectures represent transcripts as sequences of tokens, a radical departure from the above that would also be more efficient storage-wise would be to differentiate between word tokens, speaker tokens and language tokens. Thus, rather than putting a speaker and language attribute on every single word, speaker "changes" and language "changes" are inserted as separate tokens in the sequence among the words:

  "items": [
    { "language": "en" },
    { "speaker": "Barry Gib" },
    {
      "startTime": 0.5,
      "endTime": 0.1,
      "body": "Ah,",
    },
    {
      "startTime": 1,
      "endTime": 1.5,
      "body": " ha,",
    },
    {
      "startTime": 1.5,
      "endTime": 2.0,
      "body": " ha,",
    },
    {
      "startTime": 2.0,
      "endTime": 2.5,
      "body": " ha,",
    },
  ...

Now to get a little silly with this, AI models can sometimes go one step further by making even the timestamp into its own token. I'm not suggesting we go that far, but this is certainly more efficient storage wise since the endTime is almost always going to be equal to the startTime of the following word, and it is redundant to store the same timestamp twice. If timestamps are separate tokens that you can insert between words, you wouldn't need to duplicate the timestamp. Pauses are then represented by having two timestamp tokens in a row. Again, let's not go that far, because it actually poses difficulties for the parser which must lookahead to find the end timestamp of the current word, and you would need to impose some additional rules on what constitutes a well-formed file, where, for example, you must at minimum have a timestamp token before each word, and after the last word.

Notes

Unique MIME type and filename extension

We should avoid using the generic .json extension and MIME type for any format that we intend to be widely adopted as a standard. Having a distinct extension and MIME type allows different viewers/readers/parsers to be selected automatically based on the type of the file that is evident by either the filename or the Content-Type header in HTTP.

An example of this is the GeoJSON format (RFC 7946) which has a .geojson extension and application/geo+json MIME type.

Restrict use for word timestamps

The primary use case for apps that consume the JSON format is to get word timestamps. However, v1 is rather broad in what level of fidelity can in theory be supported, to the point where it is legal for a podcaster to not only put each word into its own timestamped segment (which is what the consuming apps will want), but it is also legal for a podcaster to put the entire episode transcript into a single segment with only a single timestamp (I have seen this before). Such flexibility means unfortunately that this format can't be relied on by the apps that consume it. Rather than use one format for a flexible range of different fidelities, this proposal suggests using the JSON format for word-based precision, and the SRT for subtitle block-based precision (where a subtitle is one or two lines at a time).

Karaoke support

Karaoke-style subtitles/captions are used to highlight the individual words in the lyrics of a song while they are being sung (i.e. word timestamps), while also formatting the lyrics with phrasally appropriate line breaks (like SRT). SRT has the ability to satisfy both requirements, although the v1 JSON format satisfies only the first requirement, and so the v2 proposal addresses this through the addition of optional whitespace prefixes. This whitespace can be freely stripped by apps that want to ignore the hints provided by the publisher, however those hints suggest where lines should be broken, and also more generally where whitespace should go.

In the above example, notice that the body of each word can actually be concatenated together into one long string to give the original text-based transcript. This reads as follows, including whitespace:

Ah, ha, ha, ha,
stayin' alive, stayin' alive

Using whitespace prefixes comes from the idea of Byte Pair Encoding tokenisers which are a certain method of taking a block of text, such as a transcript, and simply cutting it up into pieces, and those pieces tend to be words. Since the English language includes spaces, the tokenisers end up treating the spaces as a natural part of the beginning of the word. In languages that don't use spaces, the tokens end up having no spaces, and so the tokens can be perfectly stitched together and you'll get the original transcript with the appropriate spaces in the correct places.

I have merely added the newline character into the picture so that line breaks are also able to be reconstructed from the tokens.

Transcription services that wish to offer both transcription and editing facilities can thus provide a way for musicians to edit not only the words but also where to split new phrases onto new lines, as would be needed for karaoke-type displays.

Language switching

This is already illustrated sufficiently by the example.

Progress status

It is worth noting that this sort of thing can also be done via HTTP headers. The purpose in including these attributes in the JSON file itself is so that self-hosted podcasters also have a way of setting these values.

AI generated / human edited

We should really have this tagging option available in many other places, and it's important I think to get in there early and start tagging. There are many reasons it is important to do so, including that podcast listeners may want to know what they're getting, and once data is put out there without any tagging, it is going to be next to impossible to go back and tag the accumulated history of untagged content. In the case of transcripts, let's also realise that the people who actually depend on transcripts are negatively impacted by poor quality transcripts, and so we can provide this audience a service by letting them know how much they can trust that what they're reading is an accurate reflection of what was spoken. Another reason is that the next generation of AI transcription software will be trained on the current generation of transcripts. An objective of AI training is of course to train only on high quality transcripts, and without such tagging, the next generation of transcript software will actually end up being trained on the output of the previous generation of transcription software, and that would lead to poor quality transcripts into the future.

Feedback

This proposal collects together all of the things I think were missing from v1, but it may be incomplete, or there may be features you would take out. Feedback is welcome.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: JSON transcripts v2 #574

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Proposal: JSON transcripts v2 #574

ryan-lp Oct 14, 2023

Proposal

Food for thought...

Notes

Unique MIME type and filename extension

Restrict use for word timestamps

Karaoke support

Language switching

Progress status

AI generated / human edited

Feedback

Replies: 0 comments

ryan-lp
Oct 14, 2023