Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: AirbyteRecordMessageMeta for per-record lineage and changes #56

Merged
merged 16 commits into from
Jan 24, 2024

Conversation

evantahler
Copy link
Contributor

@evantahler evantahler commented Dec 13, 2023

Related Doc and Protocol Change Doc

In order to help track the lineage of record changes along throughout a sync, a new optional property is added to AirbyteRecordMessage that will store per-record errors that will eventually be stored in the destination.

As a reminder of our starting assumptions:

  • "data errors" are not "sync errors", and therefore should not block/fail a sync.
  • Some data is just too big to process or store.
  • Any part of the Airbyte platform might need to modify a record field for size reasons.

For example, consider the following totally realistic example record:

{
  type: "RECORD",
  record: {
    stream: "users",
    emitted_at: 123456789,
    data: {
      id: 1,
      first_name: "evan",
      last_name: "tahler",
      image: "<40mb PNG image... PNGxxxxxxyyyyyzzzz>"
    }
  }
}

While the source might be able to handle serializing the record, the platform (with its current limit of 20mb per field), will not. To keep the lineage that the image field was modified, we add a new meta.errors array which contains the change information. After the record passes though the platform, it will look like:

{
  type: "RECORD",
  record: {
    stream: "users",
    emitted_at: 123456789,
    data: {
      id: 1,
      first_name: "evan",
      last_name: "tahler",
      image: null // <--- changed!
    },
    meta: {
      changes: [
        { field: "image", change: "NULLED", reason: "PLATFORM_FIELD_SIZE_LIMITATION" }
      ]
    } 
  }
}

V2 destinations will then be able to inspect the record's meta information, and pass that along to the data warehouse, producing per-row errors. While the protocol wishes to track these changes as "changes", destinations are likely to represent these changes as errors for users.

In the threads below, we discussed how we would represent an error with a sub-field. We decided on JSON-schema representations of the problematic properties:

{
  type: "RECORD",
  record: {
    stream: "users",
    emitted_at: 123456789,
    data: {
      id: 1,
      first_name: "evan",
      last_name: "tahler",
      image: {
        name: "profile.png",
        body // <--- changed!
      }
    },
    meta: {
      changes: [
        { field: "image.body", change: "NULLED", reason: "PLATFORM_FIELD_SIZE_LIMITATION"}
      ]
    } 
  }
}

@evantahler evantahler requested a review from bleonard December 13, 2023 01:10
@evantahler
Copy link
Contributor Author

Note for destinations folks: we don't have to use all the properties in meta.errors, but I wanted to be robust in the protocol

Copy link
Contributor

@edgao edgao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added some comments. Agree with the general thrust of this change (and it definitely solves the immediate problem), just some details to work out

@evantahler evantahler requested review from jbfbell and edgao December 15, 2023 00:31
@prateekmukhedkar
Copy link

@evantahler do you have any other scenarios in mind where we would adding to meta errors, in addition to a record exceeding a size limit?

@evantahler
Copy link
Contributor Author

evantahler commented Dec 15, 2023

@evantahler do you have any other scenarios in mind where we would adding to meta errors, in addition to a record exceeding a size limit?

I can imagine all sorts of problems, but I personally haven't seen them come up in OC issues yet. Some examples could be:

  • encoding errors / charset errors
  • unable to convert to Airbyte protocol type (E.g. a binary file that we can't convert to hex in transit)
  • range errors (e.g. we could say there's a lower bound on our timestamp types of the year 0, so B.C.E. dates would be dropped)

@davinchia
Copy link
Contributor

Thoughts:

  • What is the value of only nulling out the offending field vs nulling out the entire record except the primary key? The latter seems simpler implementation wise and easier to work with from a user POV.
  • Simplicity-wise, if the platform didn't have a limit, how immediately useful would the origin piece still be? Given this is the protocol, I'm tempted to be a bit more conservative and lean the way of YAGNI.

Apologies for late review!

Copy link
Contributor

@davinchia davinchia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Evan and I went through this in person. Confirmed the proposed fix works and the proposed protocol change here is the minimal change to do so.

@evantahler evantahler requested a review from cgardens January 3, 2024 00:12
@evantahler
Copy link
Contributor Author

evantahler commented Jan 12, 2024

After a protocol review session on Jan 11th, this PR has been updated to replace record.meta.errors with record.meta.changes, and add more structure to that object. This reflects a mental-model shift that we are building a lineage tracking system, not an error tracking system

@evantahler evantahler changed the title feat: AirbyteRecordMessageMeta feat: AirbyteRecordMessageMeta for per-record lineage and changes Jan 12, 2024
@evantahler
Copy link
Contributor Author

evantahler commented Jan 22, 2024

The final batch of changes, reflected above, made a few of the properties in meta.changes requried.

@evantahler
Copy link
Contributor Author

^ The latest commit removes the free-text message field in favor of adding more options to the enum of reasons. The matrix of options includes actors (source, destination, platform) x problem (total size, field size, ser/deser, and retrieval). There will certainly be more reasons discovered in the future, and we can add them to the protocol at that time

@evantahler evantahler merged commit 45461e1 into main Jan 24, 2024
6 checks passed
@evantahler evantahler deleted the evan/airbyte-meta branch January 24, 2024 20:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants