FaceMesh Extraction Explainer #125

alcooper91 · 2020-08-06T22:16:36Z

I've recently created an Explainer for what an API to extract a FaceMesh from a camera stream that can be rendered to/over would look like. While my current proposal is that it not (at first) integrate with WebXR (and thus would likely be promoted to the WICG rather than here if/when it is ready), I know that the topic is of interest to people in this CG, so I wanted to mention it and share a pointer during the next CG call.

The Explainer can be found at https://github.com/alcooper91/face-mesh

/agenda

Manishearth · 2020-08-07T17:44:28Z

cc @blairmacintyre

blairmacintyre · 2020-08-07T19:26:21Z

Hi Alex, thanks for sharing this.

One immediate comment is that you should consider including blendshapes with the mesh data. Most face detection platforms either supply them (e.g., ARKit), or if they need to be calculated, are probably best done at the native level. Since a common use is to animate a 3D model, not just draw the mesh, they will be essential.

I think a big question with this is a meta question: how (and if) to integrate WebXR and WebRTC features.

To be useful to WebXR, the information coming out of the camera API needs to include

a time stamp that associates the time the video frame was captured with the timestamps used on WebXR frames
the intrinsics of the camera (so that we know field of view, etc., and can relate things seen in the camera view, like faces, with the same coordinates/measures as used in WebXR)
the extrinsics of the camera (so that we know where the camera was in 3D space relative to the WebXR spaces or frame coordinates).

A "bonus" would be

control over the format of the video frames, so that as little extra work as done as possible converting formats. (e.g., If a camera is going to be used for computer vision, luminance might be best format, and if the camera can product that directly, doing extra work convert to RGBA and back, and passing 2x or 4x the size of data, would be massively wasteful, esp on mobile devices).

Getting access to cameras is an oft repeated request from folks who use webxr, so they can do computer vision, use video in reflection maps, and so on. There are really two ways to do it, either by exposing the frames directly via WebXR, or by somehow associating the cameras on device with WebRTC/gUM cameras. Some of the info (e.g., camera device intrinsics and extrinsics, which probably don't change over a session) could be exposed through WebXR apis. Timestamps and video format big questions.

alcooper91 · 2020-08-07T21:02:15Z

Thanks for the feedback Blair,

As far as blendshapes, from what I can tell ARCore doesn't really support those in the same way as ARKit does. ARCore exposes the ability to request three specific region poses, but we don't really have any data beyond that (My quick read is that blendshapes include not just the region pose of facial landmarks but also have some notion of data describing a "gesture" associated, please correct me if I'm wrong). Given that it's not something natively supported on ARCore, I haven't really looked too much into it. I did however, decide to not expose those region poses. Given that the face mesh will need to be standardized for textures (and indeed while the textureCoordinates and indices are part of the API, they'll likely also need to be specified somewhere, and this is really just so developers don't need to hard-code a very large and error prone blob of data), a given set of vertices will always represent the same region, so these can also be statically defined. However, given that different pages may be interested in different regions, it didn't feel right for the spec to more-or-less say "here are poses for the regions we thought were important", when others could still be calculated by the page, and indeed there is some disparity in the regions that the underlying runtimes would natively expose.

I don't really expand on this in the explainer, since it's out of the scope of the initial implementation that I'm targeting, but the way that I would imagine integrating FaceMesh with WebXR would be that it would re-use the FaceMesh type and then be extra data exposed on the corresponding WebXR frame, which I think should ensure most of the data that you want (e.g. the timestamp, FOV, coordinate system, etc.) would all be available. There are some restrictions on the current runtimes (which I address a little bit), that may require some further tweaks (e.g. an ARCore backed implementation can only support the Viewer reference space).

More general integration with WebRTC sounds like you're asking about Raw Camera Access, which is a separate feature entirely from what I'm discussing here. @bialpio has started to do some work on that here: https://github.com/bialpio/webxr-raw-camera-access/blob/master/explainer.md

blairmacintyre · 2020-08-07T21:18:47Z

A few comments.

First, I'm not advocating for exposing what ARKit or ARCore do. Both have different features, and both will change over time. I was merely suggesting that blendshapes are much more useful than meshes for some uses (e.g., putting an animated head model where the face is, which would have it's rig animated directly by the blend shapes), which is why ARKit supports them. Forcing individual apps to compute these seems to raise the bar for use quite a bit. I completely agree this would need a standardized mesh spec; otherwise, it's "just a mesh". I assume that Snapchat-like-filter apps would want to attach things "too the face" and the easy way to do that would be via specific places on the mesh.

Second, I'm not talking about Raw Camera Access specifically, although that is the use case most people want. I discussed these different options about a year and a half ago in https://github.com/immersive-web/computer-vision. If you are going to return anything relative to the camera, it needs to be related to the coordinate system in WebXR. For just "trackables" (like the face), if they are integrated with webxr and return something like a WebXR anchor, then none of the details of the camera are required. But if it just returns a mesh in camera coordinates, some additional information (extrinsics, timestamps, etc) will absolutely be needed, to relate the pose of the camera when the face was detected to the timestamps in WebXR.

It will be VERY important in any API like this to NOT assume that there is an 1:1 relationship between camera frames and webxr frames. While phone-based AR has this property, other AR devices (i.e., head-worn displays) do not. Cameras and head-pose trackers will run at different rates, and the camera frames will not match a webxr frame. It is absolutely essential that we assume that there is a common monotonically increasing timestamp that is used by webxr and camera capture, so that the data returned by both can be related. This assumes that the exact timestamp is provided for each camera frame (i.e., when it was captured, not some arbitrary time that it was accessed by the native library or javascript).

Any API that assumes 1:1 mapping between camera and webxr frames isn't going to work on Hololens2 or ML1, for example, and is thus DOA.

alcooper91 · 2020-08-07T21:46:59Z

Thanks Blair,
I think I may not be understanding your suggestion as far as blendshapes go, I'll try to do some research, but if you have a good pointer for me to read up on, that may help to ensure that we're on the same page as far as understanding what you're suggesting.

For your other points, I think those are outside of the scope of the work that I'm currently doing, which is integrating FaceMesh extraction into something that can be consumed from getUserMedia (and not yet integrated with WebXR). However, there's obviously significant overlap and interest from the group here, hence why I wanted to share my thinking of why the initial integration was not with WebXR. If/When we do begin working on WebXR integration, your points are definitely something that we should keep in mind. However, I suspect that those points will likely (hopefully) be raised/tackled as part of the Raw Camera Access API (or others) before I begin working on a WebXR integration.

alcooper91 · 2020-08-18T22:50:53Z

I've filed alcooper91/face-mesh#2 to track the proposal to add Blendshapes; and I believe that most of the other points made are out of scope for my proposal (or at the very least this issue). Let's continue discussion in that issue (or new issues).

TrevorFSmith · 2020-08-19T15:47:49Z

I'd like to +1 the recommendation to provide blendshapes. For most of applications I have for facial tracking, blendshapes are by far the better solution than the mesh alone. Platforms that don't currently provide them can to some extent calculate them at native speeds and then authors will be able to rely on them in addition to the mesh.

probot-label bot added the agenda label Aug 6, 2020

Yonet removed the agenda label Aug 18, 2020

alcooper91 mentioned this issue Aug 18, 2020

Consider exposing Blendshapes alcooper91/face-mesh#2

Open

alcooper91 closed this as completed Aug 18, 2020

alcooper91 mentioned this issue Nov 23, 2022

Allow the sites to request AR sessions using front-facing camera immersive-web/proposals#78

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FaceMesh Extraction Explainer #125

FaceMesh Extraction Explainer #125

alcooper91 commented Aug 6, 2020

Manishearth commented Aug 7, 2020

blairmacintyre commented Aug 7, 2020

alcooper91 commented Aug 7, 2020

blairmacintyre commented Aug 7, 2020

alcooper91 commented Aug 7, 2020

alcooper91 commented Aug 18, 2020

TrevorFSmith commented Aug 19, 2020

FaceMesh Extraction Explainer #125

FaceMesh Extraction Explainer #125

Comments

alcooper91 commented Aug 6, 2020

Manishearth commented Aug 7, 2020

blairmacintyre commented Aug 7, 2020

alcooper91 commented Aug 7, 2020

blairmacintyre commented Aug 7, 2020

alcooper91 commented Aug 7, 2020

alcooper91 commented Aug 18, 2020

TrevorFSmith commented Aug 19, 2020