-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FaceMesh Extraction Explainer #125
Comments
Hi Alex, thanks for sharing this. One immediate comment is that you should consider including blendshapes with the mesh data. Most face detection platforms either supply them (e.g., ARKit), or if they need to be calculated, are probably best done at the native level. Since a common use is to animate a 3D model, not just draw the mesh, they will be essential. I think a big question with this is a meta question: how (and if) to integrate WebXR and WebRTC features. To be useful to WebXR, the information coming out of the camera API needs to include
A "bonus" would be
Getting access to cameras is an oft repeated request from folks who use webxr, so they can do computer vision, use video in reflection maps, and so on. There are really two ways to do it, either by exposing the frames directly via WebXR, or by somehow associating the cameras on device with WebRTC/gUM cameras. Some of the info (e.g., camera device intrinsics and extrinsics, which probably don't change over a session) could be exposed through WebXR apis. Timestamps and video format big questions. |
Thanks for the feedback Blair, As far as blendshapes, from what I can tell ARCore doesn't really support those in the same way as ARKit does. ARCore exposes the ability to request three specific region poses, but we don't really have any data beyond that (My quick read is that blendshapes include not just the region pose of facial landmarks but also have some notion of data describing a "gesture" associated, please correct me if I'm wrong). Given that it's not something natively supported on ARCore, I haven't really looked too much into it. I did however, decide to not expose those region poses. Given that the face mesh will need to be standardized for textures (and indeed while the textureCoordinates and indices are part of the API, they'll likely also need to be specified somewhere, and this is really just so developers don't need to hard-code a very large and error prone blob of data), a given set of vertices will always represent the same region, so these can also be statically defined. However, given that different pages may be interested in different regions, it didn't feel right for the spec to more-or-less say "here are poses for the regions we thought were important", when others could still be calculated by the page, and indeed there is some disparity in the regions that the underlying runtimes would natively expose. I don't really expand on this in the explainer, since it's out of the scope of the initial implementation that I'm targeting, but the way that I would imagine integrating FaceMesh with WebXR would be that it would re-use the FaceMesh type and then be extra data exposed on the corresponding WebXR frame, which I think should ensure most of the data that you want (e.g. the timestamp, FOV, coordinate system, etc.) would all be available. There are some restrictions on the current runtimes (which I address a little bit), that may require some further tweaks (e.g. an ARCore backed implementation can only support the Viewer reference space). More general integration with WebRTC sounds like you're asking about Raw Camera Access, which is a separate feature entirely from what I'm discussing here. @bialpio has started to do some work on that here: https://github.com/bialpio/webxr-raw-camera-access/blob/master/explainer.md |
A few comments. First, I'm not advocating for exposing what ARKit or ARCore do. Both have different features, and both will change over time. I was merely suggesting that blendshapes are much more useful than meshes for some uses (e.g., putting an animated head model where the face is, which would have it's rig animated directly by the blend shapes), which is why ARKit supports them. Forcing individual apps to compute these seems to raise the bar for use quite a bit. I completely agree this would need a standardized mesh spec; otherwise, it's "just a mesh". I assume that Snapchat-like-filter apps would want to attach things "too the face" and the easy way to do that would be via specific places on the mesh. Second, I'm not talking about Raw Camera Access specifically, although that is the use case most people want. I discussed these different options about a year and a half ago in https://github.com/immersive-web/computer-vision. If you are going to return anything relative to the camera, it needs to be related to the coordinate system in WebXR. For just "trackables" (like the face), if they are integrated with webxr and return something like a WebXR anchor, then none of the details of the camera are required. But if it just returns a mesh in camera coordinates, some additional information (extrinsics, timestamps, etc) will absolutely be needed, to relate the pose of the camera when the face was detected to the timestamps in WebXR. It will be VERY important in any API like this to NOT assume that there is an 1:1 relationship between camera frames and webxr frames. While phone-based AR has this property, other AR devices (i.e., head-worn displays) do not. Cameras and head-pose trackers will run at different rates, and the camera frames will not match a webxr frame. It is absolutely essential that we assume that there is a common monotonically increasing timestamp that is used by webxr and camera capture, so that the data returned by both can be related. This assumes that the exact timestamp is provided for each camera frame (i.e., when it was captured, not some arbitrary time that it was accessed by the native library or javascript). Any API that assumes 1:1 mapping between camera and webxr frames isn't going to work on Hololens2 or ML1, for example, and is thus DOA. |
Thanks Blair, For your other points, I think those are outside of the scope of the work that I'm currently doing, which is integrating FaceMesh extraction into something that can be consumed from getUserMedia (and not yet integrated with WebXR). However, there's obviously significant overlap and interest from the group here, hence why I wanted to share my thinking of why the initial integration was not with WebXR. If/When we do begin working on WebXR integration, your points are definitely something that we should keep in mind. However, I suspect that those points will likely (hopefully) be raised/tackled as part of the Raw Camera Access API (or others) before I begin working on a WebXR integration. |
I've filed alcooper91/face-mesh#2 to track the proposal to add Blendshapes; and I believe that most of the other points made are out of scope for my proposal (or at the very least this issue). Let's continue discussion in that issue (or new issues). |
I'd like to +1 the recommendation to provide blendshapes. For most of applications I have for facial tracking, blendshapes are by far the better solution than the mesh alone. Platforms that don't currently provide them can to some extent calculate them at native speeds and then authors will be able to rely on them in addition to the mesh. |
I've recently created an Explainer for what an API to extract a FaceMesh from a camera stream that can be rendered to/over would look like. While my current proposal is that it not (at first) integrate with WebXR (and thus would likely be promoted to the WICG rather than here if/when it is ready), I know that the topic is of interest to people in this CG, so I wanted to mention it and share a pointer during the next CG call.
The Explainer can be found at https://github.com/alcooper91/face-mesh
/agenda
The text was updated successfully, but these errors were encountered: