Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get formatted schema and anomalies to visualize #146

Open
wakanapo opened this issue Nov 19, 2020 · 5 comments
Open

Get formatted schema and anomalies to visualize #146

wakanapo opened this issue Nov 19, 2020 · 5 comments

Comments

@wakanapo
Copy link

wakanapo commented Nov 19, 2020

I'm trying to run tfdv process in Kubeflow Pipeline and visualize the results in the pipeline UI.

For statistics, I can easily visualize using get_statistics_html.

However, for schema and anomalies, I was struggled. We have display_schema and display_anomalies function, but it transforms data and calls IPython display inside. So, we have no way to get visualizable formatted data.
Eventually, I almost copied the display functions and change those to return DataFrame.

FYI, the code is like this.
def _transform_anormalies_to_df(anomalies) -> pd.DataFrame:
    anomaly_rows = []
    for feature_name, anomaly_info in anomalies.anomaly_info.items():
        anomaly_rows.append(
            [
                display_util._add_quotes(feature_name),
                anomaly_info.short_description,
                anomaly_info.description,
            ]
        )
    if anomalies.HasField("dataset_anomaly_info"):
        anomaly_rows.append(
            [
                "[dataset anomaly]",
                anomalies.dataset_anomaly_info.short_description,
                anomalies.dataset_anomaly_info.description,
            ]
        )

    if not anomaly_rows:
        logging.info("No anomalies found.")
        return None
    else:
        logging.warning(f"{len(anomaly_rows)} anomalies found.")
        anomalies_df = pd.DataFrame(
            anomaly_rows,
            columns=[
                "Feature name",
                "Anomaly short description",
                "Anomaly long description",
            ],
        )
        return anomalies_df


def main(schema_file: str, stats_file: str, anomalies_file: str):
    schema = tfdv.load_schema_text(schema_file)
    stats = tfdv.load_statistics(stats_file)
    anomalies = tfdv.validate_statistics(statistics=stats, schema=schema)
    tfdv.write_anomalies_text(anomalies, anomalies_file)

    anomalies_df = _transform_anormalies_to_df(anomalies)
    if anomalies_df is not None:
        metadata = {
            "outputs": [
                {
                    "type": "table",
                    "storage": "inline",
                    "format": "csv",
                    "header": anomalies_df.columns.tolist(),
                    "source": anomalies_df.to_csv(header=False, index=False),
                },
            ]
        }
        with open("/mlpipeline-ui-metadata.json", "w") as f:
            json.dump(metadata, f)

Does someone know any other good way?
What do you think about separate the display function for the transforming function and visualizing function like the function for statistics?

@brills
Copy link
Contributor

brills commented Nov 23, 2020

What do you mean by "visualizeable formatted" data?

The schema and stats are protocol buffer [1] objects. They implemented __str__ so if you print() them, you'll get a Protobuf Text Format [2] which is intended for human to read. Internally at Google, our users reviews and modifies the text format schema.

[1] https://developers.google.com/protocol-buffers
[2] https://googleapis.dev/python/protobuf/latest/google/protobuf/text_format.html (sorry, the spec for the TextFormat is not open-source).

@wakanapo
Copy link
Author

wakanapo commented Nov 24, 2020

Thanks, @brills, and sorry, my writing was bad.
"visualizable formatted" data just mean table format like dataframe created in display_schema or display_anomalies.

I want to visualize the result like this on Kubeflow Pipeline.
スクリーンショット 2020-11-20 16 02 51

For that, I want to get the dataframe created in display_anomalies. Of course, I can create it by myself in the same way as display_anomalies does, but I feel implementing the same logic is a waste of time. So, for example, it is helpful for me if display_anomalies returns the dataframe.

@brills
Copy link
Contributor

brills commented Nov 24, 2020

Thanks for the clarification.

We noted it in our internal bug tracker. What you suggested makes sense to me. But I'll check w/ the Kubeflow team to understand what their UI is capable of displaying first.

In the meanwhile please keep using your "hack". As you can see, that piece of logic has been stable (and the part it extracts from the schema also has been stable).

@wakanapo
Copy link
Author

I understand. Thanks!

@kennysong
Copy link

A vote of support for this feature.

I was trying to do exactly the same thing – DataFrames are much easier to work with than protos, especially for visualization in JS.

I also ended up copying the display_schema() and display_anomalies() code, but a proper library function would be great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants