Get formatted schema and anomalies to visualize #146

wakanapo · 2020-11-19T08:17:02Z

I'm trying to run tfdv process in Kubeflow Pipeline and visualize the results in the pipeline UI.

For statistics, I can easily visualize using get_statistics_html.

However, for schema and anomalies, I was struggled. We have display_schema and display_anomalies function, but it transforms data and calls IPython display inside. So, we have no way to get visualizable formatted data.
Eventually, I almost copied the display functions and change those to return DataFrame.

FYI, the code is like this.

def _transform_anormalies_to_df(anomalies) -> pd.DataFrame:
    anomaly_rows = []
    for feature_name, anomaly_info in anomalies.anomaly_info.items():
        anomaly_rows.append(
            [
                display_util._add_quotes(feature_name),
                anomaly_info.short_description,
                anomaly_info.description,
            ]
        )
    if anomalies.HasField("dataset_anomaly_info"):
        anomaly_rows.append(
            [
                "[dataset anomaly]",
                anomalies.dataset_anomaly_info.short_description,
                anomalies.dataset_anomaly_info.description,
            ]
        )

    if not anomaly_rows:
        logging.info("No anomalies found.")
        return None
    else:
        logging.warning(f"{len(anomaly_rows)} anomalies found.")
        anomalies_df = pd.DataFrame(
            anomaly_rows,
            columns=[
                "Feature name",
                "Anomaly short description",
                "Anomaly long description",
            ],
        )
        return anomalies_df


def main(schema_file: str, stats_file: str, anomalies_file: str):
    schema = tfdv.load_schema_text(schema_file)
    stats = tfdv.load_statistics(stats_file)
    anomalies = tfdv.validate_statistics(statistics=stats, schema=schema)
    tfdv.write_anomalies_text(anomalies, anomalies_file)

    anomalies_df = _transform_anormalies_to_df(anomalies)
    if anomalies_df is not None:
        metadata = {
            "outputs": [
                {
                    "type": "table",
                    "storage": "inline",
                    "format": "csv",
                    "header": anomalies_df.columns.tolist(),
                    "source": anomalies_df.to_csv(header=False, index=False),
                },
            ]
        }
        with open("/mlpipeline-ui-metadata.json", "w") as f:
            json.dump(metadata, f)

Does someone know any other good way?
What do you think about separate the display function for the transforming function and visualizing function like the function for statistics?

The text was updated successfully, but these errors were encountered:

brills · 2020-11-23T17:16:13Z

What do you mean by "visualizeable formatted" data?

The schema and stats are protocol buffer [1] objects. They implemented __str__ so if you print() them, you'll get a Protobuf Text Format [2] which is intended for human to read. Internally at Google, our users reviews and modifies the text format schema.

[1] https://developers.google.com/protocol-buffers
[2] https://googleapis.dev/python/protobuf/latest/google/protobuf/text_format.html (sorry, the spec for the TextFormat is not open-source).

wakanapo · 2020-11-24T01:18:02Z

Thanks, @brills, and sorry, my writing was bad.
"visualizable formatted" data just mean table format like dataframe created in display_schema or display_anomalies.

I want to visualize the result like this on Kubeflow Pipeline.

For that, I want to get the dataframe created in display_anomalies. Of course, I can create it by myself in the same way as display_anomalies does, but I feel implementing the same logic is a waste of time. So, for example, it is helpful for me if display_anomalies returns the dataframe.

brills · 2020-11-24T01:52:50Z

Thanks for the clarification.

We noted it in our internal bug tracker. What you suggested makes sense to me. But I'll check w/ the Kubeflow team to understand what their UI is capable of displaying first.

In the meanwhile please keep using your "hack". As you can see, that piece of logic has been stable (and the part it extracts from the schema also has been stable).

wakanapo · 2020-11-24T02:13:37Z

I understand. Thanks!

kennysong · 2020-12-04T06:12:53Z

A vote of support for this feature.

I was trying to do exactly the same thing – DataFrames are much easier to work with than protos, especially for visualization in JS.

I also ended up copying the display_schema() and display_anomalies() code, but a proper library function would be great!

rmothukuru self-assigned this Nov 21, 2020

rmothukuru added the type:support label Nov 21, 2020

rmothukuru assigned paulgc and unassigned rmothukuru Nov 23, 2020

rmothukuru added the stat:awaiting tensorflower label Nov 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get formatted schema and anomalies to visualize #146

Get formatted schema and anomalies to visualize #146

wakanapo commented Nov 19, 2020 •

edited

Loading

brills commented Nov 23, 2020

wakanapo commented Nov 24, 2020 •

edited

Loading

brills commented Nov 24, 2020

wakanapo commented Nov 24, 2020

kennysong commented Dec 4, 2020

Get formatted schema and anomalies to visualize #146

Get formatted schema and anomalies to visualize #146

Comments

wakanapo commented Nov 19, 2020 • edited Loading

brills commented Nov 23, 2020

wakanapo commented Nov 24, 2020 • edited Loading

brills commented Nov 24, 2020

wakanapo commented Nov 24, 2020

kennysong commented Dec 4, 2020

wakanapo commented Nov 19, 2020 •

edited

Loading

wakanapo commented Nov 24, 2020 •

edited

Loading