Merge pull request #57 from bigcode-project/pii-ner

Add PII redaction for NER method
bigcode-project · Aug 24, 2023 · 4f2c75a · 4f2c75a
2 parents e430b4a + 8016ff0
commit 4f2c75a
Show file tree

Hide file tree

Showing 32 changed files with 1,317 additions and 6 deletions.
diff --git a/pii/README.md b/pii/README.md
@@ -2,17 +2,16 @@
 
 We provide code to detect Names, Emails, IP addresses, Passwords API/SSH keys in text datasets (in particular datasets of source code).
 ## NER approach
-For the **NER** model based approach go to the `ner_model` folder. 
+For the **NER** model based approach (e.g [StarPII](https://huggingface.co/bigcode/starpii)), please go to the `ner` folder. 
+
+We provide the code used for training a PII NER model to detect : Names, Emails, Keys, Passwords & IP addresses (more details in our paper: [StarCoder: May The Source Be With You](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view)).  You will also find the code (and `slurm` scripts) used for running PII Inference on [StarCoderData](https://huggingface.co/datasets/bigcode/starcoderdata), we were able to detect PII in ~800GB of text in 800 GPU-hours on A100 80GB. To replace secrets we used teh following tokens:
+<NAME>, <EMAIL>, <KEY>, <PASSWORD>
+To mask IP addresses, we randomly selected an IP address from 5~synthetic, private, non-internet-facing IP addresses of the same type.
 
 ## Regex approach
 Below we explain the regex based approach to dectect Emails, IP addresses adn keys only:
 We use regexes for emails and IP addresses (they are adapted from [BigScience PII pipeline](https://github.com/bigscience-workshop/data-preparation/tree/main/preprocessing/training/02_pii)). And we use [detect-secrets](https://github.com/Yelp/detect-secrets) for finding secrets keys. We additionally implement some filters on top to reduce the number of false positives. There is also some evaluation code to test the pipeline on a PII benchmark we annotated.
 
-
-We also provide the code used for training and running [StarPII](https://huggingface.co/bigcode/starpii) in `ner_model` and NER model for PII detection on: Names, Emails, Keys, Passwords & IP addresses (more details in our paper: [StarCoder: May The Source Be With You](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view)).  We provide the code (and `slurm` scripts) used for running Inference on [StarCoderData](https://huggingface.co/datasets/bigcode/starcoderdata), we were able to detect PII in ~800GB of text in 800 GPU-hours on A100 80GB. To replace secrets we used teh following tokens:
-<NAME>, <EMAIL>, <KEY>, <PASSWORD>
-To mask IP addresses, we randomly selected an IP address from 5~synthetic, private, non-internet-facing IP addresses of the same type.
-
 ## Usage of the regex approach
 ```
 pip install -r requirements.txt

diff --git a/pii/ner/README.md b/pii/ner/README.md
@@ -0,0 +1,7 @@
+# PII detection and Redaction using an NER model
+Here we provide code to:
+- fine-tune an encoder model (like [StarEncoder](https://huggingface.co/bigcode/starencoder)) for the task of PII detection (NER): see folder `pii_train_ner`
+- run inference with our fine-tuned [StarPII](https://huggingface.co/bigcode/starpii) for PII detection on multiple GPUs: see folder `pii_inference`
+- redact/mask PII detected with the model: see folder `pii_redaction`
+
+This is the code we used for PII anonymization in the 800GB dataset [StarCoderData](https://huggingface.co/datasets/bigcode/starcoderdata).
diff --git a/pii/ner_model/README.md → pii/ner/pii_inference/README.md b/pii/ner_model/README.md → pii/ner/pii_inference/README.md
diff --git a/pii/ner_model/__init__.py → pii/ner/pii_inference/__init__.py b/pii/ner_model/__init__.py → pii/ner/pii_inference/__init__.py
diff --git a/pii/ner_model/infer.slurm → pii/ner/pii_inference/infer.slurm b/pii/ner_model/infer.slurm → pii/ner/pii_inference/infer.slurm
diff --git a/pii/ner_model/infer_special.slurm → pii/ner/pii_inference/infer_special.slurm b/pii/ner_model/infer_special.slurm → pii/ner/pii_inference/infer_special.slurm
diff --git a/pii/ner_model/ner_inference.py → pii/ner/pii_inference/ner_inference.py b/pii/ner_model/ner_inference.py → pii/ner/pii_inference/ner_inference.py
diff --git a/...f labeled-python-data-pii-detection.ipynb → ...f labeled-python-data-pii-detection.ipynb b/...f labeled-python-data-pii-detection.ipynb → ...f labeled-python-data-pii-detection.ipynb
diff --git a/...r labeled-python-data-pii-detection.ipynb → ...r labeled-python-data-pii-detection.ipynb b/...r labeled-python-data-pii-detection.ipynb → ...r labeled-python-data-pii-detection.ipynb
diff --git a/...une DeBERTa-v3-base on pii-for-code.ipynb → ...une DeBERTa-v3-base on pii-for-code.ipynb b/...une DeBERTa-v3-base on pii-for-code.ipynb → ...une DeBERTa-v3-base on pii-for-code.ipynb
diff --git a/...ebooks/Pipeline with sliding-window.ipynb → ...ebooks/Pipeline with sliding-window.ipynb b/...ebooks/Pipeline with sliding-window.ipynb → ...ebooks/Pipeline with sliding-window.ipynb
diff --git a/...ERTa-v3-base on pseudo-labeled data.ipynb → ...ERTa-v3-base on pseudo-labeled data.ipynb b/...ERTa-v3-base on pseudo-labeled data.ipynb → ...ERTa-v3-base on pseudo-labeled data.ipynb
diff --git a/pii/ner_model/start_jobs.sh → pii/ner/pii_inference/start_jobs.sh b/pii/ner_model/start_jobs.sh → pii/ner/pii_inference/start_jobs.sh
diff --git a/pii/ner_model/start_jobs_special.sh → pii/ner/pii_inference/start_jobs_special.sh b/pii/ner_model/start_jobs_special.sh → pii/ner/pii_inference/start_jobs_special.sh
diff --git a/pii/ner_model/train.py → pii/ner/pii_inference/train.py b/pii/ner_model/train.py → pii/ner/pii_inference/train.py
diff --git a/pii/ner_model/utils/__init__.py → pii/ner/pii_inference/utils/__init__.py b/pii/ner_model/utils/__init__.py → pii/ner/pii_inference/utils/__init__.py
diff --git a/pii/ner_model/utils/chunking.py → pii/ner/pii_inference/utils/chunking.py b/pii/ner_model/utils/chunking.py → pii/ner/pii_inference/utils/chunking.py
diff --git a/pii/ner_model/utils/misc.py → pii/ner/pii_inference/utils/misc.py b/pii/ner_model/utils/misc.py → pii/ner/pii_inference/utils/misc.py
diff --git a/pii/ner_model/utils/pipeline.py → pii/ner/pii_inference/utils/pipeline.py b/pii/ner_model/utils/pipeline.py → pii/ner/pii_inference/utils/pipeline.py
diff --git a/pii/ner_model/utils/postprocessing.py → ...ner/pii_inference/utils/postprocessing.py b/pii/ner_model/utils/postprocessing.py → ...ner/pii_inference/utils/postprocessing.py
diff --git a/pii/ner_model/utils/span_ops.py → pii/ner/pii_inference/utils/span_ops.py b/pii/ner_model/utils/span_ops.py → pii/ner/pii_inference/utils/span_ops.py
diff --git a/pii/ner/pii_redaction/README.md b/pii/ner/pii_redaction/README.md
@@ -0,0 +1,14 @@
+# PII redaction
+<<<<<<< HEAD
+To run PII redaction on a dataset that went though PII detection with StarPII using the code in `./pii_inference` folder:
+```bash
+mkdir  ./logs
+LANG=python
+python main_redact.py --dataset_name  $DATA_PATH --target_dataset $LANG-no-pii --save_path_disk $LANG-no-pii-local
+```
+
+To run multiple `slurm` jobs for each programming language
+
+```bash
+python run_pii_slurm.py --start 0 --end 88
+```
diff --git a/pii/ner/pii_redaction/main_redact.py b/pii/ner/pii_redaction/main_redact.py
@@ -0,0 +1,340 @@
+"""Mask detected PII in a dataset.
+"""
+
+import argparse
+import json
+import logging
+import random
+import time
+import numpy as np
+from functools import partial
+from pprint import pformat
+
+from datasets import load_dataset
+from datasets.utils.logging import set_verbosity_info
+
+from manual_sharding import save_manual_shards
+from utils import get_replacements, redact_pii_batch
+
+
+REPONAME_TOKEN = "<reponame>"
+FILENAME_TOKEN = "<filename>"
+STARS_TOKEN = "<gh_stars>"
+
+
+def get_num_stars_bucket(num_stars: int) -> str:
+    if num_stars is None or num_stars == 0:
+        return "0"
+    elif num_stars <= 10:
+        return "1-10"
+    elif num_stars <= 100:
+        return "10-100"
+    elif num_stars <= 1000:
+        return "100-1000"
+    else:
+        return "1000+"
+
+
+def content_with_meta(example):
+    # TODO
+    res = ""
+    # repo-name
+    if np.random.binomial(n=1, p=0.2):
+        res += f"{REPONAME_TOKEN}{example['max_stars_repo_name']}"
+    # file-name
+    if np.random.binomial(n=1, p=0.2):
+        res += f"{FILENAME_TOKEN}{example['max_stars_repo_path']}"
+    # number of stars
+    if np.random.binomial(n=1, p=0.2):
+        num_stars = get_num_stars_bucket(example["max_stars_count"])
+        res += f"{STARS_TOKEN}{num_stars}"
+    if len(res) > 0:
+        res += "\n"
+    res += example["content"]
+
+    return {"content_with_meta": res}
+
+
+def parseArgs():
+    parser = argparse.ArgumentParser(description="PII detection and redaction")
+    parser.add_argument(
+        "--dataset_name",
+        default="bigcode/pii-for-code",
+        type=str,
+        help="HF repo name/path of the dataset.",
+    )
+    # add arg true add metadata
+    parser.add_argument(
+        "--add_metadata",
+        action="store_true",
+        help="If set, we add metadata to the text",
+    )
+    parser.add_argument(
+        "--num_load_proc",
+        default=64,
+        type=int,
+        help="Number of processes to use for loading the dataset",
+    )
+    parser.add_argument(
+        "--text_column",
+        default="content",
+        type=str,
+        help="Text column to use, if will be renamed to content",
+    )
+    parser.add_argument(
+        "--split",
+        default="train",
+        type=str,
+        help="Dataset split to process",
+    )
+    parser.add_argument(
+        "--batch_size",
+        default=100,
+        type=int,
+        help="Batch size for the PII detection/redaction",
+    )
+    parser.add_argument(
+        "--seed",
+        default=0,
+        type=int,
+        help="Seed for random",
+    )
+    parser.add_argument(
+        "--num_proc",
+        default=96,
+        type=int,
+        help="Number of processes to use for the PII detection/redaction",
+    )
+    parser.add_argument(
+        "--no_redaction",
+        action="store_true",
+        help="If set, we don't perform redaction",
+    )
+    parser.add_argument(
+        "--load_replacements",
+        default=True,
+        help="If set, we load the replacements from file replacements.json",
+    )
+    parser.add_argument(
+        "--add_reference_text",
+        default=True,
+        type=bool,
+        help="If True we add the reference text with PII between delimiters \
+        in the redacted text -used for visualization-",
+    )
+    parser.add_argument(
+        "--check_all_files",
+        action="store_true",
+        help="If set, we check all files, not only the ones that contain PII",
+    )
+    parser.add_argument(
+        "--check_sampling_size",
+        default=0,
+        type=int,
+        help="Number of samples to check for PII",
+    )
+    # for saving the dataset: either push to HF or save locally with datasets or save manual shards
+    parser.add_argument(
+        "--save_mode",
+        default="manual_shards",
+        type=str,
+        choices=["hub", "local", "manual_shards"],
+        help="How to save the dataset",
+    )
+    parser.add_argument(
+        "--save_mode_checks",
+        default="hub",
+        type=str,
+        choices=["hub", "local", "manual_shards"],
+        help="How to save the  checks dataset",
+    )
+    # add argument for name of dataset on the hub
+    parser.add_argument(
+        "--target_dataset",
+        default="bigcode-pii2",
+        type=str,
+        help="HF repo name of the target dataset in save_mode=hub.",
+    )
+    parser.add_argument(
+        "--hub_username",
+        default="loubnabnl",
+        type=str,
+        help="Username for the hub",
+    )
+    parser.add_argument(
+        "--save_path_disk",
+        default="/fsx/loubna/data/the-stack-march-no-pii",
+        type=str,
+        help="Path to save the dataset on disk in save_mode=local.",
+    )
+    return parser.parse_args()
+
+
+def get_check_ds(ds, args):
+    if not args.check_all_files:
+        ds_checks = ds.filter(
+            lambda exs: exs["modified"],
+            batched=True,
+            batch_size=args.batch_size,
+            num_proc=args.num_proc,
+        )
+    else:
+        ds_checks = ds
+    if not args.check_sampling_size:
+        sampling_size = len(ds_checks)
+    idx_samples = random.sample(
+        range(len(ds_checks)), min(len(ds_checks), sampling_size)
+    )
+    ds_checks = ds_checks.select(idx_samples)
+
+    return ds_checks
+
+
+def check_uniques(example, uniques):
+    """Check if current id is still in set of unique id and remove if true."""
+    if example["id"] in uniques:
+        uniques.remove(example["id"])
+        return True
+    else:
+        return False
+
+
+def main():
+    set_verbosity_info()
+    args = parseArgs()
+    logger = logging.getLogger(__name__)
+    logger.setLevel(logging.INFO)
+    logging.basicConfig(
+        format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
+        datefmt="%m/%d/%Y %H:%M:%S",
+        level=logging.INFO,
+        handlers=[
+            logging.FileHandler(f"logs/pii-{args.dataset_name.split('/')[-1]}.log"),
+            logging.StreamHandler(),
+        ],
+    )
+    logger.info(
+        f"** The job is running with the following arguments: **\n{args}\n **** "
+    )
+
+    logger.info(f" ===== Loading {args.dataset_name} =====")
+    ds = load_dataset(
+        args.dataset_name,
+        split=args.split,
+        use_auth_token=True,
+        num_proc=args.num_load_proc,
+    )
+    if args.text_column != "content":
+        ds = ds.rename_column(args.text_column, "content")
+
+    logger.info(f" ===== Deduplicating dataset =====")
+    # Deduplication based on ids
+    uniques = set(ds["id"])
+    frac = len(uniques) / len(ds)
+    logger.info(f"Fraction of duplicates: {1-frac:.2%}")
+    logger.info(f"Dataset:\n{ds}")
+    # Deduplicate data and apply heuristics
+    t_start = time.time()
+    ds_pii = ds.filter(check_uniques, fn_kwargs={"uniques": uniques})
+    logger.info(f"Time to filter dataset: {time.time()-t_start:.2f}")
+    logger.info(f"Dataset after dedup:\n{ds_pii}")
+
+    logger.info(
+        f"Number of samples that contained PII: {sum([1 if x['entities'] else 0 for x in ds_pii])}"
+    )
+    # logger.info(
+    #    f"Total number of secrets found: {sum([len(x['entities']) for x in ds_pii])}"
+    # )
+
+    # redact PII in the dataset
+    logger.info(f" ===== Applying PII redaction =====")
+    random.seed(args.seed)
+
+    replacements = get_replacements()
+    with open("replacements.json", "w") as f:
+        json.dump(replacements, f)
+    logging.info(f"Using the following replacements:\n{pformat(replacements)}")
+    ds_pii = ds_pii.map(
+        partial(
+            redact_pii_batch,
+            replacements=replacements,
+            add_references=args.add_reference_text,
+        ),
+        batched=True,
+        batch_size=args.batch_size,
+        num_proc=args.num_proc,
+    )
+    logging.info(f"Dataset info after PII redaction:\n{ds_pii}")
+
+    # check the dataset
+    logger.info(
+        f" ===== Checking {args.check_sampling_size} samples from those modified in the dataset ====="
+    )
+    ds_checks = get_check_ds(ds_pii, args)
+
+    # save checks dataset
+    if len(ds_checks) == 0:
+        logger.info("Dataset was empty. Not saving anything.")
+    else:
+        logger.info(f"Checks dataset info {ds_checks}")
+        if args.save_mode_checks == "hub":
+            logger.info(
+                f"Pushing the checks dataset to the Hub as {args.target_dataset}_checks"
+            )
+            ds_checks.push_to_hub(args.target_dataset + "_checks", private=True)
+
+        elif args.save_mode_checks == "local":
+            logger.info(f"Saving the checks dataset to disk")
+            ds_checks.save_to_disk(args.save_path_disk + "_checks")
+
+        elif args.save_mode_checks == "manual_shards":
+            logger.info(f"Saving the checks dataset in manual shards")
+            save_manual_shards(
+                ds_checks,
+                user=args.hub_username,
+                remote_dataset_repo=args.target_dataset + "_checks",
+                local_dir="/fsx/loubna/data/the-stack-march-no-pii_checks",
+            )
+
+    logger.info("Removing columns that are not needed for the final dataset")
+    columns = ["content", "modified", "entities"]
+    if args.add_reference_text:
+        columns.append("references")
+    ds_pii = ds_pii.remove_columns(columns)
+    ds_pii = ds_pii.rename_column("new_content", "content")
+    logger.info(f"Dataset info after removing columns:\n{ds_pii}")
+
+    if args.add_metadata:
+        logger.info(f" ===== Adding metadata =====")
+        ds_pii = ds_pii.map(
+            content_with_meta, remove_columns=["content"], num_proc=args.num_proc
+        )
+        ds_pii = ds_pii.rename_column("content_with_meta", "content")
+
+    # save the final dataset
+    if args.save_mode == "hub":
+        logger.info(
+            f" ===== Pushing the dataset to the Hub as: {args.target_dataset} ====="
+        )
+        ds_pii.push_to_hub(args.target_dataset, private=True)
+
+    elif args.save_mode == "local":
+        logger.info(f" ===== Saving the dataset to disk =====")
+        ds_pii.save_to_disk(args.save_path_disk)
+
+    elif args.save_mode == "manual_shards":
+        logger.info(
+            f" ===== Saving the dataset in manual shards to {args.save_path_disk} ====="
+        )
+        save_manual_shards(
+            ds_pii,
+            user=args.hub_username,
+            remote_dataset_repo="the-stack-no-pii-march",
+            local_dir=args.save_path_disk,
+        )
+
+    logger.info(f" ===== Dataset saved successfully =====")
+
+
+if __name__ == "__main__":
+    main()