Skip to content

Commit

Permalink
Merge pull request #56 from bigcode-project/loubnabnl-patch-2
Browse files Browse the repository at this point in the history
Update README.md
  • Loading branch information
loubnabnl authored Jul 27, 2023
2 parents 0b3c1ba + 7a0ff73 commit 3984fef
Showing 1 changed file with 9 additions and 3 deletions.
12 changes: 9 additions & 3 deletions pii/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,12 @@
# PII detection and redaction for Emails, IP adresses and Secret keys
# PII detection and redaction for code datasets

We provide code to detect Emails, IP addresses and API/SSH keys in text datasets (in particular datasets of source code). We use regexes for emails and IP addresses (they are adapted from [BigScience PII pipeline](https://github.com/bigscience-workshop/data-preparation/tree/main/preprocessing/training/02_pii)). And we use [detect-secrets](https://github.com/Yelp/detect-secrets) for finding secrets keys. We additionally implement some filters on top to reduce the number of false positives. There is also some evaluation code to test the pipeline on a PII benchmark we annotated.
We provide code to detect Names, Emails, IP addresses, Passwords API/SSH keys in text datasets (in particular datasets of source code).
## NER approach
For the **NER** model based approach go to the `ner_model` folder.

## Regex approach
Below we explain the regex based approach to dectect Emails, IP addresses adn keys only:
We use regexes for emails and IP addresses (they are adapted from [BigScience PII pipeline](https://github.com/bigscience-workshop/data-preparation/tree/main/preprocessing/training/02_pii)). And we use [detect-secrets](https://github.com/Yelp/detect-secrets) for finding secrets keys. We additionally implement some filters on top to reduce the number of false positives. There is also some evaluation code to test the pipeline on a PII benchmark we annotated.


We also provide the code used for training and running [StarPII](https://huggingface.co/bigcode/starpii) in `ner_model` and NER model for PII detection on: Names, Emails, Keys, Passwords & IP addresses (more details in our paper: [StarCoder: May The Source Be With You](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view)). We provide the code (and `slurm` scripts) used for running Inference on [StarCoderData](https://huggingface.co/datasets/bigcode/starcoderdata), we were able to detect PII in ~800GB of text in 800 GPU-hours on A100 80GB. To replace secrets we used teh following tokens:
Expand Down Expand Up @@ -30,4 +36,4 @@ Make sure you have the `gibberish_data` folder in the same directory as the scri

## Notebooks
* `example.ipynb` is an example notebook to show how to use the pipeline.
* there are several notebooks in `notebooks` folder with some of our experiments.
* there are several notebooks in `notebooks` folder with some of our experiments.

0 comments on commit 3984fef

Please sign in to comment.