Leverage two techniques for falsified content generation at scale– Grover to detect and generate falsified scientific paper, DCGAN to generate believable scientific images.
The goal is to use Apache Tika to extract text from a dataset of scientific publications and use that text to create a new Grover model. This model will generate "fake" scientific literature. The project will also involve training a DCGAN on associated images and using that to generate believable imagery to go along with the fake papers. Finally, 500 fake PDFs of scientific papers will be generated by creating LaTeX files and converting them to PDF format.
Run the following commands to install Grover from the repo:
- !cd /content && rm -rf grover && git clone https://github.com/rowanz/grover.git
- %cd /content/grover !pip install -r requirements-gpu.txt
Libraries used:
-
from selenium import webdriver
-
from selenium.webdriver.chrome.options import Options
-
from selenium.webdriver.chrome.service import Service
-
from selenium.webdriver.common.by import By
-
from selenium.webdriver.support.ui import WebDriverWait
-
from selenium.webdriver.support import expected_conditions as EC
-
from selenium.webdriver.common.keys import Keys
-
tika
-
tensorflow-gpu==1.15.0
-
tensorboard==2.8.0
-
from tensorflow.python.util.tf_export import keras_export
-
random
Grover detected 54.54% of the modified Bik papers correctly. It did not perform very well, this could be because the Grover models were trained for Neural fake news detection, which is a very different domain from scientific paper text. If we can train the model itself on academia paper data then the Grover discriminator could perform better.
We ran into issues running DCGAN with the scientific figures, because there were multiple issues with the code. We opted for using the Kaggle celebrity images dataset instead with constraints of 5 epochs with 500 samples in each. The dataset size was decreased from 100,000 to 10,000 because it would have taken over 10 hours without closing the computer for the code to run instead of 70 minutes. Each epoch of falsified images looked slightly more clear than the previous one. The first epoch contained 500 potential fake faces, but they were very blurry and hard to visualize with the naked eye. This is further verified by seeing the pattern of the discriminator loss not quite converging with the generator loss even after over 150 iterations (plot on left). The fifth and final epoch generated 500 fake faces that were more accurate and looked closer to real faces. The losses converge fairly quickly and stabilize at similar values (plot on right).
Examples of the falsely generated articles can be found under the fake_articles folder.