Generation-and-Detection-of-Falsified-Scientific-Literature

Leverage two techniques for falsified content generation at scale– Grover to detect and generate falsified scientific paper, DCGAN to generate believable scientific images.

The goal is to use Apache Tika to extract text from a dataset of scientific publications and use that text to create a new Grover model. This model will generate "fake" scientific literature. The project will also involve training a DCGAN on associated images and using that to generate believable imagery to go along with the fake papers. Finally, 500 fake PDFs of scientific papers will be generated by creating LaTeX files and converting them to PDF format.

Requirements

Run the following commands to install Grover from the repo:

!cd /content && rm -rf grover && git clone https://github.com/rowanz/grover.git
%cd /content/grover !pip install -r requirements-gpu.txt

Libraries used:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
tika
tensorflow-gpu==1.15.0
tensorboard==2.8.0
from tensorflow.python.util.tf_export import keras_export
random

Results

Grover detected 54.54% of the modified Bik papers correctly. It did not perform very well, this could be because the Grover models were trained for Neural fake news detection, which is a very different domain from scientific paper text. If we can train the model itself on academia paper data then the Grover discriminator could perform better.

We ran into issues running DCGAN with the scientific figures, because there were multiple issues with the code. We opted for using the Kaggle celebrity images dataset instead with constraints of 5 epochs with 500 samples in each. The dataset size was decreased from 100,000 to 10,000 because it would have taken over 10 hours without closing the computer for the code to run instead of 70 minutes. Each epoch of falsified images looked slightly more clear than the previous one. The first epoch contained 500 potential fake faces, but they were very blurry and hard to visualize with the naked eye. This is further verified by seeing the pattern of the discriminator loss not quite converging with the generator loss even after over 150 iterations (plot on left). The fifth and final epoch generated 500 fake faces that were more accurate and looked closer to real faces. The losses converge fairly quickly and stabilize at similar values (plot on right).

Examples of the falsely generated articles can be found under the fake_articles folder.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
fake_articles		fake_articles
notebook		notebook
README.md		README.md
result_plot.png		result_plot.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Generation-and-Detection-of-Falsified-Scientific-Literature

Requirements

Results

About

Releases

Packages

Languages

chaimilee/Generation-and-Detection-of-Falsified-Scientific-Literature

Folders and files

Latest commit

History

Repository files navigation

Generation-and-Detection-of-Falsified-Scientific-Literature

Requirements

Results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages