This repo contains the code and data associated with an early pilot experiment on chatbot personalization from the project Generative Social Choice (paper, general audience report). This experiment was conducted in November 2023 as part of OpenAI's Democratic Inputs to AI program. We've since updated and improved our entire experimental pipeline, and conducted a follow-up experiment. If you want to build on our framework, we strongly recommend you use our new code and data (public link forthcoming). This repo only contains the necessary code to replicate the early pilot experiment on chatbot personalization.
Authors of Generative Social Choice: Sara Fish, Paul Gölz, David Parkes, Ariel Procaccia, Gili Rusak, Itai Shapira, and Manuel Wüthrich.
- In the folder where this
README.md
file is located, callpip install -e .
- Install dependencies:
pipenv install
- Create a file
OPENAI_API_KEY
inutils/
, and write in it your (personal) API key.
data/
has all cleaned and anonymized data associated with the experiments in the paper:chatbot_personalization_data.csv
: our cleaned and anonymized survey data, collected on Prolific. Also available at the dedicated repovalidate_disc_query_logs.csv
: logs from our discriminative query validation experiment (Figure 1, replicate withpaper_replication/validate_discriminative_query.py
)gen_query_eval/
: logs from our generative query evaluation experiment (Figure 2, replicate withpaper_replication/gen_query_eval.py
)user_summaries_generation.csv
anduser_summaries_generation_raw_output.csv
: the user summaries (and logs) used in our slate generation (replicate withpaper_replication/generate_summaries.py
)ratings_and_matching.csv
: assignments of validation users to statements (Figures 4-5, replicate withpaper_replication/compute_matching.py
)
paper_replication/
has scripts for replicating the experiments in the paper:validate_discriminative_query.py
: validating discriminative queries (Figure 1)gen_query_eval.py
: evaluating generative queries (Figure 2)generate_summaries.py
: generating user summariesgenerate_slate.py
: generating slatecompute_matching.py
: compute assignment of users to statements in slate (Figures 4-5)
plots/
has code for generating each of the plots in the paper, and the plots themselvesqueries/
has implementation of the queries:query_chatbot_personalization.py
contains all of the chatbot personalization specific implementationquery_interface.py
describes the interface for agents and generators. Anything that implements this interface should automatically work with our slate generation code.
slates/
has our implementation of the slate generation algorithm ingenerate_slate_ensemble_greedy.py
test/
has unit testsutils/
has miscellaneous toolsgpt_wrapper.py
contains code for making LLM callshelper_functions.py
hasget_base_dir_path()
andget_time_string()
dataframe_completion.py
contains code for df-completion style LLM calls, used for our summary generation and generative query.
Each figure in the paper can be generated using a dedicated notebook:
- Figure 1:
plots/fig1_disc_query_eval.ipynb
- Figure 2:
plots/fig2_slate_composition.ipynb
- Figure 3: N/A
- Figure 4:
plots/fig4_assigned_utilities_pie_chart.ipynb
- Figure 5:
plots/fig5_assigned_utilities_histogram.ipynb
To run unit tests with gpt-4o-mini
, run the following command.
python -m unittest -k fast -v
To run unit tests using the exact LLMs used in the paper (for replication purposes), run the following command. This requires access to gpt-4-base
and gpt-4-32k-0613
.
python -m unittest -k replication -v
To run all unit tests, run the following command. This requires access to gpt-4-base
and gpt-4-32k-0613
.
python -m unittest -v
The quickest and cheapest way to rerun our experiments is to use a more modern LLM such as gpt-4o
. The below commands run the exact experiments from our paper, except gpt-4o
is used in place of gpt-4-base
and gpt-4-32k-0613
.
Generate summaries of all users:
python paper_replication/generate_summaries.py --model gpt-4o
Generate summary for a single user (for testing):
python paper_replication/generate_summaries.py --model gpt-4o --num_agents 1
To run the full experiment empirically validating the discriminative query (600 LLM calls):
python paper_replication/validate_discriminative_query.py --model gpt-4o
To empirically validate a single discriminative query (for testing):
python paper_replication/validate_discriminative_query.py --model gpt-4o --num_samples 1
To run the full experiment empirically evaluating the generative query:
python paper_replication/gen_query_eval.py --model gpt-4o
To evaluate a single ensemble round (for testing):
python paper_replication/gen_query_eval.py --model gpt-4o --num_rounds 1
To generate a slate for all users:
python paper_replication/generate_slate.py --model gpt-4o
To generate a slate for only 10 users (for testing):
python paper_replication/generate_slate.py --model gpt-4o --num_agents 10
To "exactly" (subject to inherent LLM stochasticity) reproduce our experiments, run the below commands. These require access to gpt-4-base
and gpt-4-32k-0613
. These will write logs to data/chatbot_personalization/demo_data/
. To test on smaller sample sizes, use the --num_agents
and --num_samples
arguments (usage demonstrated above).
python paper_replication/generate_summaries.py --model default
python paper_replication/validate_discriminative_query.py --model default
python paper_replication/gen_query_eval.py --model default
python paper_replication/generate_slate.py --model default
This step uses Gurobi, but no LLM calls.
python paper_replication/compute_matching.py