SnipGen: A Dataset for Evaluating Large Language Models for Code via Prompt Engineering

Large Language Models (LLMs), such as transformer-based neural networks trained on extensive parameters, are becoming increasingly prevalent in software engineering. These sophisticated neural networks undergo training on expansive datasets that encompass both natural and programming languages. However, assessing their efficacy in specific tasks poses a challenge due to their size and intricacy. To accurately gauge these models' capabilities and their evolving potential, it is imperative to employ appropriate inputs and varied examples while steering clear of biases.

To confront this challenge head-on, we introduce snipgen, an innovative dataset that exploits prompt engineering across diverse downstream tasks for code generation. snipgen furnishes meticulously curated data points to aid researchers and practitioners in assessing LLMs across various scenarios. Employing a semi-automatic approach, we gathered approximately $227K$ data points from $338K$ recent changes in code bases on \github, with a focus on method-level granularity. Additionally, to ensure data quality, we meticulously validated the collected samples.

Moreover, snipgen offers a collection of templates that can be combined to formulate practical prompts for evaluating the performance of \LLMs in different tasks. By providing both the dataset and the methodology for its construction, our goal is to empower researchers and practitioners to effectively evaluate and interpret the capabilities of LLMs.

Use Cases

The primary focus is on answering causal queries of interest related to LLMs. It enables ML researchers in software engineering to explain the causal effect of a set of confounders associated with the treatment input prompt. Some of the use cases include:

Code Auto-Completion:

Galeras can be used to evaluate LLM's performance in auto-completing code. Its provided non-contaminated code snippets can be used with our random cut implementation to randomly cut the code snippet somewhere after the code signature.

Code Summarization:

Additionally, this facilitates the examination of an LLM's capacity for code summarization. Researchers can investigate the influence of various confounders on the production of code summaries by leveraging the dataset and benchmark provided.

Code Generation

ML researchers have the opportunity to evaluate LLM's proficiency in producing test cases using Galeras. The benchmark provides a dedicated dataset tailored for test generation purposes, enabling a comprehensive assessment and interpretation of code generation performance.

Bug Fixing:

We can utilize this to analyze LLM's effectiveness in bug fixing. Researchers can evaluate the causal relationship between different confounders and the accuracy of bug fixes generated by LLMs.

Vulnerability Detection:

We provide some snippets with an identified vulnerability and their span position, the prompt could be used to ask the LLMs to detect and fix the snippet code.

Testbed Generation and Curation:

The collection pipeline for this dataset is as follows:

In the first step, popular github repos are filtered using the following query: language : Python, fork : false, size :>= 30, 000, pushed :> 2021 - 12 - 31, stars :> 1, 000

Given ChatGPT's and the other LLMc under analysis have a training data cutoff date of ~September 2021, we selected data from January 2, 2022 to January 1, 2023. Therefore, we make the claim that our testbeds help to avoid data snooping.

We then extracted code and documentation related features from each data point. After, using the Tree-Sitter library, we parsed AST variables from each data point. These resultant features were de-duplicated, which reduced the test bed size to ~227k data points. There were ~77k data points that had a valid docstring that is: a docstring that is longer than 3 words. This docstring does not include inline comments.

We then selected 960 data points to manually validate out of the ~227k data points from RawData and RawDataDocstring. The remaining data points were verified automatically.

Our steps for verification are as follows:

Verify that the push date of each commit is within the acceptable range (January 2, 2022 to January 1 2023)
Confirm that the method associated with any given commit was actually changed in the changelog
Validate n_words
Validate confounders
Validate AST levels and AST errors using Tree-Sitter Playground
Validate cyclomatic compleixty using pycharm
Verify whitespace count
Verify if random split is after function signature
Verify docstring is not empty
Remove all one line functions and pass statements
Select functions with more than one return statement
Confirm that summary is meaningful
Confirm that summary/docstring accurately represents code snippet functionality

After this verification step, we then sampled 3k data points from RawData in order to build five additional testbeds, each for a specific SE task. They are:

RandomCut
WithDocString
FromDocString

for analysis on code completion tasks. As well as:

CommitGen
SummarizationGen

The effectiveness of a LLMs for code generation could be heavily influenced by the choice of prompt. The way we formulate the questions and actions, and introduce the context impacts the results as well as the analysis over the LLMs. Therefore, the necessary keywords and structure for the prompt constitute the key aspect of building a functional prompt. Additionally, prompts could be configured to be multi-step or single-step. In other words, we can interact with the LLM and then provide more information or restrictions according to the answer in a second prompt. Chau et al. [1] describes the combination of prompts to be used at multi-step configuration.

The following table lists a set of seven prompt templates that \snipgen can generate. We have five templates to support SE tasks using single-step prompt configuration and three templates for processing the prompt by combining them with the SE Tasks. The idea of the processing prompt is to guide and fine-tune the answer generated with the first interaction. For instance, \snipgen we can combine $P1+P7$, $P3+P6$, $P3+P7$ for code completion.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
figures		figures
notebooks		notebooks
src		src
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SnipGen: A Dataset for Evaluating Large Language Models for Code via Prompt Engineering

Use Cases

Code Auto-Completion:

Code Summarization:

Code Generation

Bug Fixing:

Vulnerability Detection:

Testbed Generation and Curation:

About

Releases

Packages

Languages

License

WM-SEMERU/snipgen

Folders and files

Latest commit

History

Repository files navigation

SnipGen: A Dataset for Evaluating Large Language Models for Code via Prompt Engineering

Use Cases

Code Auto-Completion:

Code Summarization:

Code Generation

Bug Fixing:

Vulnerability Detection:

Testbed Generation and Curation:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages