Skip to content

WM-SEMERU/snipgen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SnipGen: A Dataset for Evaluating Large Language Models for Code via Prompt Engineering

Large Language Models (LLMs), such as transformer-based neural networks trained on extensive parameters, are becoming increasingly prevalent in software engineering. These sophisticated neural networks undergo training on expansive datasets that encompass both natural and programming languages. However, assessing their efficacy in specific tasks poses a challenge due to their size and intricacy. To accurately gauge these models' capabilities and their evolving potential, it is imperative to employ appropriate inputs and varied examples while steering clear of biases.

To confront this challenge head-on, we introduce snipgen, an innovative dataset that exploits prompt engineering across diverse downstream tasks for code generation. snipgen furnishes meticulously curated data points to aid researchers and practitioners in assessing LLMs across various scenarios. Employing a semi-automatic approach, we gathered approximately $227K$ data points from $338K$ recent changes in code bases on \github, with a focus on method-level granularity. Additionally, to ensure data quality, we meticulously validated the collected samples.

Moreover, snipgen offers a collection of templates that can be combined to formulate practical prompts for evaluating the performance of \LLMs in different tasks. By providing both the dataset and the methodology for its construction, our goal is to empower researchers and practitioners to effectively evaluate and interpret the capabilities of LLMs.

Use Cases

The primary focus is on answering causal queries of interest related to LLMs. It enables ML researchers in software engineering to explain the causal effect of a set of confounders associated with the treatment input prompt. Some of the use cases include:

Code Auto-Completion:

Galeras can be used to evaluate LLM's performance in auto-completing code. Its provided non-contaminated code snippets can be used with our random cut implementation to randomly cut the code snippet somewhere after the code signature.

Code Summarization:

Additionally, this facilitates the examination of an LLM's capacity for code summarization. Researchers can investigate the influence of various confounders on the production of code summaries by leveraging the dataset and benchmark provided.

Code Generation

ML researchers have the opportunity to evaluate LLM's proficiency in producing test cases using Galeras. The benchmark provides a dedicated dataset tailored for test generation purposes, enabling a comprehensive assessment and interpretation of code generation performance.

Bug Fixing:

We can utilize this to analyze LLM's effectiveness in bug fixing. Researchers can evaluate the causal relationship between different confounders and the accuracy of bug fixes generated by LLMs.

Vulnerability Detection:

We provide some snippets with an identified vulnerability and their span position, the prompt could be used to ask the LLMs to detect and fix the snippet code.

image

Testbed Generation and Curation:

The collection pipeline for this dataset is as follows:

image

In the first step, popular github repos are filtered using the following query: language : Python, fork : false, size :>= 30, 000, pushed :> 2021 - 12 - 31, stars :> 1, 000

Given ChatGPT's and the other LLMc under analysis have a training data cutoff date of ~September 2021, we selected data from January 2, 2022 to January 1, 2023. Therefore, we make the claim that our testbeds help to avoid data snooping.

We then extracted code and documentation related features from each data point. After, using the Tree-Sitter library, we parsed AST variables from each data point. These resultant features were de-duplicated, which reduced the test bed size to ~227k data points. There were ~77k data points that had a valid docstring that is: a docstring that is longer than 3 words. This docstring does not include inline comments.

We then selected 960 data points to manually validate out of the ~227k data points from RawData and RawDataDocstring. The remaining data points were verified automatically.

Our steps for verification are as follows:

  • Verify that the push date of each commit is within the acceptable range (January 2, 2022 to January 1 2023)
  • Confirm that the method associated with any given commit was actually changed in the changelog
  • Validate n_words
  • Validate confounders
  • Validate AST levels and AST errors using Tree-Sitter Playground
  • Validate cyclomatic compleixty using pycharm
  • Verify whitespace count
  • Verify if random split is after function signature
  • Verify docstring is not empty
  • Remove all one line functions and pass statements
  • Select functions with more than one return statement
  • Confirm that summary is meaningful
  • Confirm that summary/docstring accurately represents code snippet functionality

After this verification step, we then sampled 3k data points from RawData in order to build five additional testbeds, each for a specific SE task. They are:

  • RandomCut
  • WithDocString
  • FromDocString

for analysis on code completion tasks. As well as:

  • CommitGen
  • SummarizationGen

image

The effectiveness of a LLMs for code generation could be heavily influenced by the choice of prompt. The way we formulate the questions and actions, and introduce the context impacts the results as well as the analysis over the LLMs. Therefore, the necessary keywords and structure for the prompt constitute the key aspect of building a functional prompt. Additionally, prompts could be configured to be multi-step or single-step. In other words, we can interact with the LLM and then provide more information or restrictions according to the answer in a second prompt. Chau et al. [1] describes the combination of prompts to be used at multi-step configuration.

The following table lists a set of seven prompt templates that \snipgen can generate. We have five templates to support SE tasks using single-step prompt configuration and three templates for processing the prompt by combining them with the SE Tasks. The idea of the processing prompt is to guide and fine-tune the answer generated with the first interaction. For instance, \snipgen we can combine $P1+P7$, $P3+P6$, $P3+P7$ for code completion.

image

About

Snipet generator dataset

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published