This page is for planning the janitor package, at a high level. More-finite questions and ideas can be handled via GitHub issues. This is for say, articulating what the package does or doesn't do, and how it should be organized. If it turns out we need a more discussion- and comment-friendly format, we can move to Google Docs, but let's try commenting and editing here.
Provide a framework and associated functions for checking and cleaning dirty data. There are two kinds of checks: interactive checks, like tabyl
, and programmatic checks that say, confirm in production that some variables contain no duplicate records, or contain no missing values.
- A thorough set of alerting functions for assertive data cleaning in production (instead, recommend assertr or other package)
- Where should the organizing framework go - vignette? Bookdown? Main page?
Establish organizing framework for the package
- Write up documentation/vignette showing check/clean iterative cycle
- Figure out new names for functions to fit schema, and rename them
- Redo vignette
- Redo homepage
New function: fuzzy dupes
New function family: bindability issues
should have an option for returning with or without Ns as a column (it's so much faster without)