Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Homework 3 clarification #36

Open
sandraemry opened this issue Feb 20, 2017 · 6 comments
Open

Homework 3 clarification #36

sandraemry opened this issue Feb 20, 2017 · 6 comments

Comments

@sandraemry
Copy link

sandraemry commented Feb 20, 2017

Hi @aammd

Do we add assertions to our script that cleans the raw data? Or should I read in my tidy data set and write assertions for that one?

Thanks!

Sandra

@aammd
Copy link
Member

aammd commented Feb 21, 2017

HI @sandraemry , that is a good question! I think both are just fine. Just make sure your reviewer knows where to find the assertions -- perhaps by labelling that section in your R script with a large comment

@katcheung
Copy link

Hi @aammd,
I tried to read in my tidied data and verify that certain columns were set up as factor but it fails. I verified with my original scripts from tidying my data that it was set up correctly. I seem to lose that information (eg. tank.no as a factor, etc.) in my saved csv file. Is this normal? If so, would we assume that we're continuing to work with the final product of our tidied script (assignment 2) and not reading in our tidied data csv file? Sorry if this is confusing.
Thanks,
Katherine

@sandraemry
Copy link
Author

Hi @katcheung, you can read in your csv files with the columns specified with the type of data it is. So for me it would look like this:

mydata <- read_csv("./data/flowcam_sum_tidy.csv", col_types = cols(
temp = col_integer(),
litter = col_factor(c("H", "L")),
rep = col_integer(),
cell_density = col_integer(),
cell_volume = col_double(),
biomass = col_double()
))

Is that what you were asking about? Or maybe @aammd has a better solution?

@aammd
Copy link
Member

aammd commented Feb 23, 2017

Hi @sandraemry & @katcheung ,

I think Sandra has a good answer here! You're right, factors are created when a csv or other file is read into R. So if you change the way you are reading the file, you change the way the result is represented in R. Sandra's example code shows one way to control exactly how each column is read.

Another answer to your question @katcheung is that you can choose to work in a clean script (reading in your tidy CSV) or on the bottom of your old one. Just make sure it is clear for your peer reviewer.

@LinneaSandell
Copy link

@aammd Regarding the metadata, should we have it as a routine to only work with files with metadata? As an example, should I save all my datafiles as csvy? It doesn't seem very useful to have metadata only for one part of your script (you add metadata in 01_rscript, but read in the data as csv in 02_analyse_data?
Let me know for what files metadata should be attached, and when it it is optional.
Thank you.

@aammd
Copy link
Member

aammd commented Feb 28, 2017

@LinneaSandell this is an interesting question, and one we should return to in class! Briefly, I think that we are drawing a distinction here between "in progress" data and the "final version" of the dataset. So we add metadata only when we are "happy" with the way the dataset is organized. However, there are many other workflows that could be imagined, where metadata is created at the beginning, or in the middle, of a project

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants