diff --git a/02_getting_started_with_r.qmd b/02_getting_started_with_r.qmd index 8c7ab9c..9965c06 100644 --- a/02_getting_started_with_r.qmd +++ b/02_getting_started_with_r.qmd @@ -9,8 +9,8 @@ The first screen we see in **R** should look like this: Here, there are three main window panes. The "Console" window is where we type the code to tell the computer to do stuff. The "Environment-History-Connections" window has three tabs. "Environment" -shows the objects that we have saved during our session (I will explain this -soon); "History" shows a record of all the code we asked the computer to run; and "Connections"—which we will not use here—shows us the connections we have to remote databases. +shows the objects that we save during our session (I will explain this +soon); "History" shows a record of all the code we ask the computer to run; and "Connections" (which we will not use here) shows us the connections we have to remote databases. The "Files-Plots-Packages" window has several tabs. "Files" shows us all the files in our current working directory, which is the default location where **R** will look for files we want to load and where it will put any files we save. "Plots" will display the plots we make in **R**. "Packages" shows us the packages installed in our computer and whether they are loaded in our current session. "Help" allows us to search and read the documentation for packages and @@ -87,7 +87,7 @@ If we try to do shoddy math, **R** will inform us (but not necessarily with an e ``` -**R** has "and" (`&`) and "or" (`|`) operators to combine multiple logical statements: +The logical operators "and" (`&`) and "or" (`|`) allow us to combine multiple logical statements: ```{r conjunction and disjunction} (5 < 6) & (7 == 8) @@ -121,7 +121,7 @@ store numbers and decimal positions. So, computers often use numbers with what we think of as `3` may actually be `3.000005`. This is common enough to justify the **R** [FAQ (7.31)](https://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-doesn_0027t-R-think-these-numbers-are-equal_003f). To quote from "The Elements of Programming Style", by Kernighan and Plauger: "10.0 times 0.1 is hardly ever 1.0." -Being able to do these computations is good and all, but what happens if we want to use a result later without redoing the computation? **R** can save numbers (and more) using something called "objects". +Being able to do these computations is nice, but what happens if we want to use a result later without redoing the computation? **R** can save numbers (and more) using something called "objects". ## Objects @@ -195,8 +195,8 @@ We can use almost any name we want, but there are a few rules: There are also a few suggestions that will save you many hours of frustration: + Avoid giving your object the same name as a built-in function. -+ If you need to create objects with multiple words in their name, separate them with an underscore (`my_value`) or a dot (`my.value`), or capitalize the different words (`MyValue`). Choose whichever format you like most. Just be consistent. -+ Use informative names. It is quick and easy to use names like `x` or `my_value`. But your code will be easier and faster to understand if your objects have names that illustrate what you want to do with them. Your colleagues and your future self will really appreciate it. ++ If you need to create objects with multiple words in their name, separate them with an underscore (`my_object`) or a dot (`my.object`), or capitalize the different words (`MyObject`). Choose whichever format you like most. Just be consistent. ++ Use informative names. It is quick and easy to use names like `x` or `value1`. But your code will be easier and faster to understand if your objects have names that illustrate what you want to do with them. Your colleagues and your future self will really appreciate it. ## Using functions @@ -272,7 +272,7 @@ A shorter way of writing this is to use `?` before the name of the function. After you run the code, the help page is displayed in the "Help" tab in the "Files-Plots-Packages" pane (usually in the bottom right of **R**Studio). -As a novice user, help pages may seem arcane---probably because they aim for shortness and use technical terminology. But this short jargon makes (most of) the explanations precise, so we can use the information we need without having to read the entire document. Also, all help pages are organized similarly, so we don't have to relearn how to navigate them. With a bit of practice, you will be able to find exactly what you need in mere seconds. +As a novice user, help pages may seem arcane---perhaps because they aim for shortness and use jargon. But this short jargon makes (most of) the explanations precise, so we can use the information we need without having to read the entire document. Also, all help pages are organized similarly, so we don't have to relearn how to navigate them. With a bit of practice, you will be able to find exactly what you need in mere seconds. The first line of the help document displays the name of the function and the package that contains the function. Other sections are: @@ -298,9 +298,9 @@ Note that in this case we have to use quotes `" "` around the name of the functi ??round ``` -As before, the 'Help' tab in RStudio will display the results of the search. `help.search()` looks for the pattern in the help documentation, code demonstrations, and package vignettes and displays the results as clickable links that we can follow. +As before, the 'Help' tab in **R**Studio will display the results of the search. `help.search()` looks for the pattern in the help documentation, code demonstrations, and package vignettes and displays the results as clickable links that we can follow. -Another useful function is `apropos()`, which lists all functions containing a specified character string. For example, to find all functions with `round` in their name, we use +Another useful function is `apropos()`, which lists all functions containing a specified character string. For example, to find all functions with `round` in their name, we use^[This output should show 6 names instead of 3. I can not fix the output in the notes, but the code works fine inside **R**Studio.] ```{r} apropos("round") @@ -337,7 +337,7 @@ Until now, the console satisfied all our coding needs. But if we wanted to reuse ## Working with scripts -An **R** script is a plain text file that we can use to save our code. I think the best way to code in **R** is to use a script, and I strongly suggest you always use one. A script helps us edit, proofread, and reuse our code. It also allows us to save our code so we can share it with others or come back to it later. Also, it is cumbersome to write multi-line code in the console, but it is easy if we use a script. +An **R** script is a plain text file that we can use to save our code. I think the best way to code in **R** is to use a script, and I strongly suggest you always use one. A script helps us edit, proofread, and reuse our code. Also, it is cumbersome to write multi-line code in the console, but it is easy if we use a script. A script also allows us to save our code so we can share it with others or come back to it later. To open a script, click on `File > New File > R script` in the menu bar on the upper left-hand side of the screen. @@ -351,7 +351,7 @@ To save your script, you can click on the blue square^[Young reader, this square Moving forward, I will write all my code in a script and will assume you are doing so too. Trust me, it's for your own benefit. -Now that we know how to preserve our hard-obtained code, we can start doing more elaborate work. Previously we worked with one number at the time. But we can also work with groups of numbers (and of other stuff) by using something called "vectors". +Now that we know how to preserve our hard-gained code, we can start doing more elaborate work. Previously we worked with one number at the time. But we can also work with groups of numbers (and of other stuff) by using something called "vectors". ## Working with vectors @@ -381,7 +381,7 @@ Note that we must enclose words in quotation marks to let **R** know that we wan vector_of_words <- c(monday, lemon) ``` -Later on we will work more with vectors of words, but for now let's focus on numerical vectors. +Later on we will work more with vectors of words, but for now let's focus on vectors of numbers. ### Named vectors @@ -406,6 +406,11 @@ my_vec*my_vec In the last example, **R** did not obey the rules of linear algebra to multiply two vectors. Instead, **R** used "element-wise execution", which means that **R** applied the same operation to each member of the vector. For example, `my_vec + 7` adds `7` to each number inside `my_vec`. +Element-wise execution allows us to manipulate entire data variables rather than +one element at a time. When working with a data set, element-wise +execution will ensure that values from one observation or case are only paired +with values from the same observation or case. Element-wise execution also facilitates writing our functions in **R**. + To decide how to apply element-wise execution, **R** considers the *length* of the vectors, which refers to the number of elements inside them. When you use two vectors with the same length for an operation, **R** will line up the vectors and perform a sequence of individual operations. For instance, in `my_vec*my_vec`, **R** multiplies the first element of vector 1 by the first element of vector 2, then the second element of vector 1 by the second element of vector 2, and so on, until all elements are multiplied. The result will be a new vector with the same length as the first two. If you give **R** two vectors with different lengths, it will repeat the shorter vector until it has as many elements as the longer vector, and then do the math. @@ -413,7 +418,7 @@ If you give **R** two vectors with different lengths, it will repeat the shorter my_vec * c(1, 2) ``` -If the length of the short vector does not divide evenly into the length of the long vector, R will do an incomplete repeat of the shorter vector and return a warning. +If the length of the short vector does not divide evenly into the length of the long vector, **R** will do an incomplete repeat of the shorter vector and return a warning. ```{r recycling vectors 2} my_vec * c(1, 2, 3, 4) @@ -421,25 +426,20 @@ my_vec * c(1, 2, 3, 4) Repeating the numbers of the vector is known as "vector recycling", and it helps **R** do element-wise operations. -Element-wise operations allow us to manipulate entire data variables rather than -one element at a time. When you start working with data sets, element-wise -operations will ensure that values from one observation or case are only paired -with values from the same observation or case. Element-wise operations also make -it easier to write your own programs and functions in **R**. - -**R** can do vector and matrix multiplications, but we have to explicitly ask for them. For example, to get the inner product, we need the operator `%*%`. And to get the outer product, we need `%o%`. If you are not familiar with matrix operations, don't worry, you won't need them in these notes. +**R** can do vector and matrix multiplications, but we have to ask for them +explicitly. For example, to get the inner product, we need the operator `%*%`. +And to get the outer product, we need `%o%`. Don't worry if you are not familiar +with matrix operations; you won't need them in these notes. ### Extracting elements -We can access specific elements of vectors using the square bracket `[ ]` notation. Write the name of the vector we want to extract from, followed by the square brackets with an index of the element we wish to extract. This index can be a position or the result of a logical test. - -To extract elements based on their position we simply write the position inside the `[ ]`. For example, to extract the 3rd value of `my_vec`, we use +We can access specific elements of vectors using the square bracket `[ ]` notation. To use it, we first write the name of the vector we want to extract from, followed by the square brackets with an index of the element we wish to extract. This index can be a position or a logical statement. To extract elements based on their position we simply write the position number inside the `[ ]`. For example, to extract the 3rd value of `my_vec`, we use ```{r extract value from my_vec} my_vec[3] ``` -We can store this value in another object +We can store this value in another object. ```{r} value_3 = my_vec[3] @@ -465,10 +465,10 @@ my_vec[3:6] ::: {.callout-note} ## Note -In **R**, the positional index starts at 1, so to call the first element of a vector we need to use `[1]`. In most other programming languages (like Python and C++), the positional index starts at 0. +In **R**, the positional index starts at 1, so to call the first element of a vector we need to use `[1]`. In many other programming languages (like Python and C++), the positional index starts at 0. ::: -If the elements of the vector have names, we can use the name instead of a positional index. +If the elements of the vector have names, we can use the name (surrounded by quotes) instead of a positional index. ```{r extracting named element} my_named_vec my_named_vec["lions"] @@ -480,7 +480,7 @@ Another convenient way to extract elements from a vector is to use a logical exp my_vec[my_vec > 4] ``` -This works because **R** uses element-wise operations even for logical statements. So, `my_vec > 4` asks if each item of `my_vec` meets the condition "greater than four" and returns the corresponding vector of `TRUE` and `FALSE`. Then, when we add this result to the square brackets, **R** examines each element of `my_vec` asking "should I extract this element?". If the answer is `TRUE`, the value is extracted; if it's `FALSE`, the value is ignored. Under the hood, `my_vec > 4` is equivalent to +This works because **R** uses element-wise execution even for logical statements. So, `my_vec > 4` asks if each item of `my_vec` meets the condition "greater than four" and returns the corresponding vector of `TRUE` and `FALSE`. Then, when we add this result to the square brackets, **R** examines each element of `my_vec` asking "should I extract this element?". If the answer is `TRUE`, the value is extracted; if it's `FALSE`, the value is ignored. Under the hood, using `my_vec > 4` is equivalent to ```{r} my_vec[c(TRUE, FALSE, TRUE, FALSE, FALSE, TRUE)] @@ -504,7 +504,7 @@ my_vec[c(5, 6)] <- 55 my_vec ``` -**R** can also replace elements with an element wise operation: +**R** can also do element-wise replacement: ```{r} my_vec[c(5, 6)] <- c(100, 200) @@ -529,7 +529,7 @@ sorted_vec <- sort(my_vec) sorted_vec ``` -If we want to sort from highest to lowest, we need to set the optional argument `decreasing` to `TRUE` +If we want to sort from highest to lowest, we need to set the optional argument `decreasing` to `TRUE`. ```{r} sorted_vec_decreasing <- sort(my_vec, decreasing = TRUE) @@ -543,30 +543,28 @@ sorted_vec_decreasing <- rev(sort(my_vec)) sorted_vec_decreasing ``` -A more useful feature of vectors is that we can reorder their elements based on the values of other vectors. To show this, let's first create a vector of countries and another vector with (my guess of) their typical daily temperatures in degrees Fahrenheit. +A more useful feature of vectors is that we can reorder their elements based on the values of other vectors. To show this, let's first create a vector of cities and another vector with (my guess of) their typical daily temperatures in degrees Fahrenheit. ```{r} -countries <- c("Japan", "Egypt", "Mexico", "Finland") +cities <- c("Tokyo", "Cairo", "Mexico City", "Helsinki") temperatures_fahrenheit <- c(50, 90, 65, -10) ``` -Now imagine that we want to order the vector of countries, going from coldest to hottest. The first step to reorder the countries is to use `order()` to create a new variable called "temperatures_ordered". +Now imagine that we want to order the vector of cities, going from coldest to hottest. The first step is to use `order()` to create a new variable called "temperatures_ordered". ```{r} temperatures_ordered <- order(temperatures_fahrenheit) temperatures_ordered ``` -This output says that the lowest value in `temperatures_fahrenheit` is in the fourth position, the second lowest value is on the first position, and so on. So, we can think of `temperatures_ordered` as a vector of positional indices of temperatures in ascending order. Now we can use these indices to reorder the vector of countries. +This output says that the lowest value in `temperatures_fahrenheit` is in the fourth position, the second lowest value is on the first position, and so on. So, we can think of `temperatures_ordered` as a vector of positional indices of temperatures in ascending order. Now we can use these indices to reorder the vector of cities. ```{r} -countries_ordered <- countries[temperatures_ordered] -countries_ordered +cities_ordered <- cities[temperatures_ordered] +cities_ordered ``` -Ta-da! - -These vector manipulations can do more than dazzle your friends. Imagine we have a data set with two columns of data and we want to sort each column. If we use `sort()` on each column separately, the values of each column will become uncoupled from each other. Instead, we can use `order()` on one column to make a vector of positional indices that we can use on the second column. This will return a vector of values based on the order of the first column. +Ta-da! These vector manipulations can do more than dazzle your friends. Imagine that we have a data set with two columns of data and that we want to sort it based on the values of the first column. If we use `sort()` on each column separately, the values of each column will become uncoupled from each other. Instead, we can use `order()` on one column to make a vector of positional indices. Then we can use this vector on both columns to keep the values of each coupled in the original way. So far we have relied on **R**'s built-in capabilities to do everything we need. But sometimes we need to do something for which **R** doesn't have a pre-made function. In these cases, we can write a function ourselves and test it easily using a script. @@ -590,7 +588,7 @@ simple_function <- function() { :::callout-tip ## Indents make code more readable -Note that I indented the body of the function. This indentation doesn't affect our function, but it lets the reader know that the code is only supposed to run inside the function. +Indenting the body of the function helps the reader notice that the code is only supposed to run inside the function. Indentation doesn't affect our functions, but it is helpful and pervasive among **R** coders. ::: To run our function, we have to write its name followed by round brackets, just like with any other function: @@ -601,7 +599,7 @@ simple_function() Remember to write the round brackets even if they are empty. The round brackets make **R** *run* the code inside the function. If we don't write these brackets, **R** will *show us* the code inside the function (try it!). -Now let's write a function that will convert Fahrenheit degrees to Celsius. Since we want to use this function with different temperatures, we need to include an argument that will tell **R** which temperature to convert each time. +Let's write a function to convert Fahrenheit degrees to Celsius. We want to use this function with different temperatures, so we need to include an argument that will tell **R** which temperature to convert each time. ```{r fahrenheit to celsius without function} fahrenheit_to_celsius <- function(temperature) { @@ -626,7 +624,7 @@ Now all we need is to identify values of $a, b,$ and $c$ to pass as arguments. solve_quadratic(a = 1, b = -1, c = -3) ``` -Why didn't our function show a result? When we run a function, **R** runs the code in the body and returns the result of the last line of code. In `solve_quadratic()`, the last line saves one of the solutions, but does not show it. So, `solve_quadratic()` doesn't show a value either. We have to write something to ensure that the last line displays the solutions. Since we are using a script (right?) it is easy add one more line to our function: +Why didn't our function show a result? When we run a function, **R** runs the code in the body and returns the result of the last line of code. In `solve_quadratic()`, the last line saves the second solution only and it doesn't show it. So, running `solve_quadratic()` doesn't show a value either. We have to write something to ensure that the last line displays both solutions. Since we are using a script (right?), it is easy add one more line to our function: ```{r solve_quadratic} solve_quadratic = function(a, b, c) { @@ -638,7 +636,7 @@ solve_quadratic = function(a, b, c) { solve_quadratic(a = 1, b = -1, c = -3) ``` -A more explicit way of ensuring our function will return a result is to use the `return()` statement inside the function: +A more explicit way of ensuring our function will display its result is to use the `return()` statement inside the function: ```{r solve_quadratic with return()} solve_quadratic = function(a, b, c) { @@ -650,7 +648,9 @@ solve_quadratic = function(a, b, c) { solve_quadratic(a = 1, b = -1, c = -3) ``` -`return()` does not need to be the last line of the function. It can appear anywhere else and the function will still yield whatever `return()` contains. +`return()` does not need to be the last line of the function. It can appear anywhere else and the result will still be whatever `return()` contains. + +Note that the result of a function must always be a single object. If we tried to use `return(solution_1, solution_2)`, we would get an error message. This is why we combined both solutions from `solve_quadratic()` into a single vector. Our function can have as many arguments as we like. It is enough to add their names, separated by commas, in the parentheses that follow the function. When the function runs, **R** will replace each argument name in the function body with the corresponding value that we supply. If we don't supply a value, **R** will replace the argument name with the argument's default value (if we defined one). @@ -666,12 +666,12 @@ multiply_solutions(a = 1, b = -1, c = -3, multiplier = 10) ::: {.callout-note} ## Objects created in functions disappear -All of the objects that we create inside a function will disappear after it finishes running. Only the output will remain after the function runs, and we need to assign it to an object if we want to save it. +All of the objects that we create inside a function will disappear after it finishes running. Only the output will remain, and to save it we need to assign it to an object. ::: -## Acquiring external packages +Being able to write our own functions is great, but we don't need to reinvent the wheel every time we need to do something that is not available in **R**'s default version. We can easily download packages from CRAN's online repositories to get many useful functions. -We don't need to reinvent the wheel every time we need to do something that is not available in **R**'s default version. We can easily download packages from CRAN's online repositories to get many useful functions. +## Acquiring external packages To install a package from CRAN, we can use the `install.packages()` function. For example, if we wanted to install the package `readxl` (for loading .xslx files), we would need: ```{r installing readxl} @@ -689,11 +689,11 @@ After installing a package, we need to load it into **R** before we can use its library("readxl") ``` -Every time we start a new **R** session we need to load the packages we need. If we try to run a function without loading its package first, we will get an error message saying that **R** could not find it. +Whenever we start a new **R** session we need to load the packages we need. If we try to run a function without loading its package first, **R** will not be able to find it and will return an error message. -Writing all our `library()` statements at the top of our **R** scripts is almost always a good idea. This helps us know that we need to load the libraries at the start our sessions; and it helps others know quickly that they will need to have those libraries installed to be able to use our code. +Writing all our `library()` statements at the top of our **R** scripts is almost always good because it helps us know that we need to load the libraries at the start our sessions. It also helps others know quickly that they will need to have those libraries installed to be able to use our code. -Sometimes we only need one or two functions from a library. To avoid loading the entire library, we can access the specific function directly by specifying the package name followed by two colons and then the function name. For example: +A library can contain many objects, but sometimes we only need one or two of its functions. To avoid loading the entire library, we can access the specific function directly by specifying the package name followed by two colons and then the function name. For example: ```{r using specific function from library} #| eval: false readxl::read_xlsx("fake_data_file.xlsx") diff --git a/03_data_in_r.qmd b/03_data_in_r.qmd index 7231ef0..4b6b2e7 100644 --- a/03_data_in_r.qmd +++ b/03_data_in_r.qmd @@ -4,7 +4,7 @@ In the previous section we learned how to store, extract, and manipulate numeric ## Data types -Data types are classifications of data that help **R** conform to our intuition. For example, multiplying numbers by each other feels right, but multiplying words by each other does not. There are six data types in **R**: doubles, integers, logicals, characters, complex, and raw. Each type has different rules for storing and handling them. Learning these rules will allow us to analyze data later with less effort and fewer mistakes. +Data types are classifications of data that help **R** conform to our intuition. For example, multiplying numbers by each other feels right, but multiplying words by each other does not. There are six data types in **R**: doubles, integers, logicals, characters, complex, and raw, each with its own rules for storing and handling. Learning these rules will allow us to analyze data later with less effort and fewer mistakes. Data type **complex** is for imaginary numbers, and type **raw** represents raw bytes of data. It is unlikely that you will ever need these data types, so I will not explain them in these notes. @@ -19,7 +19,7 @@ typeof(my_double) my_integer <- 5L typeof(my_integer) ``` -Data scientists rarely use integers because we can save them as doubles. But **R** stores integers more precisely than doubles. So, integers are still helpful when dealing with complicated operations. +Data scientists rarely use integers because we can save them as doubles. But integers are still helpful when dealing with complicated operations because **R** stores them more precisely than doubles. **Logicals** are truth values `TRUE` and `FALSE`, which are useful when we compare numbers or objects: ```{r logical value} @@ -28,7 +28,7 @@ typeof(my_comparison) ``` ::: {.callout-tip} #### Write TRUE and FALSE explicitly -At the beginning of every session, **R** saves `T` and `F` as shortcuts for `TRUE` and `FALSE`. But `T` and `F` are not reserved words; they are regular variables that we can modify and even delete inadvertently. An accidental misuse of `T` or `F` will cost you more time and effort than whatever you may save by typing a single letter instead of a full word. So, I strongly suggest you always write the full words. +At the beginning of every session, **R** saves `T` and `F` as shortcuts for `TRUE` and `FALSE`. But `T` and `F` are not reserved words; they are regular objects that we can modify and even delete inadvertently. An accidental misuse of `T` or `F` will cost you more time and effort than whatever you may save by typing a single letter instead of a full word. So, I strongly suggest you always write the full words. ::: There is also a special type of logical value called `NA`, which denotes a missing value. @@ -42,14 +42,12 @@ NA > 5 NA == "Sancho" ``` - - -**Characters** are text (like "hello", "Elvis", or "Somewhere in La Mancha") or symbols that we want to handle as text (like "size 45", or "mail/u"). We can create a character object by typing a character or *string* of characters surrounded by quotes: +**Characters** are text, like "hello", "Elvis", or "Somewhere in La Mancha"; or symbols that we want to handle as text, like "size 45" or "mail/u". We can create a character object by typing a character or *string* of characters surrounded by quotes: ```{r character value} my_character <- "Somewhere in La Mancha" typeof(my_character) ``` -As you may notice, it is easy to confuse **R** objects with character strings because both appear as pieces of text in R code. For example, `the_thing` is the name of an **R** object named *the_thing* that can contain any type of data; but `"the_thing"` is a character string, i.e., it is itself a piece of data that we can assign to any name we want. If we forget to use the quotation marks when writing a name, **R** will look for an object that likely doesn't exist, so we will likely get an error. +As you may notice, it is easy to confuse objects with character strings because both appear as pieces of text in the code. For example, `the_thing` is the name of an object named *the_thing* that can contain any type of data. Conversely, `"the_thing"` is a character string, i.e., a piece of data that we can assign to any name. If we forget the quotation marks when writing a name, **R** will look for an object that likely doesn't exist, so we will likely get an error. ```{r} #| error: true noquotes @@ -80,7 +78,9 @@ Data structures are ways of organizing data that make it easier for us to manipu ### Atomic vectors -Atomic vectors are the most basic type of data structure. They are one-dimensional groups of data where all values must be of the same type. There is only one exception: any vector can include `NA` as a value regardless of the type of the other values. To create an atomic vector, we can group values using the combine function `c()`: +Atomic vectors are the most basic type of data structure. They are one-dimensional groups of data where all values must be of the same type. There is only one exception: any vector can include `NA` as a value regardless of the type of the other values. Vectors make it easy for us to store values that are supposed to measure the same property. It would be hard to understand what a vector represented if it had values like `"salsa"` and `sqrt(77)`. + +To create an atomic vector, we can group values using the combine function `c()`: ```{r atomic vector} quijote_characters <- c("Don Quijote", "Sancho Panza", NA) @@ -99,7 +99,7 @@ length(c()) Adding different data types to the same atomic vector does not produce an error. Instead, **R** automatically follows specific rules to *coerce* everything inside the vector to be of the same type. If a character string is present in an atomic vector, **R** will convert all other values to character strings. If a vector only contains logicals and numbers, **R** will convert the logicals to numbers; every `TRUE` becomes a `1`, and every `FALSE` becomes a `0`. The only values that are not coerced are `NA`s. -Following these rules helps preserve information. It is easy, for example, to recognize the original types of strings `"TRUE"` and `"3.14"`. Or to transform a vector of `1`s and `0`s back to `TRUE`s and `FALSE`s. +Following these rules helps preserve information. It is easy, for example, to recognize the original types of strings `"TRUE"` and `"3.14"`, or to transform a vector of `1`s and `0`s back to `TRUE`s and `FALSE`s. ::: ### Matrices @@ -115,7 +115,7 @@ scores_mat <- matrix(data = scores_vec, ncol = 3) scores_mat ``` -Like atomic vectors, matrices can have any data type, but only one (or `NA`): +Like atomic vectors, matrices can have any data type, but only one (or `NA`). ```{r} character_mat <- matrix( data = c("Mario", "Peach", "Luigi", "Yoshi"), @@ -132,7 +132,7 @@ scores_mat <- matrix(data = scores_vec, nrow = 3, byrow = TRUE) scores_mat ``` -When showing a matrix, **R** shows expressions with square brackets (e.g., `[,1]`). The numbers inside the square brackets are positional indices that denote the "coordinates" of the matrix. Two-dimensional objects like matrices have two indices, one for each dimension. The first number always refers to the row, and the second always refers to the column. So, as with vectors, we can use square bracket notation `[ ]` to extract values from matrices. +When showing a matrix, **R** shows expressions with square brackets (e.g., `[,1]`). The numbers inside the square brackets are positional indices that denote the "coordinates" of the matrix. Two-dimensional objects like matrices have two indices. The first index always refers to the row, and the second always refers to the column. So, as with vectors, we can use square bracket notation `[ ]` to extract values from matrices. ```{r extract value from matrix} scores_mat[c(1, 3), 2] # Rows 1 and 3 in column 2 ``` @@ -172,9 +172,9 @@ scores_mat %*% scores_mat # Matrix multiplication ### Arrays -The `array()` function creates an n-dimensional array. Using an n-dimensional array is like stacking groups of data. 1 dimension forms a column of data with multiple values; 2 dimensions are like a sheet of paper with several columns of data; 3 dimensions are like a book with several sheets; 4 dimensions are like a box with several books, and so on. Note that layers of an array have consistent sizes. All books have the same number of sheets, and all sheets have the same number of rows and columns. +An array is a multidimensional object that stacks groups of data. Using 1 dimension in an array forms a column of data with multiple values; using 2 dimensions is like a sheet of paper with several columns of data; 3 dimensions are like a book with several sheets; 4 dimensions are like a box with several books, and so on. Note that layers of an array have consistent sizes. All books have the same number of sheets, and all sheets have the same number of rows and columns. -To use `array()`, we need an atomic vector as the first argument, and a vector of dimension sizes `dim` as the second argument: +We can use `array()` to create an n-dimensional array. The first argument in `array()` must be a vector with the values that we want to store in the array. The second argument must be a vector where the length denotes the number of dimensions, and the values denote the size of each dimension. ```{r array with three dimensions} array(c(25:28, 35:38, 45:48), dim = c(2, 2, 3)) @@ -187,7 +187,7 @@ Note that the total number of elements in the array is equal to multiplying the ::: {.callout-tip} ## Applied inception -Try to make an array with 4 dimensions. Following the metaphor from above, try to make a box that contains 3 books, each of which has 4 sheets with 2 columns and 2 rows each. See a quick solution below. +Try to make an array with 4 dimensions. Following the metaphor from above, try to make an array with 3 books, each of which has 4 sheets with 2 columns and 2 rows each. See a quick solution below. ```{r inception solution} #| code-fold: true @@ -196,7 +196,7 @@ array(c(1:48), dim = c(2, 2, 4, 3)) ``` ::: -Vectors, matrices, and arrays need all of its values to be of the same type. This requirement seems rigid, but it allows the computer to store large sets of numbers in a simple and efficient way; and it accelerates computations because **R** knows that it can manipulate all values in the object the same way. Also, vectors make it easy for us to store values that are supposed to measure the same property. It would be hard to understand what a vector represented if it had values like `"salsa"` and `sqrt(77)`. +Vectors, matrices, and arrays need all of its values to be of the same type. This requirement seems rigid, but it allows the computer to store large sets of numbers in a simple and efficient way; and it accelerates computations because **R** knows that it can manipulate all values in the object the same way. However, we often need to store different types of data in a single place---maybe because all of that data belongs to the same underlying concept. For example, we can describe a dog based on its height, weight, and age (numerical values), and on its color and breed (character strings). **R** can keep all of these in a single place. @@ -218,7 +218,7 @@ new_list typeof(new_list) ``` -Or we can use double bracket notation `[[ ]]` to get only the contents of an element from the original list (we can not extract multiple elements this way). +Or we can use double bracket notation `[[ ]]` to get only the subelements of an element from the original list (we can not extract multiple elements this way). ```{r extracting a single element without more lists} new_item <- all_in_one_list[[1]] new_item @@ -250,14 +250,14 @@ countries_info[["speak_spanish"]] Data frames are the most common storage structure for data analysis. We can think of them as a group of atomic vectors (columns) of the same length. Usually, each row of a data frame represents an individual observation and each column represents a different measurement or variable of that observation. -Different vectors can have different data types, but they must have the same length. If we use vectors of different lengths, **R** will recycle values of some vectors to ensure that the data frame has a square shape. +Different vectors in a data frame can have different data types, but they must all have the same length. If we use vectors of different lengths, **R** will recycle values of some vectors to ensure that the data frame has a square shape. We can create a data frame using the `data.frame()` function. Give `data.frame()` any number of vectors of equal length, each separated with a comma. Each vector should be set equal to a name that describes the vector. `data.frame()` will turn each vector into a column of the new data frame: ```{r create data frame from scratch} aliens_df <- data.frame( - name = c("Axanim", "Blob", "Cloomin", "Dlemex"), - planet = c("Kepler-5", "Patzapuan", "Laodic_Prime", "Future_Earth"), - number_of_arms = c(5, NA, 1, 2.5) + name = c("Bender", "Fry", "Nibbler", "Zoidberg"), + species = c("Robot", "Human", "Nibblonian", "Decapodian"), + fingers_per_hand = c(3, 4, 3, NA) ) aliens_df ``` @@ -281,7 +281,7 @@ This means that we can extract extract values from data frames the same way we e ```{r extracting values from data frame} aliens_df[2] # Equivalently -aliens_df["planet"] +aliens_df["species"] ``` Or we can use double brackets `[[ ]]` or dollar sign notation `$` to get atomic vectors: ```{r} @@ -290,7 +290,7 @@ aliens_df$name Also, as with matrices, we can use a single bracket with two indices (note that this produces an atomic vector): ```{r} -aliens_df[c(1,2), "number_of_arms"] +aliens_df[c(1,2), "fingers_per_hand"] ``` Creating data frames from scratch is cumbersome and prone to errors. In the next section, we will see how to import data from different sources into **R**, as well as basic ways to prepare it for analysis. diff --git a/04_basic_data_processing.qmd b/04_basic_data_processing.qmd index a5f5d85..2461753 100644 --- a/04_basic_data_processing.qmd +++ b/04_basic_data_processing.qmd @@ -1,6 +1,6 @@ # Basic data processing -Now we can apply our understanding of **R** to work with files of pre-existing data. The first step when loading data into **R** is to locate our working directory. This is the default location where **R** will look for files we want to load and where it will put any files we save. The working directory will change on different computers. To find our current working directory, we run: +Now we can apply our understanding of **R** to work with files of pre-existing data. The first step when loading data is to locate our working directory. This is the default location where **R** will look for files we want to load and where it will put any files we save. The working directory will change on different computers. To find our current working directory, we run: ```{r get working directory} #| eval: false getwd() @@ -44,7 +44,7 @@ flower_df <- read.table("data_files/flower.csv", header = TRUE, sep = ",") In the code above, I added arguments `header` and `sep`. `header` tells **R** whether the first line of the file contains variable names instead of values; this will help us identify the variables in the data frame. `sep` tells **R** the symbol that the file uses to separate the cells; this will help us preserve the correct location of the data cells. -Sometimes a plain-text file starts with text that is not part of the data set. Or maybe we want to read only part of a data set. Argument `skip` tells **R** to skip a specific number of lines before it starts reading values from the file. Argument `nrow` tells **R** to only read a certain number of lines, starting from the top. Keep in mind that `nrow` does not count the header in the number of rows it reads. +Other useful arguments are `skip` and `nrow`. `skip` tells **R** to skip a specific number of lines before it starts reading values from the file. This argument is helpful when the file starts with text that is not part of the data set, and when we want to read only part of a data set. `nrow` tells **R** to only read a certain number of lines, starting from the top. Keep in mind that `nrow` does not count the header in the number of rows it reads. ```{r} flower_df_chunk <- read.table( @@ -57,7 +57,7 @@ flower_df_chunk <- read.table( flower_df_chunk ``` -`read.table()` has other arguments that we can tweak. You can consult the function's help page to know more about it. +`read.table()` has other arguments that we can tweak. You can consult the function's help page to know more about them. #### Shortcuts for read.table @@ -106,9 +106,9 @@ flowers_fwf_df ### Excel files -The best way to load data from Excel files (.xlsx) into **R** is to first save these files as .csv or .txt files and then use `read.table`. Excel files can include multiple spreadsheets, macros, colors, dynamic tables, and other complicated formats. All of these make it difficult for **R** to read the files properly. Plain text files are simpler, so we can load and transfer them more easily. +The best way to load data from Excel files (.xlsx) is to first save these files as .csv or .txt files and then use `read.table`. Excel files can include multiple spreadsheets, macros, colors, dynamic tables, and other complicated formats that make it difficult for **R** to read the files properly. Plain text files are simpler, so we can load and transfer them more easily. -Still, there are ways to load Excel files into **R** if we *really* need to. **R** has no native way of loading these files, but we can use the package `readxl`, which works on all operating systems. We install it using `install.packages("readxl")`. Then we load it using `library(readxl)`. Once we load the package, we can use the function `read_excel()` to load files of the type .xls and .xlsx (see `help("read_excel")` for more information). +Still, there are ways to load Excel files if we *really* need to. **R** has no native way of loading these files, but we can use the package `readxl`, which works on all operating systems. We install it using `install.packages("readxl")` and then load it using `library(readxl)`. Once we load the package, we can use the function `read_excel()` to load files of the type .xls and .xlsx (see `help("read_excel")` for more information). ### Files from other programs @@ -122,9 +122,9 @@ But sometimes we can't transform the file to a plain-text format---maybe because ## Cleaning data -Once we load our data files as data.frames in **R**, we should verify that all of the information has an appropriate format. The process of identifying, removing and correcting inaccurate information is often referred to as "data cleaning". We will practice data cleaning using a "messy" version of the flower data that we loaded above. You can get this messy version from [here](https://github.com/CSCAR/workshop-r-intro/blob/main/data_files/flower_messy.csv). Again, you can use `Ctrl+Shift+s` to download the file. +Once we load our data files as data.frames, we should verify that all of the information has an appropriate format. The process of identifying, removing and correcting inaccurate information is often referred to as "data cleaning". We will practice data cleaning using a "messy" version of the flower data that we loaded above. You can get this messy version from [here](https://github.com/CSCAR/workshop-r-intro/blob/main/data_files/flower_messy.csv). Again, you can use `Ctrl+Shift+s` to download the file. -Since this is a .csv file, we can load it into **R** using: +Since this is a .csv file, we can load it using: ```{r loading messy flower data} flower_messy_df = read.csv("data_files/flower_messy.csv", header = TRUE) ``` @@ -155,7 +155,7 @@ colnames(flower_clean_df) <- new_colnames # Replace column names in data frame colnames(flower_clean_df) # Check our work ``` -The last change to these column names will be to substitute the periods in the names with underscores. In **R**, this is purely out of personal preference, but it's a good excuse to meet the function `gsub()`, which substitutes patterns of strings: +The last change to these column names will be to substitute the periods in the names with underscores. In **R**, this is purely out of personal preference, but it's a good excuse to meet `gsub()`, which substitutes patterns of strings: ```{r substitute periods with underscores in colnames} colnames(flower_clean_df) <- gsub( pattern = "\\.", # What we want to remove @@ -204,7 +204,7 @@ flower_clean_df["nitrogen"] <- factor(flower_clean_df$nitrogen) str(flower_clean_df) ``` -Column "flowers" looks fine, but column "nitrogen" looks suspicious. It is supposed to have only three values ("low", "medium", and "high"), but its description counts eight values. Let's examine them more closely: +Column "flowers" looks fine, but column "nitrogen" looks suspicious. It is supposed to have only three levels ("low", "medium", and "high"), but its description counts eight. Let's examine them more closely: ```{r check levels of nitrogen column} levels(flower_clean_df$nitrogen) ``` diff --git a/docs/02_getting_started_with_r.html b/docs/02_getting_started_with_r.html index dcf2035..b2cbfb4 100644 --- a/docs/02_getting_started_with_r.html +++ b/docs/02_getting_started_with_r.html @@ -2,7 +2,7 @@
- + @@ -22,7 +22,7 @@ } /* CSS for syntax highlighting */ pre > code.sourceCode { white-space: pre; position: relative; } -pre > code.sourceCode > span { line-height: 1.25; } +pre > code.sourceCode > span { display: inline-block; line-height: 1.25; } pre > code.sourceCode > span:empty { height: 1.2em; } .sourceCode { overflow: visible; } code.sourceCode > span { color: inherit; text-decoration: inherit; } @@ -83,13 +83,7 @@ "collapse-after": 3, "panel-placement": "start", "type": "textbox", - "limit": 50, - "keyboard-shortcut": [ - "f", - "/", - "s" - ], - "show-item-context": false, + "limit": 20, "language": { "search-no-results-text": "No results", "search-matching-documents-text": "matching documents", @@ -98,7 +92,6 @@ "search-more-match-text": "more match in this document", "search-more-matches-text": "more matches in this document", "search-clear-button-title": "Clear", - "search-text-placeholder": "", "search-detached-cancel-button-title": "Cancel", "search-submit-button-title": "Submit", "search-label": "Search" @@ -108,33 +101,6 @@ - - @@ -143,12 +109,12 @@