library(dplyr)
library(tidyr)
Sometimes when you read in a package after you run base, then other parts of base are masked due to the same name in the new package that you loaded. if you do not want to see this then add in the warning = FALSE, message = FALSE in the setup
sometimes if you have a problem function that is masked and not working they you can just call it explicity For example: stats::filter() And generally use package_name::function_name()
Usefule to build a roadmap for your rmarkdown document with simple outline headings of what you plan to do
Reading in urls for windows machines will need a url wrapper with some more methods.
catch_original<-read.csv(url("https://knb.ecoinformatics.org/knb/d1/mn/v2/object/df35b.302.1", method = "libcurl"), stringsAsFactors = FALSE)
Remove the notes and all columns because we don’t need them. That information should really be in the metadata somewhere. Also good clean coding with piping will be to move the application to a different line, but do the line break after the pipe, NOT before. Mutate allows to add columns or change existing columns, left side of expression is what you want to change or add, then = to change it to whatever.
catch_clean<-catch_original %>%
select(-All, -notesRegCode) %>%
mutate(Chinook = ifelse(Chinook=="I",1,Chinook)) %>%
mutate(Chinook = as.numeric(Chinook))
head(catch_clean)
## Region Year Chinook Sockeye Coho Pink Chum
## 1 SSE 1886 0 5 0 0 0
## 2 SSE 1887 0 155 0 0 0
## 3 SSE 1888 0 224 16 0 0
## 4 SSE 1889 0 182 11 92 0
## 5 SSE 1890 0 251 42 0 0
## 6 SSE 1891 0 274 24 0 0
#summary(catch_clean)
Checking for NA values, just checks one columns Determine which values were converted to NA by as.numeric()
i <- which(is.na(catch_clean$Chinook))
catch_original[i,]
## [1] Region Year Chinook Sockeye Coho
## [6] Pink Chum All notesRegCode
## <0 rows> (or 0-length row.names)
Found issue: “I” in the Chinook column should be a 1. Looked up the reason that there was a character setting for the Chinook column and found that there was an entry with an I instead of a 1 in line 401 of the original data frame. in above code chunk was trying to coerce Chinook to numeric using “mutate(Chinook = as.numeric(Chinook))” in the pipe.
In general have an expectation of the number of rows and columns in your data and what they should be, check domain, range of values, missing value codes.
Can also make a lookup table that you
We want our data table to be in a long format. We are using pivot longer and or pivot wider, it’s the same as melt and cast in plyr, gather and spread is a little more recent. Plyr isn’t really supported any more, but it can be frustrating. pivot longer supports more of the column helper functions that you can use for selecting names.
Easier to sometimes just tell the pivot what to not pivot over, rather than typing out all the columns you want to keep. So in this case could type out all the salmon names with cols = c(), or just state which ones you don’t want with cols = -c(). So this is to change the table from the wide format to a long format, essentially creating a pivot table.
catch_long<- catch_clean %>%
pivot_longer(cols = -c(Region, Year),
names_to = "species",
values_to = "catch") %>%
rename(catch_thousands = catch) %>%
mutate(catch = catch_thousands * 1000) %>%
select(-catch_thousands)
head(catch_long)
## # A tibble: 6 x 4
## Region Year species catch
## <chr> <int> <chr> <dbl>
## 1 SSE 1886 Chinook 0
## 2 SSE 1886 Sockeye 5000
## 3 SSE 1886 Coho 0
## 4 SSE 1886 Pink 0
## 5 SSE 1886 Chum 0
## 6 SSE 1887 Chinook 0
and sometimes you want your data in wide format
catch_wide<-catch_long %>%
pivot_wider(names_from = "Year",
values_from = "catch")
head(catch_wide)
## # A tibble: 6 x 122
## Region species `1886` `1887` `1888` `1889` `1890` `1891` `1892` `1893` `1894`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SSE Chinook 0 0 0 0 0 0 0 0 0
## 2 SSE Sockeye 5000 155000 224000 182000 251000 274000 207000 189000 253000
## 3 SSE Coho 0 0 16000 11000 42000 24000 11000 1000 5000
## 4 SSE Pink 0 0 0 92000 0 0 8000 187000 529000
## 5 SSE Chum 0 0 0 0 0 0 0 0 0
## 6 NSE Chinook 0 0 0 0 0 0 0 0 7000
## # ... with 111 more variables: `1895` <dbl>, `1896` <dbl>, `1897` <dbl>,
## # `1898` <dbl>, `1899` <dbl>, `1900` <dbl>, `1901` <dbl>, `1902` <dbl>,
## # `1903` <dbl>, `1904` <dbl>, `1905` <dbl>, `1906` <dbl>, `1907` <dbl>,
## # `1908` <dbl>, `1909` <dbl>, `1910` <dbl>, `1911` <dbl>, `1912` <dbl>,
## # `1913` <dbl>, `1914` <dbl>, `1915` <dbl>, `1916` <dbl>, `1917` <dbl>,
## # `1918` <dbl>, `1919` <dbl>, `1920` <dbl>, `1921` <dbl>, `1922` <dbl>,
## # `1923` <dbl>, `1924` <dbl>, `1925` <dbl>, `1926` <dbl>, `1927` <dbl>,
## # `1928` <dbl>, `1929` <dbl>, `1930` <dbl>, `1931` <dbl>, `1932` <dbl>,
## # `1933` <dbl>, `1934` <dbl>, `1935` <dbl>, `1936` <dbl>, `1937` <dbl>,
## # `1938` <dbl>, `1939` <dbl>, `1940` <dbl>, `1941` <dbl>, `1942` <dbl>,
## # `1943` <dbl>, `1944` <dbl>, `1945` <dbl>, `1946` <dbl>, `1947` <dbl>,
## # `1948` <dbl>, `1949` <dbl>, `1950` <dbl>, `1951` <dbl>, `1952` <dbl>,
## # `1953` <dbl>, `1954` <dbl>, `1955` <dbl>, `1956` <dbl>, `1957` <dbl>,
## # `1958` <dbl>, `1959` <dbl>, `1960` <dbl>, `1961` <dbl>, `1962` <dbl>,
## # `1963` <dbl>, `1964` <dbl>, `1965` <dbl>, `1966` <dbl>, `1967` <dbl>,
## # `1968` <dbl>, `1969` <dbl>, `1970` <dbl>, `1971` <dbl>, `1972` <dbl>,
## # `1973` <dbl>, `1974` <dbl>, `1975` <dbl>, `1976` <dbl>, `1977` <dbl>,
## # `1978` <dbl>, `1979` <dbl>, `1980` <dbl>, `1981` <dbl>, `1982` <dbl>,
## # `1983` <dbl>, `1984` <dbl>, `1985` <dbl>, `1986` <dbl>, `1987` <dbl>,
## # `1988` <dbl>, `1989` <dbl>, `1990` <dbl>, `1991` <dbl>, `1992` <dbl>,
## # `1993` <dbl>, `1994` <dbl>, ...
Mean catch per region
mean_region<-catch_long %>%
group_by(Region,species) %>%
summarise(mean_catch = mean(catch),
n_obs = n())
head(mean_region)
## # A tibble: 6 x 4
## # Groups: Region [2]
## Region species mean_catch n_obs
## <chr> <chr> <dbl> <int>
## 1 ALU Chinook 23.0 87
## 2 ALU Chum 2908. 87
## 3 ALU Coho 195. 87
## 4 ALU Pink 191954. 87
## 5 ALU Sockeye 6839. 87
## 6 BER Chinook 19.6 102
Filter for only one species
sockeye_catch<-catch_long %>%
filter(species=="Sockeye") %>%
group_by(Year) %>%
summarise(mean_catch = mean(catch)) %>%
arrange(desc(mean_catch))
max(sockeye_catch$mean_catch)
## [1] 3783118
head(sockeye_catch)
## # A tibble: 6 x 2
## Year mean_catch
## <int> <dbl>
## 1 1993 3783118.
## 2 1995 3737235.
## 3 1992 3428471.
## 4 1983 3110294.
## 5 1990 3099529.
## 6 1994 3083353.
Want to get information from another table to get information on the region. This would be equivalent to a look-up table.
Read in the region defs tables
region_defs <- read.csv(url("https://knb.ecoinformatics.org/knb/d1/mn/v2/object/df35b.303.1",method = "libcurl"),stringsAsFactors = FALSE) %>%
select(code, mgmtArea)
head(region_defs)
## code mgmtArea
## 1 GSE Unallocated Southeast Alaska
## 2 NSE Northern Southeast Alaska
## 3 SSE Southern Southeast Alaska
## 4 YAK Yakutat
## 5 PWSmgmt Prince William Sound Management Area
## 6 BER Bering River Subarea Copper River Subarea
catch_joined<- left_join(catch_long,region_defs,
by = c("Region"="code"))
head(catch_joined)
## # A tibble: 6 x 5
## Region Year species catch mgmtArea
## <chr> <int> <chr> <dbl> <chr>
## 1 SSE 1886 Chinook 0 Southern Southeast Alaska
## 2 SSE 1886 Sockeye 5000 Southern Southeast Alaska
## 3 SSE 1886 Coho 0 Southern Southeast Alaska
## 4 SSE 1886 Pink 0 Southern Southeast Alaska
## 5 SSE 1886 Chum 0 Southern Southeast Alaska
## 6 SSE 1887 Chinook 0 Southern Southeast Alaska
Need to specify which columns you should join over, they don’t need to be named the same.
Dummy df exercises to separate names and unity
sites_df <- data.frame(site = c("HAW-101",
"HAW-103",
"OAH-320",
"OAH-219",
"MAI-039"),
stringsAsFactors = FALSE)
sites_df %>%
separate(site, c("island", "site_number"), "-")
## island site_number
## 1 HAW 101
## 2 HAW 103
## 3 OAH 320
## 4 OAH 219
## 5 MAI 039
Unite
dates_df <- data.frame(year = c("1930",
"1930",
"1930"),
month = c("12",
"12",
"12"),
day = c("14",
"15",
"16"),
stringsAsFactors = FALSE)
dates_df %>%
unite(date, year, month, day, sep = "-")
## date
## 1 1930-12-14
## 2 1930-12-15
## 3 1930-12-16
If you want to do more with strings then use stringr package which also goes with the tidyverse.
make a change