library(dplyr)
library(tidyr)

Sometimes when you read in a package after you run base, then other parts of base are masked due to the same name in the new package that you loaded. if you do not want to see this then add in the warning = FALSE, message = FALSE in the setup

sometimes if you have a problem function that is masked and not working they you can just call it explicity For example: stats::filter() And generally use package_name::function_name()

Read in data

Usefule to build a roadmap for your rmarkdown document with simple outline headings of what you plan to do

Reading in urls for windows machines will need a url wrapper with some more methods.

catch_original<-read.csv(url("https://knb.ecoinformatics.org/knb/d1/mn/v2/object/df35b.302.1", method = "libcurl"), stringsAsFactors = FALSE)

Make data tidy

Remove the notes and all columns because we don’t need them. That information should really be in the metadata somewhere. Also good clean coding with piping will be to move the application to a different line, but do the line break after the pipe, NOT before. Mutate allows to add columns or change existing columns, left side of expression is what you want to change or add, then = to change it to whatever.

catch_clean<-catch_original %>% 
  select(-All, -notesRegCode) %>% 
  mutate(Chinook = ifelse(Chinook=="I",1,Chinook)) %>% 
  mutate(Chinook = as.numeric(Chinook))

head(catch_clean)
##   Region Year Chinook Sockeye Coho Pink Chum
## 1    SSE 1886       0       5    0    0    0
## 2    SSE 1887       0     155    0    0    0
## 3    SSE 1888       0     224   16    0    0
## 4    SSE 1889       0     182   11   92    0
## 5    SSE 1890       0     251   42    0    0
## 6    SSE 1891       0     274   24    0    0
#summary(catch_clean)

Data QA

Checking for NA values, just checks one columns Determine which values were converted to NA by as.numeric()

i <- which(is.na(catch_clean$Chinook))
catch_original[i,]
## [1] Region       Year         Chinook      Sockeye      Coho        
## [6] Pink         Chum         All          notesRegCode
## <0 rows> (or 0-length row.names)

Found issue: “I” in the Chinook column should be a 1. Looked up the reason that there was a character setting for the Chinook column and found that there was an entry with an I instead of a 1 in line 401 of the original data frame. in above code chunk was trying to coerce Chinook to numeric using “mutate(Chinook = as.numeric(Chinook))” in the pipe.

In general have an expectation of the number of rows and columns in your data and what they should be, check domain, range of values, missing value codes.

Can also make a lookup table that you

Convert to long format

We want our data table to be in a long format. We are using pivot longer and or pivot wider, it’s the same as melt and cast in plyr, gather and spread is a little more recent. Plyr isn’t really supported any more, but it can be frustrating. pivot longer supports more of the column helper functions that you can use for selecting names.

Easier to sometimes just tell the pivot what to not pivot over, rather than typing out all the columns you want to keep. So in this case could type out all the salmon names with cols = c(), or just state which ones you don’t want with cols = -c(). So this is to change the table from the wide format to a long format, essentially creating a pivot table.

catch_long<- catch_clean %>% 
  pivot_longer(cols = -c(Region, Year), 
               names_to = "species", 
               values_to = "catch") %>% 
  rename(catch_thousands = catch) %>% 
  mutate(catch = catch_thousands * 1000) %>% 
  select(-catch_thousands)


head(catch_long)
## # A tibble: 6 x 4
##   Region  Year species catch
##   <chr>  <int> <chr>   <dbl>
## 1 SSE     1886 Chinook     0
## 2 SSE     1886 Sockeye  5000
## 3 SSE     1886 Coho        0
## 4 SSE     1886 Pink        0
## 5 SSE     1886 Chum        0
## 6 SSE     1887 Chinook     0

and sometimes you want your data in wide format

catch_wide<-catch_long %>% 
  pivot_wider(names_from = "Year", 
              values_from = "catch")
head(catch_wide)
## # A tibble: 6 x 122
##   Region species `1886` `1887` `1888` `1889` `1890` `1891` `1892` `1893` `1894`
##   <chr>  <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 SSE    Chinook      0      0      0      0      0      0      0      0      0
## 2 SSE    Sockeye   5000 155000 224000 182000 251000 274000 207000 189000 253000
## 3 SSE    Coho         0      0  16000  11000  42000  24000  11000   1000   5000
## 4 SSE    Pink         0      0      0  92000      0      0   8000 187000 529000
## 5 SSE    Chum         0      0      0      0      0      0      0      0      0
## 6 NSE    Chinook      0      0      0      0      0      0      0      0   7000
## # ... with 111 more variables: `1895` <dbl>, `1896` <dbl>, `1897` <dbl>,
## #   `1898` <dbl>, `1899` <dbl>, `1900` <dbl>, `1901` <dbl>, `1902` <dbl>,
## #   `1903` <dbl>, `1904` <dbl>, `1905` <dbl>, `1906` <dbl>, `1907` <dbl>,
## #   `1908` <dbl>, `1909` <dbl>, `1910` <dbl>, `1911` <dbl>, `1912` <dbl>,
## #   `1913` <dbl>, `1914` <dbl>, `1915` <dbl>, `1916` <dbl>, `1917` <dbl>,
## #   `1918` <dbl>, `1919` <dbl>, `1920` <dbl>, `1921` <dbl>, `1922` <dbl>,
## #   `1923` <dbl>, `1924` <dbl>, `1925` <dbl>, `1926` <dbl>, `1927` <dbl>,
## #   `1928` <dbl>, `1929` <dbl>, `1930` <dbl>, `1931` <dbl>, `1932` <dbl>,
## #   `1933` <dbl>, `1934` <dbl>, `1935` <dbl>, `1936` <dbl>, `1937` <dbl>,
## #   `1938` <dbl>, `1939` <dbl>, `1940` <dbl>, `1941` <dbl>, `1942` <dbl>,
## #   `1943` <dbl>, `1944` <dbl>, `1945` <dbl>, `1946` <dbl>, `1947` <dbl>,
## #   `1948` <dbl>, `1949` <dbl>, `1950` <dbl>, `1951` <dbl>, `1952` <dbl>,
## #   `1953` <dbl>, `1954` <dbl>, `1955` <dbl>, `1956` <dbl>, `1957` <dbl>,
## #   `1958` <dbl>, `1959` <dbl>, `1960` <dbl>, `1961` <dbl>, `1962` <dbl>,
## #   `1963` <dbl>, `1964` <dbl>, `1965` <dbl>, `1966` <dbl>, `1967` <dbl>,
## #   `1968` <dbl>, `1969` <dbl>, `1970` <dbl>, `1971` <dbl>, `1972` <dbl>,
## #   `1973` <dbl>, `1974` <dbl>, `1975` <dbl>, `1976` <dbl>, `1977` <dbl>,
## #   `1978` <dbl>, `1979` <dbl>, `1980` <dbl>, `1981` <dbl>, `1982` <dbl>,
## #   `1983` <dbl>, `1984` <dbl>, `1985` <dbl>, `1986` <dbl>, `1987` <dbl>,
## #   `1988` <dbl>, `1989` <dbl>, `1990` <dbl>, `1991` <dbl>, `1992` <dbl>,
## #   `1993` <dbl>, `1994` <dbl>, ...

Summarize

Mean catch per region

mean_region<-catch_long %>% 
  group_by(Region,species) %>% 
  summarise(mean_catch = mean(catch), 
            n_obs = n())

head(mean_region)
## # A tibble: 6 x 4
## # Groups:   Region [2]
##   Region species mean_catch n_obs
##   <chr>  <chr>        <dbl> <int>
## 1 ALU    Chinook       23.0    87
## 2 ALU    Chum        2908.     87
## 3 ALU    Coho         195.     87
## 4 ALU    Pink      191954.     87
## 5 ALU    Sockeye     6839.     87
## 6 BER    Chinook       19.6   102

Filter for only one species

sockeye_catch<-catch_long %>% 
  filter(species=="Sockeye") %>% 
  group_by(Year) %>% 
  summarise(mean_catch = mean(catch)) %>% 
  arrange(desc(mean_catch))

max(sockeye_catch$mean_catch)
## [1] 3783118
head(sockeye_catch)
## # A tibble: 6 x 2
##    Year mean_catch
##   <int>      <dbl>
## 1  1993   3783118.
## 2  1995   3737235.
## 3  1992   3428471.
## 4  1983   3110294.
## 5  1990   3099529.
## 6  1994   3083353.

Join to other data

Want to get information from another table to get information on the region. This would be equivalent to a look-up table.

Read in the region defs tables

region_defs <- read.csv(url("https://knb.ecoinformatics.org/knb/d1/mn/v2/object/df35b.303.1",method = "libcurl"),stringsAsFactors = FALSE) %>% 
  select(code, mgmtArea)

head(region_defs)
##      code                                  mgmtArea
## 1     GSE              Unallocated Southeast Alaska
## 2     NSE                 Northern Southeast Alaska
## 3     SSE                 Southern Southeast Alaska
## 4     YAK                                   Yakutat
## 5 PWSmgmt      Prince William Sound Management Area
## 6     BER Bering River Subarea Copper River Subarea
catch_joined<- left_join(catch_long,region_defs, 
                         by = c("Region"="code"))

head(catch_joined)
## # A tibble: 6 x 5
##   Region  Year species catch mgmtArea                 
##   <chr>  <int> <chr>   <dbl> <chr>                    
## 1 SSE     1886 Chinook     0 Southern Southeast Alaska
## 2 SSE     1886 Sockeye  5000 Southern Southeast Alaska
## 3 SSE     1886 Coho        0 Southern Southeast Alaska
## 4 SSE     1886 Pink        0 Southern Southeast Alaska
## 5 SSE     1886 Chum        0 Southern Southeast Alaska
## 6 SSE     1887 Chinook     0 Southern Southeast Alaska

Need to specify which columns you should join over, they don’t need to be named the same.

Dummy df exercises to separate names and unity

sites_df <- data.frame(site = c("HAW-101",
                                "HAW-103",
                                "OAH-320",
                                "OAH-219",
                                "MAI-039"),
                       stringsAsFactors = FALSE)

sites_df %>% 
  separate(site, c("island", "site_number"), "-")
##   island site_number
## 1    HAW         101
## 2    HAW         103
## 3    OAH         320
## 4    OAH         219
## 5    MAI         039

Unite

dates_df <- data.frame(year = c("1930",
                                "1930",
                                "1930"),
                       month = c("12",
                                "12",
                                "12"),
                       day = c("14",
                               "15",
                               "16"),
                       stringsAsFactors = FALSE)

dates_df %>% 
  unite(date, year, month, day, sep = "-")
##         date
## 1 1930-12-14
## 2 1930-12-15
## 3 1930-12-16

If you want to do more with strings then use stringr package which also goes with the tidyverse.

make a change