Biostatistics

By

First published on March 16, 2019

Biostatistics

By

First published on March 16, 2019





resources for biostatistics


resources for biostatistics





Table of Contents

  1. 1. Lesson 8 — 3D scatterplot



    Learning Objectives

    Create 3D scatterplots

     

    Demonstration

    Here is the dataset that we are using in this demo:

    downs_bc.csv

    This dataset is about Down’s Syndrome in Canada. (I have slightly modified the original dataset for our purpose).

    https://vincentarelbundock.github.io/Rdatasets/doc/boot/downs.bc.html

     

    We will first import the dataset. Then we will generate a 3D scatterplot from the dataset downs_bc. See below for details.

     

    Step 1: Install the R package “scatterplot3d”

    Two ways to install a R package one way is to type install.packages(scatterplot3d); the other way is to install via the graphical interface of RStudio:

    Step 2: Install the R package “scatterplot3d”

    Before we can use the R package to create 3D scatterplots, we need to load the package into R first:

    > library(scatterplot3d)
    
    

    Step 3: Create a 3D scatterplot

    Our dataset is called downs_bc. We want to generate a 3D scatterplot for the three columns in that dataset: cases, age and births.

    > library(scatterplot3d)
    > with(downs_bc, 
    +      scatterplot3d(cases ~ age + births, pch = 19, 
    +                    main="Incidence of Down's Syndrome",
    +                    xlab="Age", ylab="Cases", zlab="Births"))
    
    

    Explanation:

    We use the library function to load the R package “scatterplot3d“.

    Then we use generate the 3D scatterplot from the databset downs_bc:

    cases ~ age means that the variable cases is explained by age (i.e. age is the explanatory variable);

    + births means that the variable births is the third variable in this scatterplot;

    pch = 19 means to use plotting symbol 19 (solid circle) in our plot. (Type ?pch at the prompt to find out more about pch.)

     

    Here is the 3D scatterplot:

    -END-


  2. 2. Lesson 9—Import data via R scripts



    Learning Objectives

    Write code to important data in .R scripts

     

    Demonstration

    Instead of using the Import Dataset button in RStudio as we have been doing so far, we will write R codes in our .R script to do the data importing.

    The advantage of importing your data via your .R script is that you can re-import your dataset and re-run your script in one smooth operation so as to generate a clean output.

     

    Here are the steps:

    Step 1: Bring up the Import Dataset screen

    In RStudio,

    i) Click the Import Dataset button and choose From Text (readr)….

    ii) Then select the dataset that you are going to use BUT DO NOT click the IMPORT button at the bottom of the screen yet.

    iii) Instead, copy the lines of R code in the box Code Preview in the lower left-hand corner of the screen and then paste the lines of R code into your .R script.

    Here is what your .R script should look like after you paste the lines of R code and save your script:

     

    -END-


  3. 3. Lesson 10—Matched-Pairs t tests



    Learning Objectives

    Perform a matched-pair hypothesis t test

     

    Demonstration

    Here is the dataset that we are using in this demo from #17.16 in our textbook:

    sitting_squatting.csv

     

    Goal: To conduct a two-sided t test of no difference for a matched-pair t test

     

    Here are the null and alternative hypotheses in this example:

    \(H_0: \mu_1 = \mu_2 \)

    \(H_a: \mu_1 \neq \mu_2 \)

    where

    \( \mu_1 \) = population mean angle in squatting

    \( \mu_2 \)= population mean angle in sitting

     

    Here are the steps:

    Step 1: Import the dataset

    Step 2: Create a column of difference

    w = sitting_squatting$Sitting - sitting_squatting$Squatting
    
    

    DO NOT use name your column of difference as diff  because diff is a built-in function in R.

    In this demonstration, we name our column of difference w.

     

    Step 3: Draw a stemplot to check data

    Draw a stemplot on the column of difference to check if data are roughly symmetric and without too many extreme outliers:

    stem(w)
    

     

    Step 3: Run the matched-pair t test via the R function t.test

    t.test(sitting_squatting$Sitting, sitting_squatting$Squatting, 
                   mu=0, paired=TRUE, 
                   conf.level = 0.95)

    Explanation:

    Whenever R runs a hypothesis test, R automatically calculates the corresponding confidence interval —the range of values which the population mean is estimated to lie within.

    Given a set of data, the corresponding hypothesis test result and the confidence interval are closely related. Therefore if we want the significance level \(\alpha \) to be 0.05, then we set the argument conf.level = 0.95  because  conf.level = 1 – \(\alpha \) .

    By default, R automatically sets \(\mu = 0 \) and conf.level = 0.95 even if you don’t explicitly type these arguments. So you can skip typing these arguments into the t.test function if you are testing a two-sided alternative hypothesis with  \(\alpha =0.05\).

     

    -END-


  4. 4. Lesson 11—Two-Sample t tests



    Learning Objectives

    Perform a two-sample t hypothesis test

     

    Demonstration

    Goal:

    To conduct a two-sample t test with a two-sided alternative hypothesis. The following example to test if there is a difference between heights of plants grown with and without fertilizers (see p111 in [1]).

     

    Here are the null and alternative hypotheses in this example:

    \(H_0: \mu_1 = \mu_2 \)

    \(H_a: \mu_1 \neq \mu_2 \)

    where

    \( \mu_1 \) = population mean height of plants grown without fertilizers

    \( \mu_2 \)= population mean height of plants grown with fertilizers

     

    Here are the steps:

    Step 1: Enter the data

    We create a vector called cont to store heights of plants grown without fertilizers.

    cont = c(64.7, 86.6, 67.1, 62.6, 75.1, 83.8, 71.7, 83.4, 90.3, 82.7)
    
    

    We then create another vector called fert to store heights of plants grown with fertilizers.

    fert = c(110.3, 130.4, 114.0, 135.7, 129.9, 98.2, 109.4, 131.4, 127.9, 125.7)
    
    

    Step 3: Draw boxplots to check data

    We draw two boxplots to check if the data are roughly symmetric and without too many extreme outliers:

    boxplot(cont, fert, names =c("Control", "Fertilizer"),
            xlab = "Treatment", ylab = "Plant Height (cm)",
            main = "Plants with(out) Fertilizer", cex.lab =1.5)
    

    Explanation:

    The argument cex.lab magnifies the labels (default value is 1).

     

    Step 4: Run the two-sample t test via the R function t.test

    t.test(cont, fert, mu = 0, conf.level = 0.99)

    Explanation:

    Whenever R runs a hypothesis test, R automatically calculates the corresponding confidence interval —the range of values which the population mean is estimated to lie within.

    Given a set of data, the corresponding hypothesis test result and the confidence interval are closely related. Therefore if we want the significance level \(\alpha \) to be 0.01, then we set the argument conf.level = 0.99  because  conf.level = 1 – \(\alpha \) .

    By default, R automatically sets  mu=0  and conf.level = 0.95 even if you don’t explicitly type these arguments. So you can skip typing these arguments into the t.test function if you are testing a two-sided alternative hypothesis with  \(\alpha =0.05\).

     

    References

    [1] Hartvigsen, G. 2014. A Premier in Biological Data Analysis and Visualization Using R. Columbia University Press.

    -END-


  5. 5. Lesson 12—The Chi-Square Test for Goodness of Fit



    Learning Objectives

    Perform a chi-square goodness-of-fit test.

     

    Demonstration

    Goal: To conduct a chi-square goodness-of-test for #21.24 in our textbook.

     

    For #21.24, the null hypothesis is:

    \(H_0: p_{tall} = 0.75, p_{dwarf} = 0.25 \)

     

    Here is the R code:

    chisq.test(c(787, 277), p=c(0.75, 0.25))
    
    

    Explanation:

    We use the R function chisq.test to run a chi-square test.

    The arguments for the chisq.test function are:

    c(787, 277) is the vector that stores the two observed values

    p=c(0.75, 0.25) is the vector that stores the probabilities in the theoretical model (ratio 3:1).  (We must use the letter p but not any other letters in this chisq.test function).

    -END-



Content is copyright the author. Layout is copyright Corsbook. See Corsbook.com for further notices.