# Biostatistics

#### By Olivia Mah

First published on March 16, 2019

# Biostatistics

### By Olivia Mah

First published on March 16, 2019

resources for biostatistics

resources for biostatistics

1. ## 1. Lesson 8 — 3D scatterplot

### Learning Objectives

Create 3D scatterplots

### Demonstration

Here is the dataset that we are using in this demo:

downs_bc.csv

This dataset is about Down’s Syndrome in Canada. (I have slightly modified the original dataset for our purpose).

https://vincentarelbundock.github.io/Rdatasets/doc/boot/downs.bc.html

We will first import the dataset. Then we will generate a 3D scatterplot from the dataset downs_bc. See below for details.

#### Step 1: Install the R package “scatterplot3d”

Two ways to install a R package one way is to type install.packages(scatterplot3d); the other way is to install via the graphical interface of RStudio: #### Step 2: Install the R package “scatterplot3d”

Before we can use the R package to create 3D scatterplots, we need to load the package into R first:

> library(scatterplot3d)

#### Step 3: Create a 3D scatterplot

Our dataset is called downs_bc. We want to generate a 3D scatterplot for the three columns in that dataset: cases, age and births.

> library(scatterplot3d)
> with(downs_bc,
+      scatterplot3d(cases ~ age + births, pch = 19,
+                    main="Incidence of Down's Syndrome",
+                    xlab="Age", ylab="Cases", zlab="Births"))



Explanation:

We use the library function to load the R package “scatterplot3d“.

Then we use generate the 3D scatterplot from the databset downs_bc:

cases ~ age means that the variable cases is explained by age (i.e. age is the explanatory variable);

+ births means that the variable births is the third variable in this scatterplot;

pch = 19 means to use plotting symbol 19 (solid circle) in our plot. (Type ?pch at the prompt to find out more about pch.)

Here is the 3D scatterplot: -END-

2. ## 2. Lesson 9—Import data via R scripts

### Learning Objectives

Write code to important data in .R scripts

### Demonstration

Instead of using the Import Dataset button in RStudio as we have been doing so far, we will write R codes in our .R script to do the data importing.

The advantage of importing your data via your .R script is that you can re-import your dataset and re-run your script in one smooth operation so as to generate a clean output.

Here are the steps:

#### Step 1: Bring up the Import Dataset screen

In RStudio,

i) Click the Import Dataset button and choose From Text (readr)….

ii) Then select the dataset that you are going to use BUT DO NOT click the IMPORT button at the bottom of the screen yet.

iii) Instead, copy the lines of R code in the box Code Preview in the lower left-hand corner of the screen and then paste the lines of R code into your .R script. Here is what your .R script should look like after you paste the lines of R code and save your script: -END-

3. ## 3. Lesson 10—Matched-Pairs t tests

### Learning Objectives

Perform a matched-pair hypothesis t test

### Demonstration

Here is the dataset that we are using in this demo from #17.16 in our textbook:

sitting_squatting.csv

Goal: To conduct a two-sided t test of no difference for a matched-pair t test

Here are the null and alternative hypotheses in this example:

$$H_0: \mu_1 = \mu_2$$

$$H_a: \mu_1 \neq \mu_2$$

where

$$\mu_1$$ = population mean angle in squatting

$$\mu_2$$= population mean angle in sitting

Here are the steps:

#### Step 2: Create a column of difference

w = sitting_squatting$Sitting - sitting_squatting$Squatting



DO NOT use name your column of difference as diff  because diff is a built-in function in R.

In this demonstration, we name our column of difference w.

#### Step 3: Draw a stemplot to check data

Draw a stemplot on the column of difference to check if data are roughly symmetric and without too many extreme outliers:

stem(w)


#### Step 3: Run the matched-pair t test via the R function t.test

t.test(sitting_squatting$Sitting, sitting_squatting$Squatting,
mu=0, paired=TRUE,
conf.level = 0.95)

Explanation:

Whenever R runs a hypothesis test, R automatically calculates the corresponding confidence interval —the range of values which the population mean is estimated to lie within.

Given a set of data, the corresponding hypothesis test result and the confidence interval are closely related. Therefore if we want the significance level $$\alpha$$ to be 0.05, then we set the argument conf.level = 0.95  because  conf.level = 1 – $$\alpha$$ .

By default, R automatically sets $$\mu = 0$$ and conf.level = 0.95 even if you don’t explicitly type these arguments. So you can skip typing these arguments into the t.test function if you are testing a two-sided alternative hypothesis with  $$\alpha =0.05$$.

-END-

4. ## 4. Lesson 11—Two-Sample t tests

### Learning Objectives

Perform a two-sample t hypothesis test

### Demonstration

Goal:

To conduct a two-sample t test with a two-sided alternative hypothesis. The following example to test if there is a difference between heights of plants grown with and without fertilizers (see p111 in ).

Here are the null and alternative hypotheses in this example:

$$H_0: \mu_1 = \mu_2$$

$$H_a: \mu_1 \neq \mu_2$$

where

$$\mu_1$$ = population mean height of plants grown without fertilizers

$$\mu_2$$= population mean height of plants grown with fertilizers

Here are the steps:

#### Step 1: Enter the data

We create a vector called cont to store heights of plants grown without fertilizers.

cont = c(64.7, 86.6, 67.1, 62.6, 75.1, 83.8, 71.7, 83.4, 90.3, 82.7)



We then create another vector called fert to store heights of plants grown with fertilizers.

fert = c(110.3, 130.4, 114.0, 135.7, 129.9, 98.2, 109.4, 131.4, 127.9, 125.7)



#### Step 3: Draw boxplots to check data

We draw two boxplots to check if the data are roughly symmetric and without too many extreme outliers:

boxplot(cont, fert, names =c("Control", "Fertilizer"),
xlab = "Treatment", ylab = "Plant Height (cm)",
main = "Plants with(out) Fertilizer", cex.lab =1.5)


Explanation:

The argument cex.lab magnifies the labels (default value is 1).

#### Step 4: Run the two-sample t test via the R function t.test

t.test(cont, fert, mu = 0, conf.level = 0.99)

Explanation:

Whenever R runs a hypothesis test, R automatically calculates the corresponding confidence interval —the range of values which the population mean is estimated to lie within.

Given a set of data, the corresponding hypothesis test result and the confidence interval are closely related. Therefore if we want the significance level $$\alpha$$ to be 0.01, then we set the argument conf.level = 0.99  because  conf.level = 1 – $$\alpha$$ .

By default, R automatically sets  mu=0  and conf.level = 0.95 even if you don’t explicitly type these arguments. So you can skip typing these arguments into the t.test function if you are testing a two-sided alternative hypothesis with  $$\alpha =0.05$$.

### References

 Hartvigsen, G. 2014. A Premier in Biological Data Analysis and Visualization Using R. Columbia University Press.

-END-

5. ## 5. Lesson 12—The Chi-Square Test for Goodness of Fit

### Learning Objectives

Perform a chi-square goodness-of-fit test.

### Demonstration

Goal: To conduct a chi-square goodness-of-test for #21.24 in our textbook.

For #21.24, the null hypothesis is:

$$H_0: p_{tall} = 0.75, p_{dwarf} = 0.25$$

Here is the R code:

chisq.test(c(787, 277), p=c(0.75, 0.25))



Explanation:

We use the R function chisq.test to run a chi-square test.

The arguments for the chisq.test function are:

c(787, 277) is the vector that stores the two observed values

p=c(0.75, 0.25) is the vector that stores the probabilities in the theoretical model (ratio 3:1).  (We must use the letter p but not any other letters in this chisq.test function).

-END-