---
title: "Using GxEScanR"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Using GxEScanR}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup, include = FALSE}
library(GxEScanR)
library(BinaryDosage)
```

## Introduction

GxEScanR is designed to efficiently run genome-wide association study (GWAS) and
genome-wide by environment interaction study (GWEIS) scans using imputed genotypes
stored in the BinaryDosage format. The phenotype to be analyzed can either be a
continuous or binary trait. The GWEIS scan performs multiple tests that can be
used in two-step methods.

## Notation and Models

There are five models that can be fit using GxEScanR when running a GWEIS, plus
one that must be fit before running the GWEIS.

$Y$ represents the outcome or phenotype

$X$ represents the confounding covariates

$E$ represents the covariate that will have its genetic interaction tested

$G$ represents the genetic data

$\beta$ represents the effect estimate that will be fitted in the model

### Model 0 - Environment-only Model

This model contains the phenotype, confounding covariates, and the interaction
covariate. It must be fit prior to running the GWEIS using the glm() routine.

$$Y = \beta_X X + \beta_E E$$

### Model 1 - Gene-only Model

This model contains all the values from the environment-only model and adds the
genetic data.

$$Y = \beta_X X + \beta_E E + \beta_G G$$

$\beta_G$ is the only term tested in this model and is represented by `bg_ge`
in the output.

### Model 2 - GxE Interaction Model

This model contains all the values from the gene-only model and adds the
gene-environment interaction term.

$$Y = \beta_X X + \beta_E E + \beta_G G + \beta_{GxE} GE$$

$\beta_G$ and $\beta_{GxE}$ are both tested for this model as well as a joint
test for both terms. They are represented by `bg_gxe`, `bgxe`, and `joint`,
respectively, in the output.

### Model 3 - EG Model

This model tests for a relationship between the environmental value of interest
and the gene. If $E$ is coded 0,1, a logistic model is fitted; otherwise a
linear model is fitted.

$$E = \beta_X X + \beta_G G$$

$\beta_G$ is tested in this model and is represented by `bg_eg` in the output.

### Model 4 - Case-only Model

This model is the same as model 3 except it is run on cases only.
$\beta_G$ is tested and is represented by `bg_case` in the output.
This model can only be run on a dichotomous phenotype.

### Model 5 - Control-only Model

This model is the same as model 3 except it is run on controls only.
$\beta_G$ is tested and is represented by `bg_ctrl` in the output.
This model can only be run on a dichotomous phenotype.

## Steps to Run a GWEIS

There are five steps to running a GWEIS using GxEScanR:

1. Preparing the Data (Genetic Data + Subject Data)
2. Fitting the Base Model
3. Allocating Memory for the GWEIS
4. Running the GWEIS
5. Processing the Results

## Step 1: Preparing the Data

### Genetic Data

GxEScanR uses data stored in the BinaryDosage format. The BinaryDosage package
converts files from the VCF format to the BinaryDosage format using the
`vcftobd()` routine. The `getbdinfo()` routine is then used to load the
information about the BinaryDosage file needed to run the GWEIS.

The following example shows how to convert a VCF file to BinaryDosage format.
Note that `vcftobd()` requires the `vcfppR` package.

```{r genetic-data-convert, eval = FALSE}
vcffile <- system.file("extdata", "gendata.vcf.gz", package = "GxEScanR")

exampledir <- tempdir()
bdosefile <- file.path(exampledir, "gendata.bdose")

BinaryDosage::vcftobd(vcffile = vcffile, bdose_file = bdosefile)
bdinfo <- BinaryDosage::getbdinfo(bdosefile)
```

This vignette uses a pre-converted BinaryDosage file included with the package.

```{r genetic-data}
bdosefile <- system.file("extdata", "gendata.bdose", package = "GxEScanR")
bdinfo <- BinaryDosage::getbdinfo(bdosefile)
exampledir <- tempdir()
```

### Subject Data

The subject data consists of subject IDs used to link the genetic data, and the
phenotype and covariates used in the GWEIS. The data included with GxEScanR has
two phenotypes (one continuous, one dichotomous) and two covariates (x1 and x2).

**Important:** All values for phenotypes and covariates must be numeric; factors
are not allowed.

**Important:** All subject IDs must be character values.

**Important:** When using a dichotomous phenotype, it must be coded 0,1.

```{r subject-data}
subjectfile <- system.file("extdata", "subdata.rds", package = "GxEScanR")
subjectdata <- readRDS(subjectfile)
head(subjectdata)
```

## Step 2: Fitting the Base Model

The base model is fit using glm() using data that contains only the subject IDs,
phenotype, and covariates of interest — no genetic data.

It is important to:

- Remove subjects with missing data.
- Keep only subjects who have genetic data (listed in `bdinfo$samples`).
- Keep the subject IDs in the data frame, as they are used to match genetic data
  to the correct phenotype and covariates.

The formula passed to glm() must list the covariate whose interaction with the
genetic data is being tested **last**. In this example we test the interaction
with x1, so the formula is `y ~ x2 + x1`.

### Continuous Phenotype

```{r linear-model}
# Remove y_logistic from the subject data
lineardata <- subjectdata[, c(1, 2, 4, 5)]

# Keep only subjects with complete data
lineardata <- lineardata[complete.cases(lineardata), ]

# Keep only subjects with genetic data
lineardata <- lineardata[lineardata$subid %in% bdinfo$samples$sid, ]

# Fit the linear model
linearmodel <- glm(y_linear ~ x2 + x1, data = lineardata)
```

### Dichotomous Phenotype

```{r logistic-model}
# Remove y_linear from the subject data
logisticdata <- subjectdata[, c(1, 3, 4, 5)]

# Keep only subjects with complete data
logisticdata <- logisticdata[complete.cases(logisticdata), ]

# Keep only subjects with genetic data
logisticdata <- logisticdata[logisticdata$subid %in% bdinfo$samples$sid, ]

# Fit the logistic model
logisticmodel <- glm(y_logistic ~ x2 + x1, family = binomial, data = logisticdata)
```

## Step 3: Allocating Memory for the GWEIS

Memory must be allocated before running the GWEIS using `gweis.mem()`. This
pre-computation makes the scan substantially faster. Pass the fitted base model,
the subject IDs from that model's data, and the list of tests to perform.

Typical test sets:

- **Continuous phenotype:** `"bg_ge"`, `"bg_gxe"`, `"bgxe"`, `"joint"`
- **Dichotomous phenotype:** `"bg_ge"`, `"bg_gxe"`, `"bgxe"`, `"joint"`,
  `"bg_eg"`, `"bg_case"`, `"bg_ctrl"`

```{r allocate-memory}
# Continuous outcome
linearmem <- gweis.mem(gemdl = linearmodel,
                       subids = lineardata$subid,
                       tests = c("bg_ge", "bg_gxe", "bgxe", "joint"))

# Dichotomous outcome
logisticmem <- gweis.mem(gemdl = logisticmodel,
                         subids = logisticdata$subid,
                         tests = c("bg_ge", "bg_gxe", "bgxe", "joint",
                                   "bg_eg", "bg_case", "bg_ctrl"))
```

## Step 4: Running the GWEIS

Pass the memory object from `gweis.mem()`, the `bdinfo` object, a vector of SNP
indices or IDs to test, and an output file path to `rungweis()`. Results are
written as a tab-delimited text file.

```{r run-gweis}
snpindex <- 1:nrow(bdinfo$snps)

# Continuous outcome
linearresults <- file.path(exampledir, "linear.txt")
rungweis(gweismem = linearmem,
         bdinfo = bdinfo,
         snps = snpindex,
         outfilename = linearresults)

# Dichotomous outcome
logisticresults <- file.path(exampledir, "logistic.txt")
rungweis(gweismem = logisticmem,
         bdinfo = bdinfo,
         snps = snpindex,
         outfilename = logisticresults)
```

## Step 5: Processing the Results

Read the output file with `read.table()`. The following describes the output
columns, illustrated with the first three SNPs.

```{r read-results}
lineardf <- read.table(linearresults, header = TRUE, sep = "\t")
logisticdf <- read.table(logisticresults, header = TRUE, sep = "\t")
```

### SNP Information and Allele Frequency

Every row contains SNP ID (`snpid`), chromosome (`chr`), position (`loc`),
reference allele (`ref`), alternate allele (`alt`), and alternate allele
frequency (`aaf`). For a dichotomous outcome, `aaf_case` and `aaf_ctrl` are
also included.

```{r snp-info-linear}
knitr::kable(lineardf[1:3, 1:6], caption = "SNP information — continuous outcome")
```

```{r snp-info-logistic}
knitr::kable(logisticdf[1:3, 1:8], caption = "SNP information — dichotomous outcome")
```

### Model 1 Output: `bg_ge`

`bg_ge` is the $\beta_G$ estimate from model 1. `bg_ge_lrt` is the LRT 1 df
chi-squared statistic for $\beta_G = 0$. Output when `"bg_ge"` is in `tests`.

```{r model1-output}
knitr::kable(lineardf[1:3, 7:8], caption = "Model 1 results")
```

### Model 2 Output: `bg_gxe`, `bgxe`, `joint`

- `bg_gxe` / `bg_gxe_lrt`: $\beta_G$ estimate and LRT from model 2. Output
  when `"bg_gxe"` is in `tests`.
- `bgxe` / `bgxe_lrt`: $\beta_{GxE}$ estimate and LRT. Output when `"bgxe"`
  is in `tests`.
- `joint_lrt`: 2 df LRT for $\beta_G = 0$ and $\beta_{GxE} = 0$ jointly.
  Output when `"joint"` is in `tests`.

```{r model2-output}
knitr::kable(lineardf[1:3, 9:13], caption = "Model 2 results")
```

### Models 3–5 Output: `bg_eg`, `bg_case`, `bg_ctrl`

Each test produces a $\beta_G$ estimate and a 1 df LRT statistic. Output when
`"bg_eg"`, `"bg_case"`, or `"bg_ctrl"` are in `tests`, respectively.

```{r model3-5-output}
knitr::kable(logisticdf[1:3, 16:21], caption = "Models 3-5 results")
```