---
title: 'Contrastive PCA: Finding What''s Different Between Groups'
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Contrastive PCA: Finding What's Different Between Groups}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
params:
  family: red
css: albers.css
resource_files:
- albers.css
- albers.js
includes:
  in_header: |-
    <script src="albers.js"></script>
    <script>document.addEventListener('DOMContentLoaded',()=>document.body.classList.add('palette-red'));</script>

---

```{r setup, include=FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>", fig.width = 6, fig.height = 4)
library(multivarious)
library(ggplot2)
```

## What is Contrastive PCA?

Imagine you're studying two groups: patients with a disease and healthy controls. Both groups show variation in their measurements, but you're specifically interested in what makes the patients *different*. Standard PCA would find the largest sources of variation across all samples, which might be dominated by age, sex, or other factors common to both groups. 

**Contrastive PCA (cPCA++) finds patterns that are enriched in one group (foreground) compared to another (background).**

## A Simple Example

Let's start with a practical example to see why contrastive PCA is useful:

```{r motivation_example}
set.seed(123)
n_samples <- 100
n_features <- 50

# Create background data (e.g., healthy controls)
# Main variation is in features 1-10
background <- matrix(rnorm(n_samples * n_features), n_samples, n_features)
background[, 1:10] <- background[, 1:10] * 3  # Strong common variation

# Create foreground data (e.g., patients)
# Has the same common variation PLUS disease-specific signal in features 20-25
foreground <- background[1:60, ]  # Start with same structure
foreground[, 20:25] <- foreground[, 20:25] + matrix(rnorm(60 * 6, sd = 2), 60, 6)

# Standard PCA on combined data
all_data <- rbind(background, foreground)
regular_pca <- pca(all_data, ncomp = 2)

# Contrastive PCA
cpca_result <- cPCAplus(X_f = foreground, X_b = background, ncomp = 2)

# Compare what each method finds
loadings_df <- rbind(
  data.frame(
    feature = factor(1:30),
    value   = abs(regular_pca$v[1:30, 1]),
    method  = "Standard PCA"
  ),
  data.frame(
    feature = factor(1:30),
    value   = abs(cpca_result$v[1:30, 1]),
    method  = "Contrastive PCA"
  )
)

ggplot(loadings_df, aes(x = feature, y = value)) +
  geom_col(fill = "#1f78b4") +
  facet_wrap(~method, nrow = 1) +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(
    x = "Feature",
    y = "|Loading|",
    title = "Top loading coefficients for PC1"
  )
```

Notice how standard PCA focuses on features 1-10 (the common variation), while contrastive PCA correctly identifies features 20-25 (the group-specific signal).

## Using cPCAplus()

The `cPCAplus()` function makes contrastive PCA easy to use:

```{r basic_usage}
# Basic usage
cpca_fit <- cPCAplus(
  X_f = foreground,  # Your group of interest (foreground)
  X_b = background,  # Your reference group (background)
  ncomp = 5          # Number of components to extract
)

# The result is a bi_projector object with familiar methods
print(cpca_fit)

# Project new data
new_samples <- matrix(rnorm(10 * n_features), 10, n_features)
new_scores <- project(cpca_fit, new_samples)

# Reconstruct using top components
reconstructed <- reconstruct(cpca_fit, comp = 1:2)
```

## Understanding the Output

`cPCAplus()` returns a `bi_projector` object containing:

- **`v`**: Loadings (feature weights) for each component
- **`s`**: Scores (sample projections) for the foreground data
- **`sdev`**: Standard deviations explaining the "contrastive variance"
- **`values`**: The eigenvalue ratios (foreground variance / background variance)

```{r understanding_output}
# Which features contribute most to the first contrastive component?
top_features <- order(abs(cpca_fit$v[, 1]), decreasing = TRUE)[1:10]
print(paste("Top contributing features:", paste(top_features, collapse = ", ")))

# How much more variable is each component in foreground vs background?
print(paste("Variance ratios:", paste(round(cpca_fit$values[1:3], 2), collapse = ", ")))
```

## Common Applications

### 1. Biomedical Studies
```{r biomedical_example, eval=FALSE}
# Identify disease-specific patterns
tumor_cpca <- cPCAplus(
  X_f = tumor_samples,
  X_b = healthy_tissue,
  ncomp = 10
)
```

### 2. Technical Variation Removal
```{r technical_example, eval=FALSE}
# Use technical replicates as background to find biological signal
bio_cpca <- cPCAplus(
  X_f = biological_samples,
  X_b = technical_replicates,
  ncomp = 5
)
```

### 3. Time-Based Contrasts
```{r time_example, eval=FALSE}
# Find patterns specific to treatment timepoint
treatment_cpca <- cPCAplus(
  X_f = after_treatment,
  X_b = before_treatment,
  ncomp = 5
)
```

## Advanced Options

### Handling High-Dimensional Data

When you have more features than samples (p >> n), use the efficient sample-space strategy:

```{r high_dim}
# Create high-dimensional example
n_f <- 50; n_b <- 80; p <- 1000
X_background_hd <- matrix(rnorm(n_b * p), n_b, p)
X_foreground_hd <- X_background_hd[1:n_f, ] + 
                   matrix(c(rnorm(n_f * 20, sd = 2), rep(0, n_f * (p-20))), n_f, p)

# Use sample-space strategy for efficiency
cpca_hd <- cPCAplus(X_f = X_foreground_hd, X_b = X_background_hd,
                    ncomp = 5, strategy = "sample")
```

### Regularization for Unstable Background

If your background covariance is nearly singular, add regularization:

```{r regularization}
# Small background sample size can lead to instability
small_background <- matrix(rnorm(20 * 100), 20, 100)
small_foreground <- matrix(rnorm(30 * 100), 30, 100)

# Add regularization
cpca_regularized <- cPCAplus(
  X_f = small_foreground,
  X_b = small_background,
  ncomp = 5,
  lambda = 0.1  # Regularization parameter for background covariance
)
```

## When to Use Contrastive PCA

✓ **Use contrastive PCA when:**
- You have two groups and want to find patterns specific to one
- Background variation obscures your signal of interest
- You want to remove technical/batch effects captured by control samples

✗ **Don't use contrastive PCA when:**
- You only have one group (use standard PCA)
- Groups differ mainly in mean levels (use t-tests or LDA)
- The interesting variation is non-linear (consider kernel methods)

## Technical Details

<details>
<summary>Click for mathematical details</summary>

Contrastive PCA++ solves the generalized eigenvalue problem:

$$\mathbf{R}_f \mathbf{v} = \lambda \mathbf{R}_b \mathbf{v}$$

where:
- $\mathbf{R}_f$ is the foreground covariance matrix
- $\mathbf{R}_b$ is the background covariance matrix
- $\lambda$ represents the variance ratio (foreground/background)
- $\mathbf{v}$ are the contrastive directions

This finds directions that maximize the ratio of foreground to background variance, effectively highlighting patterns enriched in the foreground group.

The `geneig()` function provides the underlying solver with multiple algorithm options:
- `"geigen"`: General purpose, handles non-symmetric matrices
- `"robust"`: Fast for well-conditioned problems
- `"primme"`: Efficient for very large sparse matrices

</details>

## References

- Abid, A., Zhang, M. J., Bagaria, V. K., & Zou, J. (2018). Exploring patterns enriched in a dataset with contrastive principal component analysis. *Nature Communications*, 9(1), 2134.

- Salloum, R., & Kuo, C. C. J. (2022). cPCA++: An efficient method for contrastive feature learning. *Pattern Recognition*, 124, 108378.

- Wu, M., Sun, Q., & Yang, Y. (2025). PCA++: How Uniformity Induces Robustness to Background Noise in Contrastive Learning. *arXiv preprint arXiv:2511.12278*.

- Woller, J. P., Menrath, D., & Gharabaghi, A. (2025). Generalized contrastive PCA is equivalent to generalized eigendecomposition. *PLOS Computational Biology*, 21(10), e1013555.

## See Also

- `pca()` for standard principal component analysis
- `discriminant_projector()` for supervised dimensionality reduction
- `geneig()` for solving generalized eigenvalue problems directly
