---
title: Pre-processing pipelines in multiblock
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Pre-processing pipelines in multiblock}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
params:
  family: red
css: albers.css
resource_files:
- albers.css
- albers.js
includes:
  in_header: |-
    <script src="albers.js"></script>
    <script>document.addEventListener('DOMContentLoaded',()=>document.body.classList.add('palette-red'));</script>

---

```{r setup, include=FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>", fig.width=6, fig.height=4)
library(multivarious)
library(dplyr) # Needed for %>% and tibble manipulation
library(tibble)
library(ggplot2)
```

# 1. Why a pipeline at all?

Code that mutates data in place (e.g. `scale(X)`) is convenient in a script
but dangerous inside reusable functions:

*   **Data-leak avoidance**: Fitted means/SDs live inside the pre-processor object, calculated only once (typically on training data).
*   **Reversibility**: `inverse_transform()` gives you proper back-transforms (handy for reconstruction error or publication plots).
*   **Composability**: You can nest simple steps together (e.g., `colscale(center())`).
*   **Partial input**: The same pipeline can process just the columns you pass (`transform(..., colind = 1:3)`), perfect for region-of-interest or block workflows.

The grammar is tiny:

| Verb          | Role                           | Typical Call                       |
|---------------|--------------------------------|------------------------------------|
| `pass()`      | do nothing (placeholder)       | `fit(pass(), X)`                   |
| `center()`    | subtract column means          | `fit(center(), X)`                 |
| `standardize()` | centre and scale to unit SD    | `fit(standardize(), X)`            |
| `colscale()`  | user-supplied weights/scaling  | `fit(colscale(type="z"), X)`       |
| `...`         | (write your own)               | any function returning a node      |

The `fit()` verb is the bridge between defining your preprocessing steps (the *recipe*) and actually applying them. You call `fit()` on your recipe, providing your training dataset. `fit()` calculates and stores the necessary parameters (e.g., column means, standard deviations) from this data, returning a *fitted pre-processor* object.

Once you have a fitted preprocessor object, it exposes three key methods:

| Method                | Role                                           | Typical Use Case |
|-----------------------|------------------------------------------------|------------------|
| `fit_transform(prep, X)` | fits parameters *and* transforms `X`         | Training set (convenience) |
| `transform(pp, Xnew)`| applies stored parameters to new data          | Test/new data    |
| `inverse_transform(pp, Y)` | back-transforms data using stored parameters | Interpreting results |


# 2. The 60-second tour

## 2.1 No-op and sanity check

```{r setup_data_preproc}
set.seed(0)
X <- matrix(rnorm(10*4), 10, 4)

pp_pass <- fit(pass(), X)        # == do nothing
Xp_pass <- transform(pp_pass, X) # applies nothing, just copies X
all.equal(Xp_pass, X)            # TRUE
```

## 2.2 Centre → standardise

```{r standardize_example}
# Fit the preprocessor (calculates means & SDs from X) and transform
pp_std <- fit(standardize(), X)
Xs     <- transform(pp_std, X)

# Check results
all(abs(colMeans(Xs)) < 1e-12)   # TRUE: data is centered
round(apply(Xs, 2, sd), 6)       # ~1: data is scaled

# Check back-transform
all.equal(inverse_transform(pp_std, Xs), X) # TRUE
```

## 2.3 Partial input (region-of-interest)

Imagine a sensor fails and you only observe columns 2 and 4:

```{r partial_transform}
X_cols24 <- X[, c(2,4), drop=FALSE] # Keep as matrix

# Apply the *already fitted* standardizer using only columns 2 & 4
Xs_cols24 <- transform(pp_std, X_cols24, colind = c(2,4))

# Compare original columns 2, 4 with their transformed versions
head(cbind(X_cols24, Xs_cols24))

# Back-transform works too
X_rev_cols24 <- inverse_transform(pp_std, Xs_cols24, colind = c(2,4))
all.equal(X_rev_cols24, X_cols24) # TRUE
```

# 3. Composing preprocessing steps

Because preprocessing steps nest, you can build pipelines by composing them:

```{r pipe_example}
# Define a pipeline: center, then scale to unit variance
# Fit the pipeline to the data
pp_pipe <- fit(standardize(), X)

# Apply the pipeline
Xp_pipe <- transform(pp_pipe, X)
```

## 3.1 Quick visual

```{r plot_pipeline}
# Compare first column before and after pipeline
df_pipe <- tibble(raw = X[,1],   processed = Xp_pipe[,1])

ggplot(df_pipe) +
  geom_density(aes(raw), colour = "red", linewidth = 1) +
  geom_density(aes(processed), colour = "blue", linewidth = 1) +
  ggtitle("Column 1 Density: Before (red) and After (blue) Pipeline") +
  theme_minimal()
```

# 4. Block-wise concatenation

Large multiblock models often want different preprocessing per block.
`concat_pre_processors()` glues several *already fitted* pipelines into one
wide transformer that understands global column indices.

```{r concat_example}
# Two fake blocks with distinct scales
X1 <- matrix(rnorm(10*5 , 10 , 5), 10, 5)   # block 1: high mean
X2 <- matrix(rnorm(10*7 ,  2 , 7), 10, 7)   # block 2: low mean

# Fit separate preprocessors for each block
p1 <- fit(center(), X1)
p2 <- fit(standardize(), X2)

# Transform each block
X1p <- transform(p1, X1)
X2p <- transform(p2, X2)

# Concatenate the *fitted* preprocessors
block_indices_list = list(1:5, 6:12)
pp_concat <- concat_pre_processors(
  list(p1, p2),
  block_indices = block_indices_list
)

# Apply the concatenated preprocessor to the combined data
X_combined <- cbind(X1, X2)
X_combined_p <- transform(pp_concat, X_combined)

# Check means (block 1 only centered, block 2 standardized)
round(colMeans(X_combined_p), 2)

# Need only block 1 processed later? Use colind with global indices
X1_later_p <- transform(pp_concat, X1, colind = block_indices_list[[1]])
all.equal(X1_later_p, X1p) # TRUE

# Need block 2 processed?
X2_later_p <- transform(pp_concat, X2, colind = block_indices_list[[2]])
all.equal(X2_later_p, X2p) # TRUE
```

### Check reversibility of concatenated pipeline

```{r concat_reversibility}
back_combined <- inverse_transform(pp_concat, X_combined_p)

# Compare first few rows/cols of original vs round-trip
knitr::kable(
  head(cbind(orig = X_combined[, 1:6], recon = back_combined[, 1:6]), 3),
  digits = 2,
  caption = "First 3 rows, columns 1-6: Original vs Reconstructed"
)

all.equal(X_combined, back_combined) # TRUE
```

# 5. Inside the weeds (for authors & power users)

| Helper                    | Purpose                                                              |
|---------------------------|----------------------------------------------------------------------|
| `fresh(pp)`               | return the un-fitted recipe skeleton. **Crucial for tasks like cross-validation (CV)**, as it allows you to re-`fit()` the pipeline using *only* the current training fold's data, preventing data leakage from other folds or the test set. |
| `concat_pre_processors()` | build one big transformer out of already-fitted pieces.               |
| `pass()` vs `fit(pass(), X)` | `pass()` is a recipe; `fit(pass(), X)` is a fitted identity transformer. |
| caching                   | Fitted preprocessor objects store parameters (means, SDs) for fast re-application. |

You rarely need to interact with these helpers directly; they exist so
model-writers (e.g. new PCA flavours) can avoid boiler-plate.

# 6. Key take-aways

*   **Write once**: Define a preprocessing recipe (e.g., `colscale(center())`) and reuse it safely across CV folds using `fit()` on each fold's training data.
*   **No data leakage**: Parameters live inside the fitted preprocessor object, calculated only from training data.
*   **Composable & reversible**: Nest preprocessing steps, extract the original recipe with `fresh()`, and back-transform whenever you need results in original units using `inverse_transform()`.
*   **Block-aware**: The same mechanism powers multiblock PCA, CCA, ComDim…

Happy projecting!

---

# Session info

```{r session_info_preproc}
sessionInfo()
``` 
