---
title: "Data Validation Rules"
author: "Duncan Garmonsway"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Data Validation Rules}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## What data validation rules are

Data validation rules control what constants can be entered into a cell, e.g.
any whole number between 0 and 9, or one of several values from another part of
the spreadsheet.

‘xlsx_validation()’ returns each of the data validation rules in an xlsx file,
and the ranges of cells to which each rule applies.

Here is a rule that restricts input to integers between 0 and 9 inclusive, or no
value (blank).  If any other value is attempted, then an error message is
displayed with the imaginative title "message title", the informative body text
"message body", and a "stop" symbol.

```{r}
library(tidyxl)
library(dplyr)
library(tidyr)
examples <- system.file("extdata/examples.xlsx", package = "tidyxl")
glimpse(xlsx_validation(examples)[1, ])
```

The gamut of possible rules is given in the examples for `xlsx_validation()`.

```{r}
as.data.frame(xlsx_validation(examples))
```

## Joining rules to cells

> There are no built-in functions for joining ranges like `A1:D5,G8` to single
> cells like `B3`.  For now, use the snippets in this section.  In future I
> might develop a dplyr-like join function (this is hard currently because dplyr
> doesn't yet join on arbitrary functions, or even the standard inequalities
> like `>=`).  Help and advice would be gratefully accepted!

To join rules to cells, a naive method is to use the `sheet` and `ref` columns
to match the `sheet` and `address` columns to the output of `xlsx_cells()`.

```{r}
rules <- xlsx_validation(examples)
cells <- filter(xlsx_cells(examples), row >= 106, col == 1)

rules
cells

inner_join(rules, cells, by = c("sheet" = "sheet", "ref" = "address"))
```

Notice that only 9 cells were joined, even though 15 rules were defined.  Surely
at least 15 cells ought to be joined?  The reason why they are not is that the
cells for the other 6 rules don't exist -- rules can be defined for cells that
have no value, and cells with no value are not returned by `xlsx_cells()`,
otherwise all `r sprintf("%11.0f", 16384 * 1048576)` cells in a worksheet must
be returned.

A more subtle reason for certain cells not to have joined successfully is that
the `ref` column of the rules sometimes refers to more than one cell, and can
even refer to several, non-contiguous ranges of cells.  Specifically, the
seventh rule's `ref` column has ``r rules$ref[7]``.

Special treatment is needed here.  Ideally, some kind of join function would be
defined that can compare indidual cells with ranges.  But I haven't written one,
so what follows is a workaround.  First, the two ranges of cells must be
`unnested` into `A115` and `A121:122`.  Then the range `A121:122` must be
'unranged' into `A121` and `A122`.

```{r}
unrange <- function(x) {
  limits <- cellranger::as.cell_limits(x)
  rows <- seq(limits$ul[1], limits$lr[1])
  cols <- seq(limits$ul[2], limits$lr[2])
  rowcol <- expand.grid(rows, cols)
  cell_addrs <- cellranger::cell_addr(rowcol[[1]], rowcol[[2]])
  cellranger::to_string(cell_addrs, fo = "A1", strict = FALSE)
}

unnest_ref <- function(x, ref) {
  UseMethod("unnest_ref")
}

unnest_ref.default <- function(x, ref_col = ref) {
  stopifnot(is.character(x), length(x) == 1L)
  refs <- unlist(strsplit(x, ",", fixed = TRUE))
  unlist(lapply(refs, unrange))
}

unrange("A121:A122")
unnest_ref("A115,A121:A122")
```

The `unnest_ref()` function can also be defined for whole data frames, unnesting
them by a column of references.

```{r}
unnest_ref.data.frame <- function(x, ref_col) {
  ref <- rlang::enquo(ref_col)
  x[[rlang::quo_name(ref)]] <- lapply(x[[rlang::quo_name(ref)]], unnest_ref)
  tidyr::unnest(x, rlang::UQ(ref))
}

(nested_rule <- slice(rules, 7))
unnest_ref(nested_rule, ref)
```

Finally the new data frame of rules can be joined to a data frame of cells in
any of the usual ways, via the `sheet` and `ref` columns.

Problems with this approach occur with rules that are defined over large ranges
of cells: the 'unnesting' of those ranges results in very long vectors of
individual cell addresses, or (worse) huge data frames of rules.  Such cases are
commonplace, because rules are often defined for entire columns of a
spreadsheet, and a column has 1048576 rows.
