---
title: "Quick Start"
author: "Gilles Colling"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Quick Start}

  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(keyed)
library(dplyr)
set.seed(42)
```

## The Problem: Silent Data Corruption

You receive monthly customer exports from a CRM system. The data should have unique `customer_id` values and complete `email` addresses. One month, someone upstream changes the export logic. Now `customer_id` has duplicates and some emails are missing.

**Without explicit checks, you won't notice until something breaks downstream**—wrong row counts after a join, duplicated invoices, failed email campaigns.

```{r}
# January export: clean data
january <- data.frame(
  customer_id = c(101, 102, 103, 104, 105),
  email = c("alice@example.com", "bob@example.com", "carol@example.com",
            "dave@example.com", "eve@example.com"),
  segment = c("premium", "basic", "premium", "basic", "premium")
)

# February export: corrupted upstream (duplicates + missing email)
february <- data.frame(
  customer_id = c(101, 102, 102, 104, 105),  # Note: 102 is duplicated

  email = c("alice@example.com", "bob@example.com", NA,
            "dave@example.com", "eve@example.com"),
  segment = c("premium", "basic", "basic", "basic", "premium")
)
```

The February data looks fine at a glance:

```{r}
head(february)
nrow(february)  # Same row count
```

But it will silently corrupt your analysis.

---

## The Solution: Make Assumptions Explicit

**keyed** catches these issues by making your assumptions explicit:

```{r error=TRUE}
# Define what you expect: customer_id is unique
january_keyed <- january |>
  key(customer_id) |>
  lock_no_na(email)

# This works - January data is clean
january_keyed
```

Now try the same with February's corrupted data:

```{r error=TRUE}
# This fails immediately - duplicates detected
february |>
  key(customer_id)
```

The error catches the problem **at import time**, not downstream when you're debugging a mysterious row count mismatch.

---

## Workflow 1: Monthly Data Validation

**Goal**: Validate each month's export against expected constraints before processing.

**Challenge**: Data quality varies month-to-month. Silent corruption causes cascading errors.

**Strategy**: Define keys and assumptions once, apply consistently to each import.

### Define validation function

```{r}
validate_customer_export <- function(df) {
  df |>
    key(customer_id) |>
    lock_no_na(email) |>
    lock_nrow(min = 1)
}

# January: passes
january_clean <- validate_customer_export(january)
summary(january_clean)
```

### Keys survive transformations

Once defined, keys persist through dplyr operations:

```{r}
# Filter preserves key
premium_customers <- january_clean |>
  filter(segment == "premium")

has_key(premium_customers)
get_key_cols(premium_customers)

# Mutate preserves key
enriched <- january_clean |>
  mutate(domain = sub(".*@", "", email))

has_key(enriched)
```

### Strict enforcement

If an operation breaks uniqueness, keyed errors and tells you to use `unkey()` first:

```{r error=TRUE}
# This creates duplicates - keyed stops you
january_clean |>
  mutate(customer_id = 1)
```

To proceed, you must explicitly acknowledge breaking the key:

```{r}
january_clean |>
  unkey() |>
  mutate(customer_id = 1)
```

---

## Workflow 2: Safe Joins

**Goal**: Join customer data with orders without accidentally duplicating rows.

**Challenge**: Join cardinality mistakes are common and hard to debug. A "one-to-one" join that's actually one-to-many silently inflates your data.

**Strategy**: Use `diagnose_join()` to understand cardinality *before* joining.

### Create sample data

```{r}
customers <- data.frame(
  customer_id = 1:5,
  name = c("Alice", "Bob", "Carol", "Dave", "Eve"),
  tier = c("gold", "silver", "gold", "bronze", "silver")
) |>
  key(customer_id)

orders <- data.frame(
  order_id = 1:8,
  customer_id = c(1, 1, 2, 3, 3, 3, 4, 5),
  amount = c(100, 150, 200, 50, 75, 125, 300, 80)
) |>
  key(order_id)
```

### Diagnose before joining

```{r}
diagnose_join(customers, orders, by = "customer_id", use_joinspy = FALSE)
```

The diagnosis shows:

- **Cardinality is one-to-many**: Each customer can have multiple orders

- **Coverage**: Shows how many keys match vs. don't match

Now you know what to expect. A `left_join()` will create 8 rows (one per order), not 5 (one per customer).

### Compare key structures

```{r}
compare_keys(customers, orders)
```

This shows the join key exists in both tables but with different uniqueness properties—essential information before joining.

---

## Workflow 3: Row Identity Tracking

**Goal**: Track which original rows survive through a complex pipeline.

**Challenge**: After filtering, aggregating, and joining, you lose track of which source rows contributed to your final data.

**Strategy**: Use `add_id()` to attach stable identifiers that survive transformations.

### Add row IDs

```{r}
# Add UUIDs to rows
customers_tracked <- customers |>
  add_id()

customers_tracked
```

### IDs survive transformations

```{r}
# Filter: IDs persist
gold_customers <- customers_tracked |>
  filter(tier == "gold")

get_id(gold_customers)

# Compare with original
compare_ids(customers_tracked, gold_customers)
```

The comparison shows exactly which rows were lost (filtered out) and which were preserved.

### Combining data with ID handling

When appending new data, `bind_id()` handles ID conflicts:
```{r}
batch1 <- data.frame(x = 1:3) |> add_id()
batch2 <- data.frame(x = 4:6)  # No IDs yet

# bind_id assigns new IDs to batch2 and checks for conflicts
combined <- bind_id(batch1, batch2)
combined
```

---

## Workflow 4: Drift Detection

**Goal**: Detect when data changes unexpectedly between pipeline runs.

**Challenge**: Reference data (lookup tables, dimension tables) changes upstream without notice. Your pipeline silently uses stale assumptions.

**Strategy**: Commit snapshots with `stamp()` and check for drift with `check_drift()`.

### Commit a reference snapshot

```{r}
# Commit current state as reference
reference_data <- data.frame(
  region_id = c("US", "EU", "APAC"),
  tax_rate = c(0.08, 0.20, 0.10)
) |>
  key(region_id) |>
  stamp()
```

### Check for drift

```{r}
# No changes yet
check_drift(reference_data)
```

### Detect changes

```{r}
# Simulate upstream change: EU tax rate changed
modified_data <- reference_data
modified_data$tax_rate[2] <- 0.21

# Drift detected!
check_drift(modified_data)
```

The drift report shows exactly what changed, letting you decide whether to accept the new data or investigate.

### Row-level diff

For detailed cell-level comparison, use `diff()` on two keyed data frames:

```{r}
old_rates <- key(data.frame(
  region_id = c("US", "EU", "APAC"),
  tax_rate  = c(0.08, 0.20, 0.10)
), region_id)

new_rates <- data.frame(
  region_id = c("US", "EU", "APAC", "LATAM"),
  tax_rate  = c(0.08, 0.21, 0.10, 0.15)
)

diff(old_rates, new_rates)
```

### Cleanup

```{r}
# Remove snapshots when done
clear_all_snapshots()
```

---

## Quick Reference

### Core Functions

| Function | Purpose |
|----------|---------|
| `key()` | Define key columns (validates uniqueness) |
| `unkey()` | Remove key |
| `has_key()`, `get_key_cols()` | Query key status |

### Assumption Checks

| Function | Validates |
|----------|-----------|
| `lock_unique()` | No duplicate values |
| `lock_no_na()` | No missing values |
| `lock_complete()` | All expected values present |
| `lock_coverage()` | Reference values covered |
| `lock_nrow()` | Row count within bounds |

### Diagnostics

| Function | Purpose |
|----------|---------|
| `diagnose_join()` | Analyze join cardinality |
| `compare_keys()` | Compare key structures |
| `compare_ids()` | Compare row identities |
| `find_duplicates()` | Find duplicate key values |
| `key_status()` | Quick status summary |

### Row Identity

| Function | Purpose |
|----------|---------|
| `add_id()` | Add UUID to rows |
| `get_id()` | Retrieve row IDs |
| `bind_id()` | Combine data with ID handling |
| `make_id()` | Create deterministic IDs from columns |
| `check_id()` | Validate ID integrity |

### Drift Detection

| Function | Purpose |
|----------|---------|
| `stamp()` | Save reference snapshot |
| `check_drift()` | Compare against snapshot |
| `diff()` | Cell-level comparison of two data frames |
| `list_snapshots()` | View saved snapshots |
| `clear_snapshot()` | Remove specific snapshot |

---

## When to Use Something Else

keyed is designed for **flat-file workflows** without database infrastructure. If you need:

| Need | Better Alternative |
|------|-------------------|
| Enforced schema | Database (SQLite, DuckDB) |
| Version history | Git, git2r |
| Full data validation | pointblank, validate |
| Production pipelines | targets |

keyed fills a specific gap: lightweight key tracking for exploratory and semi-structured workflows where heavier tools add friction.

---

## See Also

- [Design Philosophy](philosophy.html) - The reasoning behind keyed's approach

- [Function Reference](https://gillescolling.com/keyed/reference/index.html) - Complete API documentation
