---
title: "Categorical summary tables in R"
description: >
  Build categorical summary tables in R with table_categorical(),
  including grouped cross-tabulations, effect sizes, confidence
  intervals, and export to gt, tinytable, flextable, Excel, or Word.
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Categorical summary tables in R}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

build_rich_tables <- identical(Sys.getenv("IN_PKGDOWN"), "true")

pkgdown_dark_gt <- function(tab) {
  tab |>
    gt::opt_css(
      css = paste(
        ".gt_table, .gt_heading, .gt_col_headings, .gt_col_heading,",
        ".gt_column_spanner_outer, .gt_column_spanner, .gt_title,",
        ".gt_subtitle, .gt_sourcenotes, .gt_sourcenote {",
        "  background-color: transparent !important;",
        "  color: currentColor !important;",
        "}",
        sep = "\n"
      )
    )
}
```

```{r setup}
library(spicy)
```

`table_categorical()` builds publication-ready categorical tables suitable for
APA-style reporting in social science and data science research. With
`by`, it produces grouped cross-tabulation tables with chi-squared
\(p\)-values, effect sizes, confidence intervals, and multi-level
headers. Without `by`, it produces one-way frequency-style tables for
the selected variables. Export to gt, tinytable, flextable, Excel, or
Word. This vignette walks through the main features.

## Basic usage

For grouped tables, provide a data frame, one or more selected
variables, and a grouping variable:

```{r basic}
table_categorical(
  sochealth,
  select = c(smoking, physical_activity, dentist_12m),
  by = education
)
```

The default output is `"default"`, which prints a styled ASCII table to
the console. Use `output = "data.frame"` to get a plain numeric
data frame suitable for further processing.

## One-way tables

Omit `by` to build a frequency-style table for the selected variables:

```{r oneway}
table_categorical(
  sochealth,
  select = c(smoking, physical_activity),
  output = "default"
)
```

## Output formats

`table_categorical()` supports several output formats. The table below
summarizes the options:

| Format | Description |
|---|---|
| `"default"` | Styled ASCII table in the console (default) |
| `"data.frame"` | Wide data frame, one row per modality |
| `"long"` | Long data frame, one row per modality x group |
| `"gt"` | Formatted gt table |
| `"tinytable"` | Formatted tinytable |
| `"flextable"` | Formatted flextable |
| `"excel"` | Excel file (requires `excel_path`) |
| `"clipboard"` | Copy to clipboard |
| `"word"` | Word document (requires `word_path`) |

### gt output

The `"gt"` format produces a table with APA-style borders, column
spanners, and proper alignment:

```{r gt, eval = build_rich_tables}
pkgdown_dark_gt(
  table_categorical(
    sochealth,
    select = c(smoking, physical_activity, dentist_12m),
    by = education,
    output = "gt"
  )
)
```

### tinytable output

```{r tinytable, eval = build_rich_tables}
table_categorical(
  sochealth,
  select = c(smoking, physical_activity),
  by = sex,
  output = "tinytable"
)
```

### Data frame output

Use `output = "data.frame"` for a wide numeric data frame (one row per
modality), or `output = "long"` for a long format (one row per
modality x group):

```{r data-frame}
table_categorical(
  sochealth,
  select = smoking,
  by = education,
  output = "data.frame"
)
```

## Custom labels

By default, `table_categorical()` uses variable names as row headers.
Use the `labels` argument to provide human-readable labels. Two
forms are accepted (matching `table_continuous()` and
`table_continuous_lm()`):

- A **named character vector** keyed by column name in `data` --
  the recommended form. Only listed columns are relabelled; others
  fall back to the column name.
- A **positional character vector** of the same length as `select`
  -- the legacy spicy < 0.11.0 form, kept for backward
  compatibility.

```{r labels, eval = build_rich_tables}
pkgdown_dark_gt(
  table_categorical(
    sochealth,
    select = c(smoking, physical_activity),
    by = education,
    labels = c(
      smoking           = "Smoking status",
      physical_activity = "Regular physical activity"
    ),
    output = "gt"
  )
)
```

## Association measures and confidence intervals

`table_categorical()` picks the association measure per row variable
based on the variable type (`assoc_measure = "auto"`, the default):

* **2x2** (binary row variable vs. binary `by`) -> `phi`,
* both ordered factors -> Kendall's `tau_b`,
* otherwise -> Cramer's `V`.

When the chosen measures differ across rows, the column header
collapses to `"Effect size"` and an APA-style `Note.` line documents
which measure was used for each variable.

Override with a single string for uniform application, or with a
named vector to mix measures per row:

```{r assoc-measure, eval = build_rich_tables}
# Uniform: same measure for every row variable
table_categorical(
  sochealth,
  select = smoking,
  by = education,
  assoc_measure = "lambda",
  output = "tinytable"
)
```

```{r assoc-measure-named, eval = build_rich_tables}
# Per-row: pick the right measure for each variable.
# `smoking` x `education` is 2x3 (binary x ordered) -> Cramer's V;
# `self_rated_health` x `education` is ordered x ordered -> Tau-b.
# The mixed result collapses the header to "Effect size" and adds an
# APA `Note.` line documenting the per-row measure.
table_categorical(
  sochealth,
  select = c(smoking, self_rated_health),
  by = education,
  assoc_measure = c(
    smoking           = "cramer_v",
    self_rated_health = "tau_b"
  ),
  output = "tinytable"
)
```

Add confidence intervals with `assoc_ci = TRUE`. In rendered formats
(gt, tinytable, flextable), the CI is shown inline:

```{r ci-rendered, eval = build_rich_tables}
pkgdown_dark_gt(
  table_categorical(
    sochealth,
    select = c(smoking, physical_activity),
    by = education,
    assoc_ci = TRUE,
    output = "gt"
  )
)
```

In data formats (`"data.frame"`, `"long"`, `"excel"`, `"clipboard"`),
separate `CI lower` and `CI upper` columns are added:

```{r ci-data}
table_categorical(
  sochealth,
  select = smoking,
  by = education,
  assoc_ci = TRUE,
  output = "data.frame"
)
```

## Weighted tables

Pass survey weights with the `weights` argument. Use `rescale = TRUE` so
the total weighted N matches the unweighted N:

```{r weighted, eval = build_rich_tables}
pkgdown_dark_gt(
  table_categorical(
    sochealth,
    select = c(smoking, physical_activity),
    by = education,
    weights = "weight",
    rescale = TRUE,
    output = "gt"
  )
)
```

## Handling missing values

By default, rows with missing values are dropped (`drop_na = TRUE`).
Set `drop_na = FALSE` to display them as a "(Missing)" category:

```{r missing, eval = build_rich_tables}
pkgdown_dark_gt(
  table_categorical(
    sochealth,
    select = income_group,
    by = education,
    drop_na = FALSE,
    output = "gt"
  )
)
```

## Filtering and reordering levels

Use `levels_keep` to display only specific modalities. The order you
specify controls the display order, which is useful for placing
"(Missing)" first to highlight missingness:

```{r levels-keep, eval = build_rich_tables}
pkgdown_dark_gt(
  table_categorical(
    sochealth,
    select = income_group,
    by = education,
    drop_na = FALSE,
    levels_keep = c("(Missing)", "Low", "High"),
    output = "gt"
  )
)
```

## Formatting options

Control the number of digits for percentages, p-values, and the
association measure:

```{r formatting, eval = build_rich_tables}
pkgdown_dark_gt(
  table_categorical(
    sochealth,
    select = smoking,
    by = education,
    percent_digits = 2,
    p_digits = 4,
    v_digits = 3,
    output = "gt"
  )
)
```

`p_digits` drives both the displayed precision of the `p` column and
the small-*p* threshold (`p_digits = 3` -> `<.001`, `p_digits = 4`
-> `<.0001`), matching `table_continuous()` and
`table_continuous_lm()`.

## Decimal alignment

By default (`align = "decimal"`) numeric columns are aligned on the
decimal mark, the standard scientific-publication convention used by
SPSS, SAS, LaTeX `siunitx`, and the native primitives of
`gt::cols_align_decimal()` and `tinytable::style_tt(align = "d")`.
Engines without a native primitive (`flextable`, `word`,
`clipboard`, ASCII print) get the alignment via leading / trailing
space padding, with `flextable` / `word` switching the body font to
`Consolas` so character widths match.

Pass `align = "auto"` to revert to the legacy uniform
right-alignment used in spicy < 0.11.0:

```{r align}
table_categorical(
  sochealth,
  select = c(smoking, physical_activity),
  by = sex,
  align = "auto"
)
```

`"center"` and `"right"` apply literal alignment.

## Tidying for downstream pipelines

`table_categorical()` returns an object that can be coerced to a
plain `data.frame` / `tbl_df` (stripping the spicy formatting
attributes) or piped into `broom::tidy()` / `broom::glance()` for
use with `gtsummary`, `modelsummary`, `parameters`, or any other
tidyverse-stats workflow:

```{r tidy-glance}
out <- table_categorical(
  sochealth,
  select = c(smoking, physical_activity),
  by = sex
)

# One row per (variable x level x group) with broom-style columns
# (outcome, level, group, n, proportion). The synthetic Total
# margin is excluded so each observation is counted once.
broom::tidy(out)

# One row per outcome with the omnibus chi-squared test and the
# chosen association measure (test_type, statistic, df, p.value,
# assoc_type, assoc_value, assoc_ci_lower / assoc_ci_upper, n_total).
broom::glance(out)
```

## Exporting to Excel, Word, or clipboard

For Excel export, provide a file path:

```r
table_categorical(
  sochealth,
  select = c(smoking, physical_activity, dentist_12m),
  by = education,
  output = "excel",
  excel_path = "my_table.xlsx"
)
```

For Word, use `output = "word"`:

```r
table_categorical(
  sochealth,
  select = c(smoking, physical_activity, dentist_12m),
  by = education,
  output = "word",
  word_path = "my_table.docx"
)
```

You can also copy directly to the clipboard for pasting into a
spreadsheet or a text editor:

```r
table_categorical(
  sochealth,
  select = c(smoking, physical_activity),
  by = education,
  output = "clipboard"
)
```
