---
title: "Introduction to phinterval"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{phinterval}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  markdown: 
    wrap: 72
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  warning = FALSE,
  message = FALSE
)
```

```{r setup}
library(phinterval)
library(lubridate, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
library(tidyr, warn.conflicts = FALSE)
```

# Introduction

The phinterval package extends
[{lubridate}](https://lubridate.tidyverse.org/) to support disjoint
("holey") and empty time spans. It implements the `<phinterval>` vector
class, a generalization of the standard contiguous `<Interval>`, which
can represent:

-   **Contiguous spans:** A contiguous interval bounded by a start and
    end point (e.g., the year 2025).
-   **Empty spans:** A set containing no time points (e.g., the
    intersection of your life and Napoleon's).
-   **Disjoint spans:** A set of multiple time spans separated by gaps
    (e.g., the days you attended school, excluding weekends and
    holidays).

This package is designed to easily integrate into existing
lubridate workflows. Any `<Interval>` vector can be converted to an
equivalent `<phinterval>` vector using `as_phinterval()`, and all
phinterval functions accept either `<Interval>` or `<phinterval>`
inputs.

# When Time Isn't Continuous

Certain set operations on time spans naturally produce empty or disjoint
results, which are difficult to represent using a standard interval.
This section illustrates several such edge cases using the months of
January and November 2025, along with the full calendar year.

```{r}
jan <- interval(ymd("2025-01-01"), ymd("2025-02-01"))
nov <- interval(ymd("2025-11-01"), ymd("2025-12-01"))
full_2025 <- interval(ymd("2025-01-01"), ymd("2026-01-01"))
```

## Empty Intersections

Because January and November do not overlap, their intersection should
contain no time.

```{r}
lubridate::intersect(jan, nov)

phint_intersect(jan, nov)
```

In lubridate this is resolved by coercing the intersection to `NA`, while
phinterval returns a `<hole>`, which explicitly represents an empty
span of time.

This distinction matters when performing downstream calculations. For
example, counting the number of days contained in both January and
November:

```{r}
lubridate::intersect(jan, nov) / duration(days = 1)

phint_intersect(jan, nov) / duration(days = 1)
```

## Punching Holes in Intervals

Next, consider subtracting the month of November from the full year of
2025.

```{r}
try(lubridate::setdiff(full_2025, nov))

phint_setdiff(full_2025, nov)
```

The result is two disjoint spans, January through October and December,
which can't be represented by a single interval. As a result,
lubridate raises an error. In phinterval, the disjoint span is represented 
as a single object with an explicit gap.

## Unions of Non-Overlapping Spans

Similarly, the union of January and November contains a gap from
February to October.

```{r}
lubridate::union(jan, nov)

phint_union(jan, nov)
```

In this case lubridate returns the span from the beginning of
January to the end of November, implicitly filling in the gap. The two disjoint
months are represented explicitly using phinterval.

## Subtracting an Interval from Itself

Finally, consider subtracting an interval from itself. Intuitively, this
should result in an empty time span.

```{r}
lubridate::setdiff(jan, jan)

phint_setdiff(jan, jan)
```

In this case, lubridate returns the original interval, while
phinterval returns a `<hole>`.

# Case Study: Employment History

The phinterval package is most useful when working with tabular data and
vectorized workflows. To illustrate this, we’ll consider an abridged
employment history for several characters from the television show
*Succession*.

```{r}
jobs <- tribble(
  ~name,   ~job_title,             ~start,        ~end,
  "Greg",  "Mascot",               "2018-01-01",  "2018-06-03",
  "Greg",  "Executive Assistant",  "2018-06-10",  "2020-04-01",
  "Greg",  "Chief of Staff",       "2020-03-01",  "2020-11-28",
  "Tom",   "Chairman",             "2019-05-01",  "2020-11-10",
  "Tom",   "CEO",                  "2020-11-10",  "2020-12-31",
  "Shiv",  "Political Consultant", "2017-01-01",  "2019-04-01"
)
```

Suppose we know that Greg, Tom, and Shiv went on a Christmas vacation in December 2017.

```{r}
vacation <- interval(ymd("2017-12-23"), ymd("2017-12-29"))
```

If we want to analyze only the time spent working, and exclude time on
vacation, we might try to subtract the `vacation` interval from each span
in `jobs`. However, this approach breaks down when the vacation falls
strictly within a job interval, as it does for Shiv’s Political
Consultant role.

```{r}
try(
  jobs |>
    mutate(
      span = interval(start, end),
      span = setdiff(span, vacation)
    ) |>
    select(name, job_title, span)
)
```

Handling this correctly is surprisingly involved. One option is to split
Shiv’s job into two rows (one pre-vacation and one post-vacation),
breaking the one-row-per-job structure of the data. Another is to
represent each job as a list of intervals, complicating downstream
analysis. 

The main purpose of phinterval is to avoid these workarounds, by providing drop-in replacements for lubridate interval functions. Because phinterval functions accept 
either `<Interval>` or `<phinterval>` inputs, existing code can typically be
adapted by simply replacing a lubridate function with its phinterval counterpart.

```{r}
jobs |>
  mutate(
    span = interval(start, end),
    span = phint_setdiff(span, vacation)
  ) |>
  select(name, job_title, span)
```

## Merging Intervals

Suppose we want to analyze only the total time each character spent
employed, without distinguishing between individual jobs. This can be
done using `phint_squash()`, which aggregates a vector of intervals into
a minimal set of non-overlapping spans within a scalar `<phinterval>`.

```{r, include = FALSE}
opts <- options(width = 120)
```

```{r}
employment <- jobs |>
  mutate(span = interval(start, end)) |>
  group_by(name) |>
  summarize(employed = phint_squash(span))

employment
```

Notice that:

-   *Greg* has multiple disjoint employment periods, which are preserved
    as separate spans within a single `<phinterval>` element.
-   *Tom* held two back-to-back positions (Chairman followed by CEO),
    which `phint_squash()` correctly merges into a single contiguous
    span.

The `by` argument of `phint_squash()` and `datetime_squash()` (which takes 
`start` and `end` times directly) can be used in place of `dplyr::group_by()`.
The example below is equivalent to the previous code but is usually several 
times faster.

```{r}
datetime_squash(
  start = ymd(jobs$start),
  end = ymd(jobs$end),
  by = jobs$name,
  keep_by = TRUE,
  order_by = TRUE
)
```

```{r, include = FALSE}
options(opts)
```

As in `dplyr::summarize()`, the `by` argument can be a vector or data frame 
to support multiple grouping columns.

To return the dataset to a one-row-per-span format, use `phint_unnest()`,
which converts each `<phinterval>` element into separate rows:

```{r}
employment |>
  reframe(phint_unnest(employed, key = name))
```

## Finding Gaps

To analyze periods of unemployment, we need to identify the gaps between
employment intervals. The `phint_invert()` function returns the gaps
between spans in a `<phinterval>`.

```{r}
unemployment <- employment |>
  mutate(
    # Find the gaps between jobs
    unemployed = phint_invert(employed),
    
    # Calculate duration of unemployment
    days_unemployed = unemployed / ddays(1)
  ) |>
  select(name, unemployed, days_unemployed)

unemployment
```

Greg was unemployed for 7 days between his time as a Mascot and his role
as Executive Assistant. Tom and Shiv have no gaps within their
respective employment timelines, represented by a `<hole>`.

# Edge Cases and Gotchas

## Abutting Intervals and Intersection

Manipulating abutting intervals (intervals that share an endpoint) can produce
sometimes unexpected results. To demonstrate, consider the time within
a Monday and Tuesday in November 2025.

```{r}
monday <- interval(ymd("2025-11-10"), ymd("2025-11-11"))
tuesday <- interval(ymd("2025-11-11"), ymd("2025-11-12"))
```

By default, intervals in `<phinterval>` and `<Interval>` vectors have inclusive
endpoints, meaning that midnight on Monday, November 11th, 2025 falls within
both `monday` and `tuesday`:

```{r}
midnight_monday <- ymd_hms("2025-11-11 00:00:00")
phint_within(midnight_monday, monday)
phint_within(midnight_monday, tuesday)
```

As a result, the intersection of `monday` and `tuesday` is an instantaneous
interval at `midnight_monday`.

```{r}
phint_intersect(monday, tuesday) == as_phinterval(midnight_monday)
```

Perhaps surprisingly, this also means that the intersection of `monday` and its
complement is not empty, but consists of the two endpoints of `monday`.

```{r}
not_monday <- phint_complement(monday)
not_monday

phint_intersect(monday, not_monday)
```

The bounds argument in `phint_overlaps()`, `phint_within()`, and
`phint_intersect()` controls this behavior. When `bounds = "()"`, endpoints are
treated as exclusive:

```{r}
phint_overlaps(monday, tuesday, bounds = "()")
phint_intersect(monday, tuesday, bounds = "()")
```

With exclusive endpoints, `monday` and `tuesday` no longer overlap, and their
intersection is empty.

An instantaneous interval `(point, point)` with open bounds is mathematically
undefined, but for convenience we allow these points to exist. With `bounds = "()"`,
instants on the endpoint of an interval are outside of the interval, while
instants in the middle of an interval are considered to be within it:

```{r}
monday_at_9AM <- as_phinterval(ymd_hms("2025-11-10 00:09:00"))
phint_within(monday_at_9AM, monday, bounds = "()")
phint_within(midnight_monday, monday, bounds = "()")
```

To consider instantaneous intervals as empty, use `phint_sift()` to remove all 
instants from an interval vector:

```{r}
phint <- phint_squash(c(monday_at_9AM, tuesday))
phint

phint_sift(phint)
```

## Instantaneous Intervals and Set Difference

Because phinterval elements are composed of non-overlapping, non-adjacent spans,
"punching" an instantaneous hole into an interval using `phint_setdiff()` has no 
effect on the interval. While removing a single point from an interval `[start, end]` 
would theoretically split it into `[start, point)` and `(point, end]`, in practice
these adjacent pieces are immediately merged back together:

```{r}
monday_noon <- as_phinterval(ymd_hms("2025-11-10 12:00:00"))
monday_lunch_break <- interval(
  ymd_hms("2025-11-10 12:00:00"), 
  ymd_hms("2025-11-10 13:00:00")
)

phint_setdiff(monday, monday_lunch_break) # Removes a non-zero interval
phint_setdiff(monday, monday_noon)        # Instantaneous - no effect
```

To create gaps, you must remove an interval with non-zero duration.

## Time Zones

To ensure that any `<Interval>` vector can be represented as an equivalent
`<phinterval>` vector, the `phinterval()` constructor accepts any time zone
permitted by `interval()`, including unrecognized zones.

```{r}
intvl <- interval(ymd("2020-01-01"), ymd("2020-01-02"), tzone = "nozone")
phint <- phinterval(ymd("2020-01-01"), ymd("2020-01-02"), tzone = "nozone")
intvl == phint
```

When a `<phinterval>` with an unrecognized time zone is formatted, its time points
are displayed using the UTC time zone:

```{r, include = FALSE}
rlang::reset_warning_verbosity("phinterval_warning_unrecognized_tzone")
```

```{r}
print(phint)
```

The `is_recognized_tzone()` function can be used to check whether a time zone is
recognized:

```{r}
is_recognized_tzone("America/New_York")
is_recognized_tzone("nozone")
is_recognized_tzone(NA_character_)
```

Some datetime vectors, such as `<POSIXct>`, are allowed to have an `NA` time zone.
When converted to a `<phinterval>`, the missing time zone is silently replaced
with UTC:

```{r}
na_zoned <- as.POSIXct("2021-01-01", tz = NA_character_)
as_phinterval(na_zoned)
```

Operations that combine two or more interval vectors, such as `phint_union()`,
use the time zone of the first argument. If the first argument's time zone is 
`""` (the user's local time zone), the second argument's time zone is used instead.

```{r}
int_est <- interval(ymd("2020-01-01"), ymd("2020-01-02"), tzone = "EST")
int_utc <- interval(ymd("2020-01-01"), ymd("2020-01-02"), tzone = "UTC")
int_lcl <- interval(ymd("2020-01-01"), ymd("2020-01-02"), tzone = "")

phint_union(int_est, int_utc)
phint_union(int_utc, int_est)
phint_union(int_lcl, int_est)
```

## Comparison with Datetime Vectors

Comparison operators (`<=`, `<`, `>`, `>=`, `==`) work in unexpected ways
when comparing datetime vectors (`<Date>`, `<POSIXct>`, `<POSIXlt>`)
to `<phinterval>` or `<Interval>` vectors. For example:

```{r}
span <- phinterval(ymd("2000-08-05"), ymd("2000-11-29"))
date <- ymd("2021-01-01")

span == date
```

For the intended behavior, use `as_phinterval()` to convert datetime vectors 
into an equivalent `<phinterval>` first.

```{r}
span == as_phinterval(date)
```
