%\documentclass[article]{jss}
\documentclass[nojss,article]{jss}
%              ----- for the package-vignette, don't use JSS logo, etc
%
% \author{Martin Maechler\\ Seminar f\"ur Statistik \\ ETH Zurich, \ Switzerland
%   \\\email{maechler@stat.math.ethz.ch}}
\author{Martin M\"achler \\ ETH Zurich}
\title{Dip Test Distributions, P-values, and other Explorations}
% \def\mythanks{a version of this paper, for \pkg{nacopula} 0.4\_4, has been published
%     in JSS, \url{http://www.jstatsoft.org/v39/i09}.}
%% for pretty printing and a nice hypersummary also set:
\Plainauthor{Martin M\"achler} %% comma-separated
\Plaintitle{Dip Test Distributions, P-values, and other Explorations}
% \Shorttitle{}
%\date{April 2009 ({\tiny typeset on \tiny\today})}
%%\VignetteIndexEntry{Dip Test Distributions, P-values, and other Explorations}
%%\VignetteDepends{diptest}
\SweaveOpts{engine=R,keep.source=TRUE,strip.white=true}
%		     ^^^^^^^^^^^^^^^^
\SweaveOpts{eps=FALSE,pdf=TRUE,width=7,height=4}

%% an abstract and keywords
\Abstract{

... % FIXME

... % FIXME
}
%
\Keywords{MPFR, Abitrary Precision, Multiple Precision Floating-Point, R}
%% at least one keyword must be supplied

%% publication information
%% NOTE: Typically, this can be left commented and will be filled out by the technical editor
%% \Volume{13}
%% \Issue{9}
%% \Month{September}
%% \Year{2004}
%% \Submitdate{2004-09-29}
%% \Acceptdate{2004-09-29}

%% The address of (at least) one author should be given
%% in the following format:
\Address{
	Martin M\"achler\\
	Seminar f\"ur Statistik, HG G~14.2\\
	ETH Zurich\\
	8092 Zurich, Switzerland\\
	E-mail: \email{maechler@stat.math.ethz.ch}\\
	URL: \url{https://people.math.ethz.ch/~maechler/}
}
%% It is also possible to add a telephone and fax number
%% before the e-mail in the following format:
%% Telephone: +43/1/31336-5053
%% Fax: +43/1/31336-734

%% for those who use Sweave please include the following line (with % symbols):
%% MM: this is "substituted" by  jss.cls:
%% need no \usepackage{Sweave.sty}

% \usepackage{myVignette}
% \usepackage{fullpage}% save trees ;-) --- FIXME use {geometry} package
% \usepackage[authoryear,round,longnamesfirst]{natbib}
% \bibliographystyle{plainnat}
%
%% Marius' packages
\usepackage[american]{babel}%for American English
% \usepackage{microtype}%for character protrusion and font expansion (only with pdflatex)
\usepackage{amsmath}%sophisticated mathematical formulas with amstex (includes \text{})
\usepackage{mathtools}%fix amsmath deficiencies
\usepackage{amssymb}%sophisticated mathematical symbols with amstex (includes \mathbb{})
% \usepackage{amsthm}%theorem environments
% \usepackage{bm}%for bold math symbols: \bm (= bold math)
% %NON-STANDARD:\RequirePackage{bbm}%only for indicator functions
% \usepackage{enumitem}%for automatic numbering of new enumerate environments
% \usepackage[
%   format=hang,
%   % NOT for JSS: labelsep=space,
%   justification=justified,
%   singlelinecheck=false%,
%   % NOT for JSS: labelfont=bf
% ]{caption}%for captions
% \usepackage{tikz}%sophisticated graphics package
% \usepackage{tabularx}%for special table environment (tabularx-table)
% \usepackage{booktabs}%for table layout

% This is already in jss above -- but withOUT the  fontsize=\small part !!
\DefineVerbatimEnvironment{Sinput}{Verbatim}{fontsize=\small,fontshape=sl}
\DefineVerbatimEnvironment{Soutput}{Verbatim}{fontsize=\small}
\DefineVerbatimEnvironment{Scode}{Verbatim}{fontsize=\small,fontshape=sl}
% but when submitting, do get rid of too much vertical space between R
% input & output, i.e. between Sinput and Soutput:
\fvset{listparameters={\setlength{\topsep}{0pt}}}% !! quite an effect!
%%
%
\newcommand*{\R}{\proglang{R}}%{\textsf{R}}
\newcommand*{\Arg}[1]{\texttt{\itshape $\langle$#1$\rangle$}}
\newcommand*{\file}[1]{{`\normalfont\texttt{#1}'}}
\newcommand*{\eps}{\varepsilon}
%
%% Probability P[.],  Expectation E[.]  etc
\makeatletter
%% ==  subsection of our flexible-style "texab.sty" :
\newcommand{\@pkl}{[}     % Probability Klammer links
\newcommand{\@pkr}{]}
\newcommand{\@ekl}{[}     % Erwartungswert Klammer links
\newcommand{\@ekr}{]}     % Erwartungswert Klammer rechts
\DeclareMathOperator{\PRSymbol}{P}
% Next line (\makeright): if #1 == \left then \right #2 else #1 #2
\newcommand{\makeright}[2]{\ifx#1\left\right#2\else#1#2\fi}
%% the real commands
\newcommand{\PR}[2][\left]  {\PRSymbol     #1\@pkl #2 \makeright{#1}{\@pkr}}
\newcommand{\ERW}[2][\left] {\ERWSymbol    #1\@ekl #2 \makeright{#1}{\@ekr}}
\makeatother
\newcommand{\isD}{\ {\stackrel{\mathcal{D}}{=}}\ \ }
\newcommand*{\iid}{\mbox{ i.i.d. }}

%
\begin{document}
\setkeys{Gin}{width=\textwidth}
% Manuel has
\setlength{\abovecaptionskip}{-5pt}
%
%% include your article here, just as usual
%% Note that you should use the \pkg{}, \proglang{} and \code{} commands.
% \section[About Java]{About \proglang{Java}}
%% Note: If there is markup in \(sub)section, then it has to be escape as above.
%% Note: These are explained in '?RweaveLatex' :
\begin{footnotesize}
<<preliminaries, echo=FALSE, results=hide>>=
options(SweaveHooks= list(fig=function() par(mar=c(5.1, 4.1, 1.1, 2.1))),
        width = 75,
        digits = 7, # <-- here, keep R's default!
        prompt = "R> ", # <- "yuck!" - required by JSS
        continue=" ")
set.seed(47)
Sys.setenv(LANGUAGE = "en")
if(.Platform$OS.type != "windows")
  Sys.setlocale("LC_MESSAGES","C")

## In order to  save() and load() expensive results
thisDir <- system.file('doc', package='diptest')
## not yet used:
xtraDir <- if(Sys.getenv("USER") == "maechler")
    "~/R/Pkgs/diptest/stuff" else thisDir
res1.file <- file.path(thisDir, "aggr_results.Rdata")

<<diagnose-lib, echo=FALSE>>=
if(nzchar(Sys.getenv("R_MM_PKG_CHECKING"))) print( .libPaths() )
@
\end{footnotesize}
% \maketitle
% \begin{abstract}
% \end{abstract}

\section[Introduction]{Introduction}% \small~\footnote{\mythanks}}
\label{sec:Intro}

%% MM
FIXME: Need notation

$D_n :=$\texttt{dip( runif(n) )};

but more generally,
\begin{equation}
  \label{eq:Dn.F}
  D_n(F) := D(X_1, X_2, \dots, X_n),  \mbox{ \ \ \texttt{where} } X_i \iid, X_i \sim F.
\end{equation}

\citet{HarJH85} in their ``seminal'' paper on the dip statistic $D_n$
already proved that $ \sqrt{n} \; D_n$ converges in distribution, i.e.,
\begin{equation}
  \label{eq:D.infty}
 \lim_{n\to\infty}\sqrt{n} \; D_n \isD  D_\infty.
\end{equation}

A considerable part of this paper is devoted to explore the distribution of $D_\infty$.

\bigskip
\section[History of the diptest package]{History of the \texttt{diptest} \textsf{R} package}

\citet{HarP85} published an implementation in Fortran of a concrete algorithm,
% ALGORITHM AS 217 APPL. STATIST. (1985) VOL.34, NO.3
where the code was also made available on Statlib\footnote{Statlib is now a
  website, of course, \url{http://lib.stat.cmu.edu/}, but then was \emph{the} preferred way
  for distributing algorithms for statistical computing, available years
  before the existence of the WWW, and entailing e-mail and (anonymous) FTP}

On July 28, 1994, Dario Ringach, then at NY University, asked on Snews (the
mailing list for S and S-plus users) about distributions and was helped by
me and then about \texttt{dyn.load} problems, again helped by
me. Subsequently he provided me with S-plus code which interfaced to
(a \texttt{f2c}ed version of) Hartigan's Fortran code, for computing the dip statistic.
and ended the (then private) e-mail with
\begin{quotation}
I am not going to have time to set this up for submission to StatLib.
If you want to do it, please go ahead.

Regards,
Dario
\end{quotation}



- several important bug fixes;
  last one Oct./Nov.~2003

  However, the Fortran code file \url{http://lib.stat.cmu.edu/apstat/217},
  was last changed {Thu 04 Aug 2005 03:43:28 PM CEST}.

  We have some results of the dip.dist of \emph{before} the bug fix;
  notably the ``dip of the dip'' probabilities have changed considerably!!

- see rcs log of ../../src/dip.c

\section{21st Century Improvement of  Hartigan$^2$'s Table}

((

Use listing package (or so to more or less ``cut \& paste'' the nice code in
\texttt{../../stuff/new-simul.Rout-1e6}

))

\section{The Dip in the Dip's Distribution}
\label{sec:dip_dip}
We have found empirically that the dip distribution itself starts with a ``dip''.
Specifically, the minimal possible value of $D_n$ is $\frac{1}{2n}$
\emph{and} the probability of reaching that value,
\begin{equation}
  \label{eq:P.Dn_min}
  \PR{D_n = \frac{1}{2n}},
\end{equation}
is large for small $n$.

E.g., consider an approximation of the dip distribution for $n=5$,
<<dip_n-is-5,fig=TRUE>>=
require("diptest") # after installing it ..
D5 <- replicate(10000, dip(runif(5)))
hist(D5, breaks=128, main = "Histogram of  replicate(10'000, dip(runif(5))))")
@

which looks as if there was a bug in the software --- but that look is misleading!
Note how the phenomenon is still visible for $n=8$,
<<dip_n-is-8,fig=TRUE>>=
D8 <- replicate(10000, dip(runif(8)))
hist(D8, breaks=128, main = "Histogram of  replicate(10'000, dip(runif(8))))")
@

Note that there is another phenomenon, in addition to the point mass at $1/(2n)$,
particularly visible, if we use \emph{many} replicates,
<<sim--n-eq-11, eval=FALSE>>=
 set.seed(11)
 n <- 11
 B.s11 <- 500000
 D11 <- replicate(B.s11, dip(runif(n)))
<<2nd-small-sample-phenomen--n-eq-11, echo=false>>=
if(file.exists(ff <- file.path(thisDir, "hist-D11.rda"))) {
  load(ff)
} else { ## takes a few minutes
<<sim--n-eq-11>>
  hD11 <- hist(D11, breaks=1e-6+(63:298)/(2*11*64), plot=FALSE)
  save(hD11, n, B.s11, file= ff)
}
<<2nd-small-sample-phenomen--n-eq-11, echo=false, fig=true>>=
B.str <- format(B.s11, sci=FALSE, big.mark="'")
plot(hD11, main = "",
     ## main = sprintf("Histogram of  replicate(%s, dip(runif(%d)))", B.str, n),
     border=NA, col="dark gray",
     xlab = substitute("Dip" ~~ D[.N.](U(group("[",list(0,1),"]"))), list(.N. = n)))
title(xlab= substitute(B == .B.SIM. ~ "replicates", list(.B.SIM. = B.str)),
      adj = .88)
lcol <- adjustcolor("orange4", 0.4)
abline(v = (1:3)/(2*n), col=lcol, lty=3, lwd=2)
axis(1, pos=0, at = (1:3)/(2*n),
     labels = expression(1/22, 2/22, 3/22), col=lcol, col.axis=lcol)
@

FIXME:\\
use \file{../../stuff/sim-minProb.R} \\
and \file{../../stuff/minProb-anal.R}

Further, it can be seen that the \emph{maximal} dip statistic
is $\frac 1 4 = 0.25$ and this upper bound can be reached simply (for even
$n$) using the the data $(0,0,\dots,0, \; 1, 1,\dots,1)$, a bi-point mass
with equal mass at both points.

\section{P-values for the Dip Test}
\label{sec:Pvals}
Note that it is not obvious how to compute $p$-values for ``the dip test'',
as that means determining the distribution of the test statistic, i.e.,
$D_n$ under the null hypothesis, but a natural null,
$H_o: F \in \{F \mathrm{cadlag} \mid f := \frac d{dx} F is unimodal \}$
is too large.  Hartigans'(1985) argued for using the uniform $U[0,1]$ i.e.,
$F'(x) = f(x)= \mathbf{1}_{[0,1]}(x) = [0 \le x \le 1]$ (Iverson bracket)
instead, even though they showed that it is not quite the ``least
favorable'' one.
Following Hartigans', we will define the $p$-value of an observed $d_n$ as
\begin{equation}
  \label{eq:Pval}
  P_{d_n} := \PR{D_n \ge d_n} := \PR{\mathrm{dip}(U_1,\dots,U_n) \ge d_n}, \ \
  \mathrm{where} \ U_i \sim U[0,1], \ \, \iid
\end{equation}

\subsection{Interpolating the Dip Table}
\label{sec:interpol}
Because of the asymptotic distribution,
$ \lim_{n\to\infty}\sqrt{n} \; D_n \isD  D_\infty$,
it is makes sense to consider the ``$\sqrt{n} D_n$''--scale,
even for finite $n$ values:
<<sqrt-n-qdip,fig=true>>=
data(qDiptab)
dnqd <- dimnames(qDiptab)
(nn. <- as.integer(dnqd[["n"]]))
matplot(nn., qDiptab*sqrt(nn.), type ="o", pch=1, cex = 0.4,
        log="x", xlab="n   [log scaled]",
        ylab = expression(sqrt(n) %*% q[D[n]]))
## Note that  1/2n  is the first possible value (with finite mass),,
## clearly visible for (very) small n:
lines(nn., sqrt(nn.)/(2*nn.), col=adjustcolor("yellow2",0.5), lwd=3)

P.p <- as.numeric(print(noquote(dnqd[["Pr"]])))
## Now look at one well known data set:
D <- dip(x <- faithful$waiting)
n <- length(x)
points(n, sqrt(n)*D, pch=13, cex=2, col= adjustcolor("blue2",.5), lwd=2)
## a simulated (approximate) $p$-value for D  is
mean(D <= replicate(10000, dip(runif(n)))) ## ~ 0.002
@

but we can use our table to compute a deterministic (but still approximate,
as the table is from simulation too) $p$-value:
<<interpolate-dip-table>>=
## We are in this interval:
n0 <- nn.[i.n <- findInterval(n, nn.)]
n1 <- nn.[i.n +1] ; c(n0, n1)
f.n <- (n - n0)/(n1 - n0)# in [0, 1]
## Now "find" y-interval:
y.0 <- sqrt(n0)* qDiptab[i.n  ,]
y.1 <- sqrt(n1)* qDiptab[i.n+1,]
(Pval <- 1 - approx(y.0 + f.n*(y.1 - y.0),
                    P.p,
                    xout = sqrt(n) * D)[["y"]])
## 0.018095
@

Finally, in May 2011, after several years of people asking for it,
I have implemented a \code{dip.test} function which makes use of a ---
somewhat more sophisticated --- interpolation scheme like the one above,
to compute a $p$-value.
As \code{qDiptab} has been based on $10^6$ samples, the interpolation
yields accurate $p$-values, unless in very extreme cases.
Here is the small ($n=63$) example from Hartigan$^2$,
<<statfac-dip.test>>=
data(statfaculty)
dip.test(statfaculty)
@
where, from a $p$-value of 8.7\%, we'd conclude that there is not enough
evidence against unimodality.

\subsection{Asymptotic Dip Distribution}
\label{sec:asymp}
We have conducted extensive simulations in order to explore the limit
distribution of $D_\infty$, i.e., the limit of $\sqrt{n} \; D_n$, (\ref{eq:D.infty}).

Our current \R\ code is in \file{ ../../stuff/asymp-distrib.R }
but the simulation results (7 Megabytes for each $n$) cannot be assumed to
be part of the package, nor maybe even to be simply accessible via the internet.
%% is bandwidth a problem ?  probably no longer in the near future?


%% Maybe
\section{Less Conservative Dip Testing}


\section{Session Info}

<<sessionInfo, results=tex>>=
toLatex(sessionInfo())
@

\bibliography{diptest}

\end{document}
