Using ncdfCF

What is NetCDF

“NetCDF (Network Common Data Form) is a set of software libraries and machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data. It is also a community standard for sharing scientific data.”

NetCDF is developed by UCAR/Unidata and is widely used for climate and weather data as well as for other environmental datasets. The netcdf library is ported to a wide variety of operating systems and platforms, from laptop computers to large mainframes. Datasets are typically large arrays with dimensions for longitude, latitude and time, with other dimensions, such as depth, added according to the nature of the data. Other types of data are also commonly found.

Importantly, “a netCDF file includes information about the data it contains”. This comes in two flavours:

Structural metadata are part of the netcdf library. These describe the basic building blocks of the dataset, its variables, and the dimensions of the variables and how the pieces fit together. With just this information one can read the data from the resource.
Descriptive metadata are contained in attributes attached to the basic building blocks. They inform the user on what the building blocks represent. This includes crucial details like how dimensions in the resource map to the dimensions (or axes) of the variables, but also more general items like the owners or producers of the data and the production history.

Both types of metadata are necessary to “understand” the NetCDF resource.

Conventions

The descriptive metadata are not defined by the netcdf library. To ensure interoperability, several “conventions” have been developed over the years such that users of NetCDF data can correctly interpret what data developers have put in the resource. The most important of the conventions is the CF Metadata Conventions. These conventions define a large number of standards that help interpret NetCDF resources.

Other common conventions are related to climate prediction data, such as CMIP-5 and CMIP-6.

Using NetCDF resources in R

Basic access

The RNetCDF package is developed and maintained by the same team that developed and maintains the netcdf library. It provides an interface to the netcdf library that stays very close to the API of the library. As a result, it lacks an intuitive user experience and workflow that R users would be familiar with.

Package ncdf4, the most widely used package to access NetCDF resources, does one better by performing the tedious task of reading the structural metadata from the resource that is needed for a basic understanding of the contents, such as dimension and variable details, but the library API concept remains with functions that directly map to the netcdf library functions.

One would really need to understand the NetCDF data model and implementation details to effectively use these packages. For instance, most data describing a dimension is stored as a variable. So to read the dimnames() of a dimension you’d have to call var.get.nc() or ncvar_get(). Neither package loads the attributes of the dimensions, variables and the dataset (“global” variables), which is essential to understand what the dimensions and variables represent.

While both packages are very good at what they do, it is clearly not enough.

Extending the base packages

Several packages have been developed to address some of these issues and make access to the data easier. Unfortunately, none of these packages provide a comprehensive R-style solution to accessing and interpreting NetCDF resources in an intuitive way.

ncdfCF

Package ncdfCF provides a high-level interface using functions and methods that are familiar to the R user. It reads the structural metadata and also the attributes upon opening the resource. In the process, the ncdfCF package also applies CF Metadata Conventions to interpret the data. This currently applies to:

The axis designation. The three mechanisms to identify the axis each dimension represents are applied until an axis is determined.
The time dimension. Time is usually encoded as an offset from a datum. Using the CFtime package these offsets can be turned into intelligible dates and times, for all 9 defined calendars.
Bounds information. When present, bounds are read and used in analyses.
Discrete dimensions with character labels.

Basic usage

Opening and inspecting the contents of a NetCDF resource is very straightforward:

suppressPackageStartupMessages(library(ncdfCF))

# Get a NetCDF file, here hourly data for 2016-01-01 over Rwanda
fn <- system.file("extdata", "ERA5land_Rwanda_20160101.nc", package = "ncdfCF")

# Open the file, all metadata is read
ds <- open_ncdf(fn)

# Easy access in understandable format to all the details
ds
#> Dataset   : /private/var/folders/gs/s0mmlczn4l7bjbmwfrrhjlt80000gn/T/RtmpyTj0rZ/Rinst1444e29697798/ncdfCF/extdata/ERA5land_Rwanda_20160101.nc 
#> 
#> Variables :
#>  id name long_name             units dimensions               
#>  3  t2m  2 metre temperature   K     longitude, latitude, time
#>  4  pev  Potential evaporation m     longitude, latitude, time
#>  5  tp   Total precipitation   m     longitude, latitude, time
#> 
#> Dimensions:
#>  id axis name      dims                                              unlim
#>  1  X    longitude [31: 28 ... 31]                                        
#>  2  Y    latitude  [21: -1 ... -3]                                        
#>  0  T    time      [24: 2016-01-01 00:00:00 ... 2016-01-01 23:00:00] U    
#> 
#> Attributes:
#>  id name        type    length value                                   
#>  0  CDI         NC_CHAR  64    Climate Data Interface version 2.4.1 ...
#>  1  Conventions NC_CHAR   6    CF-1.6                                  
#>  2  history     NC_CHAR 482    Tue May 28 18:39:12 2024: cdo seldate...
#>  3  CDO         NC_CHAR  64    Climate Data Operators version 2.4.1 ...


# Variables can be accessed through standard list-type extraction syntax
t2m <- ds[["t2m"]]
t2m
#> Variable: [3] t2m | 2 metre temperature
#> 
#> Dimensions:
#>  id axis      name                                              dims unlim
#>   1    X longitude                                   [31: 28 ... 31]      
#>   2    Y  latitude                                   [21: -1 ... -3]      
#>   0    T      time [24: 2016-01-01 00:00:00 ... 2016-01-01 23:00:00]     U
#> 
#> Attributes:
#>  id name          type      length value              
#>  0  long_name     NC_CHAR   19     2 metre temperature
#>  1  units         NC_CHAR    1     K                  
#>  2  add_offset    NC_DOUBLE  1     292.664569285614   
#>  3  scale_factor  NC_DOUBLE  1     0.00045127252204996
#>  4  _FillValue    NC_SHORT   1     -32767             
#>  5  missing_value NC_SHORT   1     -32767


# Same with dimensions, but now without first putting the object in a variable
ds[["longitude"]]
#> Dimension: [1] longitude
#> Axis     : X 
#> Length   : 31  
#> Range    : 28 ... 31 degrees_east 
#> Bounds   : (not set) 
#> 
#> Attributes:
#>  id name          type    length value       
#>  0  standard_name NC_CHAR  9     longitude   
#>  1  long_name     NC_CHAR  9     longitude   
#>  2  units         NC_CHAR 12     degrees_east
#>  3  axis          NC_CHAR  1     X


# Regular base R operations simplify life further
dimnames(ds[["pev"]]) # A variable: list of dimension names
#>   longitude    latitude        time 
#> "longitude"  "latitude"      "time"


dimnames(ds[["longitude"]]) # A dimension: vector of dimension element values
#>  [1] 28.0 28.1 28.2 28.3 28.4 28.5 28.6 28.7 28.8 28.9 29.0 29.1 29.2 29.3 29.4
#> [16] 29.5 29.6 29.7 29.8 29.9 30.0 30.1 30.2 30.3 30.4 30.5 30.6 30.7 30.8 30.9
#> [31] 31.0


# Access attributes
attribute(ds[["pev"]], "long_name")
#> [1] "Potential evaporation"

Extracting data

One of the perpetual headaches of users of NetCDF files is to extract the data. If you want to get all the data for a variable then neither RNetCDF nor ncdf4 are particularly troublesome:

library(ncdf4)
nc <- nc_open(fn)
vars <- names(nc$var)
d <- ncvar_get(nc, vars[[1]])

But what if you are interested in only a small area or a month of data while the resource has global data spanning multiple years? In both RNetCDF and ncdf4 packages you’d have to work out how your real-world boundaries translate to indices into the variable array of interest and then populate start and count arguments to pass on to var.get.nc() or ncvar_get(). That may be feasible for longitude and latitude dimensions, but for a time dimension this becomes more complicated (reading and parsing the “units” attribute of the dimension) or nigh-on impossible when non-standard calendars are used for the dimension. Many R users default to simply reading the entire array and then extracting the area of interest using standard R tools. That is wasteful at best (lots of I/O, RAM usage, CPU cycles) and practically impossible with some larger NetCDF resources that have variables upwards of 1GB in size.

Enter ncdfCF. With ncdfCF you have two options to extract data for a variable:

[]: Using R’s standard extraction operator [ you work directly with the index values into the array dimensions: d <- t2m[3:5, 1:4, 1:10]. You can leave out dimensions to extract everything from that dimension (but you have to indicate the position, just like in regular arrays). So to get the first 5 “time” slices from t2m: d <- t2m[, , 1:5]. This works for any number of dimensions, you simply have to adjust the number of positions that you specify. You still need to know the indices into the arrays but ncdfCF has some helper functions to get you those. Not specifying anything gets you the whole array: d <- t2m[].
subset(): The subset() method is more flexible than [] because it requires less knowledge of how the data in the variable is structured, particularly the order of the dimensions. While many NetCDF resources “behave” in their dimension order, there is no guarantee. With subset() you supply a list with items for each dimension (by axis or dimension name, in any order) and each item containing a vector of real-world coordinates to extract. As an example, to extract values of a variable x for Australia for the year 2020 you call subset(x, list(X = 112:154, Y = -9:-44, T = c("2020-01-01", "2021-01-01"))).

# Extract a timeseries for a specific location
ts <- t2m[5, 4, ]
str(ts)
#>  num [1, 1, 1:24] 293 292 292 291 291 ...
#>  - attr(*, "dimnames")=List of 3
#>   ..$ : chr "28.4"
#>   ..$ : chr "-1.3"
#>   ..$ : chr [1:24] "2016-01-01 00:00:00" "2016-01-01 01:00:00" "2016-01-01 02:00:00" "2016-01-01 03:00:00" ...
#>  - attr(*, "axis")= Named chr [1:3] "X" "Y" "T"
#>   ..- attr(*, "names")= chr [1:3] "longitude" "latitude" "time"
#>  - attr(*, "time")=List of 1
#>   ..$ time:Formal class 'CFtime' [package "CFtime"] with 4 slots
#>   .. .. ..@ datum     :Formal class 'CFdatum' [package "CFtime"] with 5 slots
#>   .. .. .. .. ..@ definition: chr "hours since 1900-01-01 00:00:00.0"
#>   .. .. .. .. ..@ unit      : int 3
#>   .. .. .. .. ..@ origin    :'data.frame':   1 obs. of  8 variables:
#>   .. .. .. .. .. ..$ year  : int 1900
#>   .. .. .. .. .. ..$ month : num 1
#>   .. .. .. .. .. ..$ day   : num 1
#>   .. .. .. .. .. ..$ hour  : num 0
#>   .. .. .. .. .. ..$ minute: num 0
#>   .. .. .. .. .. ..$ second: num 0
#>   .. .. .. .. .. ..$ tz    : chr "+0000"
#>   .. .. .. .. .. ..$ offset: num 0
#>   .. .. .. .. ..@ calendar  : chr "gregorian"
#>   .. .. .. .. ..@ cal_id    : int 1
#>   .. .. ..@ resolution: num 1
#>   .. .. ..@ offsets   : num [1:24] 1016832 1016833 1016834 1016835 1016836 ...
#>   .. .. ..@ bounds    : logi FALSE


# Extract the full spatial extent for one time step
ts <- t2m[, , 12]
str(ts)
#>  num [1:31, 1:21, 1] 300 300 300 300 300 ...
#>  - attr(*, "dimnames")=List of 3
#>   ..$ : chr [1:31] "28" "28.1" "28.2" "28.3" ...
#>   ..$ : chr [1:21] "-1" "-1.1" "-1.2" "-1.3" ...
#>   ..$ : chr "2016-01-01 11:00:00"
#>  - attr(*, "axis")= Named chr [1:3] "X" "Y" "T"
#>   ..- attr(*, "names")= chr [1:3] "longitude" "latitude" "time"
#>  - attr(*, "time")=List of 1
#>   ..$ time:Formal class 'CFtime' [package "CFtime"] with 4 slots
#>   .. .. ..@ datum     :Formal class 'CFdatum' [package "CFtime"] with 5 slots
#>   .. .. .. .. ..@ definition: chr "hours since 1900-01-01 00:00:00.0"
#>   .. .. .. .. ..@ unit      : int 3
#>   .. .. .. .. ..@ origin    :'data.frame':   1 obs. of  8 variables:
#>   .. .. .. .. .. ..$ year  : int 1900
#>   .. .. .. .. .. ..$ month : num 1
#>   .. .. .. .. .. ..$ day   : num 1
#>   .. .. .. .. .. ..$ hour  : num 0
#>   .. .. .. .. .. ..$ minute: num 0
#>   .. .. .. .. .. ..$ second: num 0
#>   .. .. .. .. .. ..$ tz    : chr "+0000"
#>   .. .. .. .. .. ..$ offset: num 0
#>   .. .. .. .. ..@ calendar  : chr "gregorian"
#>   .. .. .. .. ..@ cal_id    : int 1
#>   .. .. ..@ resolution: num NA
#>   .. .. ..@ offsets   : num 1016843
#>   .. .. ..@ bounds    : logi FALSE

Note that the results contain degenerate dimensions (of length 1). This by design because it allows attributes to be attached and then inspected by the user in a consistent manner.

# Extract a specific region, full time dimension
ts <- subset(t2m, list(X = 29:30, Y = -1:-2))
str(ts)
#>  num [1:10, 1:10, 1:24] 290 291 291 292 293 ...
#>  - attr(*, "dimnames")=List of 3
#>   ..$ : chr [1:10] "29" "29.1" "29.2" "29.3" ...
#>   ..$ : chr [1:10] "-1" "-1.1" "-1.2" "-1.3" ...
#>   ..$ : chr [1:24] "2016-01-01 00:00:00" "2016-01-01 01:00:00" "2016-01-01 02:00:00" "2016-01-01 03:00:00" ...
#>  - attr(*, "axis")= Named chr [1:3] "X" "Y" "T"
#>   ..- attr(*, "names")= chr [1:3] "longitude" "latitude" "time"
#>  - attr(*, "time")=List of 1
#>   ..$ time:Formal class 'CFtime' [package "CFtime"] with 4 slots
#>   .. .. ..@ datum     :Formal class 'CFdatum' [package "CFtime"] with 5 slots
#>   .. .. .. .. ..@ definition: chr "hours since 1900-01-01 00:00:00.0"
#>   .. .. .. .. ..@ unit      : int 3
#>   .. .. .. .. ..@ origin    :'data.frame':   1 obs. of  8 variables:
#>   .. .. .. .. .. ..$ year  : int 1900
#>   .. .. .. .. .. ..$ month : num 1
#>   .. .. .. .. .. ..$ day   : num 1
#>   .. .. .. .. .. ..$ hour  : num 0
#>   .. .. .. .. .. ..$ minute: num 0
#>   .. .. .. .. .. ..$ second: num 0
#>   .. .. .. .. .. ..$ tz    : chr "+0000"
#>   .. .. .. .. .. ..$ offset: num 0
#>   .. .. .. .. ..@ calendar  : chr "gregorian"
#>   .. .. .. .. ..@ cal_id    : int 1
#>   .. .. ..@ resolution: num 1
#>   .. .. ..@ offsets   : num [1:24] 1016832 1016833 1016834 1016835 1016836 ...
#>   .. .. ..@ bounds    : logi FALSE


# Extract specific time slices for a specific region
# Note that the dimensions are specified out of order and using alternative
# specifications: only the extreme values are used.
ts <- subset(t2m, list(T = c("2016-01-01 09:00", "2016-01-01 15:00"),
                       X = c(29.6, 28.8),
                       Y = seq(-2, -1, by = 0.05)))
dimnames(ts)
#> [[1]]
#> [1] "28.8" "28.9" "29"   "29.1" "29.2" "29.3" "29.4" "29.5"
#> 
#> [[2]]
#>  [1] "-1"   "-1.1" "-1.2" "-1.3" "-1.4" "-1.5" "-1.6" "-1.7" "-1.8" "-1.9"
#> 
#> [[3]]
#> [1] "2016-01-01 09:00:00" "2016-01-01 10:00:00" "2016-01-01 11:00:00"
#> [4] "2016-01-01 12:00:00" "2016-01-01 13:00:00" "2016-01-01 14:00:00"


# Same extraction but now with the upper bound included.
ts <- subset(t2m, list(T = c("2016-01-01 09:00", "2016-01-01 15:00"),
                       X = c(29.6, 28.8),
                       Y = seq(-2, -1, by = 0.05)),
             rightmost.closed = TRUE)
dimnames(ts)
#> [[1]]
#> [1] "28.8" "28.9" "29"   "29.1" "29.2" "29.3" "29.4" "29.5" "29.6"
#> 
#> [[2]]
#>  [1] "-1"   "-1.1" "-1.2" "-1.3" "-1.4" "-1.5" "-1.6" "-1.7" "-1.8" "-1.9"
#> [11] "-2"  
#> 
#> [[3]]
#> [1] "2016-01-01 09:00:00" "2016-01-01 10:00:00" "2016-01-01 11:00:00"
#> [4] "2016-01-01 12:00:00" "2016-01-01 13:00:00" "2016-01-01 14:00:00"
#> [7] "2016-01-01 15:00:00"

These methods will read data from the NetCDF resource, but only as much as is requested.

Both approaches lend themselves well to the apply() family of functions for processing. Importantly, these functions are to access data from the NetCDF resource so you can tweak the size of your request to the capacity of the computer, without exhausting RAM.