Create a Date Dimension

data warehouse
Author

Paulius Alaburda

Published

June 29, 2023

What’s a date dimension anyway?

I have been searching for a clean solution to building a date dimension so here I am writing my own. In R!

If you don’t speak Kimballese, a date dimension is a calendar. It contains a row for each day in a year as well as attributes about those dates. For example, a date dimension for 2023 would contain 365 rows and have columns such as week number, day of week, whether the day is a holiday etc. There’s really no correct way of building a date dimension as different organisations might have different ways of aggregating data across time but it is useful to have a “starter” table.

Building the date table

First, we’ll build the table using lubridate. My locale is Lithuanian, but your output should adapt to your locale:

library(lubridate)

Attaching package: 'lubridate'
The following objects are masked from 'package:base':

    date, intersect, setdiff, union
library(tibble)

date_dimension <- tibble(date = seq(as.Date('2011-01-01'),as.Date('2011-01-31'),by = 1),
                         full_date_description = format(date, format="%Y m. %B %d d."),
                         day_of_week = wday(date, label=FALSE, week_start = 1),
                         day_of_week_name = wday(date, label=TRUE, abbr = FALSE),
                         calendar_iso_week = isoweek(date),
                         calendar_week = week(date),
                         calendar_month = month(date),
                         calendar_month_name = month(date, label = TRUE, abbr = FALSE),
                         calendar_quarter = quarter(date),
                         calendar_quarter_name = paste0("Q",quarter(date)),
                         calendar_year = year(date),
                         is_weekday = as.integer(wday(date, week_start = 1) < 6))

head(date_dimension)
# A tibble: 6 × 12
  date       full_date_description day_of_week day_of_week_name
  <date>     <chr>                       <dbl> <ord>           
1 2011-01-01 2011 m. sausis 01 d.            6 šeštadienis     
2 2011-01-02 2011 m. sausis 02 d.            7 sekmadienis     
3 2011-01-03 2011 m. sausis 03 d.            1 pirmadienis     
4 2011-01-04 2011 m. sausis 04 d.            2 antradienis     
5 2011-01-05 2011 m. sausis 05 d.            3 trečiadienis    
6 2011-01-06 2011 m. sausis 06 d.            4 ketvirtadienis  
# ℹ 8 more variables: calendar_iso_week <dbl>, calendar_week <dbl>,
#   calendar_month <dbl>, calendar_month_name <ord>, calendar_quarter <int>,
#   calendar_quarter_name <chr>, calendar_year <dbl>, is_weekday <int>

If this were a typical date dimension post, all you would have to do is adjust the date range and either export the table or write to a database. But this date dimension could be improved:

  • It doesn’t contain bank holidays
  • If your use case is different enough from mine, you would have to edit more than the date range to get a desired result.

Let’s add holidays

To get bank holidays, you have a few options:

  • Nager.Date - free to use, no sign up required
  • abstract - free tier available but sign up required
  • Calendarific - free tier available but sign up required
  • Use The Holiday API - free for previous years but paid for current and future years
  • Scrape Public Holidays Global - free to use but I haven’t checked their scraping policy

Let’s use Nager.Date for this exercise. The API returns a list of dates that are bank holidays with quite a few useful attributes but I am keeping just the English and local holiday names.

get_holidays <- function(country_code = "lt", year = "2023") {
  
  rs <- httr::GET(glue::glue("https://date.nager.at/api/v3/publicholidays/{year}/{country_code}"))
  
  out <- httr::content(rs) %>% 
    tibble::enframe() %>% 
    dplyr::select(value) %>% 
    tidyr::unnest_wider(value) %>% 
    dplyr::transmute(date = as.Date(date),
           holiday_name_local = localName,
           holiday_name = name)
  
  return(out)
  
}

get_holidays()
# A tibble: 14 × 3
   date       holiday_name_local                       holiday_name             
   <date>     <chr>                                    <chr>                    
 1 2023-01-01 Naujieji metai                           New Year's Day           
 2 2023-02-16 Lietuvos valstybės atkūrimo diena        The Day of Restoration o…
 3 2023-03-11 Lietuvos nepriklausomybės atkūrimo diena Day of Restoration of In…
 4 2023-04-09 Velykos                                  Easter Sunday            
 5 2023-04-10 Antroji Velykų diena                     Easter Monday            
 6 2023-05-01 Tarptautinė darbo diena                  International Working Day
 7 2023-06-24 Joninės, Rasos                           St. John's Day           
 8 2023-07-06 Valstybės diena                          Statehood Day            
 9 2023-08-15 Žolinė                                   Assumption Day           
10 2023-11-01 Visų šventųjų diena                      All Saints' Day          
11 2023-11-02 Vėlinės                                  All Souls' Day           
12 2023-12-24 Šv. Kūčios                               Christmas Eve            
13 2023-12-25 Šv. Kalėdos                              Christmas Day            
14 2023-12-26 Šv. Kalėdos                              St. Stephen's Day        

Make it a function

To turn the code above into a function we are making the following changes:

  • Pulling start and end dates, date format, week start number and country code as function arguments
  • Passing the years from the date dimension to the get_holiday function. I want to create tables that span multiple years so we’re using purrr::map_df to get holidays for multiple years.
date_dimension <- function(date_from, date_to, full_date_format = "%Y m. %B %d d.", week_start = 1, country_code = "LT") {
  
  date_skeleton <- tibble(date = seq(as.Date(date_from),as.Date(date_to),by = 1),
                          full_date_description = format(date, format=full_date_format),
                          day_of_week = wday(date, label=FALSE, week_start = week_start),
                          day_of_week_name = wday(date, label=TRUE, abbr = FALSE),
                          calendar_iso_week = isoweek(date),
                          calendar_week = week(date),
                          calendar_month = month(date),
                          calendar_month_name = month(date, label = TRUE, abbr = FALSE),
                          calendar_quarter = quarter(date),
                          calendar_quarter_name = paste0("Q",quarter(date)),
                          calendar_year = year(date),
                          is_weekday = as.integer(wday(date, week_start = 1) < 6))

  holidays <- purrr::map_df(unique(date_skeleton$calendar_year), ~get_holidays(country_code = country_code, year = .))

  out <- date_skeleton %>%
    dplyr::left_join(holidays, by = "date") %>%
    dplyr::mutate(is_workday = is_weekday*is.na(holiday_name))


  return(out)

}

date_dimension("2024-01-01","2024-05-01")
# A tibble: 122 × 15
   date       full_date_description day_of_week day_of_week_name
   <date>     <chr>                       <dbl> <ord>           
 1 2024-01-01 2024 m. sausis 01 d.            1 pirmadienis     
 2 2024-01-02 2024 m. sausis 02 d.            2 antradienis     
 3 2024-01-03 2024 m. sausis 03 d.            3 trečiadienis    
 4 2024-01-04 2024 m. sausis 04 d.            4 ketvirtadienis  
 5 2024-01-05 2024 m. sausis 05 d.            5 penktadienis    
 6 2024-01-06 2024 m. sausis 06 d.            6 šeštadienis     
 7 2024-01-07 2024 m. sausis 07 d.            7 sekmadienis     
 8 2024-01-08 2024 m. sausis 08 d.            1 pirmadienis     
 9 2024-01-09 2024 m. sausis 09 d.            2 antradienis     
10 2024-01-10 2024 m. sausis 10 d.            3 trečiadienis    
# ℹ 112 more rows
# ℹ 11 more variables: calendar_iso_week <dbl>, calendar_week <dbl>,
#   calendar_month <dbl>, calendar_month_name <ord>, calendar_quarter <int>,
#   calendar_quarter_name <chr>, calendar_year <dbl>, is_weekday <int>,
#   holiday_name_local <chr>, holiday_name <chr>, is_workday <int>

That’s it! If you want to learn more about date dimensions, be sure to read Kimball’s Data Warehousing Toolkit Chapter 3.