Latin Hypercube Sampling

What is LHS?

Latin hypercube sampling aims to bring the best of both worlds: the unbiased random sampling of monte carlo simulation; and the even coverage of a grid search over the decision space. It does this by ensuring values for all variables are as uncorrelated and widely varying as possible (over the range of permitted values).

Why do I need one?

The background to this is that I have been working on improving the convergence of an optimisation package that uses genetic algorithms. Having a good distribution of “genes”, or decisions within the initial population is key to allowing a GA to effectively explore the decision space. The ga() function within the GA package in R allows for a suggestions argument, which takes a matrix of decision values and places them within the starting population. Initially I wrote one function to create evenly spaced sequences for every decision between the lower and upper bound of allowable values, from which I enumerated all possible combinations using expand.grid(). I then wrote another function to take a model object and a user-defined population size and automatically work out the nearest population count that allowed for even sampling over the model’s decision bounds.

For models with large numbers of decisions the potential number of combinations is enormous. This is true even if you only select the upper and lower bound per decision. Already with 30 independent decisions with two possible values the number of combinations is 2^30 = 10,737,41,824, 10 times more than the number of stars in the average galaxy. Combinatorial algorithms are the worst type for growth in complexity (\(O(n!)\)). I needed a way of randomly sampling from the decisions but in a way that ensured the starting population had lots of diversity. This is the exact use case for LHS.

Practical examples

Let’s assume we have three decisions, where:

  • decision a can be between 1 and 3,
  • decision b can be TRUE or FALSE, and
  • decision c can be red, green, blue or black

There are \(3 * 2 * 4 = 24\) possible unique combinations of these decisions (if a can only take on integer values). An individual in the starting population of a genetic algorithm would be of the form, 1_TRUE_red, for instance.

Let’s assume we would like only 12 individuals from the 24 potential unique combinations, but we still want good representation of all/most possible decisions. Using the lhs library from R, we first create 12 random uniform distributions between 0 and 1 for each of the three decisions, a, b and c.

library(lhs)
library(dplyr)
library(purrr)

# Test data
a <- 1:3
b <- c(TRUE, FALSE)
c <- c("red", "green", "blue", "black")
all_decisions <- rbind(list(a,b,c))
sample <- as.data.frame(randomLHS(12, 3))
names(sample) <- c("a", "b", "c")

# Uniform random values between 0 and 1
sample
##             a          b          c
## 1  0.41652181 0.18546382 0.87467267
## 2  0.81220155 0.14374777 0.18730695
## 3  0.44328892 0.81169787 0.46899786
## 4  0.05260195 0.07355818 0.71380152
## 5  0.23585575 0.74240078 0.79222327
## 6  0.29855073 0.40655744 0.04641022
## 7  0.64784589 0.58572787 0.99830407
## 8  0.88174995 0.90155272 0.15134718
## 9  0.56699472 0.32409568 0.29897915
## 10 0.94901332 0.99298967 0.59935152
## 11 0.68635953 0.46947453 0.40373326
## 12 0.12035170 0.50435556 0.57419772

Next we map the 0-1 distributions onto the real distributions of a, b and c. For instance, c has four possible values so the distribution should be 1-4. We also convert the values to factors in order to label them properly.

# Random choices for each decision with distributions based on a, b and c
choices <- map2(sample, all_decisions,
                ~cut(.x, length(.y), labels = .y)) %>%
  bind_rows()
  
choices
## # A tibble: 12 x 3
##    a     b     c    
##    <fct> <fct> <fct>
##  1 2     TRUE  black
##  2 3     TRUE  red  
##  3 2     FALSE green
##  4 1     TRUE  blue 
##  5 1     FALSE black
##  6 1     TRUE  red  
##  7 2     FALSE black
##  8 3     FALSE red  
##  9 2     TRUE  green
## 10 3     FALSE blue 
## 11 3     TRUE  green
## 12 1     TRUE  blue

For convenience we can bring this into a single function like so.

lhs_sample <- function(decisions, n){
  stopifnot(is.list(decisions))
  len_decisions <- length(decisions)
  samples <- lhs::randomLHS(n, len_decisions) %>%
    as.data.frame()
  names(samples) <- names(decisions)
  choices <- purrr::map2(samples, decisions,
                  ~cut(.x, length(.y), labels = .y))
  bind_rows(choices)
}

lhs_sample(list(cars = rownames(mtcars), species = unique(iris$Species), letter = LETTERS[24:26]), n = 10)
## # A tibble: 10 x 3
##    cars               species    letter
##    <fct>              <fct>      <fct> 
##  1 Pontiac Firebird   virginica  X     
##  2 Fiat 128           versicolor Y     
##  3 Hornet Sportabout  setosa     Y     
##  4 Merc 280           setosa     X     
##  5 Mazda RX4          setosa     X     
##  6 Ford Pantera L     setosa     Z     
##  7 Dodge Challenger   versicolor Z     
##  8 Volvo 142E         virginica  Y     
##  9 Duster 360         versicolor X     
## 10 Cadillac Fleetwood virginica  Z

Testing coverage

In theory, each sample should be orthogonal, or independent, of each other sample to the greatest possible extent. Another way of putting it is that there should neither be any one value overrepresented or under-represented in the sample set. Let’s test this in practice:

library(ggplot2)
big_sample <- lhs_sample(decisions = list(a = a,b = b,c = c), n = 1000)

ggplot(big_sample, aes(x = c, fill = c)) +
  geom_bar() +
  scale_fill_manual(values = c("red", "green", "blue", "black")) +
  labs(title = "For a single decision, each value has been sampled equally")

ggplot(big_sample, aes(x = a, y = b, colour = c)) +
  geom_jitter() +
  scale_colour_manual(values = c("red", "green", "blue", "black")) +
  labs(title = "Every value combination across all decisions has also been sampled equally")

Richard Davey
Richard Davey
Analytical Consultant

My interests include earth science, numerical modelling and problem solving through optimisation.

Related