Title: | Gaussian Mixture Graphical Model Learning and Inference |
---|---|
Description: | Gaussian mixture graphical models include Bayesian networks and dynamic Bayesian networks (their temporal extension) whose local probability distributions are described by Gaussian mixture models. They are powerful tools for graphically and quantitatively representing nonlinear dependencies between continuous variables. This package provides a complete framework to create, manipulate, learn the structure and the parameters, and perform inference in these models. Most of the algorithms are described in the PhD thesis of Roos (2018) <https://tel.archives-ouvertes.fr/tel-01943718>. |
Authors: | Jérémy Roos [aut, cre, cph], RATP Group [fnd, cph] |
Maintainer: | Jérémy Roos <[email protected]> |
License: | GPL-3 |
Version: | 1.1.2 |
Built: | 2025-03-08 03:30:54 UTC |
Source: | https://github.com/jeremyroos/gmgm |
This package provides a complete framework to deal with Gaussian mixture graphical models, which covers Bayesian networks and dynamic Bayesian networks (their temporal extension) whose local probability distributions are described by Gaussian mixture models. It includes a wide range of functions for:
creating or modifying the structure of Gaussian mixture models
(add_var
, gmm
, merge_comp
,
remove_var
, rename_var
, reorder
,
split_comp
) or graphical models (add_arcs
,
add_nodes
, gmbn
, gmdbn
,
relevant
, remove_arcs
,
remove_nodes
, rename_nodes
);
describing or visualizing Gaussian mixture models
(conditional
, ellipses
,
summary.gmm
) or graphical models (network
,
structure
, summary.gmbn
,
summary.gmdbn
);
computing densities, expectations, or sampling Gaussian mixture models
(density
, expectation
, sampling
);
computing scores of Gaussian mixture models (AIC.gmm
,
BIC.gmm
, logLik.gmm
) or graphical models
(AIC.gmbn
, AIC.gmdbn
, BIC.gmbn
,
BIC.gmdbn
, logLik.gmbn
,
logLik.gmdbn
);
learning the structure and/or the parameters of Gaussian mixture models
(em
, smem
, stepwise
) or graphical
models (param_em
, param_learn
,
struct_em
, struct_learn
);
performing inference in Gaussian mixture graphical models
(aggregation
, filtering
, inference
,
particles
, prediction
, propagation
,
smoothing
).
Descriptions of these functions are provided in this manual with related
references. Most of the algorithms are described in the PhD thesis of Roos
(2018, in french). To better handle this package, two real-world datasets are
provided (data_air
, data_body
) with examples of
Gaussian mixture models and graphical models (gmbn_body
,
gmdbn_air
, gmm_body
).
Roos, J. (2018). Prévision a court terme des flux de voyageurs : une approche par les réseaux bayésiens. PhD thesis, University of Lyon.
This function adds arcs to a Gaussian mixture graphical model. For each added arc, a variable related to the start node is added to the Gaussian mixture model describing the local distribution over the end node and its parents, with mean 0 and variance 1 for each mixture component.
add_arcs(gmgm, arcs)
add_arcs(gmgm, arcs)
gmgm |
An object of class |
arcs |
A data frame containing the added arcs. The column |
The gmbn
or gmdbn
object after adding the arcs.
add_nodes
, relevant
,
remove_arcs
, remove_nodes
,
rename_nodes
data(gmbn_body) gmbn_1 <- add_arcs(gmbn_body, data.frame(from = c("GENDER", "AGE"), to = c("GLYCO", "WEIGHT"))) data(gmdbn_air) gmdbn_1 <- add_arcs(gmdbn_air, list(b_2 = data.frame(from = "WIND", to = "NO2", lag = 1), b_13 = data.frame(from = c("NO2", "NO2"), to = c("O3", "O3"), lag = c(0, 1))))
data(gmbn_body) gmbn_1 <- add_arcs(gmbn_body, data.frame(from = c("GENDER", "AGE"), to = c("GLYCO", "WEIGHT"))) data(gmdbn_air) gmdbn_1 <- add_arcs(gmdbn_air, list(b_2 = data.frame(from = "WIND", to = "NO2", lag = 1), b_13 = data.frame(from = c("NO2", "NO2"), to = c("O3", "O3"), lag = c(0, 1))))
This function adds nodes to a Gaussian mixture graphical model. If this model is a dynamic Bayesian network, the nodes are added to each of its transition models. For each added node, a one-component univariate Gaussian mixture model is created with mean 0 and variance 1.
add_nodes(gmgm, nodes)
add_nodes(gmgm, nodes)
gmgm |
An object of class |
nodes |
A character vector containing the added nodes. |
The gmbn
or gmdbn
object after adding the nodes.
add_arcs
, relevant
,
remove_arcs
, remove_nodes
,
rename_nodes
data(gmbn_body) gmbn_1 <- add_nodes(gmbn_body, c("CHOL", "TRIGLY")) data(gmdbn_air) gmdbn_1 <- add_nodes(gmdbn_air, "PM10")
data(gmbn_body) gmbn_1 <- add_nodes(gmbn_body, c("CHOL", "TRIGLY")) data(gmdbn_air) gmdbn_1 <- add_nodes(gmdbn_air, "PM10")
This function adds variables to a Gaussian mixture model.
add_var(gmm, var)
add_var(gmm, var)
gmm |
An object of class |
var |
A character vector containing the added variables, or a data frame or numeric matrix whose columns are named after the added variables. In the first case, for each mixture component, the marginal mean vector of the added variables is 0 and the marginal covariance matrix the identity matrix. In the second case, these mean vector and covariance matrix are computed from the data (after removing the rows that contain missing values). |
The gmm
object after adding the variables.
data(gmm_body, data_body) gmm_1 <- add_var(gmm_body, "GENDER") gmm_2 <- add_var(gmm_body, data_body[, "GENDER"])
data(gmm_body, data_body) gmm_1 <- add_var(gmm_body, "GENDER") gmm_2 <- add_var(gmm_body, data_body[, "GENDER"])
This function aggregates particles to obtain inferred values. Assuming that
the particles have been propagated to a given time slice , the
weighted average of the samples is computed to estimate the state of the
system at
or at previous time slices (Koller and Friedman, 2009).
aggregation(part, nodes, col_seq = NULL, col_weight = "weight", lag = 0)
aggregation(part, nodes, col_seq = NULL, col_weight = "weight", lag = 0)
part |
A data frame containing the particles propagated to time slice
|
nodes |
A character vector containing the inferred nodes. |
col_seq |
A character vector containing the column names of |
col_weight |
A character string corresponding to the column name of
|
lag |
A non-negative integer vector containing the time lags |
If lag
has one element, a data frame (tibble) containing the
aggregated values of the inferred nodes and their observation sequences (if
col_seq
is not NULL
). If lag
has two or more elements, a
list of data frames (tibbles) containing these values for each time lag.
Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. The MIT Press.
library(dplyr) set.seed(0) data(gmdbn_air, data_air) evid <- data_air %>% group_by(DATE) %>% slice(1:3) %>% ungroup() evid$NO2[sample.int(150, 30)] <- NA evid$O3[sample.int(150, 30)] <- NA evid$TEMP[sample.int(150, 30)] <- NA evid$WIND[sample.int(150, 30)] <- NA aggreg <- particles(data.frame(DATE = unique(evid$DATE))) %>% propagation(gmdbn_air, evid, col_seq = "DATE", n_times = 3) %>% aggregation(c("NO2", "O3", "TEMP", "WIND"), col_seq = "DATE", lag = c(0, 1))
library(dplyr) set.seed(0) data(gmdbn_air, data_air) evid <- data_air %>% group_by(DATE) %>% slice(1:3) %>% ungroup() evid$NO2[sample.int(150, 30)] <- NA evid$O3[sample.int(150, 30)] <- NA evid$TEMP[sample.int(150, 30)] <- NA evid$WIND[sample.int(150, 30)] <- NA aggreg <- particles(data.frame(DATE = unique(evid$DATE))) %>% propagation(gmdbn_air, evid, col_seq = "DATE", n_times = 3) %>% aggregation(c("NO2", "O3", "TEMP", "WIND"), col_seq = "DATE", lag = c(0, 1))
This function computes the Akaike Information Criterion (AIC) of a Gaussian mixture model or graphical model:
where is the log-likelihood and
the number of free
parameters.
## S3 method for class 'gmm' AIC(object, data, y = NULL, regul = 0.01, ...) ## S3 method for class 'gmbn' AIC(object, data, col_seq = NULL, ...) ## S3 method for class 'gmdbn' AIC(object, data, col_seq = NULL, ...)
## S3 method for class 'gmm' AIC(object, data, y = NULL, regul = 0.01, ...) ## S3 method for class 'gmbn' AIC(object, data, col_seq = NULL, ...) ## S3 method for class 'gmdbn' AIC(object, data, col_seq = NULL, ...)
object |
An object of class |
data |
A data frame containing the data used to compute the AIC. Its
columns must explicitly be named after the variables (or nodes) of
|
y |
A character vector containing the dependent variables if a
conditional AIC is computed. If |
regul |
A positive numeric value corresponding to the regularization
constant if a penalty term is added for Bayesian regularization. If
|
... |
Unused arguments from the generic function. |
col_seq |
A character vector containing the column names of |
If object
is a gmm
object, a numeric value
corresponding to the AIC.
If object
is a gmbn
or gmdbn
object, a list with
elements:
global |
A numeric value corresponding to the global AIC. |
local |
For a |
data(gmm_body, data_body) aic_1 <- AIC(gmm_body, data_body) aic_2 <- AIC(gmm_body, data_body, y = "WAIST") data(gmbn_body, data_body) aic_3 <- AIC(gmbn_body, data_body) data(gmdbn_air, data_air) aic_4 <- AIC(gmdbn_air, data_air, col_seq = "DATE")
data(gmm_body, data_body) aic_1 <- AIC(gmm_body, data_body) aic_2 <- AIC(gmm_body, data_body, y = "WAIST") data(gmbn_body, data_body) aic_3 <- AIC(gmbn_body, data_body) data(gmdbn_air, data_air) aic_4 <- AIC(gmdbn_air, data_air, col_seq = "DATE")
This function computes the Bayesian Information Criterion (BIC) of a Gaussian mixture model or graphical model:
where is the log-likelihood,
the number of
observations in the data and
the number of free parameters.
## S3 method for class 'gmm' BIC(object, data, y = NULL, regul = 0.01, ...) ## S3 method for class 'gmbn' BIC(object, data, col_seq = NULL, ...) ## S3 method for class 'gmdbn' BIC(object, data, col_seq = NULL, ...)
## S3 method for class 'gmm' BIC(object, data, y = NULL, regul = 0.01, ...) ## S3 method for class 'gmbn' BIC(object, data, col_seq = NULL, ...) ## S3 method for class 'gmdbn' BIC(object, data, col_seq = NULL, ...)
object |
An object of class |
data |
A data frame containing the data used to compute the BIC. Its
columns must explicitly be named after the variables (or nodes) of
|
y |
A character vector containing the dependent variables if a
conditional BIC is computed. If |
regul |
A positive numeric value corresponding to the regularization
constant if a penalty term is added for Bayesian regularization. If
|
... |
Unused arguments from the generic function. |
col_seq |
A character vector containing the column names of |
If object
is a gmm
object, a numeric value
corresponding to the BIC.
If object
is a gmbn
or gmdbn
object, a list with
elements:
global |
A numeric value corresponding to the global BIC. |
local |
For a |
data(gmm_body, data_body) bic_1 <- BIC(gmm_body, data_body) bic_2 <- BIC(gmm_body, data_body, y = "WAIST") data(gmbn_body, data_body) bic_3 <- BIC(gmbn_body, data_body) data(gmdbn_air, data_air) bic_4 <- BIC(gmdbn_air, data_air, col_seq = "DATE")
data(gmm_body, data_body) bic_1 <- BIC(gmm_body, data_body) bic_2 <- BIC(gmm_body, data_body, y = "WAIST") data(gmbn_body, data_body) bic_3 <- BIC(gmbn_body, data_body) data(gmdbn_air, data_air) bic_4 <- BIC(gmdbn_air, data_air, col_seq = "DATE")
This function conditionalizes a Gaussian mixture model (Sun et al., 2006).
conditional(gmm, y = rownames(gmm$mu)[1])
conditional(gmm, y = rownames(gmm$mu)[1])
gmm |
An object of class |
y |
A character vector containing the dependent variables (by default
the first variable of |
A list with elements:
alpha |
A numeric vector containing the mixture proportions. |
mu_x |
A numeric matrix containing the marginal mean vectors of the explanatory variables bound by column. |
sigma_x |
A list containing the marginal covariance matrices of the explanatory variables. |
coeff |
A list containing the regression coefficient matrices of the dependent variables on the explanatory variables. |
sigma_c |
A list containing the conditional covariance matrices. |
Sun, S., Zhang, C. and Yu, G. (2006). A Bayesian Network Approach to Traffic Flow Forecasting. IEEE Transactions on Intelligent Transportation Systems, 7(1):124–132.
data(gmm_body) cond <- conditional(gmm_body)
data(gmm_body) cond <- conditional(gmm_body)
This dataset includes hourly air pollutants and weather data measured at the Dongsi air quality monitoring site in Beijing (China) for 320 complete days of the year 2015. These data are taken from the Beijing Multi-Site Air Quality Dataset published in the UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Beijing+Multi-Site+Air-Quality+Data (Zhang et al., 2017).
data_air
data_air
A data frame (tibble) with 7680 rows and 6 columns:
DATE
: day's date;
HOUR
: hour of the day;
NO2
: nitrogen dioxide concentration
(μg/m³);
O3
: ozone concentration
(μg/m³);
TEMP
: temperature (°C);
WIND
: wind speed (m/s).
Zhang, S., Guo, B., Dong, A., He, J., Xu, Z. and Chen, S. X. (2017). Cautionary Tales on Air-Quality Improvement in Beijing. Proceedings of the Royal Society A, 473.
data_body
, gmbn_body
,
gmdbn_air
, gmm_body
This dataset includes body composition data measured in 2148 adults aged 20 to 59 years in the United States. These data are taken from the National Health and Nutrition Examination Survey (NHANES) 2017-2018: https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=2017 (Centers for Disease Control and Prevention, 2020).
data_body
data_body
A data frame (tibble) with 2148 rows and 8 columns:
ID
: respondent identifier;
GENDER
: gender (0: male, 1: female);
AGE
: age (years);
HEIGHT
: height (cm);
WEIGHT
: weight (kg);
FAT
: body fat (%);
WAIST
: waist circumference (cm);
GLYCO
: glycohemoglobin (%).
Centers for Disease Control and Prevention (2020). National Health and Nutrition Examination Survey Data.
data_air
, gmbn_body
,
gmdbn_air
, gmm_body
This function computes densities of a Gaussian mixture model.
density(gmm, data, y = NULL, log = FALSE)
density(gmm, data, y = NULL, log = FALSE)
gmm |
An object of class |
data |
A data frame or numeric matrix containing the observations whose
densities are computed. Its columns must explicitly be named after the
variables of |
y |
A character vector containing the dependent variables if conditional
densities are computed. If |
log |
A logical value indicating whether the densities are returned as log-densities. |
A numeric vector containing the (log-)densities.
data(gmm_body, data_body) dens_1 <- density(gmm_body, data_body, log = TRUE) dens_2 <- density(gmm_body, data_body, y = "WAIST", log = TRUE)
data(gmm_body, data_body) dens_1 <- density(gmm_body, data_body, log = TRUE) dens_2 <- density(gmm_body, data_body, y = "WAIST", log = TRUE)
This function displays the mixture components of a Gaussian mixture model. For each pair of variables, the covariance matrices are represented by confidence ellipses.
ellipses( gmm, data = NULL, y = rownames(gmm$mu), x = rownames(gmm$mu), level = 0.95 )
ellipses( gmm, data = NULL, y = rownames(gmm$mu), x = rownames(gmm$mu), level = 0.95 )
gmm |
An object of class |
data |
A data frame or numeric matrix containing the data displayed with
the mixture components. Its columns must explicitly be named after the
variables of |
y |
A character vector containing the variables displayed on the y-axis
(by default all the variables of |
x |
A character vector containing the variables displayed on the x-axis
(by default all the variables of |
level |
A numeric value in [0, 1[ corresponding to the confidence level of the ellipses. |
A ggplot
object displaying the mixture components (and the
data if required).
set.seed(0) data(gmm_body) ellipses(gmm_body, sampling(gmm_body, n = 500))
set.seed(0) data(gmm_body) ellipses(gmm_body, sampling(gmm_body, n = 500))
This function estimates the parameters of a Gaussian mixture model using the expectation-maximization (EM) algorithm. Given an initial model, this algorithm iteratively updates the parameters, monotonically increasing the log-likelihood until convergence to a local maximum (Bilmes, 1998). A Bayesian regularization is applied by default to prevent that a mixture component comes down to a single point and leads to a zero covariance matrix (Ormoneit and Tresp, 1996). Although the EM algorithm only applies to the joint model, good parameters can be found for a derived conditional model. However, care should be taken as the monotonic increase of the conditional log-likelihood is not guaranteed.
em( gmm, data, regul = 0.01, epsilon = 1e-06, max_iter_em = 100, verbose = FALSE )
em( gmm, data, regul = 0.01, epsilon = 1e-06, max_iter_em = 100, verbose = FALSE )
gmm |
An initial object of class |
data |
A data frame or numeric matrix containing the data used in the
EM algorithm. Its columns must explicitly be named after the variables of
|
regul |
A positive numeric value corresponding to the regularization
constant if a Bayesian regularization is applied. If |
epsilon |
A positive numeric value corresponding to the convergence threshold for the increase in log-likelihood. |
max_iter_em |
A non-negative integer corresponding to the maximum number of iterations. |
verbose |
A logical value indicating whether iterations in progress are displayed. |
A list with elements:
gmm |
The final |
posterior |
A numeric matrix containing the posterior probabilities for each observation. |
seq_loglik |
A numeric vector containing the sequence of log-likelihoods measured initially and after each iteration. |
Bilmes, J. A. (1998). A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models. Technical report, International Computer Science Institute.
Ormoneit, D. and Tresp, V. (1996). Improved Gaussian Mixture Density Estimates Using Bayesian Penalty Terms and Network Averaging. In Advances in Neural Information Processing Systems 8, pages 542–548.
data(data_body) gmm_1 <- split_comp(add_var(NULL, data_body[, c("WAIST", "AGE", "FAT", "HEIGHT", "WEIGHT")]), n_sub = 3) res_em <- em(gmm_1, data_body, verbose = TRUE)
data(data_body) gmm_1 <- split_comp(add_var(NULL, data_body[, c("WAIST", "AGE", "FAT", "HEIGHT", "WEIGHT")]), n_sub = 3) res_em <- em(gmm_1, data_body, verbose = TRUE)
This function computes expectations of a Gaussian mixture model.
expectation(gmm, data_x = NULL)
expectation(gmm, data_x = NULL)
gmm |
An object of class |
data_x |
A data frame or numeric matrix containing observations of the
explanatory variables if conditional expectations are computed. Its columns
must explicitly be named after the explanatory variables. If |
A numeric matrix containing the expectations.
data(gmm_body, data_body) expect_1 <- expectation(gmm_body) expect_2 <- expectation(gmm_body, data_body[, c("WEIGHT", "FAT", "HEIGHT", "AGE")])
data(gmm_body, data_body) expect_1 <- expectation(gmm_body) expect_2 <- expectation(gmm_body, data_body[, c("WEIGHT", "FAT", "HEIGHT", "AGE")])
This function performs filtering inference in a Gaussian mixture dynamic
Bayesian network. For a sequence of time slices, this task consists
in estimating the state of the system at each time slice
(for
) given all the data (the evidence) collected up to
. This function is also designed to perform fixed-lag smoothing
inference, which consists in defining a time lag
such that at each
time slice
(for
), the state at
is
estimated given the evidence collected up to
(Murphy, 2002).
Filtering and fixed-lag smoothing inference are performed by sequential
importance resampling, which is a particle-based approximate method (Koller
and Friedman, 2009).
filtering( gmdbn, evid, nodes = names(gmdbn$b_1), col_seq = NULL, lag = 0, n_part = 1000, max_part_sim = 1e+06, min_ess = 1, verbose = FALSE )
filtering( gmdbn, evid, nodes = names(gmdbn$b_1), col_seq = NULL, lag = 0, n_part = 1000, max_part_sim = 1e+06, min_ess = 1, verbose = FALSE )
gmdbn |
An object of class |
evid |
A data frame containing the evidence. Its columns must explicitly
be named after nodes of |
nodes |
A character vector containing the inferred nodes (by default all
the nodes of |
col_seq |
A character vector containing the column names of |
lag |
A non-negative integer vector containing the time lags for which
fixed-lag smoothing inference is performed. If |
n_part |
A positive integer corresponding to the number of particles generated for each observation sequence. |
max_part_sim |
An integer greater than or equal to |
min_ess |
A numeric value in [0, 1] corresponding to the minimum ESS
(expressed as a proportion of |
verbose |
A logical value indicating whether subsets of |
If lag
has one element, a data frame (tibble) with a structure
similar to evid
containing the estimated values of the inferred
nodes and their observation sequences (if col_seq
is not NULL
).
If lag
has two or more elements, a list of data frames (tibbles)
containing these values for each time lag.
Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. The MIT Press.
Murphy, K. (2002). Dynamic Bayesian Networks: Representation, Inference and Learning. PhD thesis, University of California.
inference
, prediction
,
smoothing
set.seed(0) data(gmdbn_air, data_air) evid <- data_air evid$NO2[sample.int(7680, 1536)] <- NA evid$O3[sample.int(7680, 1536)] <- NA evid$TEMP[sample.int(7680, 1536)] <- NA evid$WIND[sample.int(7680, 1536)] <- NA filt <- filtering(gmdbn_air, evid, col_seq = "DATE", lag = c(0, 1), verbose = TRUE)
set.seed(0) data(gmdbn_air, data_air) evid <- data_air evid$NO2[sample.int(7680, 1536)] <- NA evid$O3[sample.int(7680, 1536)] <- NA evid$TEMP[sample.int(7680, 1536)] <- NA evid$WIND[sample.int(7680, 1536)] <- NA filt <- filtering(gmdbn_air, evid, col_seq = "DATE", lag = c(0, 1), verbose = TRUE)
This function creates a Gaussian mixture Bayesian network as an object of S3
class gmbn
. A Bayesian network is a probabilistic graphical model that
represents the conditional dependencies and independencies between random
variables by a directed acyclic graph. It encodes a global joint distribution
over the nodes, which decomposes into a product of local conditional
distributions:
where is the set of parents of
in the graph. In a
Gaussian mixture Bayesian network, each local joint distribution over a node
and its parents is described by a Gaussian mixture model, which means that
the global distribution is a product of local conditional Gaussian mixture
models (Davies and Moore, 2000). The
gmbn
class can be extended to the
time factor by regarding the nodes as the state of the system at a given time
slice (denoted by
) and allowing them to have parents at
previous time slices. This makes it possible to create a (
)-slice
temporal Bayesian network that encodes the transition distribution
(Hulst, 2006). Finally,
note that a Gaussian mixture Bayesian network can be created with functions
add_nodes
(by passing NULL
as argument gmgm
) and
add_arcs
, which allows to quickly initialize a gmbn
object that can be passed to a learning function.
gmbn(...)
gmbn(...)
... |
Objects of class |
A list of class gmbn
containing the gmm
objects passed
as arguments.
Davies, S. and Moore, A. (2000). Mix-nets: Factored Mixtures of Gaussians in Bayesian Networks with Mixed Continuous And Discrete Variables. In Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence, pages 168–175, Stanford, CA, USA.
Hulst, J. (2006). Modeling physiological processes with dynamic Bayesian networks. Master's thesis, Delft University of Technology.
data(data_body) gmbn_1 <- gmbn( AGE = split_comp(add_var(NULL, data_body[, "AGE"]), n_sub = 3), FAT = split_comp(add_var(NULL, data_body[, c("FAT", "GENDER", "HEIGHT", "WEIGHT")]), n_sub = 2), GENDER = split_comp(add_var(NULL, data_body[, "GENDER"]), n_sub = 2), GLYCO = split_comp(add_var(NULL, data_body[, c("GLYCO", "AGE", "WAIST")]), n_sub = 2), HEIGHT = split_comp(add_var(NULL, data_body[, c("HEIGHT", "GENDER")])), WAIST = split_comp(add_var(NULL, data_body[, c("WAIST", "AGE", "FAT", "HEIGHT", "WEIGHT")]), n_sub = 3), WEIGHT = split_comp(add_var(NULL, data_body[, c("WEIGHT", "HEIGHT")]), n_sub = 2) ) library(dplyr) data(data_air) data <- data_air %>% group_by(DATE) %>% mutate(NO2.1 = lag(NO2), O3.1 = lag(O3), TEMP.1 = lag(TEMP), WIND.1 = lag(WIND)) %>% ungroup() gmbn_2 <- gmbn( NO2 = split_comp(add_var(NULL, data[, c("NO2", "NO2.1", "WIND")]), n_sub = 3), O3 = split_comp(add_var(NULL, data[, c("O3", "NO2", "NO2.1", "O3.1", "TEMP", "TEMP.1")]), n_sub = 3), TEMP = split_comp(add_var(NULL, data[, c("TEMP", "TEMP.1")]), n_sub = 3), WIND = split_comp(add_var(NULL, data[, c("WIND", "WIND.1")]), n_sub = 3) )
data(data_body) gmbn_1 <- gmbn( AGE = split_comp(add_var(NULL, data_body[, "AGE"]), n_sub = 3), FAT = split_comp(add_var(NULL, data_body[, c("FAT", "GENDER", "HEIGHT", "WEIGHT")]), n_sub = 2), GENDER = split_comp(add_var(NULL, data_body[, "GENDER"]), n_sub = 2), GLYCO = split_comp(add_var(NULL, data_body[, c("GLYCO", "AGE", "WAIST")]), n_sub = 2), HEIGHT = split_comp(add_var(NULL, data_body[, c("HEIGHT", "GENDER")])), WAIST = split_comp(add_var(NULL, data_body[, c("WAIST", "AGE", "FAT", "HEIGHT", "WEIGHT")]), n_sub = 3), WEIGHT = split_comp(add_var(NULL, data_body[, c("WEIGHT", "HEIGHT")]), n_sub = 2) ) library(dplyr) data(data_air) data <- data_air %>% group_by(DATE) %>% mutate(NO2.1 = lag(NO2), O3.1 = lag(O3), TEMP.1 = lag(TEMP), WIND.1 = lag(WIND)) %>% ungroup() gmbn_2 <- gmbn( NO2 = split_comp(add_var(NULL, data[, c("NO2", "NO2.1", "WIND")]), n_sub = 3), O3 = split_comp(add_var(NULL, data[, c("O3", "NO2", "NO2.1", "O3.1", "TEMP", "TEMP.1")]), n_sub = 3), TEMP = split_comp(add_var(NULL, data[, c("TEMP", "TEMP.1")]), n_sub = 3), WIND = split_comp(add_var(NULL, data[, c("WIND", "WIND.1")]), n_sub = 3) )
This Gaussian mixture dynamic Bayesian network is learned from the NHANES
body composition dataset, following the example provided in the documentation
page of function struct_learn
.
gmbn_body
gmbn_body
A gmbn
object.
data_air
, data_body
,
gmdbn_air
, gmm_body
This function creates a Gaussian mixture dynamic Bayesian network as an
object of S3 class gmdbn
. Assuming that the system evolves over time
(possibly non-stationary) and denoting by its state at time
slice
, a dynamic Bayesian network is a probabilistic graphical model
that encodes the joint distribution over any finite time sequence:
It is defined by a sequence of transition models
associated with
transition time slices
, where:
is a Bayesian network that encodes the
distribution
for
, assuming that
the states at these time slices do not depend on previous states;
for each ,
is a (
)-slice
temporal Bayesian network (where
) that encodes the transition
distribution
for
(or
if
),
assuming that the states at these time slices only depend on the
previous states (Hourbracq et al., 2017).
In a Gaussian mixture dynamic Bayesian network, these transition models are Gaussian mixture Bayesian networks (Roos et al., 2017).
gmdbn(...)
gmdbn(...)
... |
Objects of class |
A list of class gmdbn
containing the gmbn
objects
passed as arguments.
Hourbracq, M., Wuillemin, P.-H., Gonzales, C. and Baumard, P. (2017). Learning and Selection of Dynamic Bayesian Networks for Non-Stationary Processes in Real Time. In Proceedings of the 30th International Flairs Conference, pages 742–747, Marco Island, FL, USA.
Roos, J., Bonnevay, S. and Gavin, G. (2017). Dynamic Bayesian Networks with Gaussian Mixture Models for Short-Term Passenger Flow Forecasting. In Proceedings of the 12th International Conference on Intelligent Systems and Knowledge Engineering, Nanjing, China.
library(dplyr) data(data_air) data <- data_air %>% group_by(DATE) %>% mutate(NO2.1 = lag(NO2), O3.1 = lag(O3), TEMP.1 = lag(TEMP), WIND.1 = lag(WIND)) %>% ungroup() gmdbn_1 <- gmdbn( b_2 = gmbn( NO2 = split_comp(add_var(NULL, data[, c("NO2", "NO2.1", "WIND")]), n_sub = 3), O3 = split_comp(add_var(NULL, data[, c("O3", "NO2", "NO2.1", "O3.1", "TEMP", "TEMP.1")]), n_sub = 3), TEMP = split_comp(add_var(NULL, data[, c("TEMP", "TEMP.1")]), n_sub = 3), WIND = split_comp(add_var(NULL, data[, c("WIND", "WIND.1")]), n_sub = 3) ), b_13 = gmbn( NO2 = split_comp(add_var(NULL, data[, c("NO2", "NO2.1", "WIND")]), n_sub = 3), O3 = split_comp(add_var(NULL, data[, c("O3", "O3.1", "TEMP", "TEMP.1", "WIND")]), n_sub = 3), TEMP = split_comp(add_var(NULL, data[, c("TEMP", "TEMP.1")]), n_sub = 3), WIND = split_comp(add_var(NULL, data[, c("WIND", "WIND.1")]), n_sub = 3) ) )
library(dplyr) data(data_air) data <- data_air %>% group_by(DATE) %>% mutate(NO2.1 = lag(NO2), O3.1 = lag(O3), TEMP.1 = lag(TEMP), WIND.1 = lag(WIND)) %>% ungroup() gmdbn_1 <- gmdbn( b_2 = gmbn( NO2 = split_comp(add_var(NULL, data[, c("NO2", "NO2.1", "WIND")]), n_sub = 3), O3 = split_comp(add_var(NULL, data[, c("O3", "NO2", "NO2.1", "O3.1", "TEMP", "TEMP.1")]), n_sub = 3), TEMP = split_comp(add_var(NULL, data[, c("TEMP", "TEMP.1")]), n_sub = 3), WIND = split_comp(add_var(NULL, data[, c("WIND", "WIND.1")]), n_sub = 3) ), b_13 = gmbn( NO2 = split_comp(add_var(NULL, data[, c("NO2", "NO2.1", "WIND")]), n_sub = 3), O3 = split_comp(add_var(NULL, data[, c("O3", "O3.1", "TEMP", "TEMP.1", "WIND")]), n_sub = 3), TEMP = split_comp(add_var(NULL, data[, c("TEMP", "TEMP.1")]), n_sub = 3), WIND = split_comp(add_var(NULL, data[, c("WIND", "WIND.1")]), n_sub = 3) ) )
This Gaussian mixture dynamic Bayesian network is learned from the Beijing
air quality dataset, following the example provided in the documentation page
of function struct_learn
.
gmdbn_air
gmdbn_air
A gmdbn
object.
data_air
, data_body
,
gmbn_body
, gmm_body
This function creates a Gaussian mixture model as an object of S3 class
gmm
. A Gaussian mixture model is a weighted sum of multivariate
Gaussian distributions:
where is the
th mixture proportion such that
and
,
the
mean vector and
the covariance matrix of the
th mixture
component (Bilmes, 1998). Since conditional distributions can be derived from
joint distributions, the
gmm
class is also used to work with
conditional Gaussian mixture models (see function conditional
to explicit their parameters). Finally, note that a one-component Gaussian
mixture model can be created with function add_var
(by passing
NULL
as argument gmm
), which allows to quickly initialize a
gmm
object that can be passed to a learning function.
gmm(alpha, mu, sigma, var = rownames(mu))
gmm(alpha, mu, sigma, var = rownames(mu))
alpha |
A positive numeric vector containing the mixture proportions. If the sum of these proportions is not 1, a normalization is performed by dividing them by this sum. |
mu |
A numeric matrix containing the mean vectors bound by column. |
sigma |
A list containing the covariance matrices. |
var |
A character vector containing the variable names (by default the
row names of |
A list of class gmm
containing the elements alpha
,
mu
and sigma
passed as arguments (completed with the variable
names passed as argument var
).
Bilmes, J. A. (1998). A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models. Technical report, International Computer Science Institute.
gmm_1 <- gmm(alpha = c(0.2, 0.5, 0.3), mu = matrix(c(109, 91, 44, 160, 41, 99, 87, 27, 173, 40, 86, 65, 35, 161, 40), nrow = 5), sigma = list(matrix(c(208, 240, 32, 17, -6, 240, 378, 40, 55, -38, 32, 40, 15, -2, 1, 17, 55, -2, 47, -13, -6, -38, 1, -13, 127), nrow = 5), matrix(c(242, 270, 82, 10, 49, 270, 363, 83, 44, 19, 82, 83, 38, -2, 15, 10, 44, -2, 45, -7, 49, 19, 15, -7, 137), nrow = 5), matrix(c(109, 102, 41, 11, 29, 102, 128, 34, 38, 10, 41, 34, 36, -9, 16, 11, 38, -9, 56, -5, 29, 10, 16, -5, 138), nrow = 5)), var = c("WAIST", "WEIGHT", "FAT", "HEIGHT", "AGE"))
gmm_1 <- gmm(alpha = c(0.2, 0.5, 0.3), mu = matrix(c(109, 91, 44, 160, 41, 99, 87, 27, 173, 40, 86, 65, 35, 161, 40), nrow = 5), sigma = list(matrix(c(208, 240, 32, 17, -6, 240, 378, 40, 55, -38, 32, 40, 15, -2, 1, 17, 55, -2, 47, -13, -6, -38, 1, -13, 127), nrow = 5), matrix(c(242, 270, 82, 10, 49, 270, 363, 83, 44, 19, 82, 83, 38, -2, 15, 10, 44, -2, 45, -7, 49, 19, 15, -7, 137), nrow = 5), matrix(c(109, 102, 41, 11, 29, 102, 128, 34, 38, 10, 41, 34, 36, -9, 16, 11, 38, -9, 56, -5, 29, 10, 16, -5, 138), nrow = 5)), var = c("WAIST", "WEIGHT", "FAT", "HEIGHT", "AGE"))
This Gaussian mixture model is learned from the NHANES body composition
dataset, following the example provided in the documentation page of function
stepwise
.
gmm_body
gmm_body
A gmm
object.
data_air
, data_body
,
gmbn_body
, gmdbn_air
This function performs inference in a (non-temporal) Gaussian mixture Bayesian network. This task consists in estimating the state of the system given partial observations of it (the evidence). Inference is performed by likelihood weighting, which is a particle-based approximate method (Koller and Friedman, 2009).
inference( gmbn, evid, nodes = names(gmbn), n_part = 1000, max_part_sim = 1e+06, verbose = FALSE )
inference( gmbn, evid, nodes = names(gmbn), n_part = 1000, max_part_sim = 1e+06, verbose = FALSE )
gmbn |
A (non-temporal) object of class |
evid |
A data frame containing the evidence. Its columns must explicitly
be named after nodes of |
nodes |
A character vector containing the inferred nodes (by default all
the nodes of |
n_part |
A positive integer corresponding to the number of particles generated for each observation. |
max_part_sim |
An integer greater than or equal to |
verbose |
A logical value indicating whether subsets of |
A data frame (tibble) with a structure similar to evid
containing the estimated values of the inferred nodes.
Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. The MIT Press.
filtering
, prediction
,
smoothing
set.seed(0) data(gmbn_body, data_body) evid <- data_body evid$GENDER[sample.int(2148, 430)] <- NA evid$AGE[sample.int(2148, 430)] <- NA evid$HEIGHT[sample.int(2148, 430)] <- NA evid$WEIGHT[sample.int(2148, 430)] <- NA evid$FAT[sample.int(2148, 430)] <- NA evid$WAIST[sample.int(2148, 430)] <- NA evid$GLYCO[sample.int(2148, 430)] <- NA infer <- inference(gmbn_body, evid, verbose = TRUE)
set.seed(0) data(gmbn_body, data_body) evid <- data_body evid$GENDER[sample.int(2148, 430)] <- NA evid$AGE[sample.int(2148, 430)] <- NA evid$HEIGHT[sample.int(2148, 430)] <- NA evid$WEIGHT[sample.int(2148, 430)] <- NA evid$FAT[sample.int(2148, 430)] <- NA evid$WAIST[sample.int(2148, 430)] <- NA evid$GLYCO[sample.int(2148, 430)] <- NA infer <- inference(gmbn_body, evid, verbose = TRUE)
This function computes the log-likelihood of a Gaussian mixture model or graphical model.
## S3 method for class 'gmm' logLik(object, data, y = NULL, regul = 0.01, ...) ## S3 method for class 'gmbn' logLik(object, data, col_seq = NULL, ...) ## S3 method for class 'gmdbn' logLik(object, data, col_seq = NULL, ...)
## S3 method for class 'gmm' logLik(object, data, y = NULL, regul = 0.01, ...) ## S3 method for class 'gmbn' logLik(object, data, col_seq = NULL, ...) ## S3 method for class 'gmdbn' logLik(object, data, col_seq = NULL, ...)
object |
An object of class |
data |
A data frame containing the data used to compute the
log-likelihood. Its columns must explicitly be named after the variables (or
nodes) of |
y |
A character vector containing the dependent variables if a
conditional log-likelihood is computed. If |
regul |
A positive numeric value corresponding to the regularization
constant if a penalty term is added for Bayesian regularization. If
|
... |
Unused arguments from the generic function. |
col_seq |
A character vector containing the column names of |
If object
is a gmm
object, a numeric value
corresponding to the log-likelihood.
If object
is a gmbn
or gmdbn
object, a list with
elements:
global |
A numeric value corresponding to the global log-likelihood. |
local |
For a |
data(gmm_body, data_body) loglik_1 <- logLik(gmm_body, data_body) loglik_2 <- logLik(gmm_body, data_body, y = "WAIST") data(gmbn_body, data_body) loglik_3 <- logLik(gmbn_body, data_body) data(gmdbn_air, data_air) loglik_4 <- logLik(gmdbn_air, data_air, col_seq = "DATE")
data(gmm_body, data_body) loglik_1 <- logLik(gmm_body, data_body) loglik_2 <- logLik(gmm_body, data_body, y = "WAIST") data(gmbn_body, data_body) loglik_3 <- logLik(gmbn_body, data_body) data(gmdbn_air, data_air) loglik_4 <- logLik(gmdbn_air, data_air, col_seq = "DATE")
This function merges mixture components of a Gaussian mixture model (Zhang et al., 2003).
merge_comp(gmm, comp = seq_along(gmm$alpha))
merge_comp(gmm, comp = seq_along(gmm$alpha))
gmm |
An object of class |
comp |
An integer vector containing the indexes of the merged mixture
components (by default all the components of |
The gmm
object after merging the mixture components.
Zhang, Z., Chen, C., Sun, J. and Chan, K. L. (2003). EM algorithms for Gaussian mixtures with split-and-merge operation. Pattern Recognition, 36(9):1973–1983.
data(gmm_body) gmm_1 <- merge_comp(gmm_body, c(1, 2))
data(gmm_body) gmm_1 <- merge_comp(gmm_body, c(1, 2))
This function displays the graphical structure of a Gaussian mixture Bayesian network.
network(gmbn)
network(gmbn)
gmbn |
An object of class |
A visNetwork
object displaying the graphical structure.
data(gmbn_body) network(gmbn_body) data(gmdbn_air) network(gmdbn_air$b_2)
data(gmbn_body) network(gmbn_body) data(gmdbn_air) network(gmdbn_air$b_2)
This function learns the parameters of a Gaussian mixture graphical model with incomplete data using the parametric EM algorithm. At each iteration, inference (smoothing inference for a dynamic Bayesian network) is performed to complete the data given the current estimate of the parameters (E step). The completed data are then used to update the parameters (M step), and so on. Each iteration is guaranteed to increase the log-likelihood until convergence to a local maximum (Koller and Friedman, 2009). In practice, due to the sampling process inherent in particle-based inference, it may happen that the monotonic increase no longer occurs when approaching the local maximum, resulting in an earlier termination of the algorithm.
param_em( gmgm, data, nodes = structure(gmgm)$nodes, col_seq = NULL, n_part = 1000, max_part_sim = 1e+06, min_ess = 1, max_iter_pem = 5, verbose = FALSE, ... )
param_em( gmgm, data, nodes = structure(gmgm)$nodes, col_seq = NULL, n_part = 1000, max_part_sim = 1e+06, min_ess = 1, max_iter_pem = 5, verbose = FALSE, ... )
gmgm |
An object of class |
data |
A data frame containing the data used for learning. Its columns
must explicitly be named after nodes of |
nodes |
A character vector containing the nodes whose local conditional
models are learned (by default all the nodes of |
col_seq |
A character vector containing the column names of |
n_part |
A positive integer corresponding to the number of particles
generated for each observation (if |
max_part_sim |
An integer greater than or equal to |
min_ess |
A numeric value in [0, 1] corresponding to the minimum ESS
(expressed as a proportion of |
max_iter_pem |
A non-negative integer corresponding to the maximum number of iterations. |
verbose |
A logical value indicating whether iterations in progress are displayed. |
... |
Additional arguments passed to function |
A list with elements:
gmgm |
The final |
data |
A data frame (tibble) containing the complete data used to learn
the final |
seq_loglik |
A numeric matrix containing the sequence of log-likelihoods measured after the E and M steps of each iteration. |
Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. The MIT Press.
param_learn
, struct_em
,
struct_learn
set.seed(0) data(data_body) data_1 <- data_body data_1$GENDER[sample.int(2148, 430)] <- NA data_1$AGE[sample.int(2148, 430)] <- NA data_1$HEIGHT[sample.int(2148, 430)] <- NA data_1$WEIGHT[sample.int(2148, 430)] <- NA data_1$FAT[sample.int(2148, 430)] <- NA data_1$WAIST[sample.int(2148, 430)] <- NA data_1$GLYCO[sample.int(2148, 430)] <- NA gmbn_1 <- gmbn( AGE = split_comp(add_var(NULL, data_1[, "AGE"]), n_sub = 3), FAT = split_comp(add_var(NULL, data_1[, c("FAT", "GENDER", "HEIGHT", "WEIGHT")]), n_sub = 2), GENDER = split_comp(add_var(NULL, data_1[, "GENDER"]), n_sub = 2), GLYCO = split_comp(add_var(NULL, data_1[, c("GLYCO", "AGE", "WAIST")]), n_sub = 2), HEIGHT = split_comp(add_var(NULL, data_1[, c("HEIGHT", "GENDER")])), WAIST = split_comp(add_var(NULL, data_1[, c("WAIST", "AGE", "FAT", "HEIGHT", "WEIGHT")]), n_sub = 3), WEIGHT = split_comp(add_var(NULL, data_1[, c("WEIGHT", "HEIGHT")]), n_sub = 2) ) res_learn_1 <- param_em(gmbn_1, data_1, verbose = TRUE) library(dplyr) set.seed(0) data(data_air) data_2 <- data_air data_2$NO2[sample.int(7680, 1536)] <- NA data_2$O3[sample.int(7680, 1536)] <- NA data_2$TEMP[sample.int(7680, 1536)] <- NA data_2$WIND[sample.int(7680, 1536)] <- NA data_3 <- data_2 %>% group_by(DATE) %>% mutate(NO2.1 = lag(NO2), O3.1 = lag(O3), TEMP.1 = lag(TEMP), WIND.1 = lag(WIND)) %>% ungroup() gmdbn_1 <- gmdbn( b_2 = gmbn( NO2 = split_comp(add_var(NULL, data_3[, c("NO2", "NO2.1", "WIND")]), n_sub = 3), O3 = split_comp(add_var(NULL, data_3[, c("O3", "NO2", "NO2.1", "O3.1", "TEMP", "TEMP.1")]), n_sub = 3), TEMP = split_comp(add_var(NULL, data_3[, c("TEMP", "TEMP.1")]), n_sub = 3), WIND = split_comp(add_var(NULL, data_3[, c("WIND", "WIND.1")]), n_sub = 3) ), b_13 = gmbn( NO2 = split_comp(add_var(NULL, data_3[, c("NO2", "NO2.1", "WIND")]), n_sub = 3), O3 = split_comp(add_var(NULL, data_3[, c("O3", "O3.1", "TEMP", "TEMP.1", "WIND")]), n_sub = 3), TEMP = split_comp(add_var(NULL, data_3[, c("TEMP", "TEMP.1")]), n_sub = 3), WIND = split_comp(add_var(NULL, data_3[, c("WIND", "WIND.1")]), n_sub = 3) ) ) res_learn_2 <- param_em(gmdbn_1, data_2, col_seq = "DATE", verbose = TRUE)
set.seed(0) data(data_body) data_1 <- data_body data_1$GENDER[sample.int(2148, 430)] <- NA data_1$AGE[sample.int(2148, 430)] <- NA data_1$HEIGHT[sample.int(2148, 430)] <- NA data_1$WEIGHT[sample.int(2148, 430)] <- NA data_1$FAT[sample.int(2148, 430)] <- NA data_1$WAIST[sample.int(2148, 430)] <- NA data_1$GLYCO[sample.int(2148, 430)] <- NA gmbn_1 <- gmbn( AGE = split_comp(add_var(NULL, data_1[, "AGE"]), n_sub = 3), FAT = split_comp(add_var(NULL, data_1[, c("FAT", "GENDER", "HEIGHT", "WEIGHT")]), n_sub = 2), GENDER = split_comp(add_var(NULL, data_1[, "GENDER"]), n_sub = 2), GLYCO = split_comp(add_var(NULL, data_1[, c("GLYCO", "AGE", "WAIST")]), n_sub = 2), HEIGHT = split_comp(add_var(NULL, data_1[, c("HEIGHT", "GENDER")])), WAIST = split_comp(add_var(NULL, data_1[, c("WAIST", "AGE", "FAT", "HEIGHT", "WEIGHT")]), n_sub = 3), WEIGHT = split_comp(add_var(NULL, data_1[, c("WEIGHT", "HEIGHT")]), n_sub = 2) ) res_learn_1 <- param_em(gmbn_1, data_1, verbose = TRUE) library(dplyr) set.seed(0) data(data_air) data_2 <- data_air data_2$NO2[sample.int(7680, 1536)] <- NA data_2$O3[sample.int(7680, 1536)] <- NA data_2$TEMP[sample.int(7680, 1536)] <- NA data_2$WIND[sample.int(7680, 1536)] <- NA data_3 <- data_2 %>% group_by(DATE) %>% mutate(NO2.1 = lag(NO2), O3.1 = lag(O3), TEMP.1 = lag(TEMP), WIND.1 = lag(WIND)) %>% ungroup() gmdbn_1 <- gmdbn( b_2 = gmbn( NO2 = split_comp(add_var(NULL, data_3[, c("NO2", "NO2.1", "WIND")]), n_sub = 3), O3 = split_comp(add_var(NULL, data_3[, c("O3", "NO2", "NO2.1", "O3.1", "TEMP", "TEMP.1")]), n_sub = 3), TEMP = split_comp(add_var(NULL, data_3[, c("TEMP", "TEMP.1")]), n_sub = 3), WIND = split_comp(add_var(NULL, data_3[, c("WIND", "WIND.1")]), n_sub = 3) ), b_13 = gmbn( NO2 = split_comp(add_var(NULL, data_3[, c("NO2", "NO2.1", "WIND")]), n_sub = 3), O3 = split_comp(add_var(NULL, data_3[, c("O3", "O3.1", "TEMP", "TEMP.1", "WIND")]), n_sub = 3), TEMP = split_comp(add_var(NULL, data_3[, c("TEMP", "TEMP.1")]), n_sub = 3), WIND = split_comp(add_var(NULL, data_3[, c("WIND", "WIND.1")]), n_sub = 3) ) ) res_learn_2 <- param_em(gmdbn_1, data_2, col_seq = "DATE", verbose = TRUE)
This function learns the parameters of a Gaussian mixture graphical model. Using the local decomposability of the log-likelihood, this task consists in learning each local conditional model independently with the EM algorithm (Koller and Friedman, 2009).
param_learn( gmgm, data, nodes = structure(gmgm)$nodes, col_seq = NULL, verbose = FALSE, ... )
param_learn( gmgm, data, nodes = structure(gmgm)$nodes, col_seq = NULL, verbose = FALSE, ... )
gmgm |
An initial object of class |
data |
A data frame containing the data used for learning. Its columns
must explicitly be named after the nodes of |
nodes |
A character vector containing the nodes whose local conditional
models are learned (by default all the nodes of |
col_seq |
A character vector containing the column names of |
verbose |
A logical value indicating whether learned nodes in progress are displayed. |
... |
Additional arguments passed to function |
A list with elements:
gmgm |
The final |
evol_loglik |
A list with elements:
|
Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. The MIT Press.
param_em
, struct_em
,
struct_learn
data(data_body) gmbn_1 <- gmbn( AGE = split_comp(add_var(NULL, data_body[, "AGE"]), n_sub = 3), FAT = split_comp(add_var(NULL, data_body[, c("FAT", "GENDER", "HEIGHT", "WEIGHT")]), n_sub = 2), GENDER = split_comp(add_var(NULL, data_body[, "GENDER"]), n_sub = 2), GLYCO = split_comp(add_var(NULL, data_body[, c("GLYCO", "AGE", "WAIST")]), n_sub = 2), HEIGHT = split_comp(add_var(NULL, data_body[, c("HEIGHT", "GENDER")])), WAIST = split_comp(add_var(NULL, data_body[, c("WAIST", "AGE", "FAT", "HEIGHT", "WEIGHT")]), n_sub = 3), WEIGHT = split_comp(add_var(NULL, data_body[, c("WEIGHT", "HEIGHT")]), n_sub = 2) ) res_learn_1 <- param_learn(gmbn_1, data_body, verbose = TRUE) library(dplyr) data(data_air) data <- data_air %>% group_by(DATE) %>% mutate(NO2.1 = lag(NO2), O3.1 = lag(O3), TEMP.1 = lag(TEMP), WIND.1 = lag(WIND)) %>% ungroup() gmdbn_1 <- gmdbn( b_2 = gmbn( NO2 = split_comp(add_var(NULL, data[, c("NO2", "NO2.1", "WIND")]), n_sub = 3), O3 = split_comp(add_var(NULL, data[, c("O3", "NO2", "NO2.1", "O3.1", "TEMP", "TEMP.1")]), n_sub = 3), TEMP = split_comp(add_var(NULL, data[, c("TEMP", "TEMP.1")]), n_sub = 3), WIND = split_comp(add_var(NULL, data[, c("WIND", "WIND.1")]), n_sub = 3) ), b_13 = gmbn( NO2 = split_comp(add_var(NULL, data[, c("NO2", "NO2.1", "WIND")]), n_sub = 3), O3 = split_comp(add_var(NULL, data[, c("O3", "O3.1", "TEMP", "TEMP.1", "WIND")]), n_sub = 3), TEMP = split_comp(add_var(NULL, data[, c("TEMP", "TEMP.1")]), n_sub = 3), WIND = split_comp(add_var(NULL, data[, c("WIND", "WIND.1")]), n_sub = 3) ) ) res_learn_2 <- param_learn(gmdbn_1, data_air, col_seq = "DATE", verbose = TRUE)
data(data_body) gmbn_1 <- gmbn( AGE = split_comp(add_var(NULL, data_body[, "AGE"]), n_sub = 3), FAT = split_comp(add_var(NULL, data_body[, c("FAT", "GENDER", "HEIGHT", "WEIGHT")]), n_sub = 2), GENDER = split_comp(add_var(NULL, data_body[, "GENDER"]), n_sub = 2), GLYCO = split_comp(add_var(NULL, data_body[, c("GLYCO", "AGE", "WAIST")]), n_sub = 2), HEIGHT = split_comp(add_var(NULL, data_body[, c("HEIGHT", "GENDER")])), WAIST = split_comp(add_var(NULL, data_body[, c("WAIST", "AGE", "FAT", "HEIGHT", "WEIGHT")]), n_sub = 3), WEIGHT = split_comp(add_var(NULL, data_body[, c("WEIGHT", "HEIGHT")]), n_sub = 2) ) res_learn_1 <- param_learn(gmbn_1, data_body, verbose = TRUE) library(dplyr) data(data_air) data <- data_air %>% group_by(DATE) %>% mutate(NO2.1 = lag(NO2), O3.1 = lag(O3), TEMP.1 = lag(TEMP), WIND.1 = lag(WIND)) %>% ungroup() gmdbn_1 <- gmdbn( b_2 = gmbn( NO2 = split_comp(add_var(NULL, data[, c("NO2", "NO2.1", "WIND")]), n_sub = 3), O3 = split_comp(add_var(NULL, data[, c("O3", "NO2", "NO2.1", "O3.1", "TEMP", "TEMP.1")]), n_sub = 3), TEMP = split_comp(add_var(NULL, data[, c("TEMP", "TEMP.1")]), n_sub = 3), WIND = split_comp(add_var(NULL, data[, c("WIND", "WIND.1")]), n_sub = 3) ), b_13 = gmbn( NO2 = split_comp(add_var(NULL, data[, c("NO2", "NO2.1", "WIND")]), n_sub = 3), O3 = split_comp(add_var(NULL, data[, c("O3", "O3.1", "TEMP", "TEMP.1", "WIND")]), n_sub = 3), TEMP = split_comp(add_var(NULL, data[, c("TEMP", "TEMP.1")]), n_sub = 3), WIND = split_comp(add_var(NULL, data[, c("WIND", "WIND.1")]), n_sub = 3) ) ) res_learn_2 <- param_learn(gmdbn_1, data_air, col_seq = "DATE", verbose = TRUE)
This function initializes particles to perform (approximate) inference in a Gaussian mixture graphical model. Particles consist in weighted sample sequences propagated forward in time by sampling the model and aggregated to obtain the inferred values (Koller and Friedman, 2009).
particles(seq = NULL, col_weight = "weight", n_part = 1000)
particles(seq = NULL, col_weight = "weight", n_part = 1000)
seq |
A data frame containing the observation sequences for which
particles are initialized. If |
col_weight |
A character string corresponding to the column name of the resulting data frame that describes the particle weight. |
n_part |
A positive integer corresponding to the number of particles initialized for each observation sequence. |
A data frame (tibble) containing the initial particles.
Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. The MIT Press.
data(data_air) part <- particles(data.frame(DATE = unique(data_air$DATE)))
data(data_air) part <- particles(data.frame(DATE = unique(data_air$DATE)))
This function performs predictive inference in a Gaussian mixture dynamic
Bayesian network. For a sequence of time slices, this task consists
in defining a time horizon
such that at each time slice
(for
), the state of the system at
is
estimated given all the data (the evidence) collected up to
. Although
the states at
are observed in the future, some
information about them can be known a priori (such as contextual information
or features controlled by the user). This "predicted" evidence can be taken
into account when propagating the particles from
to
in
order to improve the predictions. Predictive inference is performed by
sequential importance resampling, which is a particle-based approximate
method (Koller and Friedman, 2009).
prediction( gmdbn, evid, evid_pred = NULL, nodes = names(gmdbn$b_1), col_seq = NULL, horizon = 1, n_part = 1000, max_part_sim = 1e+06, min_ess = 1, verbose = FALSE )
prediction( gmdbn, evid, evid_pred = NULL, nodes = names(gmdbn$b_1), col_seq = NULL, horizon = 1, n_part = 1000, max_part_sim = 1e+06, min_ess = 1, verbose = FALSE )
gmdbn |
An object of class |
evid |
A data frame containing the evidence. Its columns must explicitly
be named after nodes of |
evid_pred |
A data frame containing the "predicted" evidence. Its
columns must explicitly be named after nodes of |
nodes |
A character vector containing the inferred nodes (by default all
the nodes of |
col_seq |
A character vector containing the column names of |
horizon |
A positive integer vector containing the time horizons for which predictive inference is performed. |
n_part |
A positive integer corresponding to the number of particles generated for each observation sequence. |
max_part_sim |
An integer greater than or equal to |
min_ess |
A numeric value in [0, 1] corresponding to the minimum ESS
(expressed as a proportion of |
verbose |
A logical value indicating whether subsets of |
If horizon
has one element, a data frame with a structure
similar to evid
containing the predicted values of the inferred
nodes and their observation sequences (if col_seq
is not NULL
).
If horizon
has two or more elements, a list of data frames (tibbles)
containing these values for each time horizon.
Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. The MIT Press.
filtering
, inference
,
smoothing
set.seed(0) data(gmdbn_air, data_air) evid <- data_air evid$NO2[sample.int(7680, 1536)] <- NA evid$O3[sample.int(7680, 1536)] <- NA pred <- prediction(gmdbn_air, evid, evid[, c("DATE", "TEMP", "WIND")], nodes = c("NO2", "O3"), col_seq = "DATE", horizon = c(1, 2), verbose = TRUE)
set.seed(0) data(gmdbn_air, data_air) evid <- data_air evid$NO2[sample.int(7680, 1536)] <- NA evid$O3[sample.int(7680, 1536)] <- NA pred <- prediction(gmdbn_air, evid, evid[, c("DATE", "TEMP", "WIND")], nodes = c("NO2", "O3"), col_seq = "DATE", horizon = c(1, 2), verbose = TRUE)
This function propagates particles forward in time. Assuming that the
particles have been propagated to a given time slice , the aim is to
propagate them to a later time slice
according to the Gaussian
mixture graphical model and to the evidence collected over time. At first, a
renewal step is performed if the effective sample size (ESS) is below a given
threshold (Doucet and Johansen, 2009). This step consists in randomly
selecting new particles among the old ones proportionately to their current
weights. Upon receiving the data (the evidence) of
, each particle
is used to generate samples for the unknown values. Its weight is then
updated to the likelihood for the observed values. The higher this
likelihood, the more likely the particle is selected at the next renewal step
for propagation to
, and so on (Koller and Friedman, 2009).
propagation( part, gmgm, evid = NULL, col_seq = NULL, col_weight = "weight", n_times = 1, min_ess = 1 )
propagation( part, gmgm, evid = NULL, col_seq = NULL, col_weight = "weight", n_times = 1, min_ess = 1 )
part |
A data frame containing the particles propagated to time slice
|
gmgm |
An object of class |
evid |
A data frame containing the evidence of time slices
|
col_seq |
A character vector containing the column names of |
col_weight |
A character string corresponding to the column name of
|
n_times |
A non-negative integer corresponding to the number of time
slices |
min_ess |
A numeric value in [0, 1] corresponding to the minimum ESS
(expressed as a proportion of the number of particles) under which the
renewal step is performed. If |
A data frame (tibble) containing the particles supplemented with the
samples of time slices .
Doucet, A. and Johansen, A. M. (2009). A Tutorial on Particle Filtering and Smoothing: Fifteen years later. Handbook of nonlinear filtering, 12:656–704.
Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. The MIT Press.
library(dplyr) set.seed(0) data(gmdbn_air, data_air) evid <- data_air %>% group_by(DATE) %>% slice(1:3) %>% ungroup() evid$NO2[sample.int(150, 30)] <- NA evid$O3[sample.int(150, 30)] <- NA evid$TEMP[sample.int(150, 30)] <- NA evid$WIND[sample.int(150, 30)] <- NA part <- particles(data.frame(DATE = unique(evid$DATE))) %>% propagation(gmdbn_air, evid, col_seq = "DATE", n_times = 3)
library(dplyr) set.seed(0) data(gmdbn_air, data_air) evid <- data_air %>% group_by(DATE) %>% slice(1:3) %>% ungroup() evid$NO2[sample.int(150, 30)] <- NA evid$O3[sample.int(150, 30)] <- NA evid$TEMP[sample.int(150, 30)] <- NA evid$WIND[sample.int(150, 30)] <- NA part <- particles(data.frame(DATE = unique(evid$DATE))) %>% propagation(gmdbn_air, evid, col_seq = "DATE", n_times = 3)
This function extracts the minimal sub-Gaussian mixture graphical model required to infer a subset of nodes (i.e. the sub-model relevant to these nodes). The nodes that do not contribute to inference are removed, which includes those that are d-separated from the inferred ones by the nodes whose values are observed, as well as the barren nodes (Druzdzel and Suermondt, 1994).
relevant(gmgm, nodes, nodes_obs = NULL, nodes_miss = NULL)
relevant(gmgm, nodes, nodes_obs = NULL, nodes_miss = NULL)
gmgm |
An object of class |
nodes |
A character vector containing the inferred nodes. |
nodes_obs |
A character vector containing the nodes whose values are observed. |
nodes_miss |
A character vector containing the nodes whose values are
missing. Note that if a node is neither in |
The gmbn
or gmdbn
object relevant to the subset of
nodes.
Druzdzel, M. J. and Suermondt, H. J. (1994). Relevance in Probabilistic Models: "Backyards" in a "Small World". In Working Notes of the AAAI 1994 Fall Symposium Series: Relevance, pages 60–63, New Orleans, LA, USA.
add_arcs
, add_nodes
,
remove_arcs
, remove_nodes
,
rename_nodes
data(gmbn_body) gmbn_1 <- relevant(gmbn_body, "AGE", nodes_obs = c("FAT", "HEIGHT", "WEIGHT"), nodes_miss = "GLYCO") data(gmdbn_air) gmdbn_1 <- do.call("gmdbn", gmdbn_air[c("b_1", "b_2")]) gmdbn_2 <- relevant(gmdbn_1, "O3", nodes_obs = "NO2")
data(gmbn_body) gmbn_1 <- relevant(gmbn_body, "AGE", nodes_obs = c("FAT", "HEIGHT", "WEIGHT"), nodes_miss = "GLYCO") data(gmdbn_air) gmdbn_1 <- do.call("gmdbn", gmdbn_air[c("b_1", "b_2")]) gmdbn_2 <- relevant(gmdbn_1, "O3", nodes_obs = "NO2")
This function removes arcs from a Gaussian mixture graphical model.
remove_arcs(gmgm, arcs)
remove_arcs(gmgm, arcs)
gmgm |
An object of class |
arcs |
A data frame containing the removed arcs. The column |
The gmbn
or gmdbn
object after removing the arcs.
add_arcs
, add_nodes
,
relevant
, remove_nodes
,
rename_nodes
data(gmbn_body) gmbn_1 <- remove_arcs(gmbn_body, data.frame(from = c("HEIGHT", "AGE"), to = c("FAT", "WAIST"))) data(gmdbn_air) gmdbn_1 <- remove_arcs(gmdbn_air, list(b_2 = data.frame(from = c("NO2", "TEMP"), to = c("O3", "O3"), lag = c(1, 1)), b_13 = data.frame(from = "TEMP", to = "O3", lag = 1)))
data(gmbn_body) gmbn_1 <- remove_arcs(gmbn_body, data.frame(from = c("HEIGHT", "AGE"), to = c("FAT", "WAIST"))) data(gmdbn_air) gmdbn_1 <- remove_arcs(gmdbn_air, list(b_2 = data.frame(from = c("NO2", "TEMP"), to = c("O3", "O3"), lag = c(1, 1)), b_13 = data.frame(from = "TEMP", to = "O3", lag = 1)))
This function removes nodes from a Gaussian mixture graphical model. If this model is a dynamic Bayesian network, the nodes are removed from each of its transition models.
remove_nodes(gmgm, nodes)
remove_nodes(gmgm, nodes)
gmgm |
An object of class |
nodes |
A character vector containing the removed nodes. |
The gmbn
or gmdbn
object after removing the nodes.
add_arcs
, add_nodes
,
relevant
, remove_arcs
, rename_nodes
data(gmbn_body) gmbn_1 <- remove_nodes(gmbn_body, c("FAT", "GLYCO")) data(gmdbn_air) gmdbn_1 <- remove_nodes(gmdbn_air, "TEMP")
data(gmbn_body) gmbn_1 <- remove_nodes(gmbn_body, c("FAT", "GLYCO")) data(gmdbn_air) gmdbn_1 <- remove_nodes(gmdbn_air, "TEMP")
This function removes variables from a Gaussian mixture model.
remove_var(gmm, var)
remove_var(gmm, var)
gmm |
An object of class |
var |
A character vector containing the removed variables. |
The gmm
object after removing the variables.
data(gmm_body) gmm_1 <- remove_var(gmm_body, "FAT")
data(gmm_body) gmm_1 <- remove_var(gmm_body, "FAT")
This function renames nodes of a Gaussian mixture graphical model. If this model is a dynamic Bayesian network, the nodes are renamed for each of its transition models.
rename_nodes(gmgm, nodes, names)
rename_nodes(gmgm, nodes, names)
gmgm |
An object of class |
nodes |
A character vector containing the renamed nodes. |
names |
A character vector containing the respective new names of the nodes. |
The gmbn
or gmdbn
object after renaming the nodes.
add_arcs
, add_nodes
,
relevant
, remove_arcs
, remove_nodes
data(gmbn_body) gmbn_1 <- rename_nodes(gmbn_body, c("FAT", "GLYCO"), c("BODY_FAT", "GLYCOHEMOGLOBIN")) data(gmdbn_air) gmdbn_1 <- rename_nodes(gmdbn_air, "TEMP", "TEMPERATURE")
data(gmbn_body) gmbn_1 <- rename_nodes(gmbn_body, c("FAT", "GLYCO"), c("BODY_FAT", "GLYCOHEMOGLOBIN")) data(gmdbn_air) gmdbn_1 <- rename_nodes(gmdbn_air, "TEMP", "TEMPERATURE")
This function renames variables of a Gaussian mixture model.
rename_var(gmm, var, names)
rename_var(gmm, var, names)
gmm |
An object of class |
var |
A character vector containing the renamed variables. |
names |
A character vector containing the respective new names of the variables. |
The gmm
object after renaming the variables.
data(gmm_body) gmm_1 <- rename_var(gmm_body, "FAT", "BODY_FAT")
data(gmm_body) gmm_1 <- rename_var(gmm_body, "FAT", "BODY_FAT")
This function reorders the variables and the mixture components of a Gaussian mixture model.
reorder(gmm, var = NULL, comp = NULL)
reorder(gmm, var = NULL, comp = NULL)
gmm |
An object of class |
var |
A character vector containing the variables in the desired order.
If variables are not specified, they are added after the ordered ones. If
|
comp |
An integer vector containing the indexes of the mixture component
in the desired order. If components are not specified, they are added after
the ordered ones. If |
The reordered gmm
object.
data(gmm_body) gmm_1 <- reorder(gmm_body, var = c("WAIST", "AGE", "FAT", "HEIGHT", "WEIGHT"), comp = c(2, 1, 3))
data(gmm_body) gmm_1 <- reorder(gmm_body, var = c("WAIST", "AGE", "FAT", "HEIGHT", "WEIGHT"), comp = c(2, 1, 3))
This function samples a Gaussian mixture model.
sampling(gmm, data_x = NULL, n = 1)
sampling(gmm, data_x = NULL, n = 1)
gmm |
An object of class |
data_x |
A data frame or numeric matrix containing observations of the
explanatory variables if conditional sampling is performed. Its columns must
explicitly be named after the explanatory variables. If |
n |
A non-negative integer corresponding to the number of samples. If conditional sampling is performed, this argument is ignored. |
A numeric matrix containing the samples.
set.seed(0) data(gmm_body, data_body) sampl_1 <- sampling(gmm_body, n = 500) sampl_2 <- sampling(gmm_body, data_body[, c("WEIGHT", "FAT", "HEIGHT", "AGE")])
set.seed(0) data(gmm_body, data_body) sampl_1 <- sampling(gmm_body, n = 500) sampl_2 <- sampling(gmm_body, data_body[, c("WEIGHT", "FAT", "HEIGHT", "AGE")])
This function selects the number of mixture components and estimates the parameters of a Gaussian mixture model using a split-and-merge EM (SMEM) algorithm. At the first iteration, the classic EM algorithm is performed to update the parameters of the initial model. Then each following iteration consists in splitting a component into two or merging two components, before re-estimating the parameters with the EM algorithm. The selected split or merge operation is the one that maximizes a scoring function (after the re-estimation process). To avoid testing all possible operations, the split and merge candidates are initially ranked according to relevant criteria (Zhang et al., 2003). At first, the top-ranked split and top-ranked merge operations are tested. If neither of them increases the score, the second-ranked ones are considered, and so on. The SMEM algorithm stops if a given maximum rank is reached without improving the score.
smem( gmm, data, y = NULL, score = "bic", split = TRUE, merge = TRUE, min_comp = 1, max_comp = Inf, space = 0.5, max_rank = 1, max_iter_smem = 10, verbose = FALSE, ... )
smem( gmm, data, y = NULL, score = "bic", split = TRUE, merge = TRUE, min_comp = 1, max_comp = Inf, space = 0.5, max_rank = 1, max_iter_smem = 10, verbose = FALSE, ... )
gmm |
An initial object of class |
data |
A data frame or numeric matrix containing the data used in the
SMEM algorithm. Its columns must explicitly be named after the variables of
|
y |
A character vector containing the dependent variables if a
conditional model is estimated (which involves maximizing a conditional
score). If |
score |
A character string ( |
split |
A logical value indicating whether split operations are allowed
(if |
merge |
A logical value indicating whether merge operations are allowed
(if |
min_comp |
A positive integer corresponding to the minimum number of mixture components. |
max_comp |
A positive integer corresponding to the maximum number of mixture components. |
space |
A numeric value in [0, 1[ corresponding to the space between two subcomponents resulting from a split. |
max_rank |
A positive integer corresponding to the maximum rank for testing the split and merge candidates. |
max_iter_smem |
A non-negative integer corresponding to the maximum number of iterations. |
verbose |
A logical value indicating whether iterations in progress are displayed. |
... |
Additional arguments passed to function |
A list with elements:
gmm |
The final |
posterior |
A numeric matrix containing the posterior probabilities for each observation. |
seq_score |
A numeric vector containing the sequence of scores measured initially and after each iteration. |
seq_oper |
A character vector containing the sequence of split and merge operations performed at each iteration. |
Zhang, Z., Chen, C., Sun, J. and Chan, K. L. (2003). EM algorithms for Gaussian mixtures with split-and-merge operation. Pattern Recognition, 36(9):1973–1983.
data(data_body) gmm_1 <- add_var(NULL, c("WAIST", "AGE", "FAT", "HEIGHT", "WEIGHT")) res_smem <- smem(gmm_1, data_body, max_comp = 3, verbose = TRUE)
data(data_body) gmm_1 <- add_var(NULL, c("WAIST", "AGE", "FAT", "HEIGHT", "WEIGHT")) res_smem <- smem(gmm_1, data_body, max_comp = 3, verbose = TRUE)
This function performs smoothing inference in a Gaussian mixture dynamic
Bayesian network. For a sequence of time slices, this task consists
in estimating the state of the system at each time slice
(for
) given all the data (the evidence) collected up to
. Smoothing inference is performed by sequential importance
resampling, which is a particle-based approximate method (Koller and
Friedman, 2009).
smoothing( gmdbn, evid, nodes = names(gmdbn$b_1), col_seq = NULL, n_part = 1000, max_part_sim = 1e+06, min_ess = 1, verbose = FALSE )
smoothing( gmdbn, evid, nodes = names(gmdbn$b_1), col_seq = NULL, n_part = 1000, max_part_sim = 1e+06, min_ess = 1, verbose = FALSE )
gmdbn |
An object of class |
evid |
A data frame containing the evidence. Its columns must explicitly
be named after nodes of |
nodes |
A character vector containing the inferred nodes (by default all
the nodes of |
col_seq |
A character vector containing the column names of |
n_part |
A positive integer corresponding to the number of particles generated for each observation sequence. |
max_part_sim |
An integer greater than or equal to |
min_ess |
A numeric value in [0, 1] corresponding to the minimum ESS
(expressed as a proportion of |
verbose |
A logical value indicating whether subsets of |
A data frame (tibble) with a structure similar to evid
containing the estimated values of the inferred nodes and their observation
sequences (if col_seq
is not NULL
).
Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. The MIT Press.
filtering
, inference
,
prediction
set.seed(0) data(gmdbn_air, data_air) evid <- data_air evid$NO2[sample.int(7680, 1536)] <- NA evid$O3[sample.int(7680, 1536)] <- NA evid$TEMP[sample.int(7680, 1536)] <- NA evid$WIND[sample.int(7680, 1536)] <- NA smooth <- smoothing(gmdbn_air, evid, col_seq = "DATE", verbose = TRUE)
set.seed(0) data(gmdbn_air, data_air) evid <- data_air evid$NO2[sample.int(7680, 1536)] <- NA evid$O3[sample.int(7680, 1536)] <- NA evid$TEMP[sample.int(7680, 1536)] <- NA evid$WIND[sample.int(7680, 1536)] <- NA smooth <- smoothing(gmdbn_air, evid, col_seq = "DATE", verbose = TRUE)
This function splits a mixture component of a Gaussian mixture model using the singular value decomposition of the covariance matrix (Zhang et al., 2003).
split_comp(gmm, comp = 1, n_sub = 2, space = 0.5)
split_comp(gmm, comp = 1, n_sub = 2, space = 0.5)
gmm |
An object of class |
comp |
An integer corresponding to the index of the split mixture component. |
n_sub |
A positive integer corresponding to the number of subcomponents. |
space |
A numeric value in [0, 1[ corresponding to the space between the subcomponents. |
The gmm
object after splitting the mixture component.
Zhang, Z., Chen, C., Sun, J. and Chan, K. L. (2003). EM algorithms for Gaussian mixtures with split-and-merge operation. Pattern Recognition, 36(9):1973–1983.
data(gmm_body) gmm_1 <- split_comp(gmm_body, n_sub = 3)
data(gmm_body) gmm_1 <- split_comp(gmm_body, n_sub = 3)
This function selects the explanatory variables, the number of mixture components and estimates the parameters of a conditional Gaussian mixture model using a stepwise algorithm. At the first iteration, the SMEM algorithm is performed to update the number of components and the parameters of the initial model. Then each following iteration consists in adding or removing a candidate explanatory variable, before re-estimating the model with the SMEM algorithm. The selected add or remove operation is the one that maximizes a conditional scoring function (after the re-estimation process). The stepwise algorithm stops if none of the candidate operations improves the score.
stepwise( gmm, data, y = rownames(gmm$mu)[1], x_cand = setdiff(colnames(data), y), score = "bic", add = TRUE, remove = TRUE, min_x = 0, max_x = Inf, max_iter_step = 10, verbose = FALSE, ... )
stepwise( gmm, data, y = rownames(gmm$mu)[1], x_cand = setdiff(colnames(data), y), score = "bic", add = TRUE, remove = TRUE, min_x = 0, max_x = Inf, max_iter_step = 10, verbose = FALSE, ... )
gmm |
An initial object of class |
data |
A data frame or numeric matrix containing the data used in the
stepwise algorithm. Its columns must explicitly be named after the variables
of |
y |
A character vector containing the dependent variables (by default
the first variable of |
x_cand |
A character vector containing the candidate explanatory
variables for addition or removal (by default all the column names of
|
score |
A character string ( |
add |
A logical value indicating whether add operations are allowed (if
|
remove |
A logical value indicating whether remove operations are
allowed (if |
min_x |
A non-negative integer corresponding to the minimum number of explanatory variables. |
max_x |
A non-negative integer corresponding to the maximum number of explanatory variables. |
max_iter_step |
A non-negative integer corresponding to the maximum number of iterations. |
verbose |
A logical value indicating whether iterations in progress are displayed. |
... |
Additional arguments passed to function |
A list with elements:
gmm |
The final |
posterior |
A numeric matrix containing the posterior probabilities for each observation. |
seq_score |
A numeric vector containing the sequence of scores measured initially and after each iteration. |
seq_oper |
A character vector containing the sequence of add and remove operations performed at each iteration. |
data(data_body) gmm_1 <- add_var(NULL, "WAIST") res_step <- stepwise(gmm_1, data_body, verbose = TRUE, max_comp = 3)
data(data_body) gmm_1 <- add_var(NULL, "WAIST") res_step <- stepwise(gmm_1, data_body, verbose = TRUE, max_comp = 3)
This function learns the structure and the parameters of a Gaussian mixture graphical model with incomplete data using the structural EM algorithm. At each iteration, the parametric EM algorithm is performed to complete the data and update the parameters (E step). The completed data are then used to update the structure (M step), and so on. Each iteration is guaranteed to increase the scoring function until convergence to a local maximum (Koller and Friedman, 2009). In practice, due to the sampling process inherent in particle-based inference, it may happen that the monotonic increase no longer occurs when approaching the local maximum, resulting in an earlier termination of the algorithm.
struct_em( gmgm, data, nodes = structure(gmgm)$nodes, arcs_cand = tibble(lag = 0), col_seq = NULL, score = "bic", n_part = 1000, max_part_sim = 1e+06, min_ess = 1, max_iter_sem = 5, max_iter_pem = 5, verbose = FALSE, ... )
struct_em( gmgm, data, nodes = structure(gmgm)$nodes, arcs_cand = tibble(lag = 0), col_seq = NULL, score = "bic", n_part = 1000, max_part_sim = 1e+06, min_ess = 1, max_iter_sem = 5, max_iter_pem = 5, verbose = FALSE, ... )
gmgm |
An object of class |
data |
A data frame containing the data used for learning. Its columns
must explicitly be named after nodes of |
nodes |
A character vector containing the nodes whose local conditional
models are learned (by default all the nodes of |
arcs_cand |
A data frame containing the candidate arcs for addition or
removal (by default all possible non-temporal arcs). The column |
col_seq |
A character vector containing the column names of |
score |
A character string ( |
n_part |
A positive integer corresponding to the number of particles
generated for each observation (if |
max_part_sim |
An integer greater than or equal to |
min_ess |
A numeric value in [0, 1] corresponding to the minimum ESS
(expressed as a proportion of |
max_iter_sem |
A non-negative integer corresponding to the maximum number of iterations. |
max_iter_pem |
A non-negative integer corresponding to the maximum number of iterations of the parametric EM algorithm. |
verbose |
A logical value indicating whether iterations in progress are displayed. |
... |
Additional arguments passed to function |
A list with elements:
gmgm |
The final |
data |
A data frame (tibble) containing the complete data used to learn
the final |
seq_score |
A numeric matrix containing the sequence of scores measured after the E and M steps of each iteration. |
Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. The MIT Press.
param_em
, param_learn
,
struct_learn
set.seed(0) data(data_body) data_1 <- data_body data_1$GENDER[sample.int(2148, 430)] <- NA data_1$AGE[sample.int(2148, 430)] <- NA data_1$HEIGHT[sample.int(2148, 430)] <- NA data_1$WEIGHT[sample.int(2148, 430)] <- NA data_1$FAT[sample.int(2148, 430)] <- NA data_1$WAIST[sample.int(2148, 430)] <- NA data_1$GLYCO[sample.int(2148, 430)] <- NA gmbn_1 <- add_nodes(NULL, c("AGE", "FAT", "GENDER", "GLYCO", "HEIGHT", "WAIST", "WEIGHT")) arcs_cand_1 <- data.frame(from = c("AGE", "GENDER", "HEIGHT", "WEIGHT", NA, "AGE", "GENDER", "AGE", "FAT", "GENDER", "HEIGHT", "WEIGHT", "AGE", "GENDER", "HEIGHT"), to = c("FAT", "FAT", "FAT", "FAT", "GLYCO", "HEIGHT", "HEIGHT", "WAIST", "WAIST", "WAIST", "WAIST", "WAIST", "WEIGHT", "WEIGHT", "WEIGHT")) res_learn_1 <- struct_em(gmbn_1, data_1, arcs_cand = arcs_cand_1, verbose = TRUE, max_comp = 3) set.seed(0) data(data_air) data_2 <- data_air data_2$NO2[sample.int(7680, 1536)] <- NA data_2$O3[sample.int(7680, 1536)] <- NA data_2$TEMP[sample.int(7680, 1536)] <- NA data_2$WIND[sample.int(7680, 1536)] <- NA gmdbn_1 <- gmdbn(b_2 = add_nodes(NULL, c("NO2", "O3", "TEMP", "WIND")), b_13 = add_nodes(NULL, c("NO2", "O3", "TEMP", "WIND"))) arcs_cand_2 <- data.frame(from = c("NO2", "NO2", "NO2", "O3", "TEMP", "TEMP", "WIND", "WIND"), to = c("NO2", "O3", "O3", "O3", NA, NA, NA, NA), lag = c(1, 0, 1, 1, 0, 1, 0, 1)) res_learn_2 <- struct_em(gmdbn_1, data_2, arcs_cand = arcs_cand_2, col_seq = "DATE", verbose = TRUE, max_comp = 3)
set.seed(0) data(data_body) data_1 <- data_body data_1$GENDER[sample.int(2148, 430)] <- NA data_1$AGE[sample.int(2148, 430)] <- NA data_1$HEIGHT[sample.int(2148, 430)] <- NA data_1$WEIGHT[sample.int(2148, 430)] <- NA data_1$FAT[sample.int(2148, 430)] <- NA data_1$WAIST[sample.int(2148, 430)] <- NA data_1$GLYCO[sample.int(2148, 430)] <- NA gmbn_1 <- add_nodes(NULL, c("AGE", "FAT", "GENDER", "GLYCO", "HEIGHT", "WAIST", "WEIGHT")) arcs_cand_1 <- data.frame(from = c("AGE", "GENDER", "HEIGHT", "WEIGHT", NA, "AGE", "GENDER", "AGE", "FAT", "GENDER", "HEIGHT", "WEIGHT", "AGE", "GENDER", "HEIGHT"), to = c("FAT", "FAT", "FAT", "FAT", "GLYCO", "HEIGHT", "HEIGHT", "WAIST", "WAIST", "WAIST", "WAIST", "WAIST", "WEIGHT", "WEIGHT", "WEIGHT")) res_learn_1 <- struct_em(gmbn_1, data_1, arcs_cand = arcs_cand_1, verbose = TRUE, max_comp = 3) set.seed(0) data(data_air) data_2 <- data_air data_2$NO2[sample.int(7680, 1536)] <- NA data_2$O3[sample.int(7680, 1536)] <- NA data_2$TEMP[sample.int(7680, 1536)] <- NA data_2$WIND[sample.int(7680, 1536)] <- NA gmdbn_1 <- gmdbn(b_2 = add_nodes(NULL, c("NO2", "O3", "TEMP", "WIND")), b_13 = add_nodes(NULL, c("NO2", "O3", "TEMP", "WIND"))) arcs_cand_2 <- data.frame(from = c("NO2", "NO2", "NO2", "O3", "TEMP", "TEMP", "WIND", "WIND"), to = c("NO2", "O3", "O3", "O3", NA, NA, NA, NA), lag = c(1, 0, 1, 1, 0, 1, 0, 1)) res_learn_2 <- struct_em(gmdbn_1, data_2, arcs_cand = arcs_cand_2, col_seq = "DATE", verbose = TRUE, max_comp = 3)
This function learns the (graphical and mixture) structure and the parameters of a Gaussian mixture graphical model. Using the local decomposability of the scoring function, this task consists in learning each local conditional model independently with the stepwise algorithm (Koller and Friedman, 2009). Note that some candidate arcs may be discarded to avoid that the global graphical structure contains cycles. To limit recourse to this action, the learning process is performed sequentially. The more arcs of a local model are likely to be part of cycles (considering the worst case where all the candidate arcs are selected), the later this local model is processed. By gradually taking into account the local structures learned over time, the number of possible cycles decreases and, with it, the number of candidate arcs to discard.
struct_learn( gmgm, data, nodes = structure(gmgm)$nodes, arcs_cand = tibble(lag = 0), col_seq = NULL, score = "bic", verbose = FALSE, ... )
struct_learn( gmgm, data, nodes = structure(gmgm)$nodes, arcs_cand = tibble(lag = 0), col_seq = NULL, score = "bic", verbose = FALSE, ... )
gmgm |
An initial object of class |
data |
A data frame containing the data used for learning. Its columns
must explicitly be named after the nodes of |
nodes |
A character vector containing the nodes whose local conditional
models are learned (by default all the nodes of |
arcs_cand |
A data frame containing the candidate arcs for addition or
removal (by default all possible non-temporal arcs). The column |
col_seq |
A character vector containing the column names of |
score |
A character string ( |
verbose |
A logical value indicating whether learned nodes in progress are displayed. |
... |
Additional arguments passed to function |
A list with elements:
gmgm |
The final |
evol_score |
A list with elements:
|
Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. The MIT Press.
param_em
, param_learn
,
struct_em
data(data_body) gmbn_1 <- add_nodes(NULL, c("AGE", "FAT", "GENDER", "GLYCO", "HEIGHT", "WAIST", "WEIGHT")) arcs_cand_1 <- data.frame(from = c("AGE", "GENDER", "HEIGHT", "WEIGHT", NA, "AGE", "GENDER", "AGE", "FAT", "GENDER", "HEIGHT", "WEIGHT", "AGE", "GENDER", "HEIGHT"), to = c("FAT", "FAT", "FAT", "FAT", "GLYCO", "HEIGHT", "HEIGHT", "WAIST", "WAIST", "WAIST", "WAIST", "WAIST", "WEIGHT", "WEIGHT", "WEIGHT")) res_learn_1 <- struct_learn(gmbn_1, data_body, arcs_cand = arcs_cand_1, verbose = TRUE, max_comp = 3) data(data_air) gmdbn_1 <- gmdbn(b_2 = add_nodes(NULL, c("NO2", "O3", "TEMP", "WIND")), b_13 = add_nodes(NULL, c("NO2", "O3", "TEMP", "WIND"))) arcs_cand_2 <- data.frame(from = c("NO2", "NO2", "NO2", "O3", "TEMP", "TEMP", "WIND", "WIND"), to = c("NO2", "O3", "O3", "O3", NA, NA, NA, NA), lag = c(1, 0, 1, 1, 0, 1, 0, 1)) res_learn_2 <- struct_learn(gmdbn_1, data_air, arcs_cand = arcs_cand_2, col_seq = "DATE", verbose = TRUE, max_comp = 3)
data(data_body) gmbn_1 <- add_nodes(NULL, c("AGE", "FAT", "GENDER", "GLYCO", "HEIGHT", "WAIST", "WEIGHT")) arcs_cand_1 <- data.frame(from = c("AGE", "GENDER", "HEIGHT", "WEIGHT", NA, "AGE", "GENDER", "AGE", "FAT", "GENDER", "HEIGHT", "WEIGHT", "AGE", "GENDER", "HEIGHT"), to = c("FAT", "FAT", "FAT", "FAT", "GLYCO", "HEIGHT", "HEIGHT", "WAIST", "WAIST", "WAIST", "WAIST", "WAIST", "WEIGHT", "WEIGHT", "WEIGHT")) res_learn_1 <- struct_learn(gmbn_1, data_body, arcs_cand = arcs_cand_1, verbose = TRUE, max_comp = 3) data(data_air) gmdbn_1 <- gmdbn(b_2 = add_nodes(NULL, c("NO2", "O3", "TEMP", "WIND")), b_13 = add_nodes(NULL, c("NO2", "O3", "TEMP", "WIND"))) arcs_cand_2 <- data.frame(from = c("NO2", "NO2", "NO2", "O3", "TEMP", "TEMP", "WIND", "WIND"), to = c("NO2", "O3", "O3", "O3", NA, NA, NA, NA), lag = c(1, 0, 1, 1, 0, 1, 0, 1)) res_learn_2 <- struct_learn(gmdbn_1, data_air, arcs_cand = arcs_cand_2, col_seq = "DATE", verbose = TRUE, max_comp = 3)
This function provides the graphical structure of a Gaussian mixture graphical model.
structure(gmgm)
structure(gmgm)
gmgm |
An object of class |
A list with elements:
nodes |
A character vector containing the nodes. |
arcs |
For a |
data(gmbn_body) struct_1 <- structure(gmbn_body) data(gmdbn_air) struct_2 <- structure(gmdbn_air)
data(gmbn_body) struct_1 <- structure(gmbn_body) data(gmdbn_air) struct_2 <- structure(gmdbn_air)
This function summarizes a Gaussian mixture model or graphical model.
## S3 method for class 'gmm' summary(object, ...) ## S3 method for class 'gmbn' summary(object, ...) ## S3 method for class 'gmdbn' summary(object, ...)
## S3 method for class 'gmm' summary(object, ...) ## S3 method for class 'gmbn' summary(object, ...) ## S3 method for class 'gmdbn' summary(object, ...)
object |
An object of class |
... |
Unused arguments from the generic function. |
If object
is a gmm
object, an integer vector containing
the number of variables, mixture components and free parameters.
If object
is a gmbn
or gmdbn
object, a list with
elements:
global |
An integer vector containing the global number of nodes, arcs,
mixture components and free parameters (for a |
local |
For a |
data(gmm_body) summ_1 <- summary(gmm_body) data(gmbn_body) summ_2 <- summary(gmbn_body) data(gmdbn_air) summ_3 <- summary(gmdbn_air)
data(gmm_body) summ_1 <- summary(gmm_body) data(gmbn_body) summ_2 <- summary(gmbn_body) data(gmdbn_air) summ_3 <- summary(gmdbn_air)