Title: | Improving MrP with Ensemble Learning |
---|---|
Description: | A tool that improves the prediction performance of multilevel regression with post-stratification (MrP) by combining a number of machine learning methods. For information on the method, please refer to Broniecki, Wüest, Leemann (2020) ''Improving Multilevel Regression with Post-Stratification Through Machine Learning (autoMrP)'' in the 'Journal of Politics'. Final pre-print version: <https://lucasleemann.files.wordpress.com/2020/07/automrp-r2pa.pdf>. |
Authors: | Reto Wüest [aut] |
Maintainer: | Philipp Broniecki <[email protected]> |
License: | GPL-3 |
Version: | 1.1.0 |
Built: | 2025-01-29 04:24:38 UTC |
Source: | https://github.com/retowuest/automrp |
The census file is generated from the full 2008 Cooperative Congressional Election Studies item cc419_1 by dissaggregating the 64 ideal type combinations of the individual level variables L1x1, L2x2 and L1x3. A row is an ideal type in a given state.
data(absentee_census)
data(absentee_census)
A data frame with 2934 rows and 13 variables:
U.S. state
U.S. state id
U.S. region (four categories: 1 = Northeast; 2 = Midwest; 3 = South; 4 = West)
Age group (four categories)
Education level (four categories)
Gender-race combination (six categories)
State-level proportion of respondents of that ideal type in the population
State-level share of votes for the Republican candidate in the previous presidential election
State-level percentage of Evangelical Protestant or Mormon respondents
State-level percentage of the population living in urban areas
State-level unemployment rate
State-level share of Hispanics
State-level share of Whites
The data set (excluding L2.x3, L2.x4, L2.x5, L2.x6) is taken from the article: Buttice, Matthew K, and Benjamin Highton. 2013. "How does multilevel regression and poststrat-stratification perform with conventional national surveys?" Political Analysis 21(4): 449-467. L2.x3, L2.x3, L2.x4, L2.x5 and L2.x6 are available at https://www.census.gov.
The Cooperative Congressional Election Stuides (CCES) item (cc419_1) asked: "States have tried many new ways to run elections in recent years. Do you support or oppose any of the following ways of voting or conducting elections in your state? Election Reform - Allow absentee voting over the Internet?" The original 2008 CCES item contains 26,934 respondents. This sample mimics a typical national survey. It contains at least 5 respondents from each state but is otherwise a random sample.
data(absentee_voting)
data(absentee_voting)
A data frame with 1500 rows and 13 variables:
1 if individual supports use of troops; 0 otherwise
Age group (four categories: 1 = 18-29; 2 = 30-44; 3 = 45-64; 4 = 65+)
Education level (four categories: 1 = < high school; 2 = high school graduate; 3 = some college; 4 = college graduate)
Gender-race combination (six categories: 1 = white male; 2 = black male; 3 = hispanic male; 4 = white female; 5 = black female; 6 = hispanic female)
U.S. state
U.S. state id
U.S. region (four categories: 1 = Northeast; 2 = Midwest; 3 = South; 4 = West)
State-level share of votes for the Republican candidate in the previous presidential election
State-level percentage of Evangelical Protestant or Mormon respondents
State-level percentage of the population living in urban areas
State-level unemployment rate
State-level share of Hispanics
State-level share of Whites
The data set (excluding L2.x3, L2.x4, L2.x5, L2.x6) is taken from the article: Buttice, Matthew K, and Benjamin Highton. 2013. "How does multilevel regression and poststrat-stratification perform with conventional national surveys?" Political Analysis 21(4): 449-467. It is a random sample with at least 5 respondents per state. L2.x3, L2.x3, L2.x4, L2.x5 and L2.x6 are available at https://www.census.gov.
This package improves the prediction performance of multilevel regression with post-stratification (MrP) by combining a number of machine learning methods through ensemble Bayesian model averaging (EBMA).
auto_MrP( y, L1.x, L2.x, L2.unit, L2.reg = NULL, L2.x.scale = TRUE, pcs = NULL, folds = NULL, bin.proportion = NULL, bin.size = NULL, survey, census, ebma.size = 1/3, cores = 1, k.folds = 5, cv.sampling = "L2 units", loss.unit = c("individuals", "L2 units"), loss.fun = c("msfe", "cross-entropy", "f1", "MSE"), best.subset = TRUE, lasso = TRUE, pca = TRUE, gb = TRUE, svm = TRUE, mrp = FALSE, deep.mrp = FALSE, oversampling = FALSE, best.subset.L2.x = NULL, lasso.L2.x = NULL, pca.L2.x = NULL, gb.L2.x = NULL, svm.L2.x = NULL, mrp.L2.x = NULL, gb.L2.unit = TRUE, gb.L2.reg = FALSE, svm.L2.unit = TRUE, svm.L2.reg = FALSE, deep.splines = TRUE, lasso.lambda = NULL, lasso.n.iter = 100, gb.interaction.depth = c(1, 2, 3), gb.shrinkage = c(0.04, 0.01, 0.008, 0.005, 0.001), gb.n.trees.init = 50, gb.n.trees.increase = 50, gb.n.trees.max = 1000, gb.n.minobsinnode = 20, svm.kernel = c("radial"), svm.gamma = NULL, svm.cost = NULL, ebma.n.draws = 100, ebma.tol = c(0.01, 0.005, 0.001, 5e-04, 1e-04, 5e-05, 1e-05), verbose = FALSE, uncertainty = FALSE, boot.iter = NULL )
auto_MrP( y, L1.x, L2.x, L2.unit, L2.reg = NULL, L2.x.scale = TRUE, pcs = NULL, folds = NULL, bin.proportion = NULL, bin.size = NULL, survey, census, ebma.size = 1/3, cores = 1, k.folds = 5, cv.sampling = "L2 units", loss.unit = c("individuals", "L2 units"), loss.fun = c("msfe", "cross-entropy", "f1", "MSE"), best.subset = TRUE, lasso = TRUE, pca = TRUE, gb = TRUE, svm = TRUE, mrp = FALSE, deep.mrp = FALSE, oversampling = FALSE, best.subset.L2.x = NULL, lasso.L2.x = NULL, pca.L2.x = NULL, gb.L2.x = NULL, svm.L2.x = NULL, mrp.L2.x = NULL, gb.L2.unit = TRUE, gb.L2.reg = FALSE, svm.L2.unit = TRUE, svm.L2.reg = FALSE, deep.splines = TRUE, lasso.lambda = NULL, lasso.n.iter = 100, gb.interaction.depth = c(1, 2, 3), gb.shrinkage = c(0.04, 0.01, 0.008, 0.005, 0.001), gb.n.trees.init = 50, gb.n.trees.increase = 50, gb.n.trees.max = 1000, gb.n.minobsinnode = 20, svm.kernel = c("radial"), svm.gamma = NULL, svm.cost = NULL, ebma.n.draws = 100, ebma.tol = c(0.01, 0.005, 0.001, 5e-04, 1e-04, 5e-05, 1e-05), verbose = FALSE, uncertainty = FALSE, boot.iter = NULL )
y |
Outcome variable. A character vector containing the column names of
the outcome variable. A character scalar containing the column name of
the outcome variable in |
L1.x |
Individual-level covariates. A character vector containing the
column names of the individual-level variables in |
L2.x |
Context-level covariates. A character vector containing the
column names of the context-level variables in |
L2.unit |
Geographic unit. A character scalar containing the column
name of the geographic unit in |
L2.reg |
Geographic region. A character scalar containing the column
name of the geographic region in |
L2.x.scale |
Scale context-level covariates. A logical argument
indicating whether the context-level covariates should be normalized.
Default is |
pcs |
Principal components. A character vector containing the column
names of the principal components of the context-level variables in
|
folds |
EBMA and cross-validation folds. A character scalar containing
the column name of the variable in |
bin.proportion |
Proportion of ideal types. A character scalar
containing the column name of the variable in |
bin.size |
Bin size of ideal types. A character scalar containing the
column name of the variable in |
survey |
Survey data. A |
census |
Census data. A |
ebma.size |
EBMA fold size. A number in the open unit interval
indicating the proportion of respondents to be allocated to the EBMA fold.
Default is |
cores |
The number of cores to be used. An integer indicating the number of processor cores used for parallel computing. Default is 1. |
k.folds |
Number of cross-validation folds. An integer-valued scalar
indicating the number of folds to be used in cross-validation. Default is
|
cv.sampling |
Cross-validation sampling method. A character-valued
scalar indicating whether cross-validation folds should be created by
sampling individual respondents ( |
loss.unit |
Loss function unit. A character-valued scalar indicating
whether performance loss should be evaluated at the level of individual
respondents ( |
loss.fun |
Loss function. A character-valued scalar indicating whether
prediction loss should be measured by the mean squared error ( |
best.subset |
Best subset classifier. A logical argument indicating
whether the best subset classifier should be used for predicting outcome
|
lasso |
Lasso classifier. A logical argument indicating whether the
lasso classifier should be used for predicting outcome |
pca |
PCA classifier. A logical argument indicating whether the PCA
classifier should be used for predicting outcome |
gb |
GB classifier. A logical argument indicating whether the GB
classifier should be used for predicting outcome |
svm |
SVM classifier. A logical argument indicating whether the SVM
classifier should be used for predicting outcome |
mrp |
MRP classifier. A logical argument indicating whether the standard
MRP classifier should be used for predicting outcome |
deep.mrp |
Deep MRP classifier. A logical argument indicating whether
the deep MRP classifier should be used for best subset prediction. Setting
|
oversampling |
Over sample to create balance on the dependent variable.
A logical argument. Default is |
best.subset.L2.x |
Best subset context-level covariates. A character
vector containing the column names of the context-level variables in
|
lasso.L2.x |
Lasso context-level covariates. A character vector
containing the column names of the context-level variables in
|
pca.L2.x |
PCA context-level covariates. A character vector containing
the column names of the context-level variables in |
gb.L2.x |
GB context-level covariates. A character vector containing the
column names of the context-level variables in |
svm.L2.x |
SVM context-level covariates. A character vector containing
the column names of the context-level variables in |
mrp.L2.x |
MRP context-level covariates. A character vector containing
the column names of the context-level variables in |
gb.L2.unit |
GB L2.unit. A logical argument indicating whether
|
gb.L2.reg |
GB L2.reg. A logical argument indicating whether
|
svm.L2.unit |
SVM L2.unit. A logical argument indicating whether
|
svm.L2.reg |
SVM L2.reg. A logical argument indicating whether
|
deep.splines |
Deep MRP splines. A logical argument indicating whether
splines should be used in the deep MRP classifier. Default is |
lasso.lambda |
Lasso penalty parameter. A numeric |
lasso.n.iter |
Lasso number of lambda values. An integer-valued scalar
specifying the number of lambda values to search over. Default is
|
gb.interaction.depth |
GB interaction depth. An integer-valued vector
whose values specify the interaction depth of GB. The interaction depth
defines the maximum depth of each tree grown (i.e., the maximum level of
variable interactions). Default is |
gb.shrinkage |
GB learning rate. A numeric vector whose values specify
the learning rate or step-size reduction of GB. Values between |
gb.n.trees.init |
GB initial total number of trees. An integer-valued
scalar specifying the initial number of total trees to fit by GB. Default
is |
gb.n.trees.increase |
GB increase in total number of trees. An
integer-valued scalar specifying by how many trees the total number of
trees to fit should be increased (until |
gb.n.trees.max |
GB maximum number of trees. An integer-valued scalar
specifying the maximum number of trees to fit by GB. Default is |
gb.n.minobsinnode |
GB minimum number of observations in the terminal
nodes. An integer-valued scalar specifying the minimum number of
observations that each terminal node of the trees must contain. Default is
|
svm.kernel |
SVM kernel. A character-valued scalar specifying the kernel
to be used by SVM. The possible values are |
svm.gamma |
SVM kernel parameter. A numeric vector whose values specify the gamma parameter in the SVM kernel. This parameter is needed for all kernel types except linear. Default is a sequence with minimum = 1e-5, maximum = 1e-1, and length = 20 that is equally spaced on the log-scale. |
svm.cost |
SVM cost parameter. A numeric vector whose values specify the cost of constraints violation in SVM. Default is a sequence with minimum = 0.5, maximum = 10, and length = 5 that is equally spaced on the log-scale. |
ebma.n.draws |
EBMA number of samples. An integer-valued scalar
specifying the number of bootstrapped samples to be drawn from the EBMA
fold and used for tuning EBMA. Default is |
ebma.tol |
EBMA tolerance. A numeric vector containing the
tolerance values for improvements in the log-likelihood before the EM
algorithm stops optimization. Values should range at least from |
verbose |
Verbose output. A logical argument indicating whether or not
verbose output should be printed. Default is |
uncertainty |
Uncertainty estimates. A logical argument indicating
whether uncertainty estimates should be computed. Default is |
boot.iter |
Number of bootstrap iterations. An integer argument
indicating the number of bootstrap iterations to be computed. Will be
ignored unless |
Bootstrapping samples the level two units, sometimes referred to as the cluster bootstrap. For the multilevel model, for example, when running MrP only, the bootstrapped median level two predictions will differ from the level two predictions without bootstrapping. We recommend assessing the difference by running autoMrP without bootstrapping alongside autoMrP with bootstrapping and then comparing level two predictions from the model without bootstrapping to the median level two predictions from the model with bootstrapping.
To ensure reproducability of the results, use the set.seed()
function to specify a seed.
The context-level predictions. A list with two elements. The first
element, EBMA
, contains the post-stratified ensemble bayesian model
avaeraging (EBMA) predictions. The second element, classifiers
,
contains the post-stratified predictions from all estimated classifiers.
# An MrP model without machine learning set.seed(123) m <- auto_MrP( y = "YES", L1.x = c("L1x1"), L2.x = c("L2.x1", "L2.x2"), L2.unit = "state", bin.proportion = "proportion", survey = taxes_survey, census = taxes_census, ebma.size = 0, cores = 2, best.subset = FALSE, lasso = FALSE, pca = FALSE, gb = FALSE, svm = FALSE, mrp = TRUE ) # summarize and plot results summary(m) plot(m) # An MrP model without context-level predictors m <- auto_MrP( y = "YES", L1.x = "L1x1", L2.x = NULL, mrp.L2.x = "", L2.unit = "state", bin.proportion = "proportion", survey = taxes_survey, census = taxes_census, ebma.size = 0, cores = 1, best.subset = FALSE, lasso = FALSE, pca = FALSE, gb = FALSE, svm = FALSE, mrp = TRUE ) # Predictions with machine learning # detect number of available cores max_cores <- parallelly::availableCores() # autoMrP with machine learning ml_out <- auto_MrP( y = "YES", L1.x = c("L1x1", "L1x2", "L1x3"), L2.x = c("L2.x1", "L2.x2", "L2.x3", "L2.x4", "L2.x5", "L2.x6"), L2.unit = "state", L2.reg = "region", bin.proportion = "proportion", survey = taxes_survey, census = taxes_census, gb.L2.reg = TRUE, svm.L2.reg = TRUE, cores = min(2, max_cores) )
# An MrP model without machine learning set.seed(123) m <- auto_MrP( y = "YES", L1.x = c("L1x1"), L2.x = c("L2.x1", "L2.x2"), L2.unit = "state", bin.proportion = "proportion", survey = taxes_survey, census = taxes_census, ebma.size = 0, cores = 2, best.subset = FALSE, lasso = FALSE, pca = FALSE, gb = FALSE, svm = FALSE, mrp = TRUE ) # summarize and plot results summary(m) plot(m) # An MrP model without context-level predictors m <- auto_MrP( y = "YES", L1.x = "L1x1", L2.x = NULL, mrp.L2.x = "", L2.unit = "state", bin.proportion = "proportion", survey = taxes_survey, census = taxes_census, ebma.size = 0, cores = 1, best.subset = FALSE, lasso = FALSE, pca = FALSE, gb = FALSE, svm = FALSE, mrp = TRUE ) # Predictions with machine learning # detect number of available cores max_cores <- parallelly::availableCores() # autoMrP with machine learning ml_out <- auto_MrP( y = "YES", L1.x = c("L1x1", "L1x2", "L1x3"), L2.x = c("L2.x1", "L2.x2", "L2.x3", "L2.x4", "L2.x5", "L2.x6"), L2.unit = "state", L2.reg = "region", bin.proportion = "proportion", survey = taxes_survey, census = taxes_census, gb.L2.reg = TRUE, svm.L2.reg = TRUE, cores = min(2, max_cores) )
best_subset_classifier
applies best subset classification to a data
set.
best_subset_classifier( model, data.train, model.family, model.optimizer, n.iter, y, verbose = c(TRUE, FALSE) )
best_subset_classifier( model, data.train, model.family, model.optimizer, n.iter, y, verbose = c(TRUE, FALSE) )
model |
Multilevel model. A model formula describing the multilevel model to be estimated on the basis of the provided training data. |
data.train |
Training data. A data.frame containing the training data used to train the model. |
model.family |
Model family. A variable indicating the model family to be used by glmer. Defaults to binomial(link = "probit"). |
model.optimizer |
Optimization method. A character-valued scalar describing the optimization method to be used by glmer. Defaults to "bobyqa". |
n.iter |
Iterations. A integer-valued scalar specifying the maximum number of function evaluations tried by the optimization method. |
y |
Outcome variable. A character vector containing the column names of
the outcome variable. A character scalar containing the column name of
the outcome variable in |
verbose |
Verbose output. A logical vector indicating whether or not verbose output should be printed. |
The multilevel model. An glmer
object.
binary_cross_entropy()
estimates the inverse binary cross-entropy on
the individual and state-level.
binary_cross_entropy( pred, data.valid, loss.unit = c("individuals", "L2 units"), y, L2.unit )
binary_cross_entropy( pred, data.valid, loss.unit = c("individuals", "L2 units"), y, L2.unit )
pred |
Predictions of outcome. A numeric vector of outcome predictions. |
data.valid |
Test data set. A tibble of data that was not used for prediction. |
loss.unit |
Loss function unit. A character-valued scalar indicating
whether performance loss should be evaluated at the level of individual
respondents ( |
y |
Outcome variable. A character vector containing the column names of the outcome variable. |
L2.unit |
Geographic unit. A character scalar containing the column name
of the geographic unit in |
Returns a tibble containing two binary cross-entropy prediction errors. The first is measured at the level of individuals and the second is measured at the context level. The tibble dimensions are 2x3 with variables: measure, value and level.
boot_auto_mrp
estimates uncertainty for auto_mrp via botstrapping.
boot_auto_mrp( y, L1.x, L2.x, mrp.L2.x, L2.unit, L2.reg, L2.x.scale, pcs, folds, bin.proportion, bin.size, survey, census, ebma.size, k.folds, cv.sampling, loss.unit, loss.fun, best.subset, lasso, pca, gb, svm, mrp, deep.mrp, best.subset.L2.x, lasso.L2.x, pca.L2.x, pc.names, gb.L2.x, svm.L2.x, svm.L2.unit, svm.L2.reg, gb.L2.unit, gb.L2.reg, deep.splines, lasso.lambda, lasso.n.iter, gb.interaction.depth, gb.shrinkage, gb.n.trees.init, gb.n.trees.increase, gb.n.trees.max, gb.n.minobsinnode, svm.kernel, svm.gamma, svm.cost, ebma.tol, boot.iter, cores )
boot_auto_mrp( y, L1.x, L2.x, mrp.L2.x, L2.unit, L2.reg, L2.x.scale, pcs, folds, bin.proportion, bin.size, survey, census, ebma.size, k.folds, cv.sampling, loss.unit, loss.fun, best.subset, lasso, pca, gb, svm, mrp, deep.mrp, best.subset.L2.x, lasso.L2.x, pca.L2.x, pc.names, gb.L2.x, svm.L2.x, svm.L2.unit, svm.L2.reg, gb.L2.unit, gb.L2.reg, deep.splines, lasso.lambda, lasso.n.iter, gb.interaction.depth, gb.shrinkage, gb.n.trees.init, gb.n.trees.increase, gb.n.trees.max, gb.n.minobsinnode, svm.kernel, svm.gamma, svm.cost, ebma.tol, boot.iter, cores )
y |
Outcome variable. A character vector containing the column names of
the outcome variable. A character scalar containing the column name of
the outcome variable in |
L1.x |
Individual-level covariates. A character vector containing the
column names of the individual-level variables in |
L2.x |
Context-level covariates. A character vector containing the
column names of the context-level variables in |
mrp.L2.x |
MRP context-level covariates. A character vector containing
the column names of the context-level variables in |
L2.unit |
Geographic unit. A character scalar containing the column
name of the geographic unit in |
L2.reg |
Geographic region. A character scalar containing the column
name of the geographic region in |
L2.x.scale |
Scale context-level covariates. A logical argument
indicating whether the context-level covariates should be normalized.
Default is |
pcs |
Principal components. A character vector containing the column
names of the principal components of the context-level variables in
|
folds |
EBMA and cross-validation folds. A character scalar containing
the column name of the variable in |
bin.proportion |
Proportion of ideal types. A character scalar
containing the column name of the variable in |
bin.size |
Bin size of ideal types. A character scalar containing the
column name of the variable in |
survey |
Survey data. A |
census |
Census data. A |
ebma.size |
EBMA fold size. A number in the open unit interval
indicating the proportion of respondents to be allocated to the EBMA fold.
Default is |
k.folds |
Number of cross-validation folds. An integer-valued scalar
indicating the number of folds to be used in cross-validation. Default is
|
cv.sampling |
Cross-validation sampling method. A character-valued
scalar indicating whether cross-validation folds should be created by
sampling individual respondents ( |
loss.unit |
Loss function unit. A character-valued scalar indicating
whether performance loss should be evaluated at the level of individual
respondents ( |
loss.fun |
Loss function. A character-valued scalar indicating whether
prediction loss should be measured by the mean squared error ( |
best.subset |
Best subset classifier. A logical argument indicating
whether the best subset classifier should be used for predicting outcome
|
lasso |
Lasso classifier. A logical argument indicating whether the
lasso classifier should be used for predicting outcome |
pca |
PCA classifier. A logical argument indicating whether the PCA
classifier should be used for predicting outcome |
gb |
GB classifier. A logical argument indicating whether the GB
classifier should be used for predicting outcome |
svm |
SVM classifier. A logical argument indicating whether the SVM
classifier should be used for predicting outcome |
mrp |
MRP classifier. A logical argument indicating whether the standard
MRP classifier should be used for predicting outcome |
deep.mrp |
Deep MRP classifier. A logical argument indicating whether
the deep MRP classifier should be used for best subset prediction. Setting
|
best.subset.L2.x |
Best subset context-level covariates. A character
vector containing the column names of the context-level variables in
|
lasso.L2.x |
Lasso context-level covariates. A character vector
containing the column names of the context-level variables in
|
pca.L2.x |
PCA context-level covariates. A character vector containing
the column names of the context-level variables in |
pc.names |
A character vector of the principal component variable names in the data. |
gb.L2.x |
GB context-level covariates. A character vector containing the
column names of the context-level variables in |
svm.L2.x |
SVM context-level covariates. A character vector containing
the column names of the context-level variables in |
svm.L2.unit |
SVM L2.unit. A logical argument indicating whether
|
svm.L2.reg |
SVM L2.reg. A logical argument indicating whether
|
gb.L2.unit |
GB L2.unit. A logical argument indicating whether
|
gb.L2.reg |
GB L2.reg. A logical argument indicating whether
|
deep.splines |
Deep MRP splines. A logical argument indicating whether
splines should be used in the deep MRP classifier. Default is |
lasso.lambda |
Lasso penalty parameter. A numeric |
lasso.n.iter |
Lasso number of lambda values. An integer-valued scalar
specifying the number of lambda values to search over. Default is
|
gb.interaction.depth |
GB interaction depth. An integer-valued vector
whose values specify the interaction depth of GB. The interaction depth
defines the maximum depth of each tree grown (i.e., the maximum level of
variable interactions). Default is |
gb.shrinkage |
GB learning rate. A numeric vector whose values specify
the learning rate or step-size reduction of GB. Values between |
gb.n.trees.init |
GB initial total number of trees. An integer-valued
scalar specifying the initial number of total trees to fit by GB. Default
is |
gb.n.trees.increase |
GB increase in total number of trees. An
integer-valued scalar specifying by how many trees the total number of
trees to fit should be increased (until |
gb.n.trees.max |
GB maximum number of trees. An integer-valued scalar
specifying the maximum number of trees to fit by GB. Default is |
gb.n.minobsinnode |
GB minimum number of observations in the terminal
nodes. An integer-valued scalar specifying the minimum number of
observations that each terminal node of the trees must contain. Default is
|
svm.kernel |
SVM kernel. A character-valued scalar specifying the kernel
to be used by SVM. The possible values are |
svm.gamma |
SVM kernel parameter. A numeric vector whose values specify the gamma parameter in the SVM kernel. This parameter is needed for all kernel types except linear. Default is a sequence with minimum = 1e-5, maximum = 1e-1, and length = 20 that is equally spaced on the log-scale. |
svm.cost |
SVM cost parameter. A numeric vector whose values specify the cost of constraints violation in SVM. Default is a sequence with minimum = 0.5, maximum = 10, and length = 5 that is equally spaced on the log-scale. |
ebma.tol |
EBMA tolerance. A numeric vector containing the
tolerance values for improvements in the log-likelihood before the EM
algorithm stops optimization. Values should range at least from |
boot.iter |
Number of bootstrap iterations. An integer argument
indicating the number of bootstrap iterations to be computed. Will be
ignored unless |
cores |
The number of cores to be used. An integer indicating the number of processor cores used for parallel computing. Default is 1. |
The census file is generated from the full 2008 Cooperative Congressional Election Studies item cc418_1 by dissaggregating the 64 ideal type combinations of the individual level variables L1x1, L2x2 and L1x3. A row is an ideal type in a given state.
census
census
A data frame with 2934 rows and 13 variables:
U.S. state
U.S. state id
U.S. region (four categories: 1 = Northeast; 2 = Midwest; 3 = South; 4 = West)
Age group (four categories)
Education level (four categories)
Gender-race combination (six categories)
State-level proportion of respondents of that ideal type in the population
State-level share of votes for the Republican candidate in the previous presidential election
State-level percentage of Evangelical Protestant or Mormon respondents
State-level percentage of the population living in urban areas
State-level unemployment rate
State-level share of Hispanics
State-level share of Whites
The data set (excluding L2.x3, L2.x4, L2.x5, L2.x6) is taken from the article: Buttice, Matthew K, and Benjamin Highton. 2013. "How does multilevel regression and poststrat-stratification perform with conventional national surveys?" Political Analysis 21(4): 449-467. L2.x3, L2.x3, L2.x4, L2.x5 and L2.x6 are available at https://www.census.gov.
cv_folding
creates folds used in classifier training within the survey
data.
cv_folding(data, L2.unit, k.folds, cv.sampling = c("individuals", "L2 units"))
cv_folding(data, L2.unit, k.folds, cv.sampling = c("individuals", "L2 units"))
data |
The survey data; must be a tibble. |
L2.unit |
The column name of the factor variable identifying the context-level unit |
k.folds |
An integer value indicating the number of folds to be generated. |
cv.sampling |
Cross-validation sampling method. A character-valued
scalar indicating whether cross-validation folds should be created by
sampling individual respondents ( |
Returns a list with length specified by k.folds
argument. Each
element is a tibble with a fold used in k-fold cross-validation.
deep_mrp_classifier
applies Deep MrP implemented in the vglmer
package to a data set.
deep_mrp_classifier(y, form, data, verbose)
deep_mrp_classifier(y, form, data, verbose)
y |
Outcome variable. A character vector containing the column names of
the outcome variable. A character scalar containing the column name of
the outcome variable in |
form |
Model formula. A two-sided linear formula describing the model to be fit, with the outcome on the LHS and the covariates separated by + operators on the RHS. |
data |
Data. A data.frame containing the data used to train the model. |
verbose |
Verbose output. A logical argument indicating whether or not
verbose output should be printed. Default is |
A Deep MrP model. A vglmer
object.
ebma
tunes EBMA and generates weights for classifier averaging.
ebma( ebma.fold, y, L1.x, L2.x, L2.unit, L2.reg, pc.names, post.strat, n.draws, tol, best.subset.opt, pca.opt, lasso.opt, gb.opt, svm.opt, deep.mrp, verbose, cores, preds_all )
ebma( ebma.fold, y, L1.x, L2.x, L2.unit, L2.reg, pc.names, post.strat, n.draws, tol, best.subset.opt, pca.opt, lasso.opt, gb.opt, svm.opt, deep.mrp, verbose, cores, preds_all )
ebma.fold |
New data for EBMA tuning. A list containing the the data that must not have been used in classifier training. |
y |
Outcome variable. A character vector containing the column names of
the outcome variable. A character scalar containing the column name of
the outcome variable in |
L1.x |
Individual-level covariates. A character vector containing the
column names of the individual-level variables in |
L2.x |
Context-level covariates. A character vector containing the
column names of the context-level variables in |
L2.unit |
Geographic unit. A character scalar containing the column
name of the geographic unit in |
L2.reg |
Geographic region. A character scalar containing the column
name of the geographic region in |
pc.names |
Principal Component Variable names. A character vector containing the names of the context-level principal components variables. |
post.strat |
Post-stratification results. A list containing the best models for each of the tuned classifiers, the individual level predictions on the data classifier trainig data and the post-stratified context-level predictions. |
n.draws |
EBMA number of samples. An integer-valued scalar specifying
the number of bootstrapped samples to be drawn from the EBMA fold and used
for tuning EBMA. Default is |
tol |
EBMA tolerance. A numeric vector containing the tolerance values
for improvements in the log-likelihood before the EM algorithm stops
optimization. Values should range at least from |
best.subset.opt |
Tuned best subset parameters. A list returned from
|
pca.opt |
Tuned best subset with principal components parameters. A list
returned from |
lasso.opt |
Tuned lasso parameters. A list returned from
|
gb.opt |
Tuned gradient tree boosting parameters. A list returned from
|
svm.opt |
Tuned support vector machine parameters. A list returned from
|
deep.mrp |
Deep MRP classifier. A logical argument indicating whether
the deep MRP classifier should be used for best subset prediction. Setting
|
verbose |
Verbose output. A logical argument indicating whether or not
verbose output should be printed. Default is |
cores |
The number of cores to be used. An integer indicating the number of processor cores used for parallel computing. Default is 1. |
#' ebma_folding()
generates a data fold that will not be used in
classifier tuning. It is data that is needed to determine the optimal
tolerance for EBMA.
ebma_folding(data, L2.unit, ebma.size)
ebma_folding(data, L2.unit, ebma.size)
data |
The full survey data. A tibble. |
L2.unit |
Geographic unit. A character scalar containing the column name
of the geographic unit in |
ebma.size |
EBMA fold size. A number in the open unit interval
indicating the proportion of respondents to be allocated to the EBMA fold.
Default is |
Returns a list with two elements which are both tibble. List element
one is named ebma_fold
and contains the tibble used in Ensemble
Bayesian Model Averaging Tuning. List element two is named cv_data
and contains the tibble used for classifier tuning.
ebma_mc_draws
is called from within ebma
. It tunes using
multiple cores.
ebma_mc_draws( train.preds, train.y, ebma.fold, y, L1.x, L2.x, L2.unit, L2.reg, pc.names, model.bs, model.pca, model.lasso, model.gb, model.svm, model.mrp, tol, n.draws, cores, preds_all, post.strat, dv_type, deep.mrp )
ebma_mc_draws( train.preds, train.y, ebma.fold, y, L1.x, L2.x, L2.unit, L2.reg, pc.names, model.bs, model.pca, model.lasso, model.gb, model.svm, model.mrp, tol, n.draws, cores, preds_all, post.strat, dv_type, deep.mrp )
train.preds |
Predictions of classifiers on the classifier training data. A tibble. |
train.y |
Outcome variable of the classifier training data. A numeric vector. |
ebma.fold |
New data for EBMA tuning. A list containing the the data that must not have been used in classifier training. |
y |
Outcome variable. A character vector containing the column names of
the outcome variable. A character scalar containing the column name of
the outcome variable in |
L1.x |
Individual-level covariates. A character vector containing the
column names of the individual-level variables in |
L2.x |
Context-level covariates. A character vector containing the
column names of the context-level variables in |
L2.unit |
Geographic unit. A character scalar containing the column
name of the geographic unit in |
L2.reg |
Geographic region. A character scalar containing the column
name of the geographic region in |
pc.names |
Principal Component Variable names. A character vector containing the names of the context-level principal components variables. |
model.bs |
The tuned model from the multilevel regression with best
subset selection classifier. An |
model.pca |
The tuned model from the multilevel regression with
principal components as context-level predictors classifier. An
|
model.lasso |
The tuned model from the multilevel regression with L1
regularization classifier. A |
model.gb |
The tuned model from the gradient boosting classifier. A
|
model.svm |
The tuned model from the support vector machine classifier.
An |
model.mrp |
The standard MrP model. An |
tol |
EBMA tolerance. A numeric vector containing the tolerance values
for improvements in the log-likelihood before the EM algorithm stops
optimization. Values should range at least from |
n.draws |
EBMA number of samples. An integer-valued scalar specifying
the number of bootstrapped samples to be drawn from the EBMA fold and used
for tuning EBMA. Default is |
cores |
The number of cores to be used. An integer indicating the number of processor cores used for parallel computing. Default is 1. |
post.strat |
Post-stratification results. A list containing the best models for each of the tuned classifiers, the individual level predictions on the data classifier trainig data and the post-stratified context-level predictions. |
dv_type |
The type of the depenedent variable. A character string. Either "binary" or "linear". |
deep.mrp |
Deep MRP classifier. A logical argument indicating whether
the deep MRP classifier should be used for best subset prediction. Setting
|
The classifier weights. A numeric vector.
ebma_mc_tol
is called from within ebma
. It tunes using
multiple cores.
ebma_mc_tol( train.preds, train.y, ebma.fold, y, L1.x, L2.x, L2.unit, L2.reg, pc.names, model.bs, model.pca, model.lasso, model.gb, model.svm, model.mrp, tol, n.draws, cores, preds_all, post.strat, dv_type, deep.mrp )
ebma_mc_tol( train.preds, train.y, ebma.fold, y, L1.x, L2.x, L2.unit, L2.reg, pc.names, model.bs, model.pca, model.lasso, model.gb, model.svm, model.mrp, tol, n.draws, cores, preds_all, post.strat, dv_type, deep.mrp )
train.preds |
Predictions of classifiers on the classifier training data. A tibble. |
train.y |
Outcome variable of the classifier training data. A numeric vector. |
ebma.fold |
The data used for EBMA tuning. A tibble. |
y |
Outcome variable. A character vector containing the column names of
the outcome variable. A character scalar containing the column name of
the outcome variable in |
L1.x |
Individual-level covariates. A character vector containing the
column names of the individual-level variables in |
L2.x |
Context-level covariates. A character vector containing the
column names of the context-level variables in |
L2.unit |
Geographic unit. A character scalar containing the column
name of the geographic unit in |
L2.reg |
Geographic region. A character scalar containing the column
name of the geographic region in |
pc.names |
Principal Component Variable names. A character vector containing the names of the context-level principal components variables. |
model.bs |
The tuned model from the multilevel regression with best
subset selection classifier. An |
model.pca |
The tuned model from the multilevel regression with
principal components as context-level predictors classifier. An
|
model.lasso |
The tuned model from the multilevel regression with L1
regularization classifier. A |
model.gb |
The tuned model from the gradient boosting classifier. A
|
model.svm |
The tuned model from the support vector machine classifier.
An |
model.mrp |
The standard MrP model. An |
tol |
The tolerance values used for EBMA. A numeric vector. |
n.draws |
EBMA number of samples. An integer-valued scalar specifying
the number of bootstrapped samples to be drawn from the EBMA fold and used
for tuning EBMA. Default is |
cores |
The number of cores to be used. An integer indicating the number of processor cores used for parallel computing. Default is 1. |
post.strat |
Post-stratification results. A list containing the best models for each of the tuned classifiers, the individual level predictions on the data classifier trainig data and the post-stratified context-level predictions. |
dv_type |
The type of the depenedent variable. A character string. Either "binary" or "linear". |
deep.mrp |
Deep MRP classifier. A logical argument indicating whether
the deep MRP classifier should be used for best subset prediction. Setting
|
The classifier weights. A numeric vector.
## Not run: # not yet ## End(Not run)
## Not run: # not yet ## End(Not run)
error_checks()
checks for incorrect data entry in autoMrP()
call.
error_checks( y, L1.x, L2.x, L2.unit, L2.reg, L2.x.scale, pcs, folds, bin.proportion, bin.size, survey, census, ebma.size, k.folds, cv.sampling, loss.unit, loss.fun, best.subset, lasso, pca, gb, svm, mrp, best.subset.L2.x, lasso.L2.x, deep.mrp, gb.L2.x, svm.L2.x, mrp.L2.x, gb.L2.unit, gb.L2.reg, lasso.lambda, lasso.n.iter, deep.splines, uncertainty, boot.iter )
error_checks( y, L1.x, L2.x, L2.unit, L2.reg, L2.x.scale, pcs, folds, bin.proportion, bin.size, survey, census, ebma.size, k.folds, cv.sampling, loss.unit, loss.fun, best.subset, lasso, pca, gb, svm, mrp, best.subset.L2.x, lasso.L2.x, deep.mrp, gb.L2.x, svm.L2.x, mrp.L2.x, gb.L2.unit, gb.L2.reg, lasso.lambda, lasso.n.iter, deep.splines, uncertainty, boot.iter )
y |
Outcome variable. A character vector containing the column names of
the outcome variable. A character scalar containing the column name of
the outcome variable in |
L1.x |
Individual-level covariates. A character vector containing the
column names of the individual-level variables in |
L2.x |
Context-level covariates. A character vector containing the
column names of the context-level variables in |
L2.unit |
Geographic unit. A character scalar containing the column
name of the geographic unit in |
L2.reg |
Geographic region. A character scalar containing the column
name of the geographic region in |
L2.x.scale |
Scale context-level covariates. A logical argument
indicating whether the context-level covariates should be normalized.
Default is |
pcs |
Principal components. A character vector containing the column
names of the principal components of the context-level variables in
|
folds |
EBMA and cross-validation folds. A character scalar containing
the column name of the variable in |
bin.proportion |
Proportion of ideal types. A character scalar
containing the column name of the variable in |
bin.size |
Bin size of ideal types. A character scalar containing the
column name of the variable in |
survey |
Survey data. A |
census |
Census data. A |
ebma.size |
EBMA fold size. A number in the open unit interval
indicating the proportion of respondents to be allocated to the EBMA fold.
Default is |
k.folds |
Number of cross-validation folds. An integer-valued scalar
indicating the number of folds to be used in cross-validation. Default is
|
cv.sampling |
Cross-validation sampling method. A character-valued
scalar indicating whether cross-validation folds should be created by
sampling individual respondents ( |
loss.unit |
Loss function unit. A character-valued scalar indicating
whether performance loss should be evaluated at the level of individual
respondents ( |
loss.fun |
Loss function. A character-valued scalar indicating whether
prediction loss should be measured by the mean squared error ( |
best.subset |
Best subset classifier. A logical argument indicating
whether the best subset classifier should be used for predicting outcome
|
lasso |
Lasso classifier. A logical argument indicating whether the
lasso classifier should be used for predicting outcome |
pca |
PCA classifier. A logical argument indicating whether the PCA
classifier should be used for predicting outcome |
gb |
GB classifier. A logical argument indicating whether the GB
classifier should be used for predicting outcome |
svm |
SVM classifier. A logical argument indicating whether the SVM
classifier should be used for predicting outcome |
mrp |
MRP classifier. A logical argument indicating whether the standard
MRP classifier should be used for predicting outcome |
best.subset.L2.x |
Best subset context-level covariates. A character
vector containing the column names of the context-level variables in
|
lasso.L2.x |
Lasso context-level covariates. A character vector
containing the column names of the context-level variables in
|
deep.mrp |
Deep MRP classifier. A logical argument indicating whether
the deep MRP classifier should be used for best subset prediction. Setting
|
gb.L2.x |
GB context-level covariates. A character vector containing the
column names of the context-level variables in |
svm.L2.x |
SVM context-level covariates. A character vector containing
the column names of the context-level variables in |
mrp.L2.x |
MRP context-level covariates. A character vector containing
the column names of the context-level variables in |
gb.L2.unit |
GB L2.unit. A logical argument indicating whether
|
gb.L2.reg |
GB L2.reg. A logical argument indicating whether
|
lasso.lambda |
Lasso penalty parameter. A numeric |
lasso.n.iter |
Lasso number of lambda values. An integer-valued scalar
specifying the number of lambda values to search over. Default is
|
deep.splines |
Deep MRP splines. A logical argument indicating whether
splines should be used in the deep MRP classifier. Default is |
uncertainty |
Uncertainty estimates. A logical argument indicating
whether uncertainty estimates should be computed. Default is |
boot.iter |
Number of bootstrap iterations. An integer argument
indicating the number of bootstrap iterations to be computed. Will be
ignored unless |
No return value, called for detection of errors in autoMrP() call.
f1_score()
estimates the inverse f1 scores on the individual and state
levels.
f1_score(pred, data.valid, y, L2.unit)
f1_score(pred, data.valid, y, L2.unit)
pred |
Predictions of outcome. A numeric vector of outcome predictions. |
data.valid |
Test data set. A tibble of data that was not used for prediction. |
y |
Outcome variable. A character vector containing the column names of the outcome variable. |
L2.unit |
Geographic unit. A character scalar containing the column name
of the geographic unit in |
Returns a tibble containing two f1 prediction errors. The first is measured at the level of individuals and the second is measured at the context level. The tibble dimensions are 2x3 with variables: measure, value and level.
gb_classifier
applies gradient boosting classification to a data set.
gb_classifier( y, form, distribution, data.train, n.trees, interaction.depth, n.minobsinnode, shrinkage, verbose = c(TRUE, FALSE) )
gb_classifier( y, form, distribution, data.train, n.trees, interaction.depth, n.minobsinnode, shrinkage, verbose = c(TRUE, FALSE) )
y |
Outcome variable. A character vector containing the column names of
the outcome variable. A character scalar containing the column name of
the outcome variable in |
form |
Model formula. A two-sided linear formula describing the model to be fit, with the outcome on the LHS and the covariates separated by + operators on the RHS. |
distribution |
Model distribution. A character string specifying the name of the distribution to be used. |
data.train |
Training data. A data.frame containing the training data used to train the model. |
n.trees |
Total number of trees. An integer-valued scalar specifying the total number of trees to be fit. |
interaction.depth |
Interaction depth. An integer-valued scalar specifying the maximum depth of each tree. |
n.minobsinnode |
Minimum number of observations in terminal nodes. An integer-valued scalar specifying the minimum number of observations in the terminal nodes of the trees. |
shrinkage |
Learning rate. A numeric scalar specifying the shrinkage or learning rate applied to each tree in the expansion. |
verbose |
Verbose output. A logical vector indicating whether or not verbose output should be printed. |
A gradient tree boosting model. A gbm
object.
gb_classifier_update()
grows additional trees in gradient tree
boosting ensemble.
gb_classifier_update(object, n.new.trees, verbose = c(TRUE, FALSE))
gb_classifier_update(object, n.new.trees, verbose = c(TRUE, FALSE))
object |
Gradient tree boosting output. A gbm object. |
n.new.trees |
Number of additional trees to grow. A numeric scalar. |
verbose |
Verbose output. A logical vector indicating whether or not verbose output should be printed. |
An updated gradient tree boosting model.
A gbm.more
object.
lasso_classifier
applies lasso classification to a data set.
lasso_classifier( L2.fix, L1.re, data.train, lambda, model.family, y, verbose = c(TRUE, FALSE) )
lasso_classifier( L2.fix, L1.re, data.train, lambda, model.family, y, verbose = c(TRUE, FALSE) )
L2.fix |
Fixed effects. A two-sided linear formula describing the fixed effects part of the model, with the outcome on the LHS and the fixed effects separated by + operators on the RHS. |
L1.re |
Random effects. A named list object, with the random effects providing the names of the list elements and ~ 1 being the list elements. |
data.train |
Training data. A data.frame containing the training data used to train the model. |
lambda |
Tuning parameter. Lambda is the penalty parameter that controls the shrinkage of fixed effects. |
model.family |
Model family. A variable indicating the model family to be used by glmmLasso. Defaults to binomial(link = "probit"). |
y |
Outcome variable. A character vector containing the column names of
the outcome variable. A character scalar containing the column name of
the outcome variable in |
verbose |
Verbose output. A logical vector indicating whether or not verbose output should be printed. |
A multilevel lasso model. An glmmLasso
object.
Sequence that is equally spaced on the log scale
log_spaced(min, max, n)
log_spaced(min, max, n)
min |
The minimum value of the sequence. A positive numeric scalar (min > 0). |
max |
The maximum value of the sequence. a positive numeric scalar (max > 0). |
n |
The length of the sequence. An integer valued scalar. |
Returns a numeric vector with length specified in argument n
.
The vector elements are equally spaced on the log-scale.
loss_function()
estimates the loss based on a loss function.
loss_function( pred, data.valid, loss.unit = c("individuals", "L2 units"), loss.fun = c("MSE", "MAE", "cross-entropy"), y, L2.unit )
loss_function( pred, data.valid, loss.unit = c("individuals", "L2 units"), loss.fun = c("MSE", "MAE", "cross-entropy"), y, L2.unit )
pred |
Predictions of outcome. A numeric vector of outcome predictions. |
data.valid |
Test data set. A tibble of data that was not used for prediction. |
loss.unit |
Loss function unit. A character-valued scalar indicating
whether performance loss should be evaluated at the level of individual
respondents ( |
loss.fun |
Loss function. A character-valued scalar indicating whether
prediction loss should be measured by the mean squared error ( |
y |
Outcome variable. A character vector containing the column names of the outcome variable. |
L2.unit |
Geographic unit. A character scalar containing the column name
of the geographic unit in |
Returns a tibble with number of rows equal to the number of loss functions tested (defaults to 4 for cross-entropy, f1, MSE, and msfe). The number of columns is 2 where the first is called measure and contains the names of the loss-functions and the second is called value and contains the loss-function scores.
loss_score_ranking()
ranks tuning parameters according to the scores
received in multiple loss functions.
loss_score_ranking(score, loss.fun)
loss_score_ranking(score, loss.fun)
score |
A data set containing loss function names, the loss function values, and the tuning parameter values. |
loss.fun |
Loss function. A character-valued scalar indicating whether
prediction loss should be measured by the mean squared error ( |
Returns a tibble containing the parameter grid as well as a rank column that corresponds to the cross-validation rank of a parameter combination across all loss function scores.
mean_absolute_error()
estimates the mean absolute error for the
desired loss unit.
mean_absolute_error(pred, data.valid, y, L2.unit)
mean_absolute_error(pred, data.valid, y, L2.unit)
pred |
Predictions of outcome. A numeric vector of outcome predictions. |
data.valid |
Test data set. A tibble of data that was not used for prediction. |
y |
Outcome variable. A character vector containing the column names of the outcome variable. |
L2.unit |
Geographic unit. A character scalar containing the column name
of the geographic unit in |
Returns a tibble containing two mean absolute prediction errors. The first is measured at the level of individuals and the second is measured at the context level. The tibble dimensions are 2x3 with variables: measure, value and level.
mean_squared_error()
estimates the mean squared error for the desired
loss unit.
mean_squared_error(pred, data.valid, y, L2.unit)
mean_squared_error(pred, data.valid, y, L2.unit)
pred |
Predictions of outcome. A numeric vector of outcome predictions. |
data.valid |
Test data set. A tibble of data that was not used for prediction. |
y |
Outcome variable. A character vector containing the column names of the outcome variable. |
L2.unit |
Geographic unit. A character scalar containing the column name
of the geographic unit in |
Returns a tibble containing two mean squared prediction errors. The first is measured at the level of individuals and the second is measured at the context level. The tibble dimensions are 2x3 with variables: measure, value and level.
msfe()
estimates the inverse f1 scores on the individual and state
levels.
mean_squared_false_error(pred, data.valid, y, L2.unit)
mean_squared_false_error(pred, data.valid, y, L2.unit)
pred |
Predictions of outcome. A numeric vector of outcome predictions. |
data.valid |
Test data set. A tibble of data that was not used for prediction. |
y |
Outcome variable. A character vector containing the column names of the outcome variable. |
L2.unit |
Geographic unit. A character scalar containing the column name
of the geographic unit in |
Returns a tibble containing two mean squared false prediction errors. The first is measured at the level of individuals and the second is measured at the context level. The tibble dimensions are 2x3 with variables: measure, value and level.
model_list()
generates an exhaustive list of lme4 model formulas from
the individual level and context level variables as well as geographic unit
variables to be iterated over in best subset selection.
model_list(y, L1.x, L2.x, L2.unit, L2.reg = NULL)
model_list(y, L1.x, L2.x, L2.unit, L2.reg = NULL)
y |
Outcome variable. A character vector containing the column names of the outcome variable. |
L1.x |
Individual-level covariates. A character vector containing the
column names of the individual-level variables in |
L2.x |
Context-level covariates. A character vector containing the
column names of the context-level variables in |
L2.unit |
Geographic unit. A character scalar containing the column name
of the geographic unit in |
L2.reg |
Geographic region. A character scalar containing the column
name of the geographic region in |
Returns a list with the number of elements equal to 2^k where k is the number context-level variables. Each element is of class formula.
model_list_pca()
generates an exhaustive list of lme4 model formulas
from the individual level and context level principal components as well as
geographic unit variables to be iterated over in best subset selection with
principal components.
model_list_pca(y, L1.x, L2.x, L2.unit, L2.reg = NULL)
model_list_pca(y, L1.x, L2.x, L2.unit, L2.reg = NULL)
y |
Outcome variable. A character vector containing the column names of the outcome variable. |
L1.x |
Individual-level covariates. A character vector containing the
column names of the individual-level variables in |
L2.x |
Context-level covariates. A character vector containing the
column names of the context-level variables in |
L2.unit |
Geographic unit. A character scalar containing the column name
of the geographic unit in |
L2.reg |
Geographic region. A character scalar containing the column
name of the geographic region in |
Returns a list with the number of elements k+1 where k is the number of context-level variables. Each element is of class formula. The first element is a model with context-level variables and the following models iteratively add the principal components as context-level variables.
multicore()
registers cores for parallel processing.
multicore(cores = 1, type, cl = NULL)
multicore(cores = 1, type, cl = NULL)
cores |
Number of cores to be used. An integer. Default is |
type |
Whether to start or end parallel processing. A character string.
The possible values are |
cl |
The registered cluster. Default is |
No return value, called to register or un-register clusters for parallel processing.
output_table()
...
output_table(object, col.names, format, digits)
output_table(object, col.names, format, digits)
object |
An |
col.names |
The column names of the table. A |
format |
The table format. A character string passed to
|
digits |
The number of digits to be displayed. An integer scalar.
Default is |
No return value, prints a table to the console.
plot.autoMrP()
plots unit-level preference estimates and error bars.
## S3 method for class 'autoMrP' plot(x, algorithm = "ebma", ci.lvl = 0.95, ...)
## S3 method for class 'autoMrP' plot(x, algorithm = "ebma", ci.lvl = 0.95, ...)
x |
An |
algorithm |
The algorithm/classifier fo which preference estimates are
desired. A character-valued scalar indicating either |
ci.lvl |
The level of the confidence intervals. A proportion. Default is
|
... |
Additional arguments affecting the summary produced. |
Returns a ggplot2
object of the preference estimates for the
selected classifier.
Apply post-stratification to classifiers.
post_stratification( y, L1.x, L2.x, L2.unit, L2.reg, best.subset.opt, lasso.opt, lasso.L2.x, pca.opt, gb.opt, svm.opt, svm.L2.reg, svm.L2.unit, svm.L2.x, mrp.include, n.minobsinnode, L2.unit.include, L2.reg.include, kernel, mrp.L2.x, data, ebma.fold, census, verbose, deep.mrp, deep.splines )
post_stratification( y, L1.x, L2.x, L2.unit, L2.reg, best.subset.opt, lasso.opt, lasso.L2.x, pca.opt, gb.opt, svm.opt, svm.L2.reg, svm.L2.unit, svm.L2.x, mrp.include, n.minobsinnode, L2.unit.include, L2.reg.include, kernel, mrp.L2.x, data, ebma.fold, census, verbose, deep.mrp, deep.splines )
y |
Outcome variable. A character vector containing the column names of
the outcome variable. A character scalar containing the column name of
the outcome variable in |
L1.x |
Individual-level covariates. A character vector containing the
column names of the individual-level variables in |
L2.x |
Context-level covariates. A character vector containing the
column names of the context-level variables in |
L2.unit |
Geographic unit. A character scalar containing the column
name of the geographic unit in |
L2.reg |
Geographic region. A character scalar containing the column
name of the geographic region in |
best.subset.opt |
Optimal tuning parameters from best subset selection
classifier. A list returned by |
lasso.opt |
Optimal tuning parameters from lasso classifier A list
returned by |
lasso.L2.x |
Lasso context-level covariates. A character vector
containing the column names of the context-level variables in
|
pca.opt |
Optimal tuning parameters from best subset selection with
principal components classifier A list returned by |
gb.opt |
Optimal tuning parameters from gradient tree boosting
classifier A list returned by |
svm.opt |
Optimal tuning parameters from support vector machine
classifier A list returned by |
svm.L2.reg |
SVM L2.reg. A logical argument indicating whether
|
svm.L2.unit |
SVM L2.unit. A logical argument indicating whether
|
svm.L2.x |
SVM context-level covariates. A character vector containing
the column names of the context-level variables in |
mrp.include |
Whether to run MRP classifier. A logical argument
indicating whether the standard MRP classifier should be used for
predicting outcome |
n.minobsinnode |
GB minimum number of observations in the terminal
nodes. An integer-valued scalar specifying the minimum number of
observations that each terminal node of the trees must contain. Passed from
|
L2.unit.include |
GB L2.unit. A logical argument indicating whether
|
L2.reg.include |
A logical argument indicating whether |
kernel |
SVM kernel. A character-valued scalar specifying the kernel to
be used by SVM. The possible values are |
mrp.L2.x |
MRP context-level covariates. A character vector containing
the column names of the context-level variables in |
data |
A data.frame containing the survey data used in classifier training. |
ebma.fold |
A data.frame containing the data not used in classifier training. |
census |
Census data. A |
verbose |
Verbose output. A logical argument indicating whether or not
verbose output should be printed. Default is |
deep.mrp |
Deep MRP classifier. A logical argument indicating whether
the deep MRP classifier should be used for best subset prediction. Setting
|
deep.splines |
Deep MRP splines. A logical argument indicating whether
splines should be used in the deep MRP classifier. Default is |
glmmLasso()
predicts on newdata objects from a glmmLasso object.
predict_glmmLasso( census, m, L1.x, lasso.L2.x, L2.unit, L2.reg, type = "response" )
predict_glmmLasso( census, m, L1.x, lasso.L2.x, L2.unit, L2.reg, type = "response" )
census |
Census data. A |
m |
A |
L1.x |
Individual-level covariates. A character vector containing the
column names of the individual-level variables in |
lasso.L2.x |
Lasso context-level covariates. A character vector
containing the column names of the context-level variables in
|
L2.unit |
Geographic unit. A character scalar containing the column
name of the geographic unit in |
L2.reg |
Geographic region. A character scalar containing the column
name of the geographic region in |
Returns a numeric vector of predictions from a glmmLasso()
object.
quiet()
suppresses cat output.
quiet(x)
quiet(x)
x |
Input. It can be any kind. |
run_best_subset
is a wrapper function that applies the best subset
classifier to a list of models provided by the user, evaluates the models'
prediction performance, and chooses the best-performing model.
run_best_subset( y, L1.x, L2.x, L2.unit, L2.reg, loss.unit, loss.fun, data, verbose, cores )
run_best_subset( y, L1.x, L2.x, L2.unit, L2.reg, loss.unit, loss.fun, data, verbose, cores )
y |
Outcome variable. A character vector containing the column names of
the outcome variable. A character scalar containing the column name of
the outcome variable in |
L1.x |
Individual-level covariates. A character vector containing the
column names of the individual-level variables in |
L2.x |
Context-level covariates. A character vector containing the
column names of the context-level variables in |
L2.unit |
Geographic unit. A character scalar containing the column
name of the geographic unit in |
L2.reg |
Geographic region. A character scalar containing the column
name of the geographic region in |
loss.unit |
Loss function unit. A character-valued scalar indicating
whether performance loss should be evaluated at the level of individual
respondents ( |
loss.fun |
Loss function. A character-valued scalar indicating whether
prediction loss should be measured by the mean squared error ( |
data |
Data for cross-validation. A |
verbose |
Verbose output. A logical argument indicating whether or not
verbose output should be printed. Default is |
cores |
The number of cores to be used. An integer indicating the number of processor cores used for parallel computing. Default is 1. |
A model formula of the winning best subset classifier model.
run_best_subset_mc
is called from within run_best_subset
. It
tunes using multiple cores.
run_best_subset_mc( y, L1.x, L2.x, L2.unit, L2.reg, loss.unit, loss.fun, data, cores, models, verbose )
run_best_subset_mc( y, L1.x, L2.x, L2.unit, L2.reg, loss.unit, loss.fun, data, cores, models, verbose )
y |
Outcome variable. A character scalar containing the column name of
the outcome variable in |
L1.x |
Individual-level covariates. A character vector containing the
column names of the individual-level variables in |
L2.x |
Context-level covariates. A character vector containing the
column names of the context-level variables in |
L2.unit |
Geographic unit. A character scalar containing the column
name of the geographic unit in |
L2.reg |
Geographic region. A character scalar containing the column
name of the geographic region in |
loss.unit |
Loss function unit. A character-valued scalar indicating
whether performance loss should be evaluated at the level of individual
respondents ( |
loss.fun |
Loss function. A character-valued scalar indicating whether
prediction loss should be measured by the mean squared error ( |
data |
Data for cross-validation. A |
cores |
The number of cores to be used. An integer indicating the number of processor cores used for parallel computing. Default is 1. |
models |
The models to perform best subset selection on. A list of model formulas. |
verbose |
Verbose output. A logical argument indicating whether or not
verbose output should be printed. Default is |
The cross-validation errors for all models. A list.
## Not run: # not yet ## End(Not run)
## Not run: # not yet ## End(Not run)
run_classifiers
tunes classifiers, post-stratifies and carries out
EMBA.
run_classifiers( y, L1.x, L2.x, mrp.L2.x, L2.unit, L2.reg, L2.x.scale, pcs, pc.names, folds, bin.proportion, bin.size, cv.folds, cv.data, ebma.fold, census, ebma.size, ebma.n.draws, k.folds, cv.sampling, loss.unit, loss.fun, best.subset, lasso, pca, gb, svm, mrp, deep.mrp, best.subset.L2.x, lasso.L2.x, pca.L2.x, gb.L2.x, svm.L2.x, gb.L2.unit, gb.L2.reg, svm.L2.unit, svm.L2.reg, deep.splines, lasso.lambda, lasso.n.iter, gb.interaction.depth, gb.shrinkage, gb.n.trees.init, gb.n.trees.increase, gb.n.trees.max, gb.n.minobsinnode, svm.kernel, svm.gamma, svm.cost, ebma.tol, cores, verbose )
run_classifiers( y, L1.x, L2.x, mrp.L2.x, L2.unit, L2.reg, L2.x.scale, pcs, pc.names, folds, bin.proportion, bin.size, cv.folds, cv.data, ebma.fold, census, ebma.size, ebma.n.draws, k.folds, cv.sampling, loss.unit, loss.fun, best.subset, lasso, pca, gb, svm, mrp, deep.mrp, best.subset.L2.x, lasso.L2.x, pca.L2.x, gb.L2.x, svm.L2.x, gb.L2.unit, gb.L2.reg, svm.L2.unit, svm.L2.reg, deep.splines, lasso.lambda, lasso.n.iter, gb.interaction.depth, gb.shrinkage, gb.n.trees.init, gb.n.trees.increase, gb.n.trees.max, gb.n.minobsinnode, svm.kernel, svm.gamma, svm.cost, ebma.tol, cores, verbose )
y |
Outcome variable. A character vector containing the column names of
the outcome variable. A character scalar containing the column name of
the outcome variable in |
L1.x |
Individual-level covariates. A character vector containing the
column names of the individual-level variables in |
L2.x |
Context-level covariates. A character vector containing the
column names of the context-level variables in |
mrp.L2.x |
MRP context-level covariates. A character vector containing
the column names of the context-level variables in |
L2.unit |
Geographic unit. A character scalar containing the column
name of the geographic unit in |
L2.reg |
Geographic region. A character scalar containing the column
name of the geographic region in |
L2.x.scale |
Scale context-level covariates. A logical argument
indicating whether the context-level covariates should be normalized.
Default is |
pcs |
Principal components. A character vector containing the column
names of the principal components of the context-level variables in
|
pc.names |
A character vector of the principal component variable names in the data. |
folds |
EBMA and cross-validation folds. A character scalar containing
the column name of the variable in |
bin.proportion |
Proportion of ideal types. A character scalar
containing the column name of the variable in |
bin.size |
Bin size of ideal types. A character scalar containing the
column name of the variable in |
cv.folds |
Data for cross-validation. A |
cv.data |
A data.frame containing the survey data used in classifier training. |
ebma.fold |
A data.frame containing the data not used in classifier training. |
census |
Census data. A |
ebma.size |
EBMA fold size. A number in the open unit interval
indicating the proportion of respondents to be allocated to the EBMA fold.
Default is |
ebma.n.draws |
EBMA number of samples. An integer-valued scalar
specifying the number of bootstrapped samples to be drawn from the EBMA
fold and used for tuning EBMA. Default is |
k.folds |
Number of cross-validation folds. An integer-valued scalar
indicating the number of folds to be used in cross-validation. Default is
|
cv.sampling |
Cross-validation sampling method. A character-valued
scalar indicating whether cross-validation folds should be created by
sampling individual respondents ( |
loss.unit |
Loss function unit. A character-valued scalar indicating
whether performance loss should be evaluated at the level of individual
respondents ( |
loss.fun |
Loss function. A character-valued scalar indicating whether
prediction loss should be measured by the mean squared error ( |
best.subset |
Best subset classifier. A logical argument indicating
whether the best subset classifier should be used for predicting outcome
|
lasso |
Lasso classifier. A logical argument indicating whether the
lasso classifier should be used for predicting outcome |
pca |
PCA classifier. A logical argument indicating whether the PCA
classifier should be used for predicting outcome |
gb |
GB classifier. A logical argument indicating whether the GB
classifier should be used for predicting outcome |
svm |
SVM classifier. A logical argument indicating whether the SVM
classifier should be used for predicting outcome |
mrp |
MRP classifier. A logical argument indicating whether the standard
MRP classifier should be used for predicting outcome |
deep.mrp |
Deep MRP classifier. A logical argument indicating whether
the deep MRP classifier should be used for best subset prediction. Setting
|
best.subset.L2.x |
Best subset context-level covariates. A character
vector containing the column names of the context-level variables in
|
lasso.L2.x |
Lasso context-level covariates. A character vector
containing the column names of the context-level variables in
|
pca.L2.x |
PCA context-level covariates. A character vector containing
the column names of the context-level variables in |
gb.L2.x |
GB context-level covariates. A character vector containing the
column names of the context-level variables in |
svm.L2.x |
SVM context-level covariates. A character vector containing
the column names of the context-level variables in |
gb.L2.unit |
GB L2.unit. A logical argument indicating whether
|
gb.L2.reg |
GB L2.reg. A logical argument indicating whether
|
svm.L2.unit |
SVM L2.unit. A logical argument indicating whether
|
svm.L2.reg |
SVM L2.reg. A logical argument indicating whether
|
deep.splines |
Deep MRP splines. A logical argument indicating whether
splines should be used in the deep MRP classifier. Default is |
lasso.lambda |
Lasso penalty parameter. A numeric |
lasso.n.iter |
Lasso number of lambda values. An integer-valued scalar
specifying the number of lambda values to search over. Default is
|
gb.interaction.depth |
GB interaction depth. An integer-valued vector
whose values specify the interaction depth of GB. The interaction depth
defines the maximum depth of each tree grown (i.e., the maximum level of
variable interactions). Default is |
gb.shrinkage |
GB learning rate. A numeric vector whose values specify
the learning rate or step-size reduction of GB. Values between |
gb.n.trees.init |
GB initial total number of trees. An integer-valued
scalar specifying the initial number of total trees to fit by GB. Default
is |
gb.n.trees.increase |
GB increase in total number of trees. An
integer-valued scalar specifying by how many trees the total number of
trees to fit should be increased (until |
gb.n.trees.max |
GB maximum number of trees. An integer-valued scalar
specifying the maximum number of trees to fit by GB. Default is |
gb.n.minobsinnode |
GB minimum number of observations in the terminal
nodes. An integer-valued scalar specifying the minimum number of
observations that each terminal node of the trees must contain. Default is
|
svm.kernel |
SVM kernel. A character-valued scalar specifying the kernel
to be used by SVM. The possible values are |
svm.gamma |
SVM kernel parameter. A numeric vector whose values specify the gamma parameter in the SVM kernel. This parameter is needed for all kernel types except linear. Default is a sequence with minimum = 1e-5, maximum = 1e-1, and length = 20 that is equally spaced on the log-scale. |
svm.cost |
SVM cost parameter. A numeric vector whose values specify the cost of constraints violation in SVM. Default is a sequence with minimum = 0.5, maximum = 10, and length = 5 that is equally spaced on the log-scale. |
ebma.tol |
EBMA tolerance. A numeric vector containing the
tolerance values for improvements in the log-likelihood before the EM
algorithm stops optimization. Values should range at least from |
cores |
The number of cores to be used. An integer indicating the number of processor cores used for parallel computing. Default is 1. |
verbose |
Verbose output. A logical argument indicating whether or not
verbose output should be printed. Default is |
run_deep_bs
is a wrapper function that applies the bestsubset
classifier to a list of models provided by the user, evaluates the models'
prediction performance, and chooses the best-performing model. It differs
from run_best_subset
in that it includes L1.x interactions.
run_deep_bs( y, L1.x, L2.x, L2.unit, L2.reg, loss.unit, loss.fun, deep.splines, data, k.folds, verbose, cores )
run_deep_bs( y, L1.x, L2.x, L2.unit, L2.reg, loss.unit, loss.fun, deep.splines, data, k.folds, verbose, cores )
y |
Outcome variable. A character vector containing the column names of
the outcome variable. A character scalar containing the column name of
the outcome variable in |
L1.x |
Individual-level covariates. A character vector containing the
column names of the individual-level variables in |
L2.x |
Context-level covariates. A character vector containing the
column names of the context-level variables in |
L2.unit |
Geographic unit. A character scalar containing the column
name of the geographic unit in |
L2.reg |
Geographic region. A character scalar containing the column
name of the geographic region in |
loss.unit |
Loss function unit. A character-valued scalar indicating
whether performance loss should be evaluated at the level of individual
respondents ( |
loss.fun |
Loss function. A character-valued scalar indicating whether
prediction loss should be measured by the mean squared error ( |
deep.splines |
Deep MRP splines. A logical argument indicating whether
splines should be used in the deep MRP classifier. Default is |
data |
Data for cross-validation. A |
k.folds |
Number of cross-validation folds. An integer-valued scalar
indicating the number of folds to be used in cross-validation. Default is
|
verbose |
Verbose output. A logical argument indicating whether or not
verbose output should be printed. Default is |
cores |
The number of cores to be used. An integer indicating the number of processor cores used for parallel computing. Default is 1. |
A model formula of the winning best subset classifier model.
run_deep_pca
is a wrapper function that applies the PCA classifier to
data provided by the user, evaluates prediction performance, and chooses the
best-performing model. It differs from run_best_subset
in that it
includes L1.x interactions.
run_deep_pca( y, L1.x, L2.x, L2.unit, L2.reg, loss.unit, loss.fun, deep.splines, data, cores, verbose )
run_deep_pca( y, L1.x, L2.x, L2.unit, L2.reg, loss.unit, loss.fun, deep.splines, data, cores, verbose )
y |
Outcome variable. A character vector containing the column names of
the outcome variable. A character scalar containing the column name of
the outcome variable in |
L1.x |
Individual-level covariates. A character vector containing the
column names of the individual-level variables in |
L2.x |
Context-level covariates. A character vector containing the
column names of the context-level variables in |
L2.unit |
Geographic unit. A character scalar containing the column
name of the geographic unit in |
L2.reg |
Geographic region. A character scalar containing the column
name of the geographic region in |
loss.unit |
Loss function unit. A character-valued scalar indicating
whether performance loss should be evaluated at the level of individual
respondents ( |
loss.fun |
Loss function. A character-valued scalar indicating whether
prediction loss should be measured by the mean squared error ( |
deep.splines |
Deep MRP splines. A logical argument indicating whether
splines should be used in the deep MRP classifier. Default is |
data |
Data for cross-validation. A |
cores |
The number of cores to be used. An integer indicating the number of processor cores used for parallel computing. Default is 1. |
verbose |
Verbose output. A logical argument indicating whether or not
verbose output should be printed. Default is |
A model formula of the winning best subset classifier model.
run_gb
is a wrapper function that applies the gradient boosting
classifier to data provided by the user, evaluates prediction performance,
and chooses the best-performing model.
run_gb( y, L1.x, L2.x, L2.eval.unit, L2.unit, L2.reg, loss.unit, loss.fun, interaction.depth, shrinkage, n.trees.init, n.trees.increase, n.trees.max, cores = cores, n.minobsinnode, data, verbose )
run_gb( y, L1.x, L2.x, L2.eval.unit, L2.unit, L2.reg, loss.unit, loss.fun, interaction.depth, shrinkage, n.trees.init, n.trees.increase, n.trees.max, cores = cores, n.minobsinnode, data, verbose )
y |
Outcome variable. A character vector containing the column names of
the outcome variable. A character scalar containing the column name of
the outcome variable in |
L1.x |
Individual-level covariates. A character vector containing the
column names of the individual-level variables in |
L2.x |
Context-level covariates. A character vector containing the
column names of the context-level variables in |
L2.eval.unit |
Geographic unit for the loss function. A character scalar
containing the column name of the geographic unit in |
L2.unit |
Geographic unit. A character scalar containing the column
name of the geographic unit in |
L2.reg |
Geographic region. A character scalar containing the column
name of the geographic region in |
loss.unit |
Loss function unit. A character-valued scalar indicating
whether performance loss should be evaluated at the level of individual
respondents ( |
loss.fun |
Loss function. A character-valued scalar indicating whether
prediction loss should be measured by the mean squared error ( |
interaction.depth |
GB interaction depth. An integer-valued vector
whose values specify the interaction depth of GB. The interaction depth
defines the maximum depth of each tree grown (i.e., the maximum level of
variable interactions). Default is |
shrinkage |
GB learning rate. A numeric vector whose values specify the
learning rate or step-size reduction of GB. Values between |
n.trees.init |
GB initial total number of trees. An integer-valued
scalar specifying the initial number of total trees to fit by GB. Default
is |
n.trees.increase |
GB increase in total number of trees. An
integer-valued scalar specifying by how many trees the total number of
trees to fit should be increased (until |
n.trees.max |
GB maximum number of trees. An integer-valued scalar
specifying the maximum number of trees to fit by GB or an integer-valued
vector of length |
cores |
The number of cores to be used. An integer indicating the number of processor cores used for parallel computing. Default is 1. |
n.minobsinnode |
GB minimum number of observations in the terminal
nodes. An integer-valued scalar specifying the minimum number of
observations that each terminal node of the trees must contain. Default is
|
data |
Data for cross-validation. A |
verbose |
Verbose output. A logical argument indicating whether or not
verbose output should be printed. Default is |
The tuned gradient boosting parameters. A list with three elements:
interaction_depth
contains the interaction depth parameter,
shrinkage
contains the learning rate, n_trees
the number of
trees to be grown.
run_gb_mc
is called from within run_gb
. It tunes using
multiple cores.
run_gb_mc( y, L1.x, L2.eval.unit, L2.unit, L2.reg, form, gb.grid, n.minobsinnode, loss.unit, loss.fun, data, cores )
run_gb_mc( y, L1.x, L2.eval.unit, L2.unit, L2.reg, form, gb.grid, n.minobsinnode, loss.unit, loss.fun, data, cores )
y |
Outcome variable. A character vector containing the column names of
the outcome variable. A character scalar containing the column name of
the outcome variable in |
L1.x |
Individual-level covariates. A character vector containing the
column names of the individual-level variables in |
L2.eval.unit |
Geographic unit for the loss function. A character scalar
containing the column name of the geographic unit in |
L2.unit |
Geographic unit. A character scalar containing the column
name of the geographic unit in |
L2.reg |
Geographic region. A character scalar containing the column
name of the geographic region in |
form |
The model formula. A formula object. |
gb.grid |
The hyper-parameter search grid. A matrix of all hyper-parameter combinations. |
n.minobsinnode |
GB minimum number of observations in the terminal
nodes. An integer-valued scalar specifying the minimum number of
observations that each terminal node of the trees must contain. Default is
|
loss.unit |
Loss function unit. A character-valued scalar indicating
whether performance loss should be evaluated at the level of individual
respondents ( |
loss.fun |
Loss function. A character-valued scalar indicating whether
prediction loss should be measured by the mean squared error ( |
data |
Data for cross-validation. A |
cores |
The number of cores to be used. An integer indicating the number of processor cores used for parallel computing. Default is 1. |
The tuning parameter combinations and there associated loss function scores. A list.
run_lasso
is a wrapper function that applies the lasso classifier to
data provided by the user, evaluates prediction performance, and chooses the
best-performing model.
run_lasso( y, L1.x, L2.x, L2.unit, L2.reg, n.iter, loss.unit, loss.fun, lambda, data, verbose, cores )
run_lasso( y, L1.x, L2.x, L2.unit, L2.reg, n.iter, loss.unit, loss.fun, lambda, data, verbose, cores )
y |
Outcome variable. A character vector containing the column names of
the outcome variable. A character scalar containing the column name of
the outcome variable in |
L1.x |
Individual-level covariates. A character vector containing the
column names of the individual-level variables in |
L2.x |
Context-level covariates. A character vector containing the
column names of the context-level variables in |
L2.unit |
Geographic unit. A character scalar containing the column
name of the geographic unit in |
L2.reg |
Geographic region. A character scalar containing the column
name of the geographic region in |
n.iter |
Lasso number of lambda values. An integer-valued scalar
specifying the number of lambda values to search over. Default is
|
loss.unit |
Loss function unit. A character-valued scalar indicating
whether performance loss should be evaluated at the level of individual
respondents ( |
loss.fun |
Loss function. A character-valued scalar indicating whether
prediction loss should be measured by the mean squared error ( |
lambda |
Lasso penalty parameter. A numeric |
data |
Data for cross-validation. A |
verbose |
Verbose output. A logical argument indicating whether or not
verbose output should be printed. Default is |
cores |
The number of cores to be used. An integer indicating the number of processor cores used for parallel computing. Default is 1. |
The tuned lambda value. A numeric scalar.
run_lasso_mc_lambda
is called from within run_lasso
. It
tunes using multiple cores.
run_lasso_mc_lambda( y, L1.x, L2.x, L2.unit, L2.reg, loss.unit, loss.fun, data, cores, L2.fe.form, L1.re, lambda )
run_lasso_mc_lambda( y, L1.x, L2.x, L2.unit, L2.reg, loss.unit, loss.fun, data, cores, L2.fe.form, L1.re, lambda )
y |
Outcome variable. A character vector containing the column names of
the outcome variable. A character scalar containing the column name of
the outcome variable in |
L1.x |
Individual-level covariates. A character vector containing the
column names of the individual-level variables in |
L2.x |
Context-level covariates. A character vector containing the
column names of the context-level variables in |
L2.unit |
Geographic unit. A character scalar containing the column
name of the geographic unit in |
L2.reg |
Geographic region. A character scalar containing the column
name of the geographic region in |
loss.unit |
Loss function unit. A character-valued scalar indicating
whether performance loss should be evaluated at the level of individual
respondents ( |
loss.fun |
Loss function. A character-valued scalar indicating whether
prediction loss should be measured by the mean squared error ( |
data |
Data for cross-validation. A |
cores |
The number of cores to be used. An integer indicating the number of processor cores used for parallel computing. Default is 1. |
L2.fe.form |
The fixed effects part of the Lasso classifier formula. The
formula is inherited from |
L1.re |
A list of random effects for the Lasso classifier formula. The
formula is inherited from |
lambda |
Lasso penalty parameter. A numeric |
The cross-validation errors for all models. A list.
run_pca
is a wrapper function that applies the PCA classifier to data
provided by the user, evaluates prediction performance, and chooses the
best-performing model.
run_pca( y, L1.x, L2.x, L2.unit, L2.reg, loss.unit, loss.fun, data, cores, verbose )
run_pca( y, L1.x, L2.x, L2.unit, L2.reg, loss.unit, loss.fun, data, cores, verbose )
y |
Outcome variable. A character vector containing the column names of
the outcome variable. A character scalar containing the column name of
the outcome variable in |
L1.x |
Individual-level covariates. A character vector containing the
column names of the individual-level variables in |
L2.x |
Context-level covariates. A character vector containing the
column names of the context-level variables in |
L2.unit |
Geographic unit. A character scalar containing the column
name of the geographic unit in |
L2.reg |
Geographic region. A character scalar containing the column
name of the geographic region in |
loss.unit |
Loss function unit. A character-valued scalar indicating
whether performance loss should be evaluated at the level of individual
respondents ( |
loss.fun |
Loss function. A character-valued scalar indicating whether
prediction loss should be measured by the mean squared error ( |
data |
Data for cross-validation. A |
cores |
The number of cores to be used. An integer indicating the number of processor cores used for parallel computing. Default is 1. |
verbose |
Verbose output. A logical argument indicating whether or not
verbose output should be printed. Default is |
A model formula of the winning best subset classifier model.
run_svm
is a wrapper function that applies the support vector machine
classifier to data provided by the user, evaluates prediction performance,
and chooses the best-performing model.
run_svm( y, L1.x, L2.x, L2.eval.unit, L2.unit, L2.reg, kernel = "radial", loss.fun, loss.unit, gamma, cost, data, verbose, cores )
run_svm( y, L1.x, L2.x, L2.eval.unit, L2.unit, L2.reg, kernel = "radial", loss.fun, loss.unit, gamma, cost, data, verbose, cores )
y |
Outcome variable. A character vector containing the column names of
the outcome variable. A character scalar containing the column name of
the outcome variable in |
L1.x |
Individual-level covariates. A character vector containing the
column names of the individual-level variables in |
L2.x |
Context-level covariates. A character vector containing the
column names of the context-level variables in |
L2.eval.unit |
Geographic unit for the loss function. A character scalar
containing the column name of the geographic unit in |
L2.unit |
Geographic unit. A character scalar containing the column
name of the geographic unit in |
L2.reg |
Geographic region. A character scalar containing the column
name of the geographic region in |
kernel |
SVM kernel. A character-valued scalar specifying the kernel to
be used by SVM. The possible values are |
loss.fun |
Loss function. A character-valued scalar indicating whether
prediction loss should be measured by the mean squared error ( |
loss.unit |
Loss function unit. A character-valued scalar indicating
whether performance loss should be evaluated at the level of individual
respondents ( |
gamma |
SVM kernel parameter. A numeric vector whose values specify the gamma parameter in the SVM kernel. This parameter is needed for all kernel types except linear. Default is a sequence with minimum = 1e-5, maximum = 1e-1, and length = 20 that is equally spaced on the log-scale. |
cost |
SVM cost parameter. A numeric vector whose values specify the cost of constraints violation in SVM. Default is a sequence with minimum = 0.5, maximum = 10, and length = 5 that is equally spaced on the log-scale. |
data |
Data for cross-validation. A |
verbose |
Verbose output. A logical argument indicating whether or not
verbose output should be printed. Default is |
cores |
The number of cores to be used. An integer indicating the number of processor cores used for parallel computing. Default is 1. |
The support vector machine tuned parameters. A list.
run_svm_mc
is called from within run_svm
. It tunes using
multiple cores.
run_svm_mc( y, L1.x, L2.x, L2.eval.unit, L2.unit, L2.reg, form, loss.unit, loss.fun, data, cores, svm.grid, verbose )
run_svm_mc( y, L1.x, L2.x, L2.eval.unit, L2.unit, L2.reg, form, loss.unit, loss.fun, data, cores, svm.grid, verbose )
y |
Outcome variable. A character vector containing the column names of
the outcome variable. A character scalar containing the column name of
the outcome variable in |
L1.x |
Individual-level covariates. A character vector containing the
column names of the individual-level variables in |
L2.x |
Context-level covariates. A character vector containing the
column names of the context-level variables in |
L2.eval.unit |
Geographic unit for the loss function. A character scalar
containing the column name of the geographic unit in |
L2.unit |
Geographic unit. A character scalar containing the column
name of the geographic unit in |
L2.reg |
Geographic region. A character scalar containing the column
name of the geographic region in |
form |
The model formula. A formula object. |
loss.unit |
Loss function unit. A character-valued scalar indicating
whether performance loss should be evaluated at the level of individual
respondents ( |
loss.fun |
Loss function. A character-valued scalar indicating whether
prediction loss should be measured by the mean squared error ( |
data |
Data for cross-validation. A |
cores |
The number of cores to be used. An integer indicating the number of processor cores used for parallel computing. Default is 1. |
svm.grid |
The hyper-parameter search grid. A matrix of all hyper-parameter combinations. |
verbose |
Verbose output. A logical argument indicating whether or not
verbose output should be printed. Default is |
The cross-validation errors for all models. A list.
summary.autoMrP()
...
## S3 method for class 'autoMrP' summary( object, ci.lvl = 0.95, digits = 4, format = "simple", classifiers = NULL, n = 10, ... )
## S3 method for class 'autoMrP' summary( object, ci.lvl = 0.95, digits = 4, format = "simple", classifiers = NULL, n = 10, ... )
object |
An |
ci.lvl |
The level of the confidence intervals. A proportion. Default is
|
digits |
The number of digits to be displayed. An integer scalar.
Default is |
format |
The table format. A character string passed to
|
classifiers |
Summarize a single classifier. A character string. Must be
one of |
n |
Number of rows to be printed. An integer scalar. Default is
|
... |
Additional arguments affecting the summary produced. |
No return value, prints a summary of the context level preference estimates to the console.
The Cooperative Congressional Election Stuides (CCES) item (cc418_1) asked: "Would you approve of the use of U.S. military troops in order to ensure the supply of oil?" The original 2008 CCES item contains 36,832 respondents. This sample mimics a typical national survey. It contains at least 5 respondents from each state but is otherwise a random sample.
survey_item
survey_item
A data frame with 1500 rows and 13 variables:
1 if individual supports use of troops; 0 otherwise
Age group (four categories: 1 = 18-29; 2 = 30-44; 3 = 45-64; 4 = 65+)
Education level (four categories: 1 = < high school; 2 = high school graduate; 3 = some college; 4 = college graduate)
Gender-race combination (six categories: 1 = white male; 2 = black male; 3 = hispanic male; 4 = white female; 5 = black female; 6 = hispanic female)
U.S. state
U.S. state id
U.S. region (four categories: 1 = Northeast; 2 = Midwest; 3 = South; 4 = West)
Normalized state-level share of votes for the Republican candidate in the previous presidential election
Normalized state-level percentage of Evangelical Protestant or Mormon respondents
Normalized state-level percentage of the population living in urban areas
Normalized state-level unemployment rate
Normalized state-level share of Hispanics
Normalized state-level share of Whites
The data set (excluding L2.x3, L2.x4, L2.x5, L2.x6) is taken from the article: Buttice, Matthew K, and Benjamin Highton. 2013. "How does multilevel regression and poststrat-stratification perform with conventional national surveys?" Political Analysis 21(4): 449-467. It is a random sample with at least 5 respondents per state. L2.x3, L2.x3, L2.x4, L2.x5 and L2.x6 are available at https://www.census.gov.
svm_classifier
applies support vector machine classification to a
data set.
svm_classifier( y, form, data, kernel, type, probability, svm.gamma, svm.cost, verbose = c(TRUE, FALSE) )
svm_classifier( y, form, data, kernel, type, probability, svm.gamma, svm.cost, verbose = c(TRUE, FALSE) )
y |
Outcome variable. A character vector containing the column names of
the outcome variable. A character scalar containing the column name of
the outcome variable in |
form |
Model formula. A two-sided linear formula describing the model to be fit, with the outcome on the LHS and the covariates separated by + operators on the RHS. |
data |
Data. A data.frame containing the cross-validation data used to train and evaluate the model. |
kernel |
Kernel for SVM. A character string specifying the kernel to be used for SVM. The possible types are linear, polynomial, radial, and sigmoid. Default is radial. |
type |
svm can be used as a classification machine, as a regression machine, or for novelty detection. Depending of whether y is a factor or not, the default setting for type is C-classification or eps-regression, respectively, but may be overwritten by setting an explicit value. Valid options are: #'
|
probability |
Probability predictions. A logical argument indicating whether the model should allow for probability predictions |
svm.gamma |
Gamma parameter for SVM. This parameter is needed for all kernels except linear. |
svm.cost |
Cost parameter for SVM. This parameter specifies the cost of constraints violation. |
verbose |
Verbose output. A logical vector indicating whether or not verbose output should be printed. |
The support vector machine model. An svm
object.
The census file is generated from the full 2008 National Annenberg Election Studies item CBb01 by dissaggregating the 64 ideal type combinations of the individual level variables L1x1, L2x2 and L1x3. A row is an ideal type in a given state.
data(taxes_census)
data(taxes_census)
A data frame with 2934 rows and 13 variables:
U.S. state
U.S. state id
U.S. region (four categories: 1 = Northeast; 2 = Midwest; 3 = South; 4 = West)
Age group (four categories)
Education level (four categories)
Gender-race combination (six categories)
State-level frequency of ideal type
State-level proportion of respondents of that ideal type in the population
State-level share of votes for the Republican candidate in the previous presidential election
State-level percentage of Evangelical Protestant or Mormon respondents
State-level percentage of the population living in urban areas
State-level unemployment rate
State-level share of Hispanics
State-level share of Whites
The data set (excluding L2.x3, L2.x4, L2.x5, L2.x6) is taken from the article: Buttice, Matthew K, and Benjamin Highton. 2013. "How does multilevel regression and poststrat-stratification perform with conventional national surveys?" Political Analysis 21(4): 449-467. L2.x3, L2.x3, L2.x4, L2.x5 and L2.x6 are available at https://www.census.gov.
The 2008 National Annenberg Election Studies (NAES) item (CBb01) asked: "I'm going to read you some options about federal income taxes. Please tell me which one comes closest to your view on what we should be doing about federal income taxes: (1) Cut taxes; (2) Keep taxes as they are; (3) Raise taxes if necessary; (4) None of these; (998) Don't know; (999) No answer. Category (3) was turned into a 'raise taxes response,' categories (1) and (2) were combined into a 'do not raise taxes' response. The original item from the phone and online surveys contains 50,483 respondents. This sample mimics a typical national survey. It contains at least 5 respondents from each state but is otherwise a random sample.
The 2008 National Annenberg Election Studies (NAES) item (CBb01) asked: "I'm going to read you some options about federal income taxes. Please tell me which one comes closest to your view on what we should be doing about federal income taxes: (1) Cut taxes; (2) Keep taxes as they are; (3) Raise taxes if necessary; (4) None of these; (998) Don't know; (999) No answer. Category (3) was turned into a 'raise taxes response,' categories (1) and (2) were combined into a 'do not raise taxes' response. The original item from the phone and online surveys contains 50,483 respondents. This sample mimics a typical national survey. It contains at least 5 respondents from each state but is otherwise a random sample.
data(taxes_survey) data(taxes_survey)
data(taxes_survey) data(taxes_survey)
A data frame with 1500 rows and 13 variables:
1 if individual supports raising taxes; 0 otherwise
Age group (four categories: 1 = 18-29; 2 = 30-44; 3 = 45-64; 4 = 65+)
Education level (four categories: 1 = < high school; 2 = high school graduate; 3 = some college; 4 = college graduate)
Gender-race combination (six categories: 1 = white male; 2 = black male; 3 = hispanic male; 4 = white female; 5 = black female; 6 = hispanic female)
U.S. state
U.S. state id
U.S. region (four categories: 1 = Northeast; 2 = Midwest; 3 = South; 4 = West)
State-level share of votes for the Republican candidate in the previous presidential election
State-level percentage of Evangelical Protestant or Mormon respondents
State-level percentage of the population living in urban areas
State-level unemployment rate
State-level share of Hispanics
State-level share of Whites
A data frame with 1500 rows and 13 variables:
1 if individual supports raising taxes; 0 otherwise
Age group (four categories: 1 = 18-29; 2 = 30-44; 3 = 45-64; 4 = 65+)
Education level (four categories: 1 = < high school; 2 = high school graduate; 3 = some college; 4 = college graduate)
Gender-race combination (six categories: 1 = white male; 2 = black male; 3 = hispanic male; 4 = white female; 5 = black female; 6 = hispanic female)
U.S. state
U.S. state id
U.S. region (four categories: 1 = Northeast; 2 = Midwest; 3 = South; 4 = West)
State-level share of votes for the Republican candidate in the previous presidential election
State-level percentage of Evangelical Protestant or Mormon respondents
State-level percentage of the population living in urban areas
State-level unemployment rate
State-level share of Hispanics
State-level share of Whites
The data set (excluding L2.x3, L2.x4, L2.x5, L2.x6) is taken from the article: Buttice, Matthew K, and Benjamin Highton. 2013. "How does multilevel regression and poststrat-stratification perform with conventional national surveys?" Political Analysis 21(4): 449-467. It is a random sample with at least 5 respondents per state. L2.x3, L2.x3, L2.x4, L2.x5 and L2.x6 are available at https://www.census.gov.
The data set (excluding L2.x3, L2.x4, L2.x5, L2.x6) is taken from the article: Buttice, Matthew K, and Benjamin Highton. 2013. "How does multilevel regression and poststrat-stratification perform with conventional national surveys?" Political Analysis 21(4): 449-467. It is a random sample with at least 5 respondents per state. L2.x3, L2.x3, L2.x4, L2.x5 and L2.x6 are available at https://www.census.gov.