15.879 Bringing Data into Dynamic Models, Fall 2022Â
Assignment 1: Categorizing model constructs and adding noise1Â
Due: 9/16/2022Â
This week you need to familiarize yourself with the model you have chosen, and make sure you understand how it works, both mechanically and conceptually. It would be important to read the papers corresponding to each model, though some are part of books and thus you may have time to only skim through the basic ideas in the book. Beyond general familiarity with your model, you need to identify and categorize model constructs in ways that facilitate your estimation work in the coming weeks. Specifically, these models (or at least the version you are given) were not developed for the purpose of estimation using quantitative data. Getting them ready for that task takes some work. Please follow the steps below to prepare your model.Â
Background on Data and Models: Educational and insight-oriented SD models are typically built to be âendogenousâ, that is, the majority of model variables are formulated as a function of other model variables. Once you specify model âparametersâ (i.e. numerical values which remain constant during a simulation), you are able to simulate such fully endogenous models for any period of time and generate behaviors for a variety of constructs you may care about. Bringing historical data into the picture complicates things. First, you need to decide which model variables correspond to data you have access to (and sometimes you will need to build such variables as your existing variables may not map into any data series you have, but only relate to those observables with some functional transformation, e.g. you may have the actual cases in a COVID-19 model, but only the reported cases in the data). Second, full endogeneity is often unrealistic when you want to fit model to historical data. Many aspects of the real world are not part of your model boundary: you may not care to endogenize demand in an inventory management model, or impact of weather on transmission rates in an epidemic model. Identifying data series that should be used to âdriveâ model outcomes exogenously (at least over the historical time horizons) is another important task to connect a model with data. Â
0. Before any conceptual work, create a new version of your model where everything is shown in a single view. Basically, you need to copy and paste structures from different views all into one. This will be helpful when later we use existing packages to translate Vensim models into Python code because there are still some bugs in translating multi-view models in the packages we use. Â
1. For concreteness, choose an appropriate empirical context where your model may apply. For example, the epidemic model may be applied to different States in the USA, or different countries across the world. Make sure you pick a context where the model may relate to multiple units of the same nature. For example, the world III model would be applicable at a country level, and that would be the right choice (rather than keeping it at a global level, where there is only a single instance of the unit we are modeling). Â
2. Identify variables for which you may have data for your model, if you were to go and collect relevant data for the empirical context you identified. You should not spend time looking for data; just use your intuition and judgment to decide which variables in the model, or important related constructs (that may not currently be part of the model), may have observed values should one seek to collect data on those. Prioritize them into those you expect data should exist for, and those you may find data for in some settings but you are less confident about potential availability of data. Â
3. Identify the type of data variables you have identified: whether they would used as inputs into simulation (âdriving data variablesâ) or to be simulated and matched (âtarget data variablesâ). Build a table with both types of data variables, their designation (target vs. driving) and their availability guestimate (from step 2).Â
4. Decide if your model already includes variables corresponding to the data series identified above, or you need to add relevant variables into your model. Build a new, expanded version of the model which includes those missing variables and how they may relate to the rest of the model. For âtarget data variablesâ you need to define a new variable, similar in name to the corresponding model construct, but with âdataâ in variable name, e.g. âReported infections dataâ could correspond to the variable âReported Infectionsâ which you already have built into your model. âdriving data variablesâ do not need such treatment. For the purpose of our exercise we do not want more than a handful of new variables added to your model, so focus on the most critical items if you have multiple candidate variables for expansion. Â
5. For each âtarget data variableâ, consider the distribution of the variable (or its error compared to the expected value that may be simulated)? Conceptually, what distribution (e.g. Normal, Poisson, etc.) would be a good fit for that variable? For example, Poisson is a conceptually good choice for the number of people existing a department store at every unit of time. Include in your table these candidate âmeasurementâ distributions.Â
Background on Parameters: Model parameters are those assumed to be taking a constant numerical value during each simulation. Note that you may have uncertainty about model parameters, and as such associate them with some underlying probability distribution; that is fine and, among others, you can reflect that uncertainty by simulating the model multiple times with different values for the parameter; yet a parameters will not change over the course of a single simulation. When it comes to estimating models based on historical data, model parameters could be categorized into two general groups: âestimated parametersâ and âassumed parametersâ. Finding reasonable values (or distributions) for the former set that match simulated model variables to their corresponding target data variables is the primary goal of methods covered in this class. The assumed parameters are all the parameters we do not empirically estimate. These assumed parameters may include general constants (e.g. 24 hours/day), items for which you have reliable empirical values (e.g. the population of a country), or those you want to assume as part of your model structure (e.g. that minimum time to finish tasks is 2 hours). Â
6. Identify all the parameters in your model (Vensim model documentation tool can simplify that). Partition these parameters into estimated and assumed groups and create a table listing all parameters and their category.Â
7. Consider the meaning of each parameter and its role in the model. Based on your qualitative understanding, and simple simulations in the model, set minimum and maximum levels for each parameter. Be generous in setting these bounds, that is they should cover not only the likely range, but also the conceivable range for each parameter. However, make sure they exclude infeasible ranges: probabilities are between 0 and 1, many physical constants canât go below zero, etc. Add these ranges as columns in your parameters table.Â
Background on Noise: You can think of ânoiseâ as random factors that in reality may change the outcomes of a system through mechanisms not explicitly captured in your deterministic SD model. It is helpful to distinguish between two types of noise: âprocess noiseâ includes random factors that change the dynamics of the system substantively. For example, in a COVID-19 model factors such as holidays or day-to-day weather changes may impact infection rates beyond mechanisms explicitly modeled. In a production-inventory model the actual production rate may vary due to randomness in production processes around the expected values one may model deterministically. Such process noise streams may thus take the systemâs dynamics to different regions of state space which the model would not normally visit in a deterministic setting. The second type is âmeasurement noiseâ, which captures the stochastic nature of measured/observed variables. For example, observed number of deaths in an epidemic model may be different from the true values due to randomness in the timing of death reporting, among others. Similarly, in modeling organizational dynamics, results from survey instruments may inform some underlying model variable, but with noisy measurements. The primary difference between process and measurement noise is that the latter does not change the dynamics in the rest of the system, i.e. the existence, and value of, measurement noise stream does not change the underlying stock variables in your model.Â
Both types of noise play a critical role in connecting models and data. Estimation methods typically start with the assumption that the model is âcorrectâ, and then assess the likelihood of observing the data given what the model predict. But if a deterministic model is assumed to be âcorrectâ, then it makes a unique point value prediction for each observed measure with no room for observing a different outcome. Few models can claim such impressive fit to data. Allowing for process and measurement noise enables one to reconcile an SD modelâs predictions with data and to offer probabilistic measures of fit quality. Much of the work in this class revolves around making those connections explicit and building estimation methods that rely on the existence of (at least) measurement noise and in some cases process noise.Â
8. What process noise streams may be included in the model? To answer, consider which flow variables might behave significantly differently due to factors outside of model boundary compared to the equations used to formulate the current deterministic model. Pick one or two of those âprocess noiseâ items, and explicitly formulate a pink or white noise process that allows your model to simulate potential stochastic variations in those flows around the deterministic formulations. Implement those changes in your model so that the you can use a switch variable (a variable taking 0 or 1 values) to turn the process noise streams on/off. Â
9. For each target data variable that you identified under task 3, implement the stochastic measurement process corresponding to the distribution you identified under task 5. Specifically, your model may be predicting deterministic expected values for those target data variables, and you will need to specify the relevant distribution that uses the model prediction but also may require additional parameters e.g. for specifying the standard deviation of the gap between true values and measured ones. Common distributions may include Binomial, Normal, Poisson, Negative Binomial, and Multinomial to name a few (in this list Binomial and Poisson have a single parameter, determined by expected values, while the others have two parameters). For the time being you can assume the measurement noise is not auto-correlated (i.e. use a white noise). Make sure your implementation allows for switching the measurement noise on/off.Â
After completing items 8 and 9, your model has become a stochastic differential equation (rather than a deterministic one). Make sure your model simulates, and that it generates the same results as the initial model when you turn off process and measurement noise. You should now be able to generate âsynthetic dataâ that better resembles true data generating processes, and offers a more realistic test-bed for assessing the quality of your model estimation frameworks. Â
PresentationÂ
For your class presentation share your categorization of data variables, parameters, and noise streams. Also include simulations with and without measurement and process noise. Discuss your choice of distribution for measured variables and bring up any points of uncertainty.