### Description

Week 5: Correlation or Causation?

Regression models are often used, especially in the social sciences, in an attempt to establish a causal relationship between variables using observational data. This is no easy task. Read Freedman (1991) and comment on your main takeaways as they relate to establishing causation using regression techniques with observational data. Initial post by Tuesday, continue the conversation with your peers by Friday of Week 5.

comment on this:

David Freeman’s *Statistical Models and Shoe Leather* paper emphasizes the limitations of using regression models when it comes to using the results of regression for claims about causality. My main takeaway from reading this paper is that regression results usually cannot indicate definitive causality because the results may be misleading if any of the statistical assumptions that are required for use of the model do not hold up under real life conditions. More specifically, this means that in real life scenarios, the assumptions that are required for the use of a model, such as the assumptions of normal errors, conditional independence of the observations, a linear relationship between the dependent variable and independent variables, and equal variance of residuals for a linear model, are difficult to prove. Because it is very difficult to prove or disprove the assumptions of a model and predictions are often not tested against reality, Freeman argues that regression results, such as if a coefficient is statistically significant, cannot be used alone to establish causality.

Freeman explores a few examples of studies to back up his claim, first citing work by John Snow in his study of cholera. Snow argued that the “active agent was a living organism that got into the alimentary canal with food or drink, multiplied in the body, and generated some poison that caused the body to expel water” and that the organism was then passed on to others went it got back into the water supply. Freeman described how Snow utilized case studies of observation and analysis of ecological data to explore cholera; Snow found a natural experiment and collected the data he needed to perform his analysis. The paper describes Snow’s work as a success story on scientific reasoning based on nonexperimental data – the key being that the use of nonexperimental data allowed Snow to bring together many different sources of data and evidence to support a theory not based on model assumptions.

In contrast, Freeman discusses the Kanarek et al. study that argued that asbestos fibers in drinking water caused lung cancer. The main tool of the study was a loglinear regression, and causation was inferred if a coefficient resulted as statistically significant when controlling for covariates. Freeman used this study as an example for how modeling results can be misleading. For example, Kanarek et al. did not discuss their stochastic assumptions, and the data they used to study lung cancer did not contain data on cigarette smoking. Freeman pointed out that imperfect control over smoking could easily account for the observed effect seen in the results of the study, and that the study ran upwards of 200 equations with only one of the p values being below 0.001. The focus on modeling in this example shows how the significance of the result may not hold up in reality, thus suggesting that correlation is not the same as causation.

Overall, I found Freeman’s paper on this topic to be quite interesting. The main point of emphasis is that an experimental model cannot substitute for testing predictions against reality in a variety of settings; moreover, it is difficult to replicate the conditions and assumptions of a modeling experiment in a real-life setting, so a statically significant result should not be used to imply causality. Correlation is not the same as causation, and it is important to keep this in mind when interpreting the results of a model.

Comment this one:

Based on the reading by David Freeman, one of my main takeaway is, in real life scenarios, there are assumptions that need to be made in order to make a model. This can be the assumption of normal errors, relationship between dependent and independent variables, etc. However, this might be difficult to prove because this is not often tested against reality. In this case, I was thinking – in statistics, we learn how to test certain things, for example, using Chi Squared to test if each variable is independent / no. Sometimes, the results are not statistically significant, so we have to reject the hypothesis. Even though, in real life, this could be statistically significant. So, sometimes the testing might not be as ideal as what real life might be. However, in the Statistical Models and Shoe Leather paper, Freeman argues that if a coefficient is statistically significant, it cannot predict causality.

Freeman also stated that some assumptions are hard to prove or not approve because there is very little effort spent. Freeman argued that some beliefs cannot be justified from just the complexity of a calculation. We need to control the observable phenomena to have a relevant argument. However, like I mentioned above, we might not always know what a specific successful model is. Can we differentiate between successful and unsuccessful uses of the model? Maybe yes if the result is statistically significant. But, the answer can be no as well.

** **

It is difficult to duplicate an assumption in real life settings, so statistical significant results should not always imply causality. To emphasize, correlation is any statistical association whether causal or not between two random variables. Causation means that one event is the result of the occurrences of other events. Example, there is a causal relationship between two events. Thus, it’s very important to keep in mind that, like Chris mentioned above, correlation is not the same as causation.

Is this the question you were looking for? Place your Order Here