Why Do We Need to Know the Data Distribution?


Why do we need to know the data distribution to do our analyses? We need to know how the data are distributed to determine the most appropriate statistical analyses to use.

  • When the outcome variable is continuous (interval/ratio), linear regression is a common method to use.
  • When the outcome variable is dichotomous (yes/no or 0/1), logistic regression is likely, particularly if the outcome is not highly prevalent in the sample. If the outcome is highly prevalent, however, negative binomial regression may be more appropriate is the final intention of the analysis is to estimate the likelihood of the outcome.
  • When the outcome variable has ordered categories (0/1/2/3), ordinal logistic regression (proportional odds regression) is potentially useful. In this situation, we also expect that the relationships between the categories are proportional (comparable).
  • When the outcome variable has unordered categories, or ordered categories in which the relationships between the categories are not proportional, we may use multinomial regression models.
  • When the outcome is a count of the number of events occurring from some large observed population, Poisson regression may be appropriate.
  • When the outcome is the time until the event occurs, and data has been collected over an extended period of time, survival analyses are used. Cox proportional hazards regression models are frequently used, but other statistical models are also available and may be more appropriate for your data.

When you’re thinking about the data distribution, note that it determines the type of analytic conclusions you can infer, so choose carefully!

Ayla Myrick