Predictive Modelling Of Mortality Rates for Diabetes: An Analysis Of Risk Factors
Diabetes is a chronic disease affecting millions globally and a leading cause of death. Understanding mortality trends and predicting future mortality rates is crucial for public health planning and intervention strategies. Our objective is to identify critical factors that contribute to mortality from diabetes and develop a suitable model for predicting future mortality rates. The outcomes of this study could inform policies and interventions aimed at reducing the burden of diabetes-related mortality.
This project analyzes mortality rates for diabetes in the United States from 2015 to 2020 and employs various predictive methods in Statistics and Machine Learning to forecast future mortality rates.
Variables:
Resource: CDC WONDER
According to the age distribution, the older the age, the higher the mortality rate of diabetics. the greatest mortality rate was recorded in the 80-84 age group, reaching 5963 per 100,000. More surprisingly, infants younger than one year of age also have a high mortality rate, ranking seventh out of 18 age groups, at 570 per 100,000. The lowest mortality rate was in the 5-9 age group, with a rate of 13 per 100,000.
In terms of ethnic distribution, the death rate for non-Hispanics and Latinos is higher than that for Hispanics, almost twice as high. American Indians or Alaska Natives have the highest mortality rate, with a rate of approximately 1061 per 100,000. Asian and Pacific Islanders have the lowest mortality rate, with a rate of approximately 364 per 100,000. By gender distribution, males have a higher mortality rate than females.
The geographical distribution is clear. According to the regional map, the North Central and Southern regions have the highest mortality rates, even almost identical, at around 729, but when it comes to mortality rates per state, seven of the top ten states are from the Southern region, while the top six states are all from the South.
Since 2015, through 2019, the diabetes death rate has steadily increased in each year based on trends, but not by much. But the mortality rate in 2020 is clearly and significantly higher, and this reason is supposedly related to the outbreak of corona virus in March 2020.
Negative binomial regression is used to predict the count of deaths instead of Poisson regression. As checked, in this dataset, there is a huge gap between the mean and variance of the number of deaths.
Negative binomial regression assumes that the count data follows a negative binomial distribution, and the predictor variables have a linear relationship with the logarithm of the expected count.
Link Function:
\(\log{\mbox{deaths}} = \mbox{Intercept} + b_1x_1+b_2x_2+\cdot + b_kx_k\)
Models | Predictors |
---|---|
Model 1 | Sex |
Model 2 | Race |
Model 3 | Age Group |
Model 4 | Age Group, Race |
Model 5 | Race, Sex |
Model 6 | Age Group, Sex |
Model 7 | Age Group, Race, Sex |
Model 8 | Age Group, Race, Sex, Ethnicity |
Model 9 | Age Group, Race, Sex, Ethnicity, Year |
Model 10 | Age Group, Race, Sex, Ethnicity, State |
Model 11 | Age Group, Race, Sex, Ethnicity, Year, State |
d.f. | dev | disp | AIC | 2Llik | |
---|---|---|---|---|---|
Model 1 | 29655 | 38327.05 | 0.43 | 394593.2 | -394587.2 |
Model 2 | 29653 | 37035.48 | 0.54 | 385368.9 | -385358.9 |
Model 3 | 29639 | 36392.88 | 0.60 | 380666.2 | -380628.2 |
Model 4 | 29636 | 34917.13 | 0.80 | 368987.6 | -368943.6 |
Model 5 | 29652 | 37023.40 | 0.54 | 385282.5 | -385270.5 |
Model 6 | 29638 | 36332.84 | 0.60 | 380225.4 | -380185.4 |
Model 7 | 29635 | 34842.72 | 0.81 | 368383.2 | -368337.2 |
Model 8 | 29634 | 34187.59 | 0.94 | 362565.6 | -362517.6 |
Model 9 | 29633 | 34177.46 | 0.95 | 362472.7 | -362422.7 |
Model 10 | 29585 | 31961.29 | 1.87 | 338436.4 | -338290.4 |
Model 11 | 29584 | 31940.07 | 1.88 | 338160.2 | -338012.2 |
Estimate | Std. Error | z value | Pr(>|z|) | Relative Rates | |
---|---|---|---|---|---|
(Intercept) | -83.9575 | 5.0791 | -16.5301 | 0.0000 | 0.00 |
Age_Groups1-4 | -1.7116 | 0.0343 | -49.8432 | 0.0000 | 0.18 |
Age_Groups5-9 | -2.1792 | 0.0395 | -55.2057 | 0.0000 | 0.11 |
Age_Groups10-14 | -1.9433 | 0.0358 | -54.2072 | 0.0000 | 0.14 |
Age_Groups15-19 | -0.7250 | 0.0289 | -25.0662 | 0.0000 | 0.48 |
Age_Groups20-24 | -0.1570 | 0.0272 | -5.7687 | 0.0000 | 0.85 |
Age_Groups25-29 | 0.0695 | 0.0266 | 2.6083 | 0.0091 | 1.07 |
Age_Groups30-34 | 0.1861 | 0.0263 | 7.0690 | 0.0000 | 1.20 |
Age_Groups35-39 | 0.3233 | 0.0260 | 12.4187 | 0.0000 | 1.38 |
Age_Groups40-44 | 0.4668 | 0.0257 | 18.1496 | 0.0000 | 1.59 |
Age_Groups45-49 | 0.7794 | 0.0252 | 30.8859 | 0.0000 | 2.18 |
Age_Groups50-54 | 1.1258 | 0.0248 | 45.3247 | 0.0000 | 3.08 |
Age_Groups55-59 | 1.4469 | 0.0245 | 59.1772 | 0.0000 | 4.25 |
Age_Groups60-64 | 1.6606 | 0.0243 | 68.2836 | 0.0000 | 5.26 |
Age_Groups65-69 | 1.7824 | 0.0242 | 73.6662 | 0.0000 | 5.94 |
Age_Groups70-74 | 1.8682 | 0.0242 | 77.2892 | 0.0000 | 6.48 |
Age_Groups75-79 | 1.9507 | 0.0242 | 80.6815 | 0.0000 | 7.03 |
Age_Groups80-84 | 2.0577 | 0.0242 | 85.1017 | 0.0000 | 7.83 |
RaceAsian or Pacific Islander | -0.3769 | 0.0203 | -18.5608 | 0.0000 | 0.69 |
RaceBlack or African American | 1.4995 | 0.0187 | 80.3979 | 0.0000 | 4.48 |
RaceWhite | 2.8806 | 0.0178 | 161.9116 | 0.0000 | 17.82 |
SexMale | 0.3888 | 0.0087 | 44.8165 | 0.0000 | 1.48 |
EthnicityNot Hispanic or Latino | 2.2334 | 0.0119 | 187.2333 | 0.0000 | 9.33 |
Year | 0.0417 | 0.0025 | 16.5794 | 0.0000 | 1.04 |
StateAlaska | -0.9074 | 0.0507 | -17.8897 | 0.0000 | 0.40 |
StateArizona | 0.3625 | 0.0407 | 8.9172 | 0.0000 | 1.44 |
StateArkansas | -0.6982 | 0.0460 | -15.1911 | 0.0000 | 0.50 |
StateCalifornia | 1.9188 | 0.0388 | 49.4843 | 0.0000 | 6.81 |
StateColorado | -0.2995 | 0.0424 | -7.0615 | 0.0000 | 0.74 |
StateConnecticut | -0.8107 | 0.0446 | -18.1629 | 0.0000 | 0.44 |
StateDelaware | -1.7077 | 0.0508 | -33.6317 | 0.0000 | 0.18 |
StateFlorida | 1.0673 | 0.0398 | 26.7844 | 0.0000 | 2.91 |
StateGeorgia | 0.4867 | 0.0419 | 11.6120 | 0.0000 | 1.63 |
StateHawaii | 0.5450 | 0.0483 | 11.2953 | 0.0000 | 1.72 |
StateIdaho | -1.4138 | 0.0530 | -26.6877 | 0.0000 | 0.24 |
StateIllinois | 0.6257 | 0.0417 | 15.0209 | 0.0000 | 1.87 |
StateIndiana | -0.2253 | 0.0433 | -5.1987 | 0.0000 | 0.80 |
StateIowa | -1.1653 | 0.0477 | -24.4357 | 0.0000 | 0.31 |
StateKansas | -1.1252 | 0.0439 | -25.6156 | 0.0000 | 0.32 |
StateKentucky | -0.4378 | 0.0465 | -9.4204 | 0.0000 | 0.65 |
StateLouisiana | -0.0567 | 0.0435 | -1.3022 | 0.1928 | 0.94 |
StateMaine | -1.2720 | 0.0638 | -19.9253 | 0.0000 | 0.28 |
StateMaryland | 0.0025 | 0.0429 | 0.0574 | 0.9542 | 1.00 |
StateMassachusetts | -0.3581 | 0.0427 | -8.3789 | 0.0000 | 0.70 |
StateMichigan | 0.1221 | 0.0414 | 2.9471 | 0.0032 | 1.13 |
StateMinnesota | -0.7054 | 0.0421 | -16.7632 | 0.0000 | 0.49 |
StateMississippi | -0.1515 | 0.0473 | -3.2066 | 0.0013 | 0.86 |
StateMissouri | -0.2451 | 0.0437 | -5.6122 | 0.0000 | 0.78 |
StateMontana | -1.0990 | 0.0532 | -20.6553 | 0.0000 | 0.33 |
StateNebraska | -1.5997 | 0.0485 | -32.9612 | 0.0000 | 0.20 |
StateNevada | -0.5055 | 0.0427 | -11.8427 | 0.0000 | 0.60 |
StateNew Hampshire | -1.3846 | 0.0641 | -21.5898 | 0.0000 | 0.25 |
StateNew Jersey | 0.3224 | 0.0420 | 7.6727 | 0.0000 | 1.38 |
StateNew Mexico | 0.0127 | 0.0443 | 0.2871 | 0.7741 | 1.01 |
StateNew York | 0.9132 | 0.0398 | 22.9527 | 0.0000 | 2.49 |
StateNorth Carolina | 0.2324 | 0.0409 | 5.6881 | 0.0000 | 1.26 |
StateNorth Dakota | -1.6883 | 0.0566 | -29.8230 | 0.0000 | 0.18 |
StateOhio | 0.3377 | 0.0425 | 7.9365 | 0.0000 | 1.40 |
StateOklahoma | -0.2144 | 0.0418 | -5.1280 | 0.0000 | 0.81 |
StateOregon | -0.8456 | 0.0434 | -19.4801 | 0.0000 | 0.43 |
StatePennsylvania | 0.3950 | 0.0420 | 9.3992 | 0.0000 | 1.48 |
StateRhode Island | -2.0280 | 0.0523 | -38.7602 | 0.0000 | 0.13 |
StateSouth Carolina | -0.1325 | 0.0441 | -3.0073 | 0.0026 | 0.88 |
StateSouth Dakota | -0.9940 | 0.0534 | -18.6271 | 0.0000 | 0.37 |
StateTennessee | -0.0249 | 0.0433 | -0.5753 | 0.5651 | 0.98 |
StateTexas | 1.6531 | 0.0402 | 41.1271 | 0.0000 | 5.22 |
StateUtah | -0.9879 | 0.0466 | -21.1936 | 0.0000 | 0.37 |
StateVermont | -2.0920 | 0.0668 | -31.3332 | 0.0000 | 0.12 |
StateVirginia | 0.0695 | 0.0422 | 1.6491 | 0.0991 | 1.07 |
StateWashington | -0.1205 | 0.0411 | -2.9362 | 0.0033 | 0.89 |
StateWest Virginia | -1.0149 | 0.0531 | -19.1294 | 0.0000 | 0.36 |
StateWisconsin | -0.6356 | 0.0423 | -15.0323 | 0.0000 | 0.53 |
StateWyoming | -2.2503 | 0.0608 | -37.0395 | 0.0000 | 0.11 |
The baselines of each factor are :
Age group: <1
Race: American Indian or Alaska native
Ethnicity: Hispanic or Latino
Sex: Female
State: Alabama
Analysis:
Ethnicity: The risk of dying due to diabetes among Non-Hispanic or Latino is about 9.33 times higher than that for the Hispanic or Latino holding all other variables constant.
Race: Highest risk of death due to diabetes is reported for white Americans. A person of white is 17.82 times more likely to die due to diabetes than American Indian or Alaska Native on average holding all other variables constant.
Sex: Males have about 1.48 times higher rick of dying due to diabetes than females on average holding all other variables constant.
Not enough variables
Inability to optimize results
No breakdown of causes of death.
---
title: "Diabetes Mortality"
output:
flexdashboard::flex_dashboard:
theme:
bootswatch: zephyr
primary: "#96ACA3"
orientation: columns
vertical_layout: fill
social: menu
source_code: embed
---
<style type="text/css">
.chart-title { /* chart_title */
font-size: 18px;
}
body{ /* Normal */
font_size: 16px
}
</style>
```{r setup, include=FALSE}
pacman:: p_load(DT, flexdashboard, gridExtra, knitr, MASS, plotly, rpart, rpart.plot, tidyverse, usmap,dplyr,MASS, pacman, ggplot2, faraway)
```
```{r data cleaning, eval=FALSE}
#----------read file and combine them vertically by year.-----------
path = "../Data"
allpopfiles <- list.files(path, pattern = ".txt", full.names = TRUE)
diabetes <- data.frame()
for (i in 1:length(allpopfiles)){
temp <- read_tsv(allpopfiles[i])
temp <- temp %>%
mutate(Year = str_extract(allpopfiles[i], "20.."))
diabetes <- rbind.data.frame(diabetes, temp)
}
```
```{r read_data}
diabetes <- read_csv("../Data/diabetes.csv")
#------------create a new variables "region"------------------------
Regions <- cbind.data.frame(state.name, state.region)
Northeast <- Regions$state.name[Regions$state.region=="Northeast"]
South <- Regions$state.name[Regions$state.region=="South"]
North_Central <- Regions$state.name[Regions$state.region=="North Central"]
West <- Regions$state.name[Regions$state.region=="West"]
diabetes <- diabetes %>%
rename(Age_Groups = `Five-Year Age Groups`,
Ethnicity =`Hispanic Origin`,
Crude_Rate = `Crude Rate`,
Sex = Gender) %>%
dplyr::select(-c(Notes,`State Code`,`Gender Code`,`Race Code`,`Hispanic Origin Code`,`Five-Year Age Groups Code`)) %>%
mutate(Region = case_when(
State %in% Northeast ~ "Northeast",
State %in% South ~ "South",
State %in% North_Central ~ "North_Central",
State %in% West ~ "West"))
all_ages_groups <- c("< 1", "1-4", "5-9", "10-14", "15-19",
"20-24", "25-29", "30-34", "35-39", "40-44",
"45-49", "50-54", "55-59", "60-64", "65-69",
"70-74", "75-79", "80-84")
diabetes$Age_Groups <- factor(gsub(" years| year", "",
diabetes$Age_Groups),
levels = all_ages_groups)
#-------------create a new variable "Death_Rate" per 10,0000---------
diabetes <- diabetes %>% filter(Population != "Not Applicable")
diabetes$Population <- as.numeric(diabetes$Population)
diabetes <- mutate(diabetes,Mortality_rate = round(Deaths/Population * 100000,2)) %>%
dplyr::select(-Crude_Rate)
#--------------Data Cleaning------------------
diabetes[sapply(diabetes, is.character)] <- lapply(diabetes[sapply(diabetes, is.character)], as.factor)
diabetes <- diabetes %>% filter(State != "District of Columbia")
diabetes$Year <- as.numeric(diabetes$Year)
```
Introduction
======================================================================
Column {data-width=450}
----------------------------------------------------------------------
### Motivation
<font size=3>
**Predictive Modelling Of Mortality Rates for Diabetes: An Analysis Of Risk Factors**
</font>
Diabetes is a chronic disease affecting millions globally and a leading cause of death. Understanding mortality trends and predicting future mortality rates is crucial for public health planning and intervention strategies. Our objective is to identify critical factors that contribute to mortality from diabetes and develop a suitable model for predicting future mortality rates. The outcomes of this study could inform policies and interventions aimed at reducing the burden of diabetes-related mortality.
### An overlook of the dataset
This project analyzes mortality rates for diabetes in the United States from 2015 to 2020 and employs various predictive methods in Statistics and Machine Learning to forecast future mortality rates.
Variables:
- Locations: States, Regions
- Gender : Female, Male
- Age: Five-Year age group from age 0-84
- Race: White, Black or African American, American Indian or Alaska Native, Asian or Pacific Islander
- Ethnicity: Hispanic or Latino, Not Hispanic or Latino
- Year: From 2015 - 2020
Resource: [CDC WONDER](https://wonder.cdc.gov/ucd-icd10.html)
Column {data-width=550}
-----------------------------------------------------------------------
### Glimpse
```{r}
datatable(diabetes, rownames = FALSE,
colnames = c("State", "Sex", "Age Group", "Race", "Ethnicity", "Death", "Population", "Year", "Region", "Mortality (Per 100,000)"),
class = "hover",
options = list(
columnDefs = list(list(className = 'dt-center',
targets = 1:5)),
pageLength = 10,
initComplete = JS("function(settings, json) {",
"$(this.api().table().header()).css({'background-color': '#96ACA3', 'color': '#fff'});","}")
))
```
Mortality
======================================================================
```{r, align='center'}
diabetes_state <- diabetes %>%
group_by(State) %>%
summarise(Deaths = sum(Deaths), Population = sum(Population)) %>%
mutate(Mortality_rate = round(Deaths/Population*100000,2)) %>%
right_join(Regions,by=c("State" = "state.name")) %>%
rename(Region = state.region)
US_map <- us_map("state") %>% filter(full != "District of Columbia")
rate_data <- diabetes_state %>%
left_join(US_map, by = c("State"="full"))
label <- rate_data %>%
group_by(abbr) %>%
summarise(x = mean(x), y = mean(y))
p <- ggplot(rate_data, aes(x = x, y = y,color = "black")) +
geom_polygon(aes(group = group, fill = Mortality_rate,
text = paste0(State,":\n", Mortality_rate, " deaths per 100,000")),
colour = "black") +
geom_text(aes(label = abbr),data = label,fontface = "bold",color="white") +
scale_fill_continuous(low = "#F6F5E9", high = "#6F6360",name = "Mortality rate") +
theme_minimal() +
theme(axis.title.x=element_blank(), axis.text.x=element_blank(),
axis.ticks.x=element_blank(), axis.title.y=element_blank(),
axis.text.y =element_blank(), axis.ticks.y=element_blank(),
panel.grid.major = element_blank(),
panel.background = element_blank())+
labs(title = "Mortality Rate Per 100,000 in the United States")
ggplotly(p, tooltip = "text", width = 1100, height = 700)
```
Data Exploration
===
Column {.tabset data-width=700}
-----------------------------------------------------------------------
### Age
```{r}
diabetes_age <- diabetes %>%
group_by(Age_Groups) %>%
summarise(Deaths = sum(Deaths), Population = sum(Population)) %>%
mutate(Mortality = round(Deaths/Population*100000,2))
a <- ggplot(diabetes_age,aes(y=reorder(Age_Groups, Mortality),x = Mortality)) +
geom_col(fill = "#A28A8F",aes(text = paste0("Mortality: ", Mortality, " deaths per 100,000"))) +
xlab("Mortality Rate Per 100,000 People") +
ylab("Age Group (Years)") +
theme(axis.text.y = element_text(size = 12), axis.text.x = element_text(size = 12))
ggplotly(a,tooltip = "text")
```
### Ethnicity
```{r}
diabetes_Ethnicity <- diabetes %>%
group_by(Ethnicity) %>%
summarise(Deaths = sum(Deaths), Population = sum(Population)) %>%
mutate(Mortality = round(Deaths/Population*100000,2))
h <- ggplot(diabetes_Ethnicity,aes(x=Ethnicity,y=Mortality)) +
geom_col(fill = "#73564C") +
ylab("Mortality Rate Per 100,000 People") +
xlab("Ethnicity")+
theme(axis.text.y = element_text(size = 12), axis.text.x = element_text(size = 12))
ggplotly(h)
```
### Race
```{r}
diabetes_race <- diabetes %>%
group_by(Race) %>%
summarise(Deaths = sum(Deaths), Population = sum(Population)) %>%
mutate(Mortality= round(Deaths/Population*100000,2))
r <- ggplot(diabetes_race,aes(y=reorder(Race,Mortality),x=Mortality)) +
geom_col(fill = "#98A28A",aes(text = paste0("Mortality: ", Mortality, " deaths per 100,000")))+
xlab("Mortality Rate Per 100,000 People") +
ylab("Race") +
theme(axis.text.y = element_text(size = 12),
axis.text.x = element_text(size = 12))
ggplotly(r)
```
### Sex
```{r}
diabetes_sex <- diabetes %>%
group_by(Sex) %>%
summarise(Deaths = sum(Deaths), Population = sum(Population)) %>%
mutate(Mortality = round(Deaths/Population*100000,2))
s <- ggplot(diabetes_sex,aes(x=Sex,y=Mortality)) +
geom_col(fill = "#BF9568")+
ylab("Mortatility Rate Per 100,000 People")+
theme(axis.text.y = element_text(size = 12), axis.text.x = element_text(size = 12))
ggplotly(s)
```
### Year
```{r}
diabetes_year <- diabetes %>%
group_by(Year) %>%
summarise(Deaths = sum(Deaths), Population = sum(Population)) %>%
mutate(Mortality = round(Deaths/Population*100000,2))
y <- (ggplot(diabetes_year,aes(x=Year,y=Mortality))) +
geom_col(fill = "#316C6C")+
ylab("Mortatility Rate Per 100,000 People")+
theme(axis.text.y = element_text(size = 12), axis.text.x = element_text(size = 12))
ggplotly(y)
```
### Region
```{r}
## mortality rate by region
diabetes_region <- diabetes %>%
group_by(Region) %>%
summarise(Mortality = round(sum(Deaths)/sum(Population)*100000, 2))
r <- diabetes_region %>%
ggplot(aes(x = Region, y = Mortality)) +
geom_bar(stat="identity", fill="#637493") +
ylab("Mortatility Rate Per 100,000 People") +
geom_polygon(aes(text = paste0("Mortality rate: ", Mortality, " deaths per 100,000")), colour = "black")+
theme(axis.text.y = element_text(size = 12), axis.text.x = element_text(size = 12))
ggplotly(r)
```
### State
```{r}
diabetes_state <- diabetes %>%
group_by(State) %>%
summarise(Deaths = sum(Deaths), Population = sum(Population)) %>%
mutate(Mortality= round(Deaths/Population*100000, 2)) %>%
right_join(Regions,by=c("State" = "state.name")) %>%
rename(Region = state.region)
q <- ggplot(diabetes_state, aes(y=reorder(State,Mortality),
x=Mortality,fill = Region)) +
geom_col(aes(text = paste0("Mortality: ", Mortality, " deaths per 100,000"))) +
ggtitle("Mortality rate of 6 years in each state")+
ylab("States")+
xlab("Mortality Rate")+
geom_polygon(aes(text = paste0("Mortality: ", Mortality, " deaths per 100,000")), colour = "black")
ggplotly(q,tooltip = "text")
```
### Panel Graph
```{r}
diabetes_Year <- diabetes %>%
group_by(Year, Sex, Race, Ethnicity) %>%
summarise(Mortality = round(sum(Deaths)/sum(Population)*100000, 2))
p_year <- diabetes_Year %>%
ggplot(aes(x=Mortality, y=Race, fill=Sex)) +
geom_bar(stat = "identity",
aes(text = paste0(Sex, Race, ":\n",
"Mortality Rate: ", Mortality, " Per 100,000 People."))) +
facet_grid(Ethnicity~Year) +
xlab("Mortatility rate Per 100,000 People") +
scale_fill_manual(values = c("Male" = "#96ACA3",
"Female" = "#F5A997")) +
theme(text = element_text(size = 12), legend.position = "none") +
theme(axis.text.y = element_text(size = 8))
ggplotly(p_year, tooltip = "text")
```
### Summary
According to the age distribution, the older the age, the higher the mortality rate of diabetics. the greatest mortality rate was recorded in the 80-84 age group, reaching 5963 per 100,000. More surprisingly, infants younger than one year of age also have a high mortality rate, ranking seventh out of 18 age groups, at 570 per 100,000. The lowest mortality rate was in the 5-9 age group, with a rate of 13 per 100,000.
In terms of ethnic distribution, the death rate for non-Hispanics and Latinos is higher than that for Hispanics, almost twice as high. American Indians or Alaska Natives have the highest mortality rate, with a rate of approximately 1061 per 100,000. Asian and Pacific Islanders have the lowest mortality rate, with a rate of approximately 364 per 100,000. By gender distribution, males have a higher mortality rate than females.
The geographical distribution is clear. According to the regional map, the North Central and Southern regions have the highest mortality rates, even almost identical, at around 729, but when it comes to mortality rates per state, seven of the top ten states are from the Southern region, while the top six states are all from the South.
Since 2015, through 2019, the diabetes death rate has steadily increased in each year based on trends, but not by much. But the mortality rate in 2020 is clearly and significantly higher, and this reason is supposedly related to the outbreak of corona virus in March 2020.
Modeling
===
Column {.tabset data-width=350}
---------------------------------------------------------------------
### Negative Binomial Regression Models
Negative binomial regression is used to predict the count of deaths instead of Poisson regression. As checked, in this dataset, there is a huge gap between the mean and variance of the number of deaths.
Negative binomial regression assumes that the count data follows a negative binomial distribution, and the predictor variables have a linear relationship with the logarithm of the expected count.
**Link Function:**
$\log{\mbox{deaths}} = \mbox{Intercept} + b_1x_1+b_2x_2+\cdot + b_kx_k$
### Models Considered
```{r allmodels}
Models <- paste("Model", 1:11)
Predictors <- c("Sex", "Race", "Age Group", "Age Group, Race",
"Race, Sex", "Age Group, Sex", "Age Group, Race, Sex",
"Age Group, Race, Sex, Ethnicity",
"Age Group, Race, Sex, Ethnicity, Year",
"Age Group, Race, Sex, Ethnicity, State",
"Age Group, Race, Sex, Ethnicity, Year, State"
)
all_models <- data.frame(Models = Models, Predictors = Predictors)
kable(all_models)
```
Column {data-width=650}
----------------------------------------------------------------------
### Model Comparison
```{r,include=FALSE}
# + offset(log(Population/100000)), link=log,
#Model 1
fit.sex = glm.nb(Deaths ~ Sex, data = diabetes)
p.val1 = pchisq(deviance(fit.sex), df.residual(fit.sex), lower = F)
p.val1
m1 <- round(c(df.residual(fit.sex), deviance(fit.sex), fit.sex$theta,
fit.sex$aic, fit.sex$twologlik),2)
#Model 2
fit.race = glm.nb(Deaths ~ Race,data = diabetes)
p.val2 = pchisq(deviance(fit.race), df.residual(fit.race), lower = F)
p.val2
m2 <- round(c(df.residual(fit.race), deviance(fit.race), fit.race$theta,
fit.race$aic, fit.race$twologlik),2)
#Model 3
fit.age = glm.nb(Deaths ~ Age_Groups, data = diabetes )
p.val3 = pchisq(deviance(fit.age), df.residual(fit.age), lower = F)
p.val3
m3 <- round(c(df.residual(fit.age), deviance(fit.age), fit.age$theta,
fit.age$aic, fit.age$twologlik),2)
#Model 4
fit.age.race = glm.nb(Deaths ~ Age_Groups + Race, data = diabetes)
p.val4 = pchisq(deviance(fit.age.race), df.residual(fit.age.race), lower = F)
p.val4
m4 <- round(c(df.residual(fit.age.race), deviance(fit.age.race), fit.age.race$theta,
fit.age.race$aic, fit.age.race$twologlik),2)
#Model 5
fit.sex.race = glm.nb(Deaths ~ Race + Sex, data = diabetes)
p.val5 = pchisq(deviance(fit.sex.race), df.residual(fit.sex.race), lower = F)
p.val5
m5 <- round(c(df.residual(fit.sex.race), deviance(fit.sex.race), fit.sex.race$theta,
fit.sex.race$aic, fit.sex.race$twologlik),2)
#Model 6
fit.sex.age = glm.nb(Deaths ~ Age_Groups + Sex, data = diabetes)
p.val6 = pchisq(deviance(fit.sex.age), df.residual(fit.sex.age), lower = F)
p.val6
m6 <- round(c(df.residual(fit.sex.age), deviance(fit.sex.age), fit.sex.age$theta,
fit.sex.age$aic, fit.sex.age$twologlik),2)
#Model 7
fit.sex.age.race = glm.nb(Deaths ~ Age_Groups + Race + Sex, data = diabetes)
p.val7 = pchisq(deviance(fit.sex.age.race), df.residual(fit.sex.age.race), lower = F)
p.val7
m7 <- round(c(df.residual(fit.sex.age.race), deviance(fit.sex.age.race), fit.sex.age.race$theta,
fit.sex.age.race$aic, fit.sex.age.race$twologlik),2)
#Model 8
fit.sex.age.race.his = glm.nb(Deaths ~ Age_Groups + Race + Sex + Ethnicity,data = diabetes)
p.val8 = pchisq(deviance(fit.sex.age.race.his), df.residual(fit.sex.age.race.his), lower = F)
p.val8
m8 <- round(c(df.residual(fit.sex.age.race.his), deviance(fit.sex.age.race.his), fit.sex.age.race.his$theta,
fit.sex.age.race.his$aic, fit.sex.age.race.his$twologlik),2)
#Model 9
fit.sex.age.race.his.year = glm.nb(Deaths ~ Age_Groups + Race + Sex + Ethnicity + Year, data = diabetes)
p.val9= pchisq(deviance(fit.sex.age.race.his.year), df.residual(fit.sex.age.race.his.year), lower = F)
p.val9
m9 <- round(c(df.residual(fit.sex.age.race.his.year), deviance(fit.sex.age.race.his.year), fit.sex.age.race.his.year$theta,
fit.sex.age.race.his.year$aic, fit.sex.age.race.his.year$twologlik),2)
#Model 10
fit.sex.age.race.his.State = glm.nb(Deaths ~ Age_Groups+ Race + Sex+ Ethnicity + State, data = diabetes)
p.val10= pchisq(deviance(fit.sex.age.race.his.State), df.residual(fit.sex.age.race.his.State), lower = F)
p.val10
m10 <- round(c(df.residual(fit.sex.age.race.his.State), deviance(fit.sex.age.race.his.State), fit.sex.age.race.his.State$theta,
fit.sex.age.race.his.State$aic, fit.sex.age.race.his.State$twologlik),2)
#Model 11
fit.sex.age.race.his.year.State = glm.nb(Deaths ~ Age_Groups +
Race+ Sex + Ethnicity+ Year + State, data = diabetes)
p.val11= pchisq(deviance(fit.sex.age.race.his.year.State), df.residual(fit.sex.age.race.his.year.State), lower = F)
p.val11
m11 <- round(c(df.residual(fit.sex.age.race.his.year.State), deviance(fit.sex.age.race.his.year.State), fit.sex.age.race.his.year.State$theta,
fit.sex.age.race.his.year.State$aic, fit.sex.age.race.his.year.State$twologlik),2)
```
```{r}
################# SUMMARY Table-2 ####################
tab2 <- rbind(m1, m2, m3, m4, m5, m6, m7, m8, m9, m10,m11)
colnames(tab2) = c( "d.f.", "dev", "disp", "AIC", "2Llik")
rownames(tab2) = c("Model 1", "Model 2", "Model 3","Model 4","Model 5","Model 6","Model 7","Model 8","Model 9","Model 10","Model 11")
knitr::kable(tab2, digits=3)
```
Diagnosis
===
Column {.tabset data-wigth=1000}
----------------------------------------------------------------------
### Model 1
::::::{style="display: flex;"}
:::{.column width=50%}
#### Quantile - Quantile Plot
```{r, fig.align = "center"}
halfnorm(residuals(fit.sex), main = " Model 1: Sex")
abline(a = 0, b = 1, lwd = 2, col = "red")
```
:::
:::{.column width=50%}
#### Probability Density Plot of Death due to Diabetes
```{r, fig.align = "center"}
plot(density(diabetes$Deaths), main = "Diagnostics of Model 1: Sex")
lines(density(predict(fit.sex, type="response")), col = "red")
```
:::
::::::
### Model 2
::::::{style="display: flex;"}
:::{.column width=50%}
#### Quantile - Quantile Plot
```{r}
halfnorm(residuals(fit.race),
main = "Model 2: Race")
abline(a = 0, b = 1, lwd = 2, col = "red")
```
:::
:::{.column width=50%}
#### Probability Density Plot of Death due to Diabetes
```{r}
plot(density(diabetes$Deaths), main = "Diagnostics of Model 2: Race")
lines(density(predict(fit.race, type="response")), col = "red")
```
:::
::::::
### Model 3
::::::{style="display: flex;"}
:::{.column width=50%}
#### Quantile - Quantile Plot
```{r}
halfnorm(residuals(fit.age),
main = "Model 3: Age Group")
abline(a = 0, b = 1, lwd = 2, col = "red")
```
:::
:::{.column width=50%}
#### Probability Density Plot of Death due to Diabetes
```{r}
plot(density(diabetes$Deaths), main = "Diagnostics of Model 3: Age Group")
lines(density(predict(fit.race, type="response")), col = "red")
```
:::
::::::
### Model 4
::::::{style="display: flex;"}
:::{.column width=50%}
#### Quantile - Quantile Plot
```{r}
halfnorm(residuals(fit.age.race),
main = "Model 4: Age Group + Race")
abline(a = 0, b = 1, lwd = 2, col = "red")
```
:::
:::{.column width=50%}
#### Probability Density Plot of Death due to Diabetes
```{r}
plot(density(diabetes$Deaths), main = "Diagnostics of Model 4: Age Group + Race")
lines(density(predict(fit.age.race, type="response")), col = "red")
```
:::
::::::
### Model 5
::::::{style="display: flex;"}
:::{.column width=50%}
#### Quantile - Quantile Plot
```{r}
halfnorm(residuals(fit.sex.race),
main = "Model 5: Race + Sex")
abline(a = 0, b = 1, lwd = 2, col = "red")
```
:::
:::{.column width=50%}
#### Probability Density Plot of Death due to Diabetes
```{r}
plot(density(diabetes$Deaths), main = "Diagnostics of Model 5: Race + Sex")
lines(density(predict(fit.sex.race, type="response")), col = "red")
```
:::
::::::
### Model 6
::::::{style="display: flex;"}
:::{.column width=50%}
#### Quantile - Quantile Plot
```{r}
halfnorm(residuals(fit.sex.age),
main = "Model 66: Age Group + sex")
abline(a = 0, b = 1, lwd = 2, col = "red")
```
:::
:::{.column width=50%}
#### Probability Density Plot of Death due to Diabetes
```{r}
plot(density(diabetes$Deaths), main = "Diagnostics of Model 6: Age Group + Sex")
lines(density(predict(fit.sex.age, type="response")), col = "red")
```
:::
::::::
### Model 7
::::::{style="display: flex;"}
:::{.column width=50%}
#### Quantile - Quantile Plot
```{r}
halfnorm(residuals(fit.sex.age.race),
main = "Model 7: Age Group + Race + Sex")
abline(a = 0, b = 1, lwd = 2, col = "red")
```
:::
:::{.column width=50%}
#### Probability Density Plot of Death due to Diabetes
```{r}
plot(density(diabetes$Deaths), main = "Diagnostics of Model 7: Age Group + Race + Sex")
lines(density(predict(fit.sex.age.race, type="response")), col = "red")
```
:::
::::::
### Model 8
::::::{style="display: flex;"}
:::{.column width=50%}
#### Quantile - Quantile Plot
```{r}
halfnorm(residuals(fit.sex.age.race.his),
main = "Age Group + Race + Sex + Ethnicity")
abline(a = 0, b = 1, lwd = 2, col = "red")
```
:::
:::{.column width=50%}
#### Probability Density Plot of Death due to Diabetes
```{r}
plot(density(diabetes$Deaths), main = "Diagnostics of Model 8: Age Group")
lines(density(predict(fit.sex.age.race.his, type="response")), col = "red")
```
:::
::::::
### Model 9
::::::{style="display: flex;"}
:::{.column width=50%}
#### Quantile - Quantile Plot
```{r}
halfnorm(residuals(fit.sex.age.race.his.year),
main = "Age Group + Race + Sex + Ethnicity + Year")
abline(a = 0, b = 1, lwd = 2, col = "red")
```
:::
:::{.column width=50%}
#### Probability Density Plot of Death due to Diabetes
```{r}
plot(density(diabetes$Deaths), main = "Diagnostics of Model 9: Age Group + Race + Sex + Ethnicity + Year")
lines(density(predict(fit.sex.age.race.his.year, type="response")), col = "red")
```
:::
::::::
### Model 10
::::::{style="display: flex;"}
:::{.column width=50%}
#### Quantile - Quantile Plot
```{r}
halfnorm(residuals(fit.sex.age.race.his.State),
main = "Age Group + Race + Sex + Ethnicity + State")
abline(a = 0, b = 1, lwd = 2, col = "red")
```
:::
:::{.column width=50%}
#### Probability Density Plot of Death due to Diabetes
```{r}
plot(density(diabetes$Deaths), main = "Diagnostics of Model 10: Age Group + Race + Sex + Ethnicity + State")
lines(density(predict(fit.sex.age.race.his.State, type="response")), col = "red")
```
:::
::::::
### Model 11
::::::{style="display: flex;"}
:::{.column width=50%}
#### Quantile - Quantile Plot
```{r}
halfnorm(residuals(fit.sex.age.race.his.year.State),
main = "Age Group + Race + Sex + Ethnicity + Year + State")
abline(a = 0, b = 1, lwd = 2, col = "red")
```
:::
:::{.column width=50%}
#### Probability Density Plot of Death due to Diabetes
```{r}
plot(density(diabetes$Deaths), main = "Diagnostics of Model 11: Age Group + Race + Sex + Ethnicity + Year + State")
lines(density(predict(fit.sex.age.race.his.year.State, type="response")), col = "red")
```
:::
::::::
Discussion
===
Column {data-width=700}
-----------------------------------------------------------------------
### Result
```{r}
m11_result <- summary(fit.sex.age.race.his.year.State)
m11_result_RR <- round(m11_result[[12]], 4)
m11_result_RR <- cbind(m11_result_RR, round(exp(m11_result_RR[,1]), 2))
colnames(m11_result_RR)[5] <- "Relative Rates"
#DT::datatable(m11_result_RR)
knitr::kable(m11_result_RR)
```
Column {.tabset}
-----------------------------------------------------------------------
### Analysis
The baselines of each factor are :
- Age group: <1
- Race: American Indian or Alaska native
- Ethnicity: Hispanic or Latino
- Sex: Female
- State: Alabama
<font size = 3>
**Analysis:**
</font>
- Ethnicity: The risk of dying due to diabetes among Non-Hispanic or Latino is about 9.33 times higher than that for the Hispanic or Latino holding all other variables constant.
- Race: Highest risk of death due to diabetes is reported for white Americans. A person of white is 17.82 times more likely to die due to diabetes than American Indian or Alaska Native on average holding all other variables constant.
- Sex: Males have about 1.48 times higher rick of dying due to diabetes than females on average holding all other variables constant.
### Limitation
- Not enough variables
- Inability to optimize results
- No breakdown of causes of death.
### About the author
This dashboard is for my capstone project, under the supervision of Dr. Ying-Ju Tessa Chen.
- Mathematics major
- Data analysis minor
- ENTP