Baseline Characteristics Table

A common task for us is putting together a baseline characteristics table, or “Table 1.” This is usually painful to do by hand, particularly when you often have to do it multiple times whenever a small change is made to a cohort. The following describes one way of automating much of this, using R and the table1() function in the table1 package.

Packages

Tidyverse

If you’re not already familiar with it, I think you’ll find tidyverse extremely helpful - it is actually a bundle of packages centered around ‘tidy’ data, which is just a fancy description for data that take the form of 3 rules:

Each variable has its own column
Each observation has its own row.
Each value has its own cell.

Outside of time-varying analyses, that’s pretty much exactly the type of analytic dataset we are creating for most of our work.

Table1

In addition to tidyverse, for the baseline characteristics table, the package table1 gets us pretty close to a final output. If you want to read more, see it’s vignette at the aforementioned link.

Haven

Haven is a tidyverse package that reads SAS datasets directly. Particularly helpful for us since most of our initial data wrangling has to take place in SAS. If you haven’t noticed already, base R is not particularly good at handling huge datasets (absent a lot of memory resources on a VM) because it completely stores them in memory.

Flextable

Flextable will be used to get the final output into a word document. Will explain more on this later.

# if you need to install, you would use:
# install.packages(c("tidyverse", "table1"))

# load the relevant packages
library(tidyverse)
library(table1)
library(haven)
library(flextable)

Load the data

Here’s where you load the SAS dataset, using the haven package. First, you assign the object tab1 (can be any name you want it to be) something using the <- assignment operator, and you assign it the dataset returned by the read_sas() function from haven. There’s lots more options with this function than I use below, but the basics usually work well.

Note

Note that R has case-sensitive variable/column names. Make your life a lot easier and rename variables to lowercase with the rename_with() function and tolower option, i.e., rename_with(tolower) (see below)

# note that the filepath needs forward slashes, or two backwards slashes,
# e.g., "C:/Users/.../...sas7bdat" or "C:\\Users\\...\\...sas7bdat"
#
# tab1 <- read_sas("E:/.../.../...sas7bdat") %>% 
#   rename_with(tolower)

Data wrangling for table1

I already have my data loaded in an object called aim1cohort, so I’m just going to assign it the new name tab1, but you would not need to run this if you run above to load the data directly from a SAS dataset.

Code

load(file = "/Users/ssmithm/UFL Dropbox/Steven Smith/R Projects/K01-Initial_Antihtn_Prescribing/data/aim1cohort.rda")

tab1 <- aim1cohort

A note on ’pipe’s

One thing that’s worth reading up on the %>% “pipe”. See this here or here. It’s a tidyverse thing originally coming from the magrittr package. Because it’s quite popular, Base R has now incorporated its own pipe now which does almost exactly the same thing, but looks like this: |\. You can use ctrl+shift+M as a short-cut, and in your R-studio preferences, you can tell R-Studio whether to use the native pipe |> or magrittr’s %>%. Basically, it’s a way to pipe an object forward from one function to the next, as opposed to having to nest a bunch of functions within one another.

Ok, here’s the data wrangling code:

tab1 <- tab1 %>%
  
  # select only variables needed; ends_with() function is a nifty short cut to 
  # grab all variables whose name ends with that string
  select(c("patid","source","age","age_cat", "hispanic","sex","race","index_year", 
           ends_with("indicator"), "combined_score_num")) %>%
  
  # mutate() is for creating new variables. Some of this should look pretty 
  # similar to what you're used to in SAS. 
  # c() function just combines multiple things and works similarly here to a
  # SAS parenthetical list, i.e., race in ("No Information", ...)
  mutate(race = if_else(race %in% c("No Information","Refuse to Answer","Unknown"), "Unknown", race),
         
         # factor() is a vector type for categorical data and it's important for
         # table1 package because it's how table1 figures out what is categorical
         # vs. continuous, and as we'll see below, how to order them in the output. 
         hispanic = factor(hispanic),
         sex = factor(sex),
         index_year = factor(index_year),
         statinindicator = factor(statinindicator),
         aspirinindicator = factor(aspirinindicator),
         smokingindicator = factor(smokingindicator),
         diabetesindicator = factor(diabetesindicator),
         ckdindicator = factor(ckdindicator),
         esrdindicator = factor(esrdindicator),
         hfejindicator = factor(hfejindicator),
         chdindicator = factor(chdindicator),
         pcrindicator = factor(pcrindicator),
         strokeindicator = factor(strokeindicator),
         padindicator = factor(padindicator),
         ascvdindicator = factor(ascvdindicator),
         afindicator = factor(afindicator),
         copdindicator = factor(copdindicator),
         asthmaindicator = factor(asthmaindicator),
         depressionindicator = factor(depressionindicator),
         goutindicator = factor(goutindicator),
         source = factor(source),
         anticogindicator = factor(anticogindicator),
         ktindicator = factor(ktindicator),
         osaindicator = factor(osaindicator)
  ) %>%
  # arrange() sorts.
  arrange(patid) %>%
  # distinct() picks out distinct values, here of patid.
  distinct(patid, .keep_all = TRUE)


# here we go in and work on specific columns of `tab1` dataset and 
# order the levels (values) of that column in the way we want it presented in
# the output table, using the factor() function. Basically, we're just taking 
# the column as is, and replacing it with the same data, but telling R what is 
# should be the intrinsic order of these values when R outputs anything with it. 
# (Not actually changing given values for a given observation)
tab1$race <- factor(tab1$race, levels = c("American Indian or Alaska Native",
                                          "Asian",
                                          "Black or African American",
                                          "Native Hawaiian or Other Pacific",
                                          "White",
                                          "Multiple Race",
                                          "Other",
                                          "Unknown"))
tab1$age_cat <- factor(tab1$age_cat, levels = c("<45 y", "45-64 y", ">65 y"))
tab1$hispanic <- factor(tab1$hispanic, levels = c("Hispanic", "Not Hispanic", "Unknown"))

# note that above I'm using base R coding, not tidyverse syntax. 
# I could have accomplished the above with tidyverse (dplyr) syntax also:
tab1 <- tab1 %>% 
  mutate(race = factor(race, levels = c("American Indian or Alaska Native",
                                        "Asian",
                                        "Black or African American",
                                        "Native Hawaiian or Other Pacific",
                                        "White",
                                        "Multiple Race",
                                        "Other",
                                        "Unknown")),
         age_cat = factor(age_cat, levels = c("<45 y", "45-64 y", ">65 y")),
         hispanic = factor(hispanic, levels = c("Hispanic", "Not Hispanic", "Unknown")),
         ckdindicator = as.integer(ckdindicator))

Labels and Units

Here we can apply label and unit attributes to each column. Labels will be printed (in the output table) as specified here, and will be appended with units, if they’re assigned. Note here I only assign one unit (to age), as most everything else is categorical. But, for example, BP would need units also if included.

# Labels
label(tab1$age) <- "Age"
label(tab1$age_cat) <- "Age Category"
label(tab1$sex) <- "Sex"
label(tab1$race) <- "Race"
label(tab1$hispanic) <-  "Ethnicity"
label(tab1$smokingindicator) <- "Current Smoker"
label(tab1$diabetesindicator) <- "Diabetes"
label(tab1$ckdindicator) <- "Chronic kidney disease"
label(tab1$esrdindicator) <- "End-stage renal disease"
label(tab1$hfejindicator) <- "Heart failure w/ reduced EF"
label(tab1$chdindicator) <- "Coronary heart disease"
label(tab1$pcrindicator) <- "Prior coronary revascularization"
label(tab1$strokeindicator) <- "Prior stroke or TIA"
label(tab1$padindicator) <- "Peripheral arterial disease"
label(tab1$ascvdindicator) <- "History of clinical ASCVD"
label(tab1$afindicator) <- "Atrial fibrillation"
label(tab1$copdindicator) <- "Chronic obstructive pulmonary disease"
label(tab1$asthmaindicator) <- "Asthma"
label(tab1$depressionindicator) <- "Depression"
label(tab1$combined_score_num) <- "Combined Comorbidity Score"
label(tab1$statinindicator) <- "Statin"
label(tab1$aspirinindicator) <- "Aspirin"
label(tab1$index_year) <- "Index Year"

# Units
units(tab1$age) <- "years"

Setting up the output

Here we tell table1 to use our label list, as well as what columns to give us. This is important for stratified columns (e.g., Medicaid and Medicare, or those with EHR and those without EHR data). I think you can do as many strata as you want, though obviously a lot will not look good in the table.

#### Render Table 1 ####
# Setup

# here, the first string is the column header, and following the = sign is 
# how to get just the patients that should be used for that column. 
# so for everyone, we use the entire tab1 dataset. For Medicaid column, we 
# subset() tab1 to get only those people who have source = "FLM"
strata_t1 <- c(list("Overall Cohort" = tab1),
               list("Medicaid-Insured" = subset(tab1, source == "FLM")),
               list("Medicare-Insured" = subset(tab1, source == "MED")))

# tell table1 where to get our labels
labels_t1 <- list(
  variables = list(age = render.varlabel(tab1$age),
                   age_cat = render.varlabel(tab1$age_cat),
                   sex = render.varlabel(tab1$sex),
                   race = render.varlabel(tab1$race),
                   hispanic = render.varlabel(tab1$hispanic),
                   smokingindicator = render.varlabel(tab1$smokingindicator),
                   diabetesindicator = render.varlabel(tab1$diabetesindicator),
                   ckdindicator = render.varlabel(tab1$ckdindicator),
                   esrdindicator = render.varlabel(tab1$esrdindicator),
                   hfejindicator = render.varlabel(tab1$hfejindicator),
                   chdindicator = render.varlabel(tab1$chdindicator),
                   pcrindicator = render.varlabel(tab1$pcrindicator),
                   strokeindicator = render.varlabel(tab1$strokeindicator),
                   padindicator = render.varlabel(tab1$padindicator),
                   ascvdindicator = render.varlabel(tab1$ascvdindicator),
                   afindicator = render.varlabel(tab1$afindicator),
                   copdindicator = render.varlabel(tab1$copdindicator),
                   asthmaindicator = render.varlabel(tab1$asthmaindicator),
                   depressionindicator = render.varlabel(tab1$depressionindicator),
                   combined_score_num = render.varlabel(tab1$combined_score_num),
                   statinindicator = render.varlabel(tab1$statinindicator),
                   aspirinindicator = render.varlabel(tab1$aspirinindicator),
                   index_year = render.varlabel(tab1$index_year)
  ))

Some functions to style the output

Basically here we’re just making some stylistic choices about what we want output to look like for categorical variables, continuous variables, and for the column headers. See the URL above for more of a description of these. But, you probably don’t need to edit this at all for any tables you create.

# add commas to Ns and cell counts
render.continuous <- function(x, ...) {
  with(stats.default(x, ...), c("", "Mean \u00B1 SD"  = sprintf("%s \u00B1 %s", signif_pad(MEAN, 3, big.mark=","), signif_pad(SD, 3, big.mark=","))))
}

render.categorical <- function(x, ...) {
  c("", sapply(stats.apply.rounding(stats.default(x)), function(y) with(y, sprintf("%s (%s%%)", prettyNum(FREQ, big.mark=","), PCT))))
}

render.strat <- function(label, n, ...) {
  sprintf("<span class='stratlabel'>%s<br><span class='stratn'>(N=%s)</span></span>", label, prettyNum(n, big.mark=","))
}

Render the Table

Now create the actual table.

# Render Table 1 -- Will need to Save the HTML for export off ResVault
table1(strata_t1, 
       labels_t1, 
       droplevels = TRUE, 
       # these next three lines are just applying the functions we created above
       # basically can be read as "for rendering continuous variables, use the 
       # function render.continuous()", etc. 
       render.continuous = render.continuous, 
       render.strat = render.strat, 
       render.categorical = render.categorical)

	Overall Cohort (N=143,054)	Medicaid-Insured (N= 71,774)	Medicare-Insured (N= 71,280)
Age (years)
Mean ± SD	57.4 ± 17.8	46.3 ± 13.6	68.6 ± 14.2
Missing	282 (0.2%)	282 (0.4%)	0 (0%)
Age Category
<45 y	38,213 (26.7%)	33,176 (46.2%)	5,037 (7.1%)
45-64 y	51,007 (35.7%)	34,274 (47.8%)	16,733 (23.5%)
>65 y	53,834 (37.6%)	4,324 (6.0%)	49,510 (69.5%)
Sex
Female	81,555 (57.0%)	43,011 (59.9%)	38,544 (54.1%)
Male	61,493 (43.0%)	28,760 (40.1%)	32,733 (45.9%)
Unknow	6 (0.0%)	3 (0.0%)	3 (0.0%)
Race
American Indian or Alaska Native	334 (0.2%)	165 (0.2%)	169 (0.2%)
Asian	1,447 (1.0%)	659 (0.9%)	788 (1.1%)
Black or African American	33,814 (23.6%)	22,041 (30.7%)	11,773 (16.5%)
Native Hawaiian or Other Pacific	33 (0.0%)	11 (0.0%)	22 (0.0%)
White	70,156 (49.0%)	25,304 (35.3%)	44,852 (62.9%)
Multiple Race	503 (0.4%)	167 (0.2%)	336 (0.5%)
Other	20,555 (14.4%)	11,610 (16.2%)	8,945 (12.5%)
Unknown	16,212 (11.3%)	11,817 (16.5%)	4,395 (6.2%)
Ethnicity
Hispanic	22,680 (15.9%)	13,016 (18.1%)	9,664 (13.6%)
Not Hispanic	99,483 (69.5%)	44,696 (62.3%)	54,787 (76.9%)
Unknown	16,203 (11.3%)	11,865 (16.5%)	4,338 (6.1%)
Missing	4688 (3.3%)	2197 (3.1%)	2491 (3.5%)
Current Smoker
0	110,974 (77.6%)	54,739 (76.3%)	56,235 (78.9%)
1	32,080 (22.4%)	17,035 (23.7%)	15,045 (21.1%)
Diabetes
0	113,421 (79.3%)	57,972 (80.8%)	55,449 (77.8%)
1	29,633 (20.7%)	13,802 (19.2%)	15,831 (22.2%)
Chronic kidney disease
Mean ± SD	1.12 ± 0.329	1.07 ± 0.257	1.18 ± 0.380
End-stage renal disease
0	141,982 (99.3%)	71,541 (99.7%)	70,441 (98.8%)
1	1,072 (0.7%)	233 (0.3%)	839 (1.2%)
Heart failure w/ reduced EF
0	140,196 (98.0%)	70,510 (98.2%)	69,686 (97.8%)
1	2,858 (2.0%)	1,264 (1.8%)	1,594 (2.2%)
Coronary heart disease
0	135,114 (94.4%)	69,032 (96.2%)	66,082 (92.7%)
1	7,940 (5.6%)	2,742 (3.8%)	5,198 (7.3%)
Prior coronary revascularization
0	142,326 (99.5%)	71,632 (99.8%)	70,694 (99.2%)
1	728 (0.5%)	142 (0.2%)	586 (0.8%)
Prior stroke or TIA
0	140,762 (98.4%)	71,465 (99.6%)	69,297 (97.2%)
1	2,292 (1.6%)	309 (0.4%)	1,983 (2.8%)
Peripheral arterial disease
0	133,666 (93.4%)	70,291 (97.9%)	63,375 (88.9%)
1	9,388 (6.6%)	1,483 (2.1%)	7,905 (11.1%)
History of clinical ASCVD
0	125,638 (87.8%)	67,545 (94.1%)	58,093 (81.5%)
1	17,416 (12.2%)	4,229 (5.9%)	13,187 (18.5%)
Atrial fibrillation
0	133,030 (93.0%)	70,399 (98.1%)	62,631 (87.9%)
1	10,024 (7.0%)	1,375 (1.9%)	8,649 (12.1%)
Chronic obstructive pulmonary disease
0	135,285 (94.6%)	67,186 (93.6%)	68,099 (95.5%)
1	7,769 (5.4%)	4,588 (6.4%)	3,181 (4.5%)
Asthma
0	136,988 (95.8%)	66,919 (93.2%)	70,069 (98.3%)
1	6,066 (4.2%)	4,855 (6.8%)	1,211 (1.7%)
Depression
0	117,364 (82.0%)	61,104 (85.1%)	56,260 (78.9%)
1	25,690 (18.0%)	10,670 (14.9%)	15,020 (21.1%)
Combined Comorbidity Score
Mean ± SD	2.54 ± 3.53	1.57 ± 2.82	3.51 ± 3.89
Statin
0	111,633 (78.0%)	58,573 (81.6%)	53,060 (74.4%)
1	31,421 (22.0%)	13,201 (18.4%)	18,220 (25.6%)
Aspirin
0	131,391 (91.8%)	65,659 (91.5%)	65,732 (92.2%)
1	11,663 (8.2%)	6,115 (8.5%)	5,548 (7.8%)
Index Year
2012	32 (0.0%)	1 (0.0%)	31 (0.0%)
2013	17,728 (12.4%)	485 (0.7%)	17,243 (24.2%)
2014	31,711 (22.2%)	11,457 (16.0%)	20,254 (28.4%)
2015	27,205 (19.0%)	11,039 (15.4%)	16,166 (22.7%)
2016	19,283 (13.5%)	10,664 (14.9%)	8,619 (12.1%)
2017	18,661 (13.0%)	9,694 (13.5%)	8,967 (12.6%)
2018	8,963 (6.3%)	8,963 (12.5%)	0 (0%)
2019	8,164 (5.7%)	8,164 (11.4%)	0 (0%)
2020	7,520 (5.3%)	7,520 (10.5%)	0 (0%)
2021	3,787 (2.6%)	3,787 (5.3%)	0 (0%)

# Can also save Table 1 for posterity, or to export as a CSV
# would just need to uncomment the following code
# t1 <- as.data.frame(table1(strata_t1, labels_t1, droplevels = TRUE,
#                            render.continuous = render.continuous, render.strat = render.strat, render.categorical = render.categorical))
# 
# write_csv(t1, file = "path_to_directory/Table 1.csv")

Note that this output still needs a bit of editing, e.g., copying the data in the ‘1’ row for the various indicators, up to the the row with the respective label, and then deleting the ‘0’ row and what was the ‘1’ row. I think a modification can probably be built into the package to do this routinely, or as an option, but I haven’t had time to fool with it. In any event, this gets us pretty close.

Getting this into Word

Ok, we have the table in HTML now, but we need it in Word. A straight copy + paste from R Studio viewer to word document does not work very well, because it either keeps all the HTML formatting (check by selecting all of the pasted table and adding all borders to the table), or it completely loses formatting if you do a Special Paste, i.e., no Word table structure any more.

Here’s a couple of ways around this (I’m sure there are others):

Export the output in the R Studio viewer to an HTML file on your computer. Then, open that file, hit CTRL+A (or CMD+A on Mac), CTRL + C, then go to Word Document and Paste Special -> HTML. It typically doesn’t look as pretty, but fixes both of the above issues.
Let Flextable do the heavy lifting for this.

Here, I use flextable. Instead of printing the table1() function results, as above, we can instead save them to an object, named tbl1 below. Then, convert it to a flextable, and have the flextable package export it as a word document.

# same as above, but instead of printing, saving as object. 
tbl1 <- table1(strata_t1, 
               labels_t1, 
               droplevels = TRUE, 
               # these next three lines are just applying the functions we created above
               # basically can be read as "for rendering continuous variables, use the 
               # function render.continuous()", etc. 
               render.continuous = render.continuous, 
               render.strat = render.strat, 
               render.categorical = render.categorical)

# convert to flextable, then save as doc.
# update path as needed. 
t1flex(tbl1) %>% 
  save_as_docx(path="./output/table1_output.docx")

Voila - here’s the output.