Comparing continuous distributions with R

In R we’ll generate similar continuous distributions for two groups and give a brief overview of statistical tests and visualizations to compare the groups. Though the fake data are normally distributed, we use methods for various kinds of continuous distributions. I put this together while working with data from an odd distribution involving money where I am focusing on the binning method.

> # load the ggplot graphing package
> require(ggplot2)
Loading required package: ggplot2
> 
> # generate fake data with group B having a higher outcome
> # the function rnorm() generates normally distributed random numbers
> set.seed(22136)
> df <- rbind(
+ data.frame(group='A', outcome=rnorm(n=200, mean=100, sd=20)),
+ data.frame(group='B', outcome=rnorm(n=200, mean=105, sd=20)))
> 
> # inspect the fake data
> summary(df)
 group      outcome      
 A:200   Min.   : 46.55  
 B:200   1st Qu.: 88.90  
         Median :101.88  
         Mean   :102.01  
         3rd Qu.:116.65  
         Max.   :160.17  
> 
> # generate five number summary and mean for each group
> by(df$outcome, df$group, summary)
df$group: A
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  59.68   86.62   97.12   97.75  109.80  145.30 
------------------------------------------------------------ 
df$group: B
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  46.55   93.54  107.10  106.30  122.40  160.20 
> 
> # perform t-test for the difference of means
> t.test(outcome ~ group, data=df)

        Welch Two Sample t-test

data:  outcome by group 
t = -4.4333, df = 387.358, p-value = 1.21e-05
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -12.304221  -4.743673 
sample estimates:
mean in group A mean in group B 
       97.74968       106.27363 

> 
> # perform two-sample Wilcoxon test
> wilcox.test(outcome ~ group, data=df)

        Wilcoxon rank sum test with continuity correction

data:  outcome by group 
W = 14770, p-value = 6.09e-06
alternative hypothesis: true location shift is not equal to 0 

> 
> # draw a box plot
> ggplot(df, aes(group, outcome)) + geom_boxplot() 

> # draw a histogram
> ggplot(df, aes(outcome)) + geom_histogram(binwidth=10) + 
+ facet_wrap(~group, nrow=2, ncol=1)

> # draw a density plot colored by group
> ggplot(df, aes(outcome,color=group)) + geom_density() 

> # draw a frequency plot colored by group
> ggplot(df, aes(outcome,color=group)) + geom_freqpoly(binwidth=10) 

> # bin/discretize (cut continuous variable into equally sized groups)
> (q<-quantile(df$outcome , seq(0, 1, .2)))
       0%       20%       40%       60%       80%      100% 
 46.55005  86.24026  96.58131 107.14224 120.55994 160.16613 
> df$outcome_bin <- cut(df$outcome, breaks=q, include.lowest=T)
> summary(df$outcome_bin)
[46.6,86.2] (86.2,96.6]  (96.6,107]   (107,121]   (121,160] 
         80          80          80          80          80 
> 
> # tabulate the binned data
> (tab<-with(df, table(outcome_bin, group))) # counts
             group
outcome_bin    A  B
  [46.6,86.2] 47 33
  (86.2,96.6] 50 30
  (96.6,107]  43 37
  (107,121]   36 44
  (121,160]   24 56
> print(prop.table(tab,2)) # proportions
             group
outcome_bin       A     B
  [46.6,86.2] 0.235 0.165
  (86.2,96.6] 0.250 0.150
  (96.6,107]  0.215 0.185
  (107,121]   0.180 0.220
  (121,160]   0.120 0.280
> 
> # perform chi square test on the binned groups
> print(chisq.test(tab))

        Pearson's Chi-squared test

data:  tab 
X-squared = 21.5, df = 4, p-value = 0.000252

> 
> # plot the binned group as a barchart with percentages of the relative frequencies
> # thanks to joran at http://stackoverflow.com/a/10888277/603799
> 
> require(reshape) # for melt()
> df1 <- melt(ddply(df,.(group),function(x){prop.table(table(x$outcome_bin))}),id.vars = 1)
> 
> require(scales) # for percent
> ggplot(df1, aes(x = variable,y = value)) +
+     facet_wrap(~group, nrow=2, ncol=1) +
+     scale_y_continuous(labels=percent) +
+     geom_bar(stat = "identity")

Tested with R 2.15.0 and ggplot2 0.9.1.

About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s