# Comparing continuous distributions with R

In R we’ll generate similar continuous distributions for two groups and give a brief overview of statistical tests and visualizations to compare the groups. Though the fake data are normally distributed, we use methods for various kinds of continuous distributions. I put this together while working with data from an odd distribution involving money where I am focusing on the binning method.

```> # load the ggplot graphing package
> require(ggplot2)
Loading required package: ggplot2
>
> # generate fake data with group B having a higher outcome
> # the function rnorm() generates normally distributed random numbers
> set.seed(22136)
> df <- rbind(
+ data.frame(group='A', outcome=rnorm(n=200, mean=100, sd=20)),
+ data.frame(group='B', outcome=rnorm(n=200, mean=105, sd=20)))
>
> # inspect the fake data
> summary(df)
group      outcome
A:200   Min.   : 46.55
B:200   1st Qu.: 88.90
Median :101.88
Mean   :102.01
3rd Qu.:116.65
Max.   :160.17
>
> # generate five number summary and mean for each group
> by(df\$outcome, df\$group, summary)
df\$group: A
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
59.68   86.62   97.12   97.75  109.80  145.30
------------------------------------------------------------
df\$group: B
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
46.55   93.54  107.10  106.30  122.40  160.20
>
> # perform t-test for the difference of means
> t.test(outcome ~ group, data=df)

Welch Two Sample t-test

data:  outcome by group
t = -4.4333, df = 387.358, p-value = 1.21e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-12.304221  -4.743673
sample estimates:
mean in group A mean in group B
97.74968       106.27363

>
> # perform two-sample Wilcoxon test
> wilcox.test(outcome ~ group, data=df)

Wilcoxon rank sum test with continuity correction

data:  outcome by group
W = 14770, p-value = 6.09e-06
alternative hypothesis: true location shift is not equal to 0

>
> # draw a box plot
> ggplot(df, aes(group, outcome)) + geom_boxplot()
``` ```> # draw a histogram
> ggplot(df, aes(outcome)) + geom_histogram(binwidth=10) +
+ facet_wrap(~group, nrow=2, ncol=1)
``` ```> # draw a density plot colored by group
> ggplot(df, aes(outcome,color=group)) + geom_density()
``` ```> # draw a frequency plot colored by group
> ggplot(df, aes(outcome,color=group)) + geom_freqpoly(binwidth=10)
``` ```> # bin/discretize (cut continuous variable into equally sized groups)
> (q<-quantile(df\$outcome , seq(0, 1, .2)))
0%       20%       40%       60%       80%      100%
46.55005  86.24026  96.58131 107.14224 120.55994 160.16613
> df\$outcome_bin <- cut(df\$outcome, breaks=q, include.lowest=T)
> summary(df\$outcome_bin)
[46.6,86.2] (86.2,96.6]  (96.6,107]   (107,121]   (121,160]
80          80          80          80          80
>
> # tabulate the binned data
> (tab<-with(df, table(outcome_bin, group))) # counts
group
outcome_bin    A  B
[46.6,86.2] 47 33
(86.2,96.6] 50 30
(96.6,107]  43 37
(107,121]   36 44
(121,160]   24 56
> print(prop.table(tab,2)) # proportions
group
outcome_bin       A     B
[46.6,86.2] 0.235 0.165
(86.2,96.6] 0.250 0.150
(96.6,107]  0.215 0.185
(107,121]   0.180 0.220
(121,160]   0.120 0.280
>
> # perform chi square test on the binned groups
> print(chisq.test(tab))

Pearson's Chi-squared test

data:  tab
X-squared = 21.5, df = 4, p-value = 0.000252

>
> # plot the binned group as a barchart with percentages of the relative frequencies
> # thanks to joran at http://stackoverflow.com/a/10888277/603799
>
> require(reshape) # for melt()
> df1 <- melt(ddply(df,.(group),function(x){prop.table(table(x\$outcome_bin))}),id.vars = 1)
>
> require(scales) # for percent
> ggplot(df1, aes(x = variable,y = value)) +
+     facet_wrap(~group, nrow=2, ncol=1) +
+     scale_y_continuous(labels=percent) +
+     geom_bar(stat = "identity")
``` Tested with R 2.15.0 and ggplot2 0.9.1.

Advertisements