In R we’ll generate similar continuous distributions for two groups and give a brief overview of statistical tests and visualizations to compare the groups. Though the fake data are normally distributed, we use methods for various kinds of continuous distributions. I put this together while working with data from an odd distribution involving money where I am focusing on the binning method.

> # load the ggplot graphing package > require(ggplot2) Loading required package: ggplot2 > > # generate fake data with group B having a higher outcome > # the function rnorm() generates normally distributed random numbers > set.seed(22136) > df <- rbind( + data.frame(group='A', outcome=rnorm(n=200, mean=100, sd=20)), + data.frame(group='B', outcome=rnorm(n=200, mean=105, sd=20))) > > # inspect the fake data > summary(df) group outcome A:200 Min. : 46.55 B:200 1st Qu.: 88.90 Median :101.88 Mean :102.01 3rd Qu.:116.65 Max. :160.17 > > # generate five number summary and mean for each group > by(df$outcome, df$group, summary) df$group: A Min. 1st Qu. Median Mean 3rd Qu. Max. 59.68 86.62 97.12 97.75 109.80 145.30 ------------------------------------------------------------ df$group: B Min. 1st Qu. Median Mean 3rd Qu. Max. 46.55 93.54 107.10 106.30 122.40 160.20 > > # perform t-test for the difference of means > t.test(outcome ~ group, data=df) Welch Two Sample t-test data: outcome by group t = -4.4333, df = 387.358, p-value = 1.21e-05 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -12.304221 -4.743673 sample estimates: mean in group A mean in group B 97.74968 106.27363 > > # perform two-sample Wilcoxon test > wilcox.test(outcome ~ group, data=df) Wilcoxon rank sum test with continuity correction data: outcome by group W = 14770, p-value = 6.09e-06 alternative hypothesis: true location shift is not equal to 0 > > # draw a box plot > ggplot(df, aes(group, outcome)) + geom_boxplot()

> # draw a histogram > ggplot(df, aes(outcome)) + geom_histogram(binwidth=10) + + facet_wrap(~group, nrow=2, ncol=1)

> # draw a density plot colored by group > ggplot(df, aes(outcome,color=group)) + geom_density()

> # draw a frequency plot colored by group > ggplot(df, aes(outcome,color=group)) + geom_freqpoly(binwidth=10)

> # bin/discretize (cut continuous variable into equally sized groups) > (q<-quantile(df$outcome , seq(0, 1, .2))) 0% 20% 40% 60% 80% 100% 46.55005 86.24026 96.58131 107.14224 120.55994 160.16613 > df$outcome_bin <- cut(df$outcome, breaks=q, include.lowest=T) > summary(df$outcome_bin) [46.6,86.2] (86.2,96.6] (96.6,107] (107,121] (121,160] 80 80 80 80 80 > > # tabulate the binned data > (tab<-with(df, table(outcome_bin, group))) # counts group outcome_bin A B [46.6,86.2] 47 33 (86.2,96.6] 50 30 (96.6,107] 43 37 (107,121] 36 44 (121,160] 24 56 > print(prop.table(tab,2)) # proportions group outcome_bin A B [46.6,86.2] 0.235 0.165 (86.2,96.6] 0.250 0.150 (96.6,107] 0.215 0.185 (107,121] 0.180 0.220 (121,160] 0.120 0.280 > > # perform chi square test on the binned groups > print(chisq.test(tab)) Pearson's Chi-squared test data: tab X-squared = 21.5, df = 4, p-value = 0.000252 > > # plot the binned group as a barchart with percentages of the relative frequencies > # thanks to joran at http://stackoverflow.com/a/10888277/603799 > > require(reshape) # for melt() > df1 <- melt(ddply(df,.(group),function(x){prop.table(table(x$outcome_bin))}),id.vars = 1) > > require(scales) # for percent > ggplot(df1, aes(x = variable,y = value)) + + facet_wrap(~group, nrow=2, ncol=1) + + scale_y_continuous(labels=percent) + + geom_bar(stat = "identity")

Tested with R 2.15.0 and ggplot2 0.9.1.

Pingback: Learning R: Hard Lessons - Home Of The Scary DBA