Learning R: 2009

Public opinion on health care reform

Monday, October 12, 2009 0 comments Posted by Jim 9:32 AM

hc_app<-read.table("n:/misc/healthcareapproval.txt", header = T, sep = "\t")
attach(hc_app) hc_fit.o1<-lowess(Oppose~as.Date(Dates),f=0.25)

hc_fit.f1<-lowess(Favor~as.Date(Dates),f=0.25)
hc_fit.o2<-lowess(Oppose~as.Date(Dates),f=0.75)
hc_fit.f2<-lowess(Favor~as.Date(Dates),f=0.75)

plot(as.Date(Dates),Oppose,
main="Public opinion and health care reform",
ylim=c(0,80),pch=16,
xlim=c(as.Date("2009-01-01"),as.Date("2009-11-01")),
cex.axis=.85, col="#E6ADD8",
xlab="",ylab="Percentage approving or opposing") points(as.Date(Dates),
Favor,pch=16,col="#ADD8E6")

lines(hc_fit.f1,col="blue",lwd=2)
lines(hc_fit.o1,col="red",lwd=2)
lines(hc_fit.f2,col="blue",lwd=2,lty=3)
lines(hc_fit.o2,col="red",lwd=2,lty=3)

abline(h=50,lwd=.5,lty=3,col="#555555")

Friday, August 28, 2009 0 comments Posted by Jim 12:11 PM

fy<-c(1990:2019)

deficit.1<-c(-3.9, -4.5, -4.7, -3.9, -2.9, -2.2, -1.4, -0.3, 0.8, 1.4, 2.4, 1.3, -1.5, -3.5, -3.6, -2.6, -1.9, -1.2, -3.2, -11.2, -9.6, NA,NA, NA, NA, NA, NA, NA, NA, NA)

projected<-c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, -9.6, -6.1, -3.7, -3.2, -3.2, -3.1, -3.3, -3.2, -3.1, -3.4)

png("c:/data/deficit_color.png",height=480,width=480)

plot(deficit.1~fy,ylim=c(-12,5),type="n",lwd=2,col="red",
main="Federal budget deficit, 1990-2019",
cex.lab=1.1,cex.axis=.75,
xlab="Fiscal year",ylab="Deficit (% of GDP)")

rect(1988,-15,1994,6,col="#FF9999",border=NA)
rect(1994,-15,2002,6,col="#6699FF",border=NA)
rect(2002,-15,2010,6,col="#FF9999",border=NA)
rect(2010,-15,2014,6,col="#6699FF",border=NA)
rect(2014,-15,2021,6,col="#CCCCCC",border=NA)
abline(h=0,lwd=.5,lty=3,col="#555555")
lines(fy,deficit.1,lwd=2.5)
lines(fy,projected,lwd=2.5,lty=2)
rect(1990,-11.5,2000,-9.5,col="#dddddd",border="#555555")
text(1995,-10.5,"Source: CBO")

graphics.off()

Adding a legend to a plot

Monday, July 20, 2009 4 comments Posted by Jim 11:37 AM

It's pretty easy!

plot (c(1968,2010),c(0,10),type="n", # sets the x and y axes scales

xlab="Year",ylab="Expenditures/GDP (%)") # adds titles to the axes

lines(year,defense,col="red",lwd=2.5) # adds a line for defense expenditures

lines(year,health,col="blue",lwd=2.5) # adds a line for health expenditures

legend(2000,9.5, # places a legend at the appropriate place c("Health","Defense"), # puts text in the legend

lty=c(1,1), # gives the legend appropriate symbols (lines)

lwd=c(2.5,2.5),col=c("blue","red")) # gives the legend lines the correct color and width

Missing data, logistic regression, and a predicted values plot (or two)

Wednesday, July 15, 2009 0 comments Posted by Jim 6:55 PM

miss <- read.table ("/data/missing.txt", header = T, sep = "\t")

attach miss result1 <- glm(a~b, family=binomial(logit))

summary(result1)

Call: glm(formula = a ~ b, family = binomial(logit))

Deviance Residuals:
Min 1Q Median 3Q Max
-1.8864 -1.2036 0.7397 0.9425 1.4385

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.96130 1.40609 -4.240 2.24e-05 ***
b 0.10950 0.02404 4.555 5.24e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 279.97 on 203 degrees of freedom
Residual deviance: 236.37 on 202 degrees of freedom
(3 observations deleted due to missingness)
AIC: 240.37

Number of Fisher Scoring iterations: 5
detach (miss)
attach (miss2)

result2 <- glm(a~b, family=binomial(logit)) summary(result2) Call: glm(formula = a ~ b, family = binomial(logit)) Deviance Residuals: Min 1Q Median 3Q Max -1.884 -1.198 0.742 0.936 1.446 Coefficients: Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.0059 1.4162 -4.241 2.23e-05 ***
b 0.1101 0.0242 4.549 5.39e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 278.81 on 202 degrees of freedom
Residual deviance: 235.14 on 201 degrees of freedom
AIC: 239.14

Number of Fisher Scoring iterations: 5

plot(b, fitted(result1))

plot(b, fitted(result1), type="n")
curve(invlogit (coef(result1)[1]+coef(result1)[2]*x), add=TRUE)

Job grade plot

0 comments Posted by Jim 7:39 AM

This plot:

was created using the following R code:

plot (q9e~q8, type = "n",
xlim = c(1,13), ylim = c(1,13),
cex.lab=1.25,cex.axis=0.75,
col.lab = "#333333",
xlab = "Obama job grade",
ylab = "Congressional job grade",
xaxt ="n", yaxt="n",
main="Obama and Congressional 100 day job grades")

abline (or,lwd=2,col="red")
abline (od,lwd=2,col="blue")

axis(1,at=c(1,2,3,4,5,6,7,8,9,10,11,12,13),
lab=c("A+","A","A-","B+","B","B-","C+","C","C-","D+","D","D-","F"),
col.lab="#333333", col.axis="#333333", tck=-0.02, cex.axis=0.75)

axis(2,at=c(1,2,3,4,5,6,7,8,9,10,11,12,13),
lab=c("A+","A","A-","B+","B","B-","C+","C","C-","D+","D","D-","F"),
col.lab="#333333", col.axis="#333333", tck=-0.02, cex.axis=0.75)

text (5,6, "Democrats",col="blue")
text (4,9.5, "Republicans",col="red")
text(3,1,"Source: CNN/ORC Poll, April 2009",col="#777777",cex=0.75)

A recommended book

Monday, July 13, 2009 0 comments Posted by Jim 12:31 PM

I've been getting a lot of help from this book:

While written for S-Plus, nearly everything in it is applicable with R.

Some detail on the last plot

0 comments Posted by Jim 6:14 AM

First we plot approval (app) against date (daten). We also specify a few other things. ylim=c(40,80) specifies that the y axis extends from 40 to 80. xlim=c(-3,210) might seem odd, but we need extra space on the left. pch=16 plots dots, and col="gray" makes them, well, gray.

cex.lab=1.25 and cex.axis=0.75 make the axis labels 1.25 times larger than normal, and the axis value labels 0.75 times normal size. The col.lab = "#777777" makes use of R's ability to accept hexidecimal color specifications, such as those used by web designers. It's a very useful function.

I specified that the x-axis label say nothing, while the y-axis label be "Obama job approval (%)". xaxt="n", yaxt="n" tell R not to plot either the x or y axes.

plot (app~daten, ylim=c(40,80), xlim=c(-3,210), pch=16, col="gray", cex.lab=1.25,cex.axis=0.75, col.lab = "#777777", xlab="",ylab="Obama job approval (%)", xaxt="n", yaxt="n")

This yields a very sparse-looking scatterplot:

Next we add the axes:

axis(1,at=c(-3,28,56,87,117,148,178,209),lab=c("Jan 09","Feb 09","Mar 09","Apr 09","May 09","Jun 09","Jul 09", "Aug 09"),
col.lab="#777777", col.axis="#777777", tck=-0.02, cex.axis=0.75)

axis number 1 is the bottom, 2 is the left, etc. All I'm really doing is specifying where the value labels for each axis should be, and what the labels should say, i.e., month abbreviations - "Mar 09" instead of "03/01/09". The "tck" parameters control the length and direction of the tick marks.

axis(2, at=c(40,50,60,70,80),
col.axis="#777777",
las=2, tck=-0.02,
cex.axis=0.75)

"las=2" tells R to rotate the y-xis labels 90% clockwise so they are more legible.

Then we add our LOWESS fit lines:

lines(lfit1, col="red", lwd=3)
lines(lfit2, col="blue", lwd=3)

and we're done!

Obama approval

Sunday, July 12, 2009 0 comments Posted by Jim 9:06 PM

Working some more with time series data. Here we have a graph of Obama job approval numbers, with two LOWESS-fit lines added for trending:

Figure1. President Obama job approval, Jan 2009 - present.

There's actually some pretty fancy stuff going on there, as the following code shows.

polls <- read.table("/data/polls.txt", header=TRUE, sep = "\t") attach (polls)

lfit1 <- lowess(app~daten, f=0.25)
lfit2 <- lowess(app~daten, f=0.75)

plot (app~daten, ylim=c(40,80), xlim=c(-3,210),
pch=16, col="gray",
cex.lab=1.25,cex.axis=0.75,
col.lab = "#777777",
xlab="",ylab="Obama job approval (%)", xaxt="n", yaxt="n")
axis(1,at=c(-3,28,56,87,117,148,178,209),
lab=c("Jan 09","Feb 09","Mar 09","Apr 09","May 09","Jun 09","Jul 09", "Aug 09"),
col.lab="#777777", col.axis="#777777", tck=-0.02, cex.axis=0.75)
axis(2, at=c(40,50,60,70,80),
col.axis="#777777",
las=2, tck=-0.02,
cex.axis=0.75)
lines(lfit1, col="red", lwd=3)
lines(lfit2, col="blue", lwd=3)

Return

Monday, July 6, 2009 0 comments Posted by Jim 4:36 AM

I'm back from vacation, so I'll post something substantive later today.

Time series data

Sunday, June 28, 2009 0 comments Posted by Jim 8:02 PM

gdp <- read.table("c:/data/GDPC96.txt", header = TRUE)

attach(gdp)

as.Date(date)

plot(gdp~date, data=gdp,pch=16,xlab="",ylab="GDP (2000 dollars)")

Conservatism of Congressional delegation and %Bush vote

Saturday, June 27, 2009 0 comments Posted by Jim 8:37 PM

Busy day today, so I'll just post this:

plot(bush04 ~ cons_hr, type = "n",
xlab="Mean ACU rating",
ylab="2004 Bush vote",
xlim=c(0,100),
ylim=c(0,100),
cex.lab=1.25,cex.axis=0.75,
col.axis = "#777777",
col.lab = "#777777")
text(y=bush04,x=cons_hr, labels=stateid,cex=0.75)
abline(lm2, lty=2, col="red")
axis(side = 2, at = seq(0,100,by=5), cex=0.75)

Filtering cases

Friday, June 26, 2009 0 comments Posted by Jim 6:12 PM

Something that's very important to be able to do in data analysis and visualization is to filter out cases. Let's say you want to do identical analyses of two different groups, or of one group and then a subset of it. R can do this a little differently; instead of merely filtering out cases you can create an object that is a subset, and then call it when necessary.

Let's look at some data on the U.S. Congress. Keith Poole has developed a two-dimensional procedure that places members of Congress at specific points based on roll call votes. What we'll do now is compare Democrats and Republicans in the 110th Congress.

First, we load the data into R.

voteview <- read.csv ("C:/Data/HouseSmall.csv", header = TRUE) attach (voteview)

The voteview data frame contains data on all Congresses beginning with the 101st. That's more than we want to deal with, and also, we need a way to look at Democrats and Republicans separately. We'll create an object just for Democrats in the 110th Congress. and then one for Republicans.

dems110 <- subset(voteview, party == 100 & cong == 110)

reps110 <- subset(voteview, party == 200 & cong == 110)

Now let's create a graph to compare them.

plot (c (-1.5, 1.5), c(-1.5, 1.5), type = 'n',
xlab = "1st dimension",
ylab = "2nd dimension",
col.axis = "#777777",
col.lab = "#777777",
cex.axis = 0.75,
cex.lab = 1.25,
main = "DW-nominate scores, 110th Congress",
col.main = "#444444")
abline (v = 0, col = "#cccccc")
points (dwnom2 ~ dwnom1, data = dems110, pch = "D", col = "blue", cex = 0.75)
points (dwnom2 ~ dwnom1, data = reps110, pch = "R", col = "red", cex = 0.75)

That's all a bit complicated, so next time I'll talk about what all those things do. But for now I'll just show what it looks like.

Figure 1. The polarization of Congress.

A bit about linear models

0 comments Posted by Jim 5:23 AM

Before we delve into slightly more advanced plotting commands I want to talk a little about linear models, specifically, linear regression. In R this is very, very simple. For instance, in our 'states' data frame, we might want to look at median household income as a predictor of state education expenditures. The command lm calculates this for us. We'll call our first model, 'model1':

model1 <- lm (publicedexp~hincome)

OK, great, but where are our results? One of the things about R is that you can assign names to all sorts of things, even models. That way, you can continually refer to them when doing other things (as we'll see a bit later.) The way to look at our results is with this:

summary (model1)

Call:
lm(formula = publicedexp ~ hincome)

Residuals:
Min 1Q Median 3Q Max
-397.50 -127.43 -8.69 120.96 431.85

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.516e+02 1.735e+02 1.450 0.153
hincome 2.346e-02 3.869e-03 6.063 1.87e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 198.8 on 49 degrees of freedom
Multiple R-squared: 0.4287, Adjusted R-squared: 0.417
F-statistic: 36.76 on 1 and 49 DF, p-value: 1.87e-07

Which gives us a lot more information than if we'd just run the lm command without assigning a name to the model. Later we'll look at how we can integrate our linear model with our plots.

Reading data, and a graph

Thursday, June 25, 2009 0 comments Posted by Jim 11:59 AM

Using Microsoft Excel I'm collecting aggregate data, by state, of various social, political, and economic indicators. I export them into a tab-delimited file called 'states.txt' (pretty clever, I know.) I've got data on education expenditures, firearm deaths per capita, median household income, etc. I'd like to do some analysis and graphing of these data to see if there are any patterns of interest.

First, I import the data into R with the following command:

states <- read.table("C:/Data/states.txt", header=TRUE, sep = "\t")

This creates a data frame called 'states', and reads the text file into it. The command 'header = TRUE' tells R that the first row of data contains variable names, and 'sep = "\t"' tells the program that the file is tab-demimited.

Next I 'attach' the data frame with the following command:

attach (states)

OK, now that I've got my data into R, what can I do with it?

First, I'll run some correlations and see what's going on.

cor (read2children, publicedexp)
[1] 0.4211508

This tells me that the correlation between public expenditures on education and the percentage of children below the age of five who are read to daily is 0.42. It is unsurprising that there's a strong relationship between the two. It's also likely that a third variable, household income, might be related.

cor (hincome, publicedexp)
[1] 0.6547179
cor (read2children, hincome)
[1] 0.4094883

These relationships are even stronger.

Let's look at the data a different way, by using scatterplots to see the relationships.

plot is the R command for, well, plotting. It's very powerful and you can do lots of things with it. First I'll do a very simple scatterplot of education expenditures and children read to.

plot (publicedexp, read2children)

Figure 1. A very basic scatterplot

Next time we'll work on making it better looking.

A start

Wednesday, June 24, 2009 0 comments Posted by Jim 11:49 AM

I've decided that this summer I will finally break down and force myself to learn a little bit about using R. I currently use Stata, a very good program, but the idea of R is appealing since it's free under the GNU license. It has a large and active user community and is constantly growing and expanding.

I'll be using this blog to keep track of my progress and to record what I'm doing along the way.

Learning R