README.txt (nptools v1.01, 7/7/2008)
1. Overview
2. Command line programs
3. Tests and examples
****************************************************************
1. Overview:
****************************************************************
nptools is a C library and a set of command-line programs that compute
multivariate and univariate nonparametric densities and do various
things with them. One of the most interesting things is computation of
counterfactual densities of the sort reported in McKee and Todd (2007)
and in fact, this software was originally written for that paper.
I've tried to make the C code relatively readable, and the external C
interface is documented in nptools.h. The command-line programs are
documented in detail below.
mvkdensity: Multivariate kernel density estimation
mvkdensity0: Estimates densities that have probability mass at zero
but are continuous elsewhere.
mvkreg: Multivariate kernel regression
cdensity: Estimates conditional densities
cfdensity: Computes the counterfactual distribution of y given a
set of observations of y and x and a counterfactual
distribution of x's
cfdensity0: Adds support for probability mass at zero
ftocdf: Converts an estimated density into a cumulative
distribution
fintegrate: Integrates an estimated density (Used mostly for testing)
fsumm: Computes several summary statistics for an estimated
density
fpoverty: Computes several poverty measures
fsummcol: Outputs statistics in a single column suitable for
copying and pasting into a spreadsheet
colcat: Combines files composed of tabular data (including single
columns) and appends them sideways for importation into a
spreadsheet
splotify: Massages bivariate density data to make it easier to
plot as a surface using the gnuplot software
Don't hesitate to contact me (mckeedm@pop.upenn.edu) if you have
questions, comments, or suggestions!
****************************************************************
2. Command line programs:
****************************************************************
** mvkdensity DATAFILE WEIGHTFILE EVALFILE OUTFILE KERN
** [BWSMETHOD | BW1...BWK]
DATAFILE: Ndata x K file of observations; Ndata is the number of
observations and K is the number of dimensions. The format is
one observation per line with variables delimited by
whitespace.
WEIGHTFILE: Ndata x 1 weights or "-" if no weights are specified
EVALFILE: Neval points where the density is evaluated
OUTFILE: Neval x (K+1) file of points and computed densities at each point
KERN: kernel
BWMETHOD: automatic bandwidth selection method
BW1...BWK: A vector of bandwidths to be used
mvkdensity computes the multivariate kernel density. The kernel can be
either "gaussian" for a normal kernel or "epan" for an Epanechnikov.
The bandwidth for each dimension can be automatically chosen or manually
chosen, but not both. The automatic bandwidth selectors work separately
on each dimension of the data, and there are three choices:
silverman: Silverman's rule of thumb for multivariate bandwidth
selection from Silverman (1986) pp. 86-87.
silverman2: Silverman's rule of thumb for univariate densities. It is
a little more robust than the multivariate selector. Stata's
kdensity command uses this method.
sjpi: Sheather-Jones plugin. I call Fortran code that I believe was
written by Sheather himself to calculate the Sheather-Jones
"solve-the-equation" bandwidth for a Gaussian kernel estimate.
(Journal of the Royal Statistical Society, Series B, 1991, Vol. 53,
pp 683-690).
The weights are immediately normalized to sum to the number of
observations and just specify the relative weighting of the individual
observations. Note that the weights are used for computing silverman
and silverman2 bandwidths, but NOT used in the computation of sjpi
bandwidths.
** mvkdensity0 DATAFILE WEIGHTFILE EVALFILE OUTFILE ZMOUTFILE
** KERN [BWSMETHOD | BW1...BWK]
DATAFILE: Ndata x K file of observations; Ndata is the number of
observations and K is the number of dimensions. The format is
one observation per line with variables delimited by
whitespace.
WEIGHTFILE: Ndata x 1 weights or "-" if no weights are specified
EVALFILE: Neval points where the density is evaluated
OUTFILE: Neval x (K+1) file of points and computed densities at each point
ZMOUTFILE: 1 x 1 file of probability mass at zero
KERN: kernel
BWMETHOD: automatic bandwidth selection method
BW1...BWK: A vector of bandwidths to be used
Like mvkdensity, mvkdensity0 computes a multivariate kernel density.
The difference is that it allows for probability mass at [0,...,0]. The
continuous density it computes is conditional on x!=[0,...,0].
** mvkreg DATAFILE WEIGHTFILE EVALFILE OUTFILE
** KERN [BWSMETHOD | BW1...BWK]
DATAFILE: Ndata x (K+1) file of observations (y x1 ... xK)
WEIGHTFILE: Ndata x 1 weights or "-" if no weights are specified
EVALFILE: Neval points (x's) where the E[Y|x] is computed
OUTFILE: Neval x (K+1) file of x's and E[Y|x] at each point
KERN: kernel
BWMETHOD: automatic bandwidth selection method
BW1...BWK: A vector of bandwidths to be used
mvkreg uses nonparametric kernel regression to compute E[Y|x] for the
set of points in EVALFILE. The conditional relationship between Y and x
is approximated using the observations in DATAFILE.
** cdensity DATAFILE WEIGHTFILE EVALFILE OUTFILE
** KERN [BWSMETHOD | BW1...BWK]
DATAFILE: Ndata x (K+1) file of observations (y x1 ... xK)
WEIGHTFILE: Ndata x 1 weights or "-" if no weights are specified
EVALFILE: Neval x (K+1) obs where the conditional density is evaluated
OUTFILE: Neval x (K+2) file of points and computed densities at each point
KERN: kernel
BWMETHOD: automatic bandwidth selection method
BW1...BWK: A vector of bandwidths to be used
cdensity computes the density of y conditional on x at the points in
EVALFILE. Like mvkreg, the conditional relationship is approximated
using the observations in DATAFILE.
** cfdensity DATAFILE WEIGHTFILE CFDATAFILE CFWEIGHTFILE
** EVALFILE OUTFILE KERN [BWSMETHOD | BWY BWX1...BWXK]
DATAFILE: Ndata x (K+1) file of observations from the original data.
The first variable is y, the rest are the x's. It's used to
compute conditional densities (y|x) as well as the density
for the original distribution of x's.
WEIGHTFILE: Ndata x 1 weights or "-" if no weights are specified
CFDATAFILE: Ncfdata x K file of x's that defines
the counterfactual distribution
CFWEIGHTFILE: Ncfdata x 1 weights or "-" if no weights are specified for the
counterfactual data
EVALFILE: The Neval points (scalar y's) where the density is evaluated
OUTFILE: Neval x 2 file of y's and computed densities at each point
KERN: kernel
BWSMETHOD: automatic bandwidth selection method
BWY BWX1...BWXK: A vector of bandwidths to be used to estimate
density of y,x.
cfdensity computes the counterfactual distribution of y given a set of
observations of y and x and a counterfactual distribution of x's. The
method is described in detail in McKee and Todd (2007).
** cfdensity0 DATAFILE WEIGHTFILE CFDATAFILE CFWEIGHTFILE
** EVALFILE OUTFILE ZMOUTFILE KERNEL
** [BWSMETHOD | BWX1...BWXK BWYNZ BWXNZ1...BWXNZK]
DATAFILE: Ndata x (K+1) file of observations from the original data.
The first variable is y, the rest are the x's. It's used to
compute conditional densities (y|x) as well as the density
for the original distribution of x's.
WEIGHTFILE: Ndata x 1 weights or "-" if no weights are specified
CFDATAFILE: Ncfdata x K file of x's that defines
the counterfactual distribution
CFWEIGHTFILE: Ncfdata x 1 weights or "-" if no weights are specified for the
counterfactual data
EVALFILE: The Neval points (scalar y's) where the density is evaluated
OUTFILE: Neval x 2 file of y's and computed densities at each point
ZMOUTFILE: 1 x 1 file of probability mass at zero
KERN: kernel
BWSMETHOD: automatic bandwidth selection method
BWX1...BWXK: A vector of manually chosen bandwidths to be used to estimate
density of the x's unconditional on y>0
BWYNZ BWXNZ1...BWXNZK: A vector of manually chosen bandwidths to be used
to estimate density of y,x conditional on y>0.
cfdensity0 extends cfdensity to account for distributions that have a
point mass at zero. The counterfactual density that is computed has two
parts. The first is the density of y conditional on y!=0 and the second
is the probability that y=0. Details of the algorithm can be found in
Mckee and Todd (2007).
** ftocdf FFILE CDFFILE
FFILE: N x 2 file of values for y and a density evaluated at y
CDFFILE: N x 2 file of values for y and the corresponding cumulative
distribution function
Starting with the density, ftocdf computes the corresponding cdf.
** fintegrate FFILE [START END]
FFILE: N x 2 file of values for y and a density evaluated at y
START, END: optional starting and ending points. Defaults to -inf, +inf.
Used mostly for testing, fintegrate integrates the density using a
version of Simpson's rule that's been generalized to deal with
potentially unequal spacing between the x's. Densities should integrate
to approximately 1.
** fsumm FFILE [ZMFILE]
FFILE: N x 2 file of values for y and a density evaluated at y
ZMFILE: optional 1 x 1 file that contains Pr[Y=0]
fsumm takes a univariate density f and computes various summary
statistics in human readable form. If a zero mass probability is
specified, then f is taken to be conditional on x != 0. In particular,
it reports Pr[Y=0], the integral of f (should be approximately 1), E[Y],
StdDev[Y], quantiles (0.1, 0.5, 0.9), the interquartile range, the 0.1
to 0.9 range, the coefficient of variation, the Gini coefficient, and
the Theil coefficient of entropy.
** fpoverty POVERTYLINE FFILE [ZMFILE]
POVERTYLINE: Individuals with values below this line
are considered to be in poverty
FFILE: N x 2 file: x f(x)
ZMFILE: 1 x 1 file: Optional probability that x == 0
fpoverty takes a poverty line and univariate density (usually signifying
income) and computes 3 poverty measures: The Headcount Ratio is the
fraction of the population below the poverty line. The Average Poverty
Gap Ratio is the mean shortfall between an individual's income and the
poverty line (with those above the poverty line having no shortfall)
expressed as a fraction of the poverty line. The Foster-Greer-Thorbeck
Index is a version of the Average Poverty Gap Ratio that gives more
weight to poorer individuals.
** fsummcol POVERTYLINE FFILE [ZMFILE]
POVERTYLINE: Individuals with values below this line
are considered to be in poverty
FFILE: N x 2 file: x f(x)
ZMFILE: 1 x 1 file: Optional probability that x == 0
fsummcol reports the same statistics as fsumm and fpoverty, but in a
more machine-readable way. Specifically, it prints out the following
statistics with no headings in a single column. colcat is useful for
pasting them together to create tables that can be imported into a
table:
Integral
Pr(X=0)
mean:
sd:
q(0.1):
q(0.5):
q(0.9):
IQR:
1090:
coef of var:
Gini:
Theil:
Headcount Ratio:
Average Poverty Gap Ratio:
Foster-Greer-Thorbecke Index:
Users will probably want to customize this list of statistics (in
src/fsummcol.c) and recompile the program.
** colcat MATFILE1 ... MATFILEN
MATFILE's: The MATFILE's are tabular data where columns are separated by
whitespace. Each file must have the same number of rows.
colcat combines files of tabular data and appends them sideways for
importation into a spreadsheet. It's included in this suite because
it's particularly useful for appending output from several runs of
fsummcol into a single table. I do this when I want to report summary
statistics for several counterfactuals.
** splotify < INFILE > OUTFILE
When plotting surfaces, the gnuplot software likes blank lines
separating rows. This filter inserts these. It's particularly useful
for plotting either bivariate densities (from mvkdensity) or univariate
densities that are conditional on another x (from cdensity). I use
something like the following commands to plot the surface from gnuplot:
gnuplot> set style data lines
gnuplot> set hidden3d
gnuplot> splot 'OUTFILE'
****************************************************************
3. Tests and examples:
****************************************************************
The test directory contains some Stata code and shell scripts to test
and demonstrate some of the command line programs. Basically, the Stata
code simulates data from some known distribution and the nptools
programs try to recover the distribution. All five tests read and write
from their own directories (t1,...,t5).
** test1.do
test1.do first simulates a joint distribution of 2 independent
variables. The first is a 50/50 mixture of a N(3,1) and a N(0,1) and
the second is N(0,1). Then it approximates the conditional density with
cdensity and the joint density with mvkdensity.
** test2.do
test2.do simulates a univariate standard normal and recovers its density
with mvkdensity and Stata's built-in kdensity.
** test3.do
test3.do simulates five univariate distributions, four of which have
probability mass at zero:
EXP1 ~ exp(1)
N11c ~ N(1,1) censored at 0 (Pr(=0) = 0.159)
N41c ~ N(4,1) censored at 0 (Pr(=0) = 3e-5)
MIX0131c ~ 0.5 N(0,1) + 0.5 N(3,1) censored at 0 (Pr(=0) = 0.251)
MIX041c ~ 0.5 0.0 + 0.5 N(4,1) censored at 0 (Pr(=0) = 0.50)
It then recovers the distributions not accounting for mass at zero
(using mvkdensity) and accounting for mass at zero (using mvkdensity0).
Summary statistics are computed using fintegrate and fsumm.
** test4.do
test4.do simulates a simple system:
Y = X + e
X ~ N(0,1)
e ~ N(0,1)
This implies that Y ~ N(0,2), Cov(X,Y) = 1, and Corr(X,Y) = sqrt(2)/2.
test4.do then computes univariate densities for X and Y and uses mvkreg
to compute E[Y] as a function of Y. mvkdensity and cdensity are used to
compute the joint density of X and Y and the conditional density of y|x.
The test4*.sh scripts are compute-intensive. On my 3.8Ghz Pentium-D,
each took about 35 minutes. The script test4cf_orig.sh uses cfdensity
to compute the density of y given the original distribution of x's.
This a test of the algorithm more than anything else as it yields the
original distribution of y. The test4cf_xplus0p5.sh and
test4cf_xplus1.sh compute the counterfactual distribution of Y
conditional on X having it's original distribution shifted up by 0.5 and
1.0 respectively. As expected, the counterfactual distributions are
also shifted.
** test5.do
test5 simulates a slightly more complicated system:
X ~ N(0,0.25)
e1 ~ N(0,1), e2 ~ N(0,1)
Y = 0 if X + e1 < 0
Y = exp(X + e2) if X + e1 >= 0 and exp(X + e2)<=20
Y = 20 o.w.
This implies that half the observations of Y are zero and the other half
are top-coded at 20. The top-coding only affects 27 of 10,000
observations. Think of this as an employment equation and a wage
equation with uncorrelated error term.s
test5.do computes univariate densities for Y using both mvkdensity and
mvkdensity0. The test5*.sh scripts compute very similar counterfactuals
as those computed for test4, but there are 6 (instead of 3) scripts
because one set accounts for mass at zero (test5cf0_*.sh) and the other
does not (test5cf0_*.sh). Like the counterfactual scripts in test 4,
these are compute-intensive. The scripts that don't account for zero
took about 3 hours each and the ones that do took about 1.5 hours each.