README.txt (nptools v1.01, 7/7/2008) 1. Overview 2. Command line programs 3. Tests and examples **************************************************************** 1. Overview: **************************************************************** nptools is a C library and a set of command-line programs that compute multivariate and univariate nonparametric densities and do various things with them. One of the most interesting things is computation of counterfactual densities of the sort reported in McKee and Todd (2007) and in fact, this software was originally written for that paper. I've tried to make the C code relatively readable, and the external C interface is documented in nptools.h. The command-line programs are documented in detail below. mvkdensity: Multivariate kernel density estimation mvkdensity0: Estimates densities that have probability mass at zero but are continuous elsewhere. mvkreg: Multivariate kernel regression cdensity: Estimates conditional densities cfdensity: Computes the counterfactual distribution of y given a set of observations of y and x and a counterfactual distribution of x's cfdensity0: Adds support for probability mass at zero ftocdf: Converts an estimated density into a cumulative distribution fintegrate: Integrates an estimated density (Used mostly for testing) fsumm: Computes several summary statistics for an estimated density fpoverty: Computes several poverty measures fsummcol: Outputs statistics in a single column suitable for copying and pasting into a spreadsheet colcat: Combines files composed of tabular data (including single columns) and appends them sideways for importation into a spreadsheet splotify: Massages bivariate density data to make it easier to plot as a surface using the gnuplot software Don't hesitate to contact me (mckeedm@pop.upenn.edu) if you have questions, comments, or suggestions! **************************************************************** 2. Command line programs: **************************************************************** ** mvkdensity DATAFILE WEIGHTFILE EVALFILE OUTFILE KERN ** [BWSMETHOD | BW1...BWK] DATAFILE: Ndata x K file of observations; Ndata is the number of observations and K is the number of dimensions. The format is one observation per line with variables delimited by whitespace. WEIGHTFILE: Ndata x 1 weights or "-" if no weights are specified EVALFILE: Neval points where the density is evaluated OUTFILE: Neval x (K+1) file of points and computed densities at each point KERN: kernel BWMETHOD: automatic bandwidth selection method BW1...BWK: A vector of bandwidths to be used mvkdensity computes the multivariate kernel density. The kernel can be either "gaussian" for a normal kernel or "epan" for an Epanechnikov. The bandwidth for each dimension can be automatically chosen or manually chosen, but not both. The automatic bandwidth selectors work separately on each dimension of the data, and there are three choices: silverman: Silverman's rule of thumb for multivariate bandwidth selection from Silverman (1986) pp. 86-87. silverman2: Silverman's rule of thumb for univariate densities. It is a little more robust than the multivariate selector. Stata's kdensity command uses this method. sjpi: Sheather-Jones plugin. I call Fortran code that I believe was written by Sheather himself to calculate the Sheather-Jones "solve-the-equation" bandwidth for a Gaussian kernel estimate. (Journal of the Royal Statistical Society, Series B, 1991, Vol. 53, pp 683-690). The weights are immediately normalized to sum to the number of observations and just specify the relative weighting of the individual observations. Note that the weights are used for computing silverman and silverman2 bandwidths, but NOT used in the computation of sjpi bandwidths. ** mvkdensity0 DATAFILE WEIGHTFILE EVALFILE OUTFILE ZMOUTFILE ** KERN [BWSMETHOD | BW1...BWK] DATAFILE: Ndata x K file of observations; Ndata is the number of observations and K is the number of dimensions. The format is one observation per line with variables delimited by whitespace. WEIGHTFILE: Ndata x 1 weights or "-" if no weights are specified EVALFILE: Neval points where the density is evaluated OUTFILE: Neval x (K+1) file of points and computed densities at each point ZMOUTFILE: 1 x 1 file of probability mass at zero KERN: kernel BWMETHOD: automatic bandwidth selection method BW1...BWK: A vector of bandwidths to be used Like mvkdensity, mvkdensity0 computes a multivariate kernel density. The difference is that it allows for probability mass at [0,...,0]. The continuous density it computes is conditional on x!=[0,...,0]. ** mvkreg DATAFILE WEIGHTFILE EVALFILE OUTFILE ** KERN [BWSMETHOD | BW1...BWK] DATAFILE: Ndata x (K+1) file of observations (y x1 ... xK) WEIGHTFILE: Ndata x 1 weights or "-" if no weights are specified EVALFILE: Neval points (x's) where the E[Y|x] is computed OUTFILE: Neval x (K+1) file of x's and E[Y|x] at each point KERN: kernel BWMETHOD: automatic bandwidth selection method BW1...BWK: A vector of bandwidths to be used mvkreg uses nonparametric kernel regression to compute E[Y|x] for the set of points in EVALFILE. The conditional relationship between Y and x is approximated using the observations in DATAFILE. ** cdensity DATAFILE WEIGHTFILE EVALFILE OUTFILE ** KERN [BWSMETHOD | BW1...BWK] DATAFILE: Ndata x (K+1) file of observations (y x1 ... xK) WEIGHTFILE: Ndata x 1 weights or "-" if no weights are specified EVALFILE: Neval x (K+1) obs where the conditional density is evaluated OUTFILE: Neval x (K+2) file of points and computed densities at each point KERN: kernel BWMETHOD: automatic bandwidth selection method BW1...BWK: A vector of bandwidths to be used cdensity computes the density of y conditional on x at the points in EVALFILE. Like mvkreg, the conditional relationship is approximated using the observations in DATAFILE. ** cfdensity DATAFILE WEIGHTFILE CFDATAFILE CFWEIGHTFILE ** EVALFILE OUTFILE KERN [BWSMETHOD | BWY BWX1...BWXK] DATAFILE: Ndata x (K+1) file of observations from the original data. The first variable is y, the rest are the x's. It's used to compute conditional densities (y|x) as well as the density for the original distribution of x's. WEIGHTFILE: Ndata x 1 weights or "-" if no weights are specified CFDATAFILE: Ncfdata x K file of x's that defines the counterfactual distribution CFWEIGHTFILE: Ncfdata x 1 weights or "-" if no weights are specified for the counterfactual data EVALFILE: The Neval points (scalar y's) where the density is evaluated OUTFILE: Neval x 2 file of y's and computed densities at each point KERN: kernel BWSMETHOD: automatic bandwidth selection method BWY BWX1...BWXK: A vector of bandwidths to be used to estimate density of y,x. cfdensity computes the counterfactual distribution of y given a set of observations of y and x and a counterfactual distribution of x's. The method is described in detail in McKee and Todd (2007). ** cfdensity0 DATAFILE WEIGHTFILE CFDATAFILE CFWEIGHTFILE ** EVALFILE OUTFILE ZMOUTFILE KERNEL ** [BWSMETHOD | BWX1...BWXK BWYNZ BWXNZ1...BWXNZK] DATAFILE: Ndata x (K+1) file of observations from the original data. The first variable is y, the rest are the x's. It's used to compute conditional densities (y|x) as well as the density for the original distribution of x's. WEIGHTFILE: Ndata x 1 weights or "-" if no weights are specified CFDATAFILE: Ncfdata x K file of x's that defines the counterfactual distribution CFWEIGHTFILE: Ncfdata x 1 weights or "-" if no weights are specified for the counterfactual data EVALFILE: The Neval points (scalar y's) where the density is evaluated OUTFILE: Neval x 2 file of y's and computed densities at each point ZMOUTFILE: 1 x 1 file of probability mass at zero KERN: kernel BWSMETHOD: automatic bandwidth selection method BWX1...BWXK: A vector of manually chosen bandwidths to be used to estimate density of the x's unconditional on y>0 BWYNZ BWXNZ1...BWXNZK: A vector of manually chosen bandwidths to be used to estimate density of y,x conditional on y>0. cfdensity0 extends cfdensity to account for distributions that have a point mass at zero. The counterfactual density that is computed has two parts. The first is the density of y conditional on y!=0 and the second is the probability that y=0. Details of the algorithm can be found in Mckee and Todd (2007). ** ftocdf FFILE CDFFILE FFILE: N x 2 file of values for y and a density evaluated at y CDFFILE: N x 2 file of values for y and the corresponding cumulative distribution function Starting with the density, ftocdf computes the corresponding cdf. ** fintegrate FFILE [START END] FFILE: N x 2 file of values for y and a density evaluated at y START, END: optional starting and ending points. Defaults to -inf, +inf. Used mostly for testing, fintegrate integrates the density using a version of Simpson's rule that's been generalized to deal with potentially unequal spacing between the x's. Densities should integrate to approximately 1. ** fsumm FFILE [ZMFILE] FFILE: N x 2 file of values for y and a density evaluated at y ZMFILE: optional 1 x 1 file that contains Pr[Y=0] fsumm takes a univariate density f and computes various summary statistics in human readable form. If a zero mass probability is specified, then f is taken to be conditional on x != 0. In particular, it reports Pr[Y=0], the integral of f (should be approximately 1), E[Y], StdDev[Y], quantiles (0.1, 0.5, 0.9), the interquartile range, the 0.1 to 0.9 range, the coefficient of variation, the Gini coefficient, and the Theil coefficient of entropy. ** fpoverty POVERTYLINE FFILE [ZMFILE] POVERTYLINE: Individuals with values below this line are considered to be in poverty FFILE: N x 2 file: x f(x) ZMFILE: 1 x 1 file: Optional probability that x == 0 fpoverty takes a poverty line and univariate density (usually signifying income) and computes 3 poverty measures: The Headcount Ratio is the fraction of the population below the poverty line. The Average Poverty Gap Ratio is the mean shortfall between an individual's income and the poverty line (with those above the poverty line having no shortfall) expressed as a fraction of the poverty line. The Foster-Greer-Thorbeck Index is a version of the Average Poverty Gap Ratio that gives more weight to poorer individuals. ** fsummcol POVERTYLINE FFILE [ZMFILE] POVERTYLINE: Individuals with values below this line are considered to be in poverty FFILE: N x 2 file: x f(x) ZMFILE: 1 x 1 file: Optional probability that x == 0 fsummcol reports the same statistics as fsumm and fpoverty, but in a more machine-readable way. Specifically, it prints out the following statistics with no headings in a single column. colcat is useful for pasting them together to create tables that can be imported into a table: Integral Pr(X=0) mean: sd: q(0.1): q(0.5): q(0.9): IQR: 1090: coef of var: Gini: Theil: Headcount Ratio: Average Poverty Gap Ratio: Foster-Greer-Thorbecke Index: Users will probably want to customize this list of statistics (in src/fsummcol.c) and recompile the program. ** colcat MATFILE1 ... MATFILEN MATFILE's: The MATFILE's are tabular data where columns are separated by whitespace. Each file must have the same number of rows. colcat combines files of tabular data and appends them sideways for importation into a spreadsheet. It's included in this suite because it's particularly useful for appending output from several runs of fsummcol into a single table. I do this when I want to report summary statistics for several counterfactuals. ** splotify < INFILE > OUTFILE When plotting surfaces, the gnuplot software likes blank lines separating rows. This filter inserts these. It's particularly useful for plotting either bivariate densities (from mvkdensity) or univariate densities that are conditional on another x (from cdensity). I use something like the following commands to plot the surface from gnuplot: gnuplot> set style data lines gnuplot> set hidden3d gnuplot> splot 'OUTFILE' **************************************************************** 3. Tests and examples: **************************************************************** The test directory contains some Stata code and shell scripts to test and demonstrate some of the command line programs. Basically, the Stata code simulates data from some known distribution and the nptools programs try to recover the distribution. All five tests read and write from their own directories (t1,...,t5). ** test1.do test1.do first simulates a joint distribution of 2 independent variables. The first is a 50/50 mixture of a N(3,1) and a N(0,1) and the second is N(0,1). Then it approximates the conditional density with cdensity and the joint density with mvkdensity. ** test2.do test2.do simulates a univariate standard normal and recovers its density with mvkdensity and Stata's built-in kdensity. ** test3.do test3.do simulates five univariate distributions, four of which have probability mass at zero: EXP1 ~ exp(1) N11c ~ N(1,1) censored at 0 (Pr(=0) = 0.159) N41c ~ N(4,1) censored at 0 (Pr(=0) = 3e-5) MIX0131c ~ 0.5 N(0,1) + 0.5 N(3,1) censored at 0 (Pr(=0) = 0.251) MIX041c ~ 0.5 0.0 + 0.5 N(4,1) censored at 0 (Pr(=0) = 0.50) It then recovers the distributions not accounting for mass at zero (using mvkdensity) and accounting for mass at zero (using mvkdensity0). Summary statistics are computed using fintegrate and fsumm. ** test4.do test4.do simulates a simple system: Y = X + e X ~ N(0,1) e ~ N(0,1) This implies that Y ~ N(0,2), Cov(X,Y) = 1, and Corr(X,Y) = sqrt(2)/2. test4.do then computes univariate densities for X and Y and uses mvkreg to compute E[Y] as a function of Y. mvkdensity and cdensity are used to compute the joint density of X and Y and the conditional density of y|x. The test4*.sh scripts are compute-intensive. On my 3.8Ghz Pentium-D, each took about 35 minutes. The script test4cf_orig.sh uses cfdensity to compute the density of y given the original distribution of x's. This a test of the algorithm more than anything else as it yields the original distribution of y. The test4cf_xplus0p5.sh and test4cf_xplus1.sh compute the counterfactual distribution of Y conditional on X having it's original distribution shifted up by 0.5 and 1.0 respectively. As expected, the counterfactual distributions are also shifted. ** test5.do test5 simulates a slightly more complicated system: X ~ N(0,0.25) e1 ~ N(0,1), e2 ~ N(0,1) Y = 0 if X + e1 < 0 Y = exp(X + e2) if X + e1 >= 0 and exp(X + e2)<=20 Y = 20 o.w. This implies that half the observations of Y are zero and the other half are top-coded at 20. The top-coding only affects 27 of 10,000 observations. Think of this as an employment equation and a wage equation with uncorrelated error term.s test5.do computes univariate densities for Y using both mvkdensity and mvkdensity0. The test5*.sh scripts compute very similar counterfactuals as those computed for test4, but there are 6 (instead of 3) scripts because one set accounts for mass at zero (test5cf0_*.sh) and the other does not (test5cf0_*.sh). Like the counterfactual scripts in test 4, these are compute-intensive. The scripts that don't account for zero took about 3 hours each and the ones that do took about 1.5 hours each.