HORVITZ-THOMPSON ESTIMATOR
An Unordered Estimator
ANGEL LAL-2148121
1MSTAT - Christ (Deemed to be University)
INTRODUCTION
Horvitz-Thompson estimator is the unbiased estimator of
the population total. Horvitz-Thompson estimator is categorized under Unordered
estimators because this estimator does not depend on the order in which the
units are drawn within the sample. This estimator is named after Daniel G. Horvitz and Donovan J. Thompson,
and was derived in 1952. This method is mostly applied in survey analysis, in
account for missing data and also as many sources of unequal selection
probabilities. This unordered estimator is proved to be more efficient compared
to other ordered estimators.
METHODOLOGY
Suppose that the initial
probability of selection of the units Ui is pi, where pi = xi/x for i = 1,2,...,N. The probability that the
unit Ui is included in the sample would be given by
Further, the probability that both the units Ui and Uj are included in the sample is:
Suppose that yi be the value of the i-th unit with ๐i the probability of inclusion in the sample. The H-T estimator
then defined as
And the H-T estimate of the mean is given by
In a Bayesian probabilistic frame work ๐i is considered the
proportion of individuals in a target population belonging to the i-th stratum. So yi/๐i can be considered as an estimate of the complete sample of
persons within i-th stratum.
In post stratified study designs, estimation of ๐ and ๐ are done in different
steps and computing the variance of ๐HT is not that easy.
Resampling techniques such as the bootstraps or the jack-knife can be applied
to gain consistent estimate of the variance of the H-T estimator.
H-T ESTIMATOR AS AN UNBIASED ESTIMATOR OF POPULATION TOTAL IN PPSWOR
The H-T estimator can be shown as
an unbiased estimator of the population total. Consider expectation of the
Horvitz-Thompson estimator,
REDUCTION OF H-T ESTIMATOR TO THE ESTIMATOR OF
POPULATION TOTAL UNDER SRSWOR
We know that in SRSWOR the
probability that i-th unit is included in
the unit is n/N. So,
VARIANCE
OF THE HORVITZ-THOMPSON ESTIMATOR
A general form of linear estimator can be written as
where a_i is a random variate taking value 1 if i-th unit is drawn and 0 otherwise; c_i are the constants attached units U_i (i=1,2,...,N).
a_i follows a binomial distribution for a sample of size 1 with probability p_i. So E(a_i) = ๐_i(1-๐_i). Since a_i a_j is 1 only if both units are distinct and appear in the sample,
Now,
if it is unbiased.
Therefore;
So,
is an unbiased estimator. Now the sampling variance is given by,
The variance of the estimator depends on the quantities ๐ij and ๐i which are calculated from the sampling procedure.
UNBIASED SAMPLE ESTIMATOR OF VARIANCE OF H-T ESTIMATOR
An unbiased sample estimator of the variance of H-T Estimator is given by
provided that none of the ๐ij in the population is 0.
One of the drawback of this variance estimator is that the variance term does not reduce to 0 even when all the values are equal. Another one is that the variance term may assume negative values for some samples.
For n=2, the H-T estimator of the total is,
and its estimated variance is
HORVITZ-THOMPSON ESTIMATOR USING R PROGRAMMING
Let us discuss a scenario and see how to use H-T estimator in R programming to find out the variance and population total.
In a village, there are 8 orchards with the following number of trees and corresponding yields.
Let us find out the total production of the 8 orchards and the variance using H-T estimator, by selecting a sample of 3.
Here first we are selecting a sample of size 3 from the population using midzuno-sen sampling procedure and then finding out the variance and total population estimator using H-T estimator. H-T estimator is included in the samplingbook package in R-Programming. So first install and load the package using library function.
The syntax of the H-T estimator code in R is
htestimate(y, N, PI, pk, pik, method = 'ht')
with the arguments,
y
vector of observations
N
integer for population size
PI
square matrix of second order inclusion probabilities with n rows and cols. It is necessary to be specified for variance estimation by methods 'ht' and 'yg'. Here for H-T estimator, 'ht' is used.
pk
vector of first order inclusion probabilities of length n for the sample elements. It is necessary to be specified for variance estimation by methods 'hh' and 'ha'.
pik
an optional vector of first order inclusion probabilities of length N for the population elements . It can be used for variance estimation by method 'ha'.
method
method to be used for variance estimation. Options are 'yg' (Yates and Grundy) and 'ht' (Horvitz-Thompson), approximate options are 'hh' (Hansen-Hurwitz) and 'ha' (Hajek).
Now let us see the R code for the above mentioned problem.
INTERPRETATION OF THE R CODE
Here as we can see using the sample chosen from the population, htestimate() is applied to find out mean estimator and standard error of the data. Mean production of the orchards using H-T estimator on the sample is 38.875 which is same as the total mean we found out using the entire population. Standard error comes out to be 3.269 which explains that any change in the sample taken can vary the result to 3.269 units. From the mean estimator the total estimate of the population is found out and as we can see, the total production of the 8 orchards using H-T estimator is 311.
CONCLUSION
Horvitz-Thompson estimator is the unordered, unbiased estimator of the population total. The R command for using this estimate is htestimate(). This estimator is used when the sample is drawn from the population without any particular order. Also the estimator can be used for any sampling design when the estimator has only distinct units in the sample. It is observed that the H-T estimator under the Sen-Mid-Zuno scheme of sampling is more efficient than the other estimator with pps and replacement.
Comments
Post a Comment