PROBABILITY PROPORTIONAL TO SIZE WITHOUT REPLACEMENT AND HORVITZ THOMPSON ESTIMATOR

-Ishita Dasgupta

2148132

What is PPS sampling?

Probability Proportional to size sampling is a method of sampling from a finite population in which a size measure is available for each population unit before sampling and where the probability of selecting unit is proportional to size.

EXPLANATION

A probability proportional to size sampling (PPS) procedure is a variation on multi-stage sampling in which the probability of selecting a PSU is proportional to its size, and an equal number of elements is sampled within each PSU. If one PSU has twice as large a population as another, it is given twice the chance of being selected.

If the same number of persons is then selected from each of the selected PSU’s, the overall probability of selection of any person will be the same. Exact PPS sampling of PSU’s thus achieves complete control over sample size.

The PPS method of selection is useful when the PSU’s vary greatly in size.

Example of PPS Sampling

A population consists of 10 villages with a total of 212 households. The second column of the accompanying table shows the number of households corresponding to each village. A sample of 6 villages is to be selected by the PPS method. To do this follow the steps are followed:

Prepare a cumulative total column with the households in column 2. These totals appear in column 3.
Make a column displaying the range implied by the cumulated totals.
Read off the random numbers from the Appendix. These random numbers are 173, 95,210, ..,32. (Ignore all random numbers lying outside the range 001-212).
The columns corresponding to our selected random numbers will be our sampled villages.
Table 5.9 shows the selected villages under-sampling with and without replacement.

The procedure has ensured that the probabilities of inclusion are proportional to the size (number of households) of the villages at each draw. If household sizes are not known, some other auxiliary variable, highly correlated with household size (such as population size) could be used instead as a measure of size.

PROBABILITY PROPORTIONAL TO SIZE WITHOUT REPLACEMENT

It is shown that sampling without replacement with probability proportional to size can be achieved if the units are grouped with reference to size. When the same unit is chosen a second time, it is substituted by another unit of the same size chosen at random. The estimate of the population total is formally the same as when sampling is done with replacement. The estimate of variance differs in that from the sum of squares of derivations of the ratio, r, we substract, for each group chosen t(>1) times, the quantity t*S/N, where S is the sum of squares of derivation within the group and N the number in the group.

THE METHOD

Sampling with probability proportional to size is usually done with replacement, for if it is done without replacement, the probability ceases to be strictly proportional to size, unless some special device is used, such as that proposed by Yates and Grundy, which is rather complicated for n=2 and hardly practicable for n>2.

Unless the probability of drawing the same unit twice is negligible, the method of sampling with replacement is inefficient, the loss of information being roughly equal to the proportion of duplicates. Although the loss is not usually very serious, it is worth while inquiring if any simple method can be found for avoiding it. It is suggested that such a method is available if the values of x, the variable measuring size, are or can be grouped.

If the values of x have been rather coarsely rounded, it will often be found that these groups already exist for all or much of the population. Where they do not exist, they can be formed by replacing groups of consecutive values(when listed in descending or ascending order) by a common central value. Thus, if it is desired to have no group smaller than five, the series

x= - 36 39 39 41 41 -

x=- 39 39 39 39 39 -

The technique of selecting the sample is then quite simple: if at any moment a unit is drawn a second time, it is replaced by another unit of the same size drawn at random from among the other units of the same size which have not yet been drawn.

In principle, it is therefore necessary that no group shall be smaller than n, the number of units in the sample. But this will not usually be necessary in practice. It will usually be found that when n=(say) 10, if the smallest group is of size (say) 3, the probability of drawing a group more times than it has members is extremely remote. If nevertheless it did happen, we would have to draw again. To the extent that this is likely to happen, the theory of the method fails, but as we are supposing that this is very improbable, we suppose also that the theoretical results are valid in the practical situation even when the smallest number in a group is less than n.

It is also admitted that the process of grouping(if not already completed by the rounding of values of x) will entail some loss of information. We suppose that in practice this loss will be extremely small; at any rate less than the gain resulting from the elimination of multiple drawings.

The sampling plan may be formulated in a different manner. Let Xi=Nixi represent the total size of group i. Then we select, with replacement, n groups with probability proportional to Xi. If the group i is chosen ti times, we then select, without replacement, ti units with equal probability from this group.

3.3 - The Horvitz-Thompson Estimator

Horvitz-Thompson (1952) introduced an unbiased estimator for $τ$ for any design, with or without replacement.


$π_{i}$ Horvitz-Thompson estimatorample $π_{i}$ , i = 1, ... , N are given positive numbers that represent the probability that unit i is included in the sample under a given sampling scheme. The Horvitz-Thompson estimator is: ${\hat{τ}}_{π} = \sum_{i = 1}^{ν} \frac{y_{i}}{π_{i}}$ Where $ν$ is the distinct number of units in the sample. The Horvitz-Thompson estimator does not depend on the number of times a unit may be selected. Each distinct unit of the sample is utilized only once. Read section 6.5 in the text. The section reviews the proofs for how the following two formula are derived. Note that: $E ({\hat{τ}}_{π}) = τ$ $V a r ({\hat{τ}}_{π}) = \sum_{i = 1}^{N} (\frac{1 - π_{i}}{π_{i}}) y_{i}^{2} + \sum_{i = 1}^{N} \sum_{j \neq i} (\frac{π_{i j} - π_{i} π_{j}}{π_{i} π_{j}}) y_{i} y_{j}$ where $π_{i j}$ > 0 denotes the probability that both unit i and unit j are included. The estimated variance of the Horvitz-Thompson estimator is given by: $\hat{V} a r ({\hat{τ}}_{π}) = \sum_{i = 1}^{v} (\frac{1 - π_{i}}{π_{i}^{2}}) y_{i}^{2} + \sum_{i = 1}^{v} \sum_{j \neq i} (\frac{π_{i j} - π_{i} π_{j}}{π_{i} π_{j}}) \frac{1}{π_{i j}} y_{i} y_{j}$ Where $π_{i j}$ > 0 denotes the probability that both unit i and j are included. An approximate (1- $α$ ) 100% CI for $τ$ is: ${\hat{τ}}_{π} \pm t_{α / 2} \sqrt{\hat{V} a r ({\hat{τ}}_{π})}$ where t has $ν$ - 1 df
EXAMPLE:The result of sample survey on the number of bearing lime trees and the area reported under limes, in each of the 22 villages growing lime in one of the tehsils of Bangalore district, are given below: S.No. of villages Area Under lime(in acres) No. of bearing lime trees 1 32.77 2328 2 7.97 754 3 0.62 105 4 15.61 949 5 42.85 3091 6 40.03 1736 7 9.39 840 8 6.33 311 9 5.05 0 10 94.55 3044 11 53.71 2483 12 0.67 128 13 0.82 102 14 2.15 60 15 0.43 0 16 123.36 11799 17 0.29 26 18 3.00 317 19 4.00 190 20 2.00 180 21 6.21 752 22 45.85 3091 $\begin{aligned} {\hat{τ}}_{π} & = \sum_{i = 1}^{n} \frac{y_{i}}{π_{i}} \\ = \sum_{i = 1}^{n} \frac{y_{i}}{n} \cdot N \\ = N \bar{y} \end{aligned}$
library(readxl) SSDdataset9 <- read_excel("SSDdataset9.xlsx") View(SSDdataset9) attach(SSDdataset9) library(samplingbook) ## Loading required package: pps ## Loading required package: sampling ## Loading required package: survey ## Loading required package: grid ## Loading required package: Matrix ## Loading required package: survival ## ## Attaching package: 'survival' ## The following objects are masked from 'package:sampling': ## ## cluster, strata ## ## Attaching package: 'survey' ## The following object is masked from 'package:graphics': ## ## dotchart #Choosing a sample of size 8 from imported data set.seed(08) Sample_PPS = pps.sampling(SSDdataset9$`Area Under lime(in acres)`, n=8, method='midzuno') Sample_PPS ## ## pps.sampling object: Sample with probabilities proportional to size ## Method of Midzuno: ## ## PPS sample: ## [1] 1 5 6 10 11 14 16 22 ## ## Sample probabilities: ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] ## [1,] 0.72745435 0.67867251 0.61607192 0.7274544 0.7274544 0.03302071 0.7274544 ## [2,] 0.67867251 0.95121816 0.83983573 0.9512182 0.9512182 0.04509511 0.9512182 ## [3,] 0.61607192 0.83983573 0.88861757 0.8886176 0.8886176 0.04171715 0.8886176 ## [4,] 0.72745435 0.95121816 0.88861757 1.0000000 1.0000000 0.04772740 1.0000000 ## [5,] 0.72745435 0.95121816 0.88861757 1.0000000 1.0000000 0.04772740 1.0000000 ## [6,] 0.03302071 0.04509511 0.04171715 0.0477274 0.0477274 0.04772740 0.0477274 ## [7,] 0.72745435 0.95121816 0.88861757 1.0000000 1.0000000 0.04772740 1.0000000 ## [8,] 0.72745435 0.95121816 0.88861757 1.0000000 1.0000000 0.04772740 1.0000000 ## [,8] ## [1,] 0.7274544 ## [2,] 0.9512182 ## [3,] 0.8886176 ## [4,] 1.0000000 ## [5,] 1.0000000 ## [6,] 0.0477274 ## [7,] 1.0000000 ## [8,] 1.0000000 sample = SSDdataset9[Sample_PPS$sample, ] sample ## # A tibble: 8 x 3 ## `S.No. of villages` `Area Under lime(in acres)` `No. of bearing lime trees` ## <dbl> <dbl> <dbl> ## 1 1 32.8 2328 ## 2 5 42.8 3091 ## 3 6 40.0 1736 ## 4 10 94.6 3044 ## 5 11 53.7 2483 ## 6 14 2.15 60 ## 7 16 123. 11799 ## 8 22 45.8 3091 pi = Sample_PPS$PI N = nrow(SSDdataset9) #Estimating variance using Horvitz-Thompson and htestimate(sample$`No. of bearing lime trees`, N = N, PI = pi, method = 'ht') ## ## htestimate object: Estimator for samples with probabilities proportional to size ## Method of Horvitz-Thompson: ## ## Mean estimator: 1367.157 ## Standard Error: 86.00294 htestimate(sample$`No. of bearing lime trees`, N = N, PI = pi, method = 'yg') ## ## htestimate object: Estimator for samples with probabilities proportional to size ## Method of Yates and Grundy: ## ## Mean estimator: 1367.157 ## Standard Error: 25.57435 pk = Sample_PPS$pik[Sample_PPS$sample] pk ## [1] 0.7274544 0.9512182 0.8886176 1.0000000 1.0000000 0.0477274 1.0000000 ## [8] 1.0000000 Est_HH = htestimate(sample$`No. of bearing lime trees`, N = N, pk = pk, method = 'hh') Est_HH ## ## htestimate object: Estimator for samples with probabilities proportional to size ## Method of Hansen-Hurwitz (approximate variance): ## ## Mean estimator: 1367.157 ## Standard Error: 427.2806 Est_Tot_HH = N * Est_HH$mean Est_Tot_HH ## [1] 30077.45 pik = Sample_PPS$pik pik ## [1] 0.727454354 0.176924358 0.013763250 0.346523114 0.951218159 0.888617570 ## [7] 0.208446640 0.140518342 0.112103890 1.000000000 1.000000000 0.014873189 ## [13] 0.018203008 0.047727399 0.009545480 1.000000000 0.006437649 0.066596370 ## [19] 0.088795161 0.044397580 0.137854487 1.000000000 Est_Ha = htestimate(sample$`No. of bearing lime trees`, N = N, pk = pk, pik = pik, method = 'ha') Est_Ha ## ## htestimate object: Estimator for samples with probabilities proportional to size ## Method of Hajek (approximate variance): ## ## Mean estimator: 1367.157 ## Standard Error: 45.04632 Est_Tot_Ha = N * Est_Ha$mean Est_Tot_Ha ## [1] 30077.45 #95% CI for estimates lb_Est_HH = Est_HH$mean * N - qnorm(0.975) * N * Est_HH$se ub_Est_HH = Est_HH$mean * N + qnorm(0.975) * N * Est_HH$se lb_Est_HH ## [1] 11653.45 ub_Est_HH ## [1] 48501.45 lb_Est_Ha = Est_Ha$mean * N - qnorm(0.975) * N * Est_Ha$se ub_Est_Ha = Est_Ha$mean * N + qnorm(0.975) * N * Est_Ha$se lb_Est_Ha ## [1] 28135.09 ub_Est_Ha ## [1] 32019.82

CONCLUSION:

We conclude that probability propoortional to size without replacement using Horvitz Thompson estimator, the mean estimator is 1367.57 and standard error is 427.2806, the estimated total is 30077.45 and lower limit of the confidence interval is 11653.45 and upper limit of the confidence interval is 48501.45.

Search This Blog

Complex Sample Survey Designs

PPSWOR AND HORVITZ THOMPSON ESTIMATOR

Example of PPS Sampling

3.3 - The Horvitz-Thompson Estimator

Comments

Post a Comment

Popular posts from this blog

Population Proportion of Size Without Replacement Using DesRaj Estimator

Probability Proportional to Size Sampling without replacement (PPSWOR) using Murthy’s unordered estimator

HORVITZ-THOMPSON ESTIMATOR - An Unordered Estimator

S.No. of villages	Area Under lime(in acres)	No. of bearing lime trees
1	32.77	2328
2	7.97	754
3	0.62	105
4	15.61	949
5	42.85	3091
6	40.03	1736
7	9.39	840
8	6.33	311
9	5.05	0
10	94.55	3044
11	53.71	2483
12	0.67	128
13	0.82	102
14	2.15	60
15	0.43	0
16	123.36	11799
17	0.29	26
18	3.00	317
19	4.00	190
20	2.00	180
21	6.21	752
22	45.85	3091