PPSWOR AND HORVITZ THOMPSON ESTIMATOR

 PROBABILITY PROPORTIONAL TO SIZE WITHOUT REPLACEMENT AND HORVITZ THOMPSON ESTIMATOR

                                                           -Ishita Dasgupta

                                                              2148132

What is PPS sampling?

Probability Proportional to size sampling is a method of sampling from a finite population in which a size measure is available for each population unit before sampling and where the probability of selecting unit is proportional to size.


EXPLANATION

A probability proportional to size sampling (PPS) procedure is a variation on multi-stage sampling in which the probability of selecting a PSU is proportional to its size, and an equal number of elements is sampled within each PSU. If one PSU has twice as large a population as another, it is given twice the chance of being selected.

If the same number of persons is then selected from each of the selected PSU’s, the overall probability of selection of any person will be the same. Exact PPS sampling of PSU’s thus achieves complete control over sample size.

The PPS method of selection is useful when the PSU’s vary greatly in size.

Example of PPS Sampling

A population consists of 10 villages with a total of 212 households. The second column of the accompanying table shows the number of households corresponding to each village. A sample of 6 villages is to be selected by the PPS method. To do this follow the steps are followed:

  1. Prepare a cumulative total column with the households in column 2. These totals appear in column 3.
  2. Make a column displaying the range implied by the cumulated totals.
  3. Read off the random numbers from the Appendix. These random numbers are 173, 95,210, ..,32. (Ignore all random numbers lying outside the range 001-212).
  4. The columns corresponding to our selected random numbers will be our sampled villages.
  5. Table 5.9 shows the selected villages under-sampling with and without replacement.

The procedure has ensured that the probabilities of inclusion are proportional to the size (number of households) of the villages at each draw. If household sizes are not known, some other auxiliary variable, highly correlated with household size (such as population size) could be used instead as a measure of size.


PROBABILITY PROPORTIONAL TO SIZE WITHOUT REPLACEMENT

It is shown that sampling without replacement with probability proportional to size can be achieved if the units are grouped with reference to size. When the same unit is chosen a second time, it is substituted by another unit of the same size chosen at random. The estimate of the population total is formally the same as when sampling is done with replacement. The estimate of variance differs in that from the sum of squares of derivations of the ratio, r, we substract, for each group chosen t(>1) times, the quantity t*S/N, where S is the sum of squares of derivation within the group and N the number in the group.

                                      THE METHOD

Sampling with probability proportional to size is usually done with replacement, for if it is done without replacement, the probability ceases to be strictly proportional to size, unless some special device is used, such as that proposed by Yates and Grundy, which is rather complicated for n=2 and hardly practicable for n>2.

Unless the probability of drawing the same unit twice is negligible, the method of  sampling with replacement is inefficient, the loss of information being roughly equal to the proportion of duplicates. Although the loss is not usually very serious, it is worth while inquiring if any simple method can be found for avoiding it. It is suggested that such a method is available if the values of x, the variable measuring size, are or can be grouped.

If the values of x have been rather coarsely rounded, it will often be found that these groups already exist for all or much of the population. Where they do not exist, they can be formed by replacing groups of consecutive values(when listed in descending or ascending order) by a common central value. Thus, if it is desired to have no group smaller than five, the series

x= -    36 39 39 41 41  -

x=-     39 39 39 39 39  -

The technique of selecting the sample is then quite simple: if at any moment a unit is drawn a second time, it is replaced by another unit of the same size drawn at random from among the other units of the same size which have not yet been drawn.

In principle, it is therefore necessary that no group shall be smaller than n, the number of units in the sample. But this will not usually be necessary in practice. It will usually be found that when n=(say) 10, if the smallest group is of size (say) 3, the probability of drawing a group more times than it has members is extremely remote. If nevertheless it did happen, we would have to draw again. To the extent that this is likely to happen, the theory of the method fails, but as we are supposing that this is very improbable, we suppose also that the theoretical results are valid in the practical situation even when the smallest number in a group is less than n.

It is also admitted that the process of grouping(if not already completed by the rounding of values of x) will entail some loss of information. We suppose that in practice this loss will be extremely small; at any rate less than the gain resulting from the elimination of multiple drawings.

The sampling plan may be formulated in a different manner. Let Xi=Nixi represent the total size of group i. Then we select, with replacement, n groups with probability proportional to Xi. If the group i is chosen ti times, we then select, without replacement, ti units with equal probability from this group. 

3.3 - The Horvitz-Thompson Estimator

3.3 - The Horvitz-Thompson Estimator

Horvitz-Thompson (1952) introduced an unbiased estimator for τ for any design, with or without replacement.


πi

Horvitz-Thompson estimatorample

πii = 1, ... , N are given positive numbers that represent the probability that unit i is included in the sample under a given sampling scheme. The Horvitz-Thompson estimator is:

τ^π=i=1νyiπi

Where ν is the distinct number of units in the sample. The Horvitz-Thompson estimator does not depend on the number of times a unit may be selected. Each distinct unit of the sample is utilized only once.

 Read section 6.5 in the text. The section reviews the proofs for how the following two formula are derived.

Note that:

E(τ^π)=τ

Var(τ^π)=i=1N(1πiπi)yi2+i=1Nji(πijπiπjπiπj)yiyj

where πij > 0 denotes the probability that both unit i and unit j are included.

The estimated variance of the Horvitz-Thompson estimator is given by:

V^ar(τ^π)=i=1v(1πiπi2)yi2+i=1vji(πijπiπjπiπj)1πijyiyj

Where πij > 0 denotes the probability that both unit i and j are included.

An approximate (1-α) 100% CI for τ is:

τ^π±tα/2V^ar(τ^π)

where t has ν - 1 df

EXAMPLE:The result of sample survey on the number of bearing lime trees and the area reported under limes, in each of the 22 villages growing lime in one of the tehsils of Bangalore district, are given below:


S.No. of villages

Area Under lime(in acres)

No. of bearing lime trees

1

32.77

2328

2

7.97

754

3

0.62

105

4

15.61

949

5

42.85

3091

6

40.03

1736

7

9.39

840

8

6.33

311

9

5.05

0

10

94.55

3044

11

53.71

2483

12

0.67

128

13

0.82

102

14

2.15

60

15

0.43

0

16

123.36

11799

17

0.29

26

18

3.00

317

19

4.00

190

20

2.00

180

21

6.21

752

22

45.85

3091

τ^π=i=1nyiπi=i=1nyinN=Ny¯The result of sample survey on the number of bearing lime trees and the area reported under limes, in each of the 22 villages growing lime in one of the tehsils of Bangalore district, are given below:

library(readxl)
SSDdataset9 <- read_excel("SSDdataset9.xlsx")
View(SSDdataset9)

attach(SSDdataset9)
library(samplingbook)

## Loading required package: pps

## Loading required package: sampling

## Loading required package: survey

## Loading required package: grid

## Loading required package: Matrix

## Loading required package: survival

##
## Attaching package: 'survival'

## The following objects are masked from 'package:sampling':
##
##     cluster, strata

##
## Attaching package: 'survey'

## The following object is masked from 'package:graphics':
##
##     dotchart

#Choosing a sample of size 8 from imported data
set.seed(08)
Sample_PPS = pps.sampling(SSDdataset9$`Area Under lime(in acres)`, n=8,
method='midzuno')
Sample_PPS

##
## pps.sampling object: Sample with probabilities proportional to size
## Method of Midzuno:
##
## PPS sample:
## [1]  1  5  6 10 11 14 16 22
##
## Sample probabilities:
##            [,1]       [,2]       [,3]      [,4]      [,5]       [,6]      [,7]
## [1,] 0.72745435 0.67867251 0.61607192 0.7274544 0.7274544 0.03302071 0.7274544
## [2,] 0.67867251 0.95121816 0.83983573 0.9512182 0.9512182 0.04509511 0.9512182
## [3,] 0.61607192 0.83983573 0.88861757 0.8886176 0.8886176 0.04171715 0.8886176
## [4,] 0.72745435 0.95121816 0.88861757 1.0000000 1.0000000 0.04772740 1.0000000
## [5,] 0.72745435 0.95121816 0.88861757 1.0000000 1.0000000 0.04772740 1.0000000
## [6,] 0.03302071 0.04509511 0.04171715 0.0477274 0.0477274 0.04772740 0.0477274
## [7,] 0.72745435 0.95121816 0.88861757 1.0000000 1.0000000 0.04772740 1.0000000
## [8,] 0.72745435 0.95121816 0.88861757 1.0000000 1.0000000 0.04772740 1.0000000
##           [,8]
## [1,] 0.7274544
## [2,] 0.9512182
## [3,] 0.8886176
## [4,] 1.0000000
## [5,] 1.0000000
## [6,] 0.0477274
## [7,] 1.0000000
## [8,] 1.0000000

sample = SSDdataset9[Sample_PPS$sample, ]
sample

## # A tibble: 8 x 3
##   `S.No. of villages` `Area Under lime(in acres)` `No. of bearing lime trees`
##                 <dbl>                       <dbl>                       <dbl>
## 1                   1                       32.8                         2328
## 2                   5                       42.8                         3091
## 3                   6                       40.0                         1736
## 4                  10                       94.6                         3044
## 5                  11                       53.7                         2483
## 6                  14                        2.15                          60
## 7                  16                      123.                         11799
## 8                  22                       45.8                         3091

pi = Sample_PPS$PI
N = nrow(SSDdataset9)

#Estimating variance using Horvitz-Thompson and
htestimate(sample$`No. of bearing lime trees`, N = N, PI = pi, method = 'ht')

##
## htestimate object: Estimator for samples with probabilities proportional to size
## Method of Horvitz-Thompson:
##
## Mean estimator: 1367.157
## Standard Error: 86.00294

htestimate(sample$`No. of bearing lime trees`, N = N, PI = pi, method = 'yg')

##
## htestimate object: Estimator for samples with probabilities proportional to size
## Method of Yates and Grundy:
##
## Mean estimator: 1367.157
## Standard Error: 25.57435

pk = Sample_PPS$pik[Sample_PPS$sample]
pk

## [1] 0.7274544 0.9512182 0.8886176 1.0000000 1.0000000 0.0477274 1.0000000
## [8] 1.0000000

Est_HH = htestimate(sample$`No. of bearing lime trees`, N = N, pk = pk,
method = 'hh')
Est_HH

##
## htestimate object: Estimator for samples with probabilities proportional to size
## Method of Hansen-Hurwitz (approximate variance):
##
## Mean estimator: 1367.157
## Standard Error: 427.2806

Est_Tot_HH = N * Est_HH$mean
Est_Tot_HH

## [1] 30077.45

pik = Sample_PPS$pik
pik

##  [1] 0.727454354 0.176924358 0.013763250 0.346523114 0.951218159 0.888617570
##  [7] 0.208446640 0.140518342 0.112103890 1.000000000 1.000000000 0.014873189
## [13] 0.018203008 0.047727399 0.009545480 1.000000000 0.006437649 0.066596370
## [19] 0.088795161 0.044397580 0.137854487 1.000000000

Est_Ha = htestimate(sample$`No. of bearing lime trees`, N = N, pk = pk, pik =
pik, method = 'ha')
Est_Ha

##
## htestimate object: Estimator for samples with probabilities proportional to size
## Method of Hajek (approximate variance):
##
## Mean estimator: 1367.157
## Standard Error: 45.04632

Est_Tot_Ha = N * Est_Ha$mean
Est_Tot_Ha

## [1] 30077.45

#95% CI for estimates
lb_Est_HH = Est_HH$mean * N - qnorm(0.975) * N * Est_HH$se
ub_Est_HH = Est_HH$mean * N + qnorm(0.975) * N * Est_HH$se
lb_Est_HH

## [1] 11653.45

ub_Est_HH

## [1] 48501.45

lb_Est_Ha = Est_Ha$mean * N - qnorm(0.975) * N * Est_Ha$se
ub_Est_Ha = Est_Ha$mean * N + qnorm(0.975) * N * Est_Ha$se
lb_Est_Ha

## [1] 28135.09

ub_Est_Ha

## [1] 32019.82


CONCLUSION:




We conclude that probability propoortional to size without replacement using Horvitz Thompson estimator, the mean estimator is 1367.57 and standard error is 427.2806, the estimated total is 30077.45 and lower limit of the confidence interval is 11653.45 and upper limit of the confidence interval is 48501.45.

Comments

Popular posts from this blog

Population Proportion of Size Without Replacement Using DesRaj Estimator

Probability Proportional to Size Sampling without replacement (PPSWOR) using Murthy’s unordered estimator