Demonstrating the Allocation of Sample Size using Delhi Property Prices Data
Determining the Sample size in Stratified Random Sampling and its Demonstration
-Sharothi Bose (2148114)
Introduction-
Before we begin to explore our topic of interest of allocation of sample size in stratified random sampling, we first briefly understand what stratified random sampling is and why it is used.
We know that in Simple Random sampling, the variance is inversely proportional to the sample size and directly proportional to the variability of the sampling units of the population. In order to get more precise estimates, we can thus, either take more samples (which might not always be possible) or decrease the variability between the units.
This is where stratified random sampling comes in. We divide our data into non-overlapping groups or "strata", such that within that group the units are homogenous and the variability among them is less. A random sample is then drawn from each stratum separately.
The major advantages of dividing our data into strata are that it makes our sample more representative, our results more accurate and at times when we are interested also in different strata, it also aids in administrative convenience.
There are 3 main problems that need to be addressed while doing stratified random sampling-
1) Deciding the Stratifying factor-
What should be the characteristic used to differentiate and divide the data into different groups?
2) What should the number of strata be?
3) Given our total sample size n, how should it be allocated between different strata?
In this blog article, we pay attention to the 3rd question and try to determine the sample sizes for the strata.
Allocating Sample Size-
When we find the variance in the case of stratified random sampling, we see that it also depends on ni, which is the size of the ith strata. We need to fix ni. We study 2 approaches for this-
1) Proportional Allocation-
Introduced by Bowley in 1926. Allocation of ni's to various strata is called proportional if the sample fraction is constant for each stratum.
n1/N1 =n2/N2 = ... =nk/Nk = n/N = C (constant)
ni= n*Ni/N . Thus, sample size from an ith stratum, ni is directly proportional to the size of the stratum.
2) Optimum Allocation-
The guiding principle in the determination of the ni's is to choose them so as to-
a) Minimise the variance of the estimate for i) fixed sample size n ii) fixed cost
b) Minimise the total cost for the fixed desired precision
The allocation of ni's according to the above principles is known as optimum allocation.
Just as the variance is a function of the sample sizes, so also is the cost of a survey. The manner in which the cost will vary with the size of the total sample and with its allocation among
the different strata will depend upon the particular survey. The cost
will be directly proportional to the number of units in the sample. In yield surveys in India, where the fieldwork is carried out by the local staff in the course of their normal duties and the major item
in the cost of a survey consists of labour charges on the harvesting of produce, the cost of the survey is found to be approximately proportional to the number of crop-cutting experiments. Cost per experiment may, however, vary in the different strata depending upon the availability of labour. In this situation, the total cost may be appropriately represented by
where c is the cost per experiment in the ith strat
um. When Ci is the same from stratum to stratum, say c, the total cost of a survey is given by C=cn.
um. When Ci is the same from stratum to stratum, say c, the total cost of a survey is given by C=cn.
To determine the optimum values of ni, when the cost function
is represented by (8), we consider the function
Clearly, V (Yw) is minimum for fixed C, or the cost C of a survey is minimum for a fixed value of V (Yw) when each of the square terms on the right-hand side of (10) is zero, or in other words,
when,
Thus,
(i) the larger the size of the stratum, the larger should be the
size of the sample to be selected therefrom;
(ii) the larger the variability within a stratum, the larger should
be the size of the sample from that stratum; and
(iii) the cheaper the labour in a stratum, the larger the sample
from that stratum.
Study Conducted-
Data on household prices of Delhi have been taken and divided into strata based on the number of bedrooms in the house. The appropriate sample sizes for proportional and optimum allocation have been obtained. Using these sample sizes, samples are drawn from the data using both proportional and optimum allocation. We then estimate the average household price and a total of the household prices in Delhi using both samples.
Data Description – We take data on household prices in Delhi. There are 4998 household prices mentioned and there are 30 columns providing details about them. The variable of concern in this study is the price of the house and the stratifying variable here is the number of bedrooms in the house.
https://www.kaggle.com/ruchi798/housing-prices-in-metropolitan-areas-of-india?select=Delhi.csv
R package used -
samplingbook
Function - The function calculates the sample size for each stratum depending on the type of allocation.
Arguments
positive integer specifying sampling size.
vector of population sizes of each stratum.
vector of standard deviation in each stratum.
vector of cost for a sample in each stratum.
type of allocation. Default is type='prop' for proportional, alternatives are type='opt' for optimal and type='costopt' for cost-optimal.
Value
The function stratasamp returns a matrix, which lists the strata and the sizes of observation depending on type of allocation.
Code-
# We take data on prices of household prices in Delhi
Delhi <- read.csv("C:/Users/Sharothi/Desktop/SSd/Delhi.csv")
attach(Delhi)
# We divide the data into stratas based on the number of bedrooms in the house
stratum1=Delhi[Delhi$No..of.Bedrooms==1,]
stratum2=Delhi[Delhi$No..of.Bedrooms==2,]
stratum3=Delhi[Delhi$No..of.Bedrooms==3,]
stratum4=Delhi[Delhi$No..of.Bedrooms==4,]
stratum5=Delhi[Delhi$No..of.Bedrooms==5,]
stratum6=Delhi[Delhi$No..of.Bedrooms==6,]
stratum7=Delhi[Delhi$No..of.Bedrooms==7,]
stratum8=Delhi[Delhi$No..of.Bedrooms==8,]
ls=list(stratum1,stratum2,stratum3,stratum4,stratum5,stratum6,stratum7,stratum8)
# We calculate the stratum size, and the stratum mean square for each strata
S=1
N=1
for(i in 1:8){
N[i]=sum(ls[[i]]$No..of.Bedrooms==i)
S[i]=sqrt(var(ls[[i]]$Price))
}
S
N
# We load the samplingbook library and use the stratsamp function to calculate the size of the sample to be drawn from each sample using proportional and optimum allocation
# The required sample size is 400
library(samplingbook)
ni_prop=stratasamp(n=400,Nh=N,Sh=S,type="prop")
ni_prop
ni_opt=stratasamp(n=400,Nh=N,Sh=S,type="opt")
ni_opt
# Since the sample size from strata can't be 1, we decide to ignore the strata in this case. So our total sample size is now 399 for both proportional and optimum allocation.
#We now draw the samples using proportional sampling
set.seed(17)
psample1=stratum1[sample(1:nrow(stratum1),19),]
psample2=stratum2[sample(1:nrow(stratum2),155),]
psample3=stratum3[sample(1:nrow(stratum3),182),]
psample4=stratum4[sample(1:nrow(stratum4),37),]
psample5=stratum5[sample(1:nrow(stratum5),6),]
```
```{r}
# We draw the sample using optimum allocation
set.seed(1791)
osample1=stratum1[sample(1:nrow(stratum1),15),]
osample2=stratum2[sample(1:nrow(stratum2),106),]
osample3=stratum3[sample(1:nrow(stratum3),178),]
osample4=stratum4[sample(1:nrow(stratum4),70),]
osample5=stratum5[sample(1:nrow(stratum5),27),]
osample6=stratum6[sample(1:nrow(stratum6),4),]
# We combine the individual samples to get the combined total sample for proportional and optimum allocation.
prop_sample=rbind(psample1,psample2,psample3,psample4,psample5)
opt_sample=rbind(osample1,osample2,osample3,osample4,osample5,osample6)
# We calculate the sample size from each strata in the form of a vector and calculate weight of each strata under proprtional allocation
ni_p=as.vector(table(prop_sample$No..of.Bedrooms))
ni_p
wi_p=ni_p/sum(ni_p)
wi_p
# The obtain the estimated population mean under proprtional allocation
stratamean(y=prop_sample$Price,h=as.vector(prop_sample$No..of.Bedrooms),wh=wi_p)
#The estimate of population total using optimum allocation
Pop_total_prop=4998*17831040
Pop_total_prop
# We calculate the sample size from each strata in the form of a vector and calculate weight of each strata under optimum allocation
ni_o=as.vector(table(opt_sample$No..of.Bedrooms))
ni_o
wi_o=ni_o/sum(ni_o)
wi_o
# The obtain the estimated population mean under optimum allocation
stratamean(y=opt_sample$Price,h=as.vector(opt_sample$No..of.Bedrooms),wh=wi_o)
#The estimate of population total using optimum allocation
Pop_total_opt= 4998*20054610
Pop_total_opt
```
```{r}
# We now draw a sample of same size 399 using SRSWOR
set.seed(30)
samp_srs=Delhi[sample(1:nrow(Delhi),399),]
# We calculate the sample mean square and variance of sample mean using SRSWOR
samp_mean_sq=var(samp_srs$Price)
samp_var=((1/399)-(1/4998))*samp_mean_sq
samp_var
# We get the variance of the sample mean in case of optimum allocation by using the standard error we got using strata mean function and squaring it
var_opt= 2834121^2
var_opt
# We now calculate the estimation of gain in precision due to stratification using optimum allocation over SRSWOR using the samples obtained
est_gain=(samp_var-var_opt)/var_opt
est_gain
# We now calculate the gain in precision due to stratification using optimum allocation over SRSWOR using the theoretical population formulas
#We get the variance of the population mean in SRSWOR as v
v=var(Delhi$Price)*((1/399)-(1/4998))
v
# We define Sh and Nh and use the formula for the variance of the sample mean in stratified random sampling
sh=S[1:6]
Nh=N[1:6]
var_o=sum((wi_o^2)*sh*((1/ni_o)-(1/Nh)))
var_o
Result-
1) On the basis of proportional allocation, the sample size for each stratum is obtained. We ignore the strata with sample size 1 since the sample size of strata can’t be 1.
2) The sample mean using proportional allocation is Rs. 17831040 with standard error
2639378 and 95% confidence interval (12657954,23004126).
3) The estimate for population total using proportional allocation is Rs 89,11,95,37,920.
4) On the basis of optimum allocation, the sample size for each stratum is obtained. We ignore the strata with sample size 1 since the sample size of strata can’t be 1.
5) The sample mean using optimum allocation is Rs. 20054610 with standard error
2834121 and 95% confidence interval (14499834,25609385).
6) The estimate for population total using optimum allocation is Rs 100232940780.
7) The gain of precision for optimum allocation over SRSWOR using population values is 46003291.
8) The estimated gain in precision for optimum allocation over SRSWOR using sample values is 0.4157049
Conclusion and Interpretation-
Using proportional allocation, we get the estimated mean price of households in Delhi as Rs 1,78,31,040. In 95 out of 100 samples the mean price will lie in the interval (1,26,57,954,2,30,04,126).
The estimated total price of all the households in the population is Rs 89,11,95,37,920.
Using optimum allocation, we get the estimated mean price of households in Delhi as Rs. 2,00,54,610.
In 95 out of 100 samples the mean price will lie in the interval (1,44,99,834,2,56,09,385). The estimated total price of all the households in the population is Rs 1,00,23,29,40,780.
Comments
Post a Comment