Proportional Sampling
~Sanchaita Bhattacharjee-2148150.
Christ University, Bengaluru.
In this paper I have tried to explain the concept in as simple as possible so that it becomes easy even for non-statistical people to understand in a very easy way.
Here I have divided the topic firstly by giving the definition then some differences between proportionate and disproportionate sampling along with some examples and latter with R codes that will conclude this paper.
DEFINITION: Proportional sampling is a method of sampling in which the investigator divides a finite population into subpopulations and then applies random sampling techniques to each subpopulation. Proportional sampling is similar to proportional allocation in finite population sampling, but in a different context, it also refers to other survey sampling situations. For a finite population with population size N, the population is divided into H strata (subpopulations) according to certain attributes. The size of the hth stratum is denoted as Nh and
. Proportional sampling refers to a design with total sample size n such that

and

Following that, a simple random sample with sample size nh would be selected within each stratum. This is essentially the same as the stratified random sampling design with proportional allocation.
Disproportional vs. Proportional Sampling
The main difference between the two sampling techniques is the proportion given to each stratum with respect to other strata. In proportional sampling, each stratum has the same sampling fraction while in disproportional sampling technique; the sampling fraction of each stratum varies.
Example of Disproportional Sample
Suppose, for example, a researcher desires to conduct a survey of all the students in a given university with 10,000 students, 8,000 females and 2,000 males. His desired sample size is only 1,000. Since the 1,000 subjects needed for the survey is 10% of the entire population, sampling proportion suggests that 8/10 be female and 2/10 be male. This would result in a sample composed of 800 females and 200 males. In this case, the relatively small number of males in the sample probably would not provide adequate representation for drawing conclusion from the said survey.
Disproportional sample technique will permit the researcher in the mentioned case selection of students of adequate size from the two genders. Say for example, 500 males and 500 females can be selected to represent the population. This cannot be considered random since the males had better chances of being selected as part of the sample.
Examples for Proportionate Sampling .
Proportionate sampling is a is a sampling strategy (a method for gathering participants for a study) used when the population is composed of several subgroups that are vastly different in number. The number of participants from each subgroup is determined by their number relative to the entire population.
For example, imagine you want to create a council of 20 employees that will meet and recommend possible changes to the employee handbook. Let's say 40% of your employees are in Sales and Marketing, 30% in Customer Service, 20% of your employees are in IT, and 10% in Finance. You will randomly select 8 people from Sales and Marketing, 6 from Customer Service, 4 from IT, and 2 from Finance. As you can see, each number you pick is proportionate to the overall percentage of people in each category (e.g., 40% = 8 people).
Proportional sampling using R
#DATA DESCRIPTION: #A dataset of 2017 songs with attributes from Spotify’s API. Each song is labeled “1” meaning song is like and “0” for song not liked. This data is used to see what kind of songs are liked or disliked. In this study the variable of interest in mode. Population size of our data is 199.
#OBJECTIVE: #The objective for this study is to give estimates of proportion of the songs in spotify that are liked. Also I have found the 95% confidence limits both using simple random sampling without replacement.
library(readxl)
proportion_data <- read_excel("~/proportion data.xlsx")
## New names:
## * `` -> ...1
View(proportion_data)
population=proportion_data
population
## # A tibble: 199 x 12
## ...1 acousticness danceability duration_ms energy instrumentalness key
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0.0102 0.833 204600 0.434 0.0219 2
## 2 1 0.199 0.743 326933 0.359 0.00611 1
## 3 2 0.0344 0.838 185707 0.412 0.000234 2
## 4 3 0.604 0.494 199413 0.338 0.51 5
## 5 4 0.18 0.678 392893 0.561 0.512 5
## 6 5 0.00479 0.804 251333 0.56 0 8
## 7 6 0.0145 0.739 241400 0.472 0.00000727 1
## 8 7 0.0202 0.266 349667 0.348 0.664 10
## 9 8 0.0481 0.603 202853 0.944 0 11
## 10 9 0.00208 0.836 226840 0.603 0 7
## # ... with 189 more rows, and 5 more variables: liveness <dbl>, loudness <dbl>,
## # mode <dbl>, song_title <chr>, artist <chr>
population
## # A tibble: 199 x 12
## ...1 acousticness danceability duration_ms energy instrumentalness key
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0.0102 0.833 204600 0.434 0.0219 2
## 2 1 0.199 0.743 326933 0.359 0.00611 1
## 3 2 0.0344 0.838 185707 0.412 0.000234 2
## 4 3 0.604 0.494 199413 0.338 0.51 5
## 5 4 0.18 0.678 392893 0.561 0.512 5
## 6 5 0.00479 0.804 251333 0.56 0 8
## 7 6 0.0145 0.739 241400 0.472 0.00000727 1
## 8 7 0.0202 0.266 349667 0.348 0.664 10
## 9 8 0.0481 0.603 202853 0.944 0 11
## 10 9 0.00208 0.836 226840 0.603 0 7
## # ... with 189 more rows, and 5 more variables: liveness <dbl>, loudness <dbl>,
## # mode <dbl>, song_title <chr>, artist <chr>
?population
## No documentation for 'population' in specified packages and libraries:
## you could try '??population'
N=length(population$mode)
N
## [1] 199
set.seed(123)
srswor=population[sample(1:nrow(population),20,replace = FALSE),]
srswor
## # A tibble: 20 x 12
## ...1 acousticness danceability duration_ms energy instrumentalness key
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 158 0.19 0.735 275840 0.41 0 11
## 2 178 0.0108 0.787 273853 0.658 0.0152 8
## 3 13 0.366 0.762 243270 0.476 0 0
## 4 194 0.00818 0.539 224907 0.582 0.0717 7
## 5 169 0.228 0.628 260947 0.568 0.0183 10
## 6 49 0.183 0.716 576888 0.957 0.058 3
## 7 117 0.00257 0.896 267024 0.623 0.000258 2
## 8 42 0.00156 0.624 183000 0.792 0.00000154 10
## 9 196 0.00275 0.476 195200 0.818 0.757 2
## 10 192 0.0346 0.19 230693 0.908 0.0369 0
## 11 152 0.689 0.658 238560 0.179 0 8
## 12 89 0.0397 0.786 302840 0.547 0.272 1
## 13 90 0.0213 0.734 321933 0.548 0 11
## 14 186 0.0585 0.552 283636 0.878 0.892 1
## 15 184 0.0414 0.774 269867 0.901 0.00000484 9
## 16 91 0.0454 0.812 268421 0.676 0 1
## 17 136 0.103 0.772 191160 0.582 0.0000661 4
## 18 98 0.716 0.685 427227 0.423 0.518 11
## 19 71 0.783 0.596 196627 0.639 0.569 0
## 20 25 0.00219 0.781 205160 0.795 0.269 7
## # ... with 5 more variables: liveness <dbl>, loudness <dbl>, mode <dbl>,
## # song_title <chr>, artist <chr>
#suppose we are interested in estimation the proportion of songs on spotify that are liked (0=song disliked,1=song liked).
song1=sum(srswor$mode==1)
song1
## [1] 14
library(samplingbook)
## Loading required package: pps
## Loading required package: sampling
## Loading required package: survey
## Loading required package: grid
## Loading required package: Matrix
## Loading required package: survival
##
## Attaching package: 'survival'
## The following objects are masked from 'package:sampling':
##
## cluster, strata
##
## Attaching package: 'survey'
## The following object is masked from 'package:graphics':
##
## dotchart
Sprop(y=srswor$mode,m=song1,n=length(srswor$mode),N=199,level = .95)
##
## Sprop object: Sample proportion estimate
## With finite population correction: N = 199
##
## Proportion estimate: 0.7
## Standard error: 0.0997
##
## 95% approximate confidence interval:
## proportion: [0.5046,0.8954]
## number in population: [101,178]
## 95% exact hypergeometric confidence interval:
## proportion: [0.4673,0.8744]
## number in population: [93,174]
pop_total=199*0.7
pop_total
## [1] 139.3
ul=0.7+1.96*0.0997
ul
## [1] 0.895412
ll=0.7-1.96*0.0997
ll
## [1] 0.504588
#INTERPRETATION: #The data has population size N=199 and using srswor I have taken the sample size of 20. Since we are interested in knowing the proportion of songs liked hence we see that in the sample of size 20 their are 14 songs that are liked and 6 songs that are disliked. In order to find the proportion i have used the tool sprop. And I have got the proportion estimate as 0.7.This means that 70% songs in the population are liked songs and 30% of the songs are disliked songs. I have also got the 95% approximately confidence interval which is [0.5046,0.8954] and the range is from [101,178] which means that minimum of 101 songs are liked and maximum of 178 songs are liked. The population total that I have got is 139.3.
Comments
Post a Comment