R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Stratified Sampling in R With Examples

Posted on February 19, 2022 by finnstats in R bloggers | 0 Comments

[This article was first published on Data Analysis in R » Quick Guide for Statistics & R » finnstats, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

If you want to read the original article, click here Stratified Sampling in R With Examples.

Are you looking for the latest Data Science Job vacancies then click here The post Stratified Sampling in R With Examples appeared first on finnstats.

Researchers frequently take samples from a population and use the data from the sample to make generalizations about the entire population.

A typical sampling approach is stratified random sampling, which divides a population into groups and selects a random number of people from each category to be included in the sample.

This article shows you how to use R to achieve stratified random sampling.

Approach: Stratified Sampling in R

A corporation has 400 employees who are either freshers, juniors, mid-level employees, or senior employees.

Let’s say we want to obtain a stratified sample of 40 employees, with 10 employees from each level represented.

The following code explains how to create a 400-employee sample data frame.

With the help of set.seed, we can make this example repeatable.

set.seed(1)

Now let’s create a data frame

data

view the first six rows of a data frame

head(data) Level Score 1 freshers 46.81129 2 freshers 45.61885 3 freshers 47.13777 4 freshers 45.54551 5 freshers 45.06891 6 freshers 45.68639

The following code demonstrates how to use the dplyr package’s group_by() and sample_n() methods to create a stratified random sample of 40 employees, with 10 employees from each Level.

library(dplyr)

To get a stratified sample from a data frame.

stratified % group_by(Level) %>% sample_n(size=10)

To find the frequency of employees from each Level.

table(stratified$Score) 40.6277541808117 41.8867328984806 42.1225665842419 42.5233762802742 42.5544884803451 1 1 1 1 1 42.7536151417636 42.8846937474664 42.9742927968522 43.1218453854941 43.1558424722147 1 1 1 1 1 43.6575315133425 43.7415578635583 43.7732881183767 44.6932550551858 44.8755449387381 1 1 1 1 1 45.0020656995027 45.2668319456886 45.3899139820568 45.4797068293891 45.5017168903959 1 1 1 1 1 45.5455064157118 46.1478255944327 46.3450739535307 46.3836008714994 46.5858975045594 1 1 1 1 1 46.6546954492613 46.7620971328865 46.9493723718007 47.0418493618535 47.1284691388457 1 1 1 1 1 47.1753773706728 47.2486845777309 47.3834597232738 47.4520743699156 47.6813717922399 1 1 1 1 1 47.6916655311883 48.4030768433805 48.7269106424762 48.9858858605196 49.0114190243513 1 1 1 1 1

Conclusions

We’ve discussed the most important sampling technique a data scientist should know in this article.

Remember that in machine learning, a well-generated sample can make all the difference because it allows us to work with less data while maintaining statistical significance.

If you are interested to learn more about data science, you can find more articles here finnstats.