In sociology and statistics research, snowball sampling is a nonprobability sampling technique where existing study subjects recruit future subjects from among their acquaintances. Thus the sample group is said to grow like a rolling snowball; as the sample builds up, enough data are gathered to be useful for research. This sampling technique is used in hidden populations, such as drug users or sex workers, which are difficult for researchers to access; as sample members are not selected from a sampling frame, snowball samples are subject to numerous biases. For example, people who have many friends are more to be recruited into the sample; when virtual social networks are used this technique is called virtual snowball sampling. It was believed that it was impossible to make unbiased estimates from snowball samples, but a variation of snowball sampling called respondent-driven sampling has been shown to allow researchers to make asymptotically unbiased estimates from snowball samples under certain conditions.
Snowball sampling and respondent-driven sampling allows researchers to make estimates about the social network connecting the hidden population. Snowball sampling uses a small pool of initial informants to nominate, through their social networks, other participants who meet the eligibility criteria and could contribute to a specific study; the term "snowball sampling" reflects an analogy to a snowball increasing in size as it rolls downhill. Draft a participation program. Approach ask for contacts. Gain contacts and ask them to participate. Community issues groups may emerge. Continue the snowballing with contacts to gain more stakeholders if necessary. Ensure a diversity of contacts by widening the profile of persons involved in the snowballing exercise; the participants are to know others who share the characteristics that make them eligible for inclusion in the study. Snowball sampling is quite suitable to use when members of a population are hidden and difficult to locate and these members are connected.
1. Social computing Snowball sampling can be perceived as an evaluation sampling in the social computing field. For example, in the interview phase, snowball sampling can be used to reach hard-to-reach populations. Participants or informants with whom contact has been made can use their social networks to refer the researcher to other people who could participate in or contribute to the study. 2. Conflict environment It has been observed that conducting research in conflict environment is challenging due to mistrust and suspicion. A conflict environment, where people or groups thinks their needs and goal are contradictory to the goals and or needs of other people or group; these conflicts among groups or people include the differences to claim the area of territory, trade and religious rights that cause considerable misunderstanding and heighten the disagreements that lead to an environment with lack of trust and suspicion. In conflict environment, the entire population is marginalized to some extent rather than a specific group of people and makes it hard for investigators to reach the study subjects to conduct the research.
For example, a threatening political environment under authoritarian regime creates obstacles for the investigators to conduct the research. Snowball sampling has demonstrated as a second best method in conducting research in conflict environments like, in the context of the Israel and Arab Conflict. Snowball sampling allows the investigators to approach the marginalized population at cognitive and emotional level and enroll them in study. Snowball sampling address the conditions of lack of trust that arises due to uncertainty about the future through trace-linking methodology.3. Expert information collection Snowball sampling can be used to identify experts in a certain field such as medicine, manufacturing processes, or customer relation methods, gather professional and valuable knowledge. For instance, 3M called in specialists from all fields that related to how a surgical drape could be applied to the body using snowball sampling; every involved expert can suggest. Locate hidden populations: It is possible for the surveyors to include people in the survey that they would not have known but, through the use of social network.
Locating people of a specific population: There are no lists or other obvious sources for locating members of the population. The investigators use previous contact and communication with subjects the investigators are able to gain access and cooperation from new subjects; the key in gaining access and documenting the cooperation of subjects is trust. This is achieved that investigators act in good faith and establish good working relationship with the subjects. Methodology: As subjects are used to locate the hidden population, the researcher invests less money and time in sampling. Snowball sampling method does not require complex planning and the staffing required is smaller in comparison to other sampling methods. Snowball sampling can use in both complementary research methodology; as an alternative methodology, when other research methods can not be employed, due to challenging circumstancing and when random sampling is not possible. As complementary methodology with other research methods to boost the quality and efficiency of rese
Cluster sampling is a sampling plan used when mutually homogeneous yet internally heterogeneous groupings are evident in a statistical population. It is used in marketing research. In this sampling plan, the total population is divided into these groups and a simple random sample of the groups is selected; the elements in each cluster are sampled. If all elements in each sampled cluster are sampled this is referred to as a "one-stage" cluster sampling plan. If a simple random subsample of elements is selected within each of these groups, this is referred to as a "two-stage" cluster sampling plan. A common motivation for cluster sampling is to reduce the total number of interviews and costs given the desired accuracy. For a fixed sample size, the expected random error is smaller when most of the variation in the population is present internally within the groups, not between the groups; the population within a cluster should ideally be as heterogeneous as possible, but there should be homogeneity between clusters.
Each cluster should be a small-scale representation of the total population. The clusters should be collectively exhaustive. A random sampling technique is used on any relevant clusters to choose which clusters to include in the study. In single-stage cluster sampling, all the elements from each of the selected clusters are sampled. In two-stage cluster sampling, a random sampling technique is applied to the elements from each of the selected clusters; the main difference between cluster sampling and stratified sampling is that in cluster sampling the cluster is treated as the sampling unit so sampling is done on a population of clusters. In stratified sampling, the sampling is done on elements within each strata. In stratified sampling, a random sample is drawn from each of the strata, whereas in cluster sampling only the selected clusters are sampled. A common motivation of cluster sampling is to reduce costs by increasing sampling efficiency; this contrasts with stratified sampling. There is multistage cluster sampling, where at least two stages are taken in selecting elements from clusters.
Without modifying the estimated parameter, cluster sampling is unbiased when the clusters are the same size. In this case, the parameter is computed by combining all the selected clusters; when the clusters are of different sizes, probability proportionate to size sampling is used. In this sampling plan, the probability of selecting a cluster is proportional to its size, so that a large cluster has a greater probability of selection than a small cluster. However, when clusters are selected with probability proportionate to size, the same number of interviews should be carried out in each sampled cluster so that each unit sampled has the same probability of selection. An example of cluster sampling is geographical cluster sampling; each cluster is a geographical area. Because a geographically dispersed population can be expensive to survey, greater economy than simple random sampling can be achieved by grouping several respondents within a local area into a cluster, it is necessary to increase the total sample size to achieve equivalent precision in the estimators, but cost savings may make such an increase in sample size feasible.
Cluster sampling is used to estimate high mortalities in cases such as wars and natural disasters. Can be cheaper than other sampling plans – e.g. fewer travel expenses, administration costs. Feasibility: This sampling plan takes large populations into account. Since these groups are so large, deploying any other sampling plan would be costly. Economy: The regular two major concerns of expenditure, i.e. traveling and listing, are reduced in this method. For example: Compiling research information about every household in a city would be costly, whereas compiling information about various blocks of the city will be more economical. Here, traveling as well as listing efforts will be reduced. Reduced variability: When estimates are being considered by any other method, reduced variability in results are observed; this may not be an ideal situation every time. Major use: when sampling frame of all elements is not available we can resort only to the cluster sampling. Higher sampling error, which can be expressed in the so-called "design effect", the ratio between the number of subjects in the cluster study and the number of subjects in an reliable, randomly sampled unclustered study.
Biased samples: If the group in population, chosen as a sample has a biased opinion the entire population is inferred to have the same opinion. This may not be the actual case. Errors: The other probabilistic methods give fewer errors than this method. For this reason, it is discouraged for beginners. Two-stage cluster sampling, a simple case of multistage sampling, is obtained by selecting cluster samples in the first stage and selecting sample of elements from every sampled cluster. Consider a population of N clusters in total. In the first stage, n clusters are selected using ordinary cluster sampling method. In the second stage, simple random sampling is used, it is used separately in every cluster and the numbers of elements selected from different clusters are not equal. The total number of clusters N, number of clusters selected n, numbers of elements from selected clusters need to be pre-determined by the survey designer. Two-stage cluster sampling aims at minimizing survey costs and at the same time controlling the uncertainty related to estimates of interest.
This method can be used in health and social sciences. For instance, researchers used two-stage cluster sampling to generate a representative sample of
In statistics, sampling bias is a bias in which a sample is collected in such a way that some members of the intended population are less to be included than others. It results in a biased sample, a non-random sample of a population in which all individuals, or instances, were not likely to have been selected. If this is not accounted for, results can be erroneously attributed to the phenomenon under study rather than to the method of sampling. Medical sources sometimes refer to sampling bias as ascertainment bias. Ascertainment bias has the same definition, but is still sometimes classified as a separate type of bias. Sampling bias is classified as a subtype of selection bias, sometimes termed sample selection bias, but some classify it as a separate type of bias. A distinction, albeit not universally accepted, of sampling bias is that it undermines the external validity of a test, while selection bias addresses internal validity for differences or similarities found in the sample at hand. In this sense, errors occurring in the process of gathering the sample or cohort cause sampling bias, while errors in any process thereafter cause selection bias.
However, selection bias and sampling bias are used synonymously. Selection from a specific real area. For example, a survey of high school students to measure teenage use of illegal drugs will be a biased sample because it does not include home-schooled students or dropouts. A sample is biased if certain members are underrepresented or overrepresented relative to others in the population. For example, a "man on the street" interview which selects people who walk by a certain location is going to have an overrepresentation of healthy individuals who are more to be out of the home than individuals with a chronic illness; this may be an extreme form of biased sampling, because certain members of the population are excluded from the sample. Self-selection bias, possible whenever the group of people being studied has any form of control over whether to participate. Participants' decision to participate may be correlated with traits that affect the study, making the participants a non-representative sample.
For example, people who have strong opinions or substantial knowledge may be more willing to spend time answering a survey than those who do not. Another example is online and phone-in polls, which are biased samples because the respondents are self-selected; those individuals who are motivated to respond individuals who have strong opinions, are overrepresented, individuals that are indifferent or apathetic are less to respond. This leads to a polarization of responses with extreme perspectives being given a disproportionate weight in the summary; as a result, these types of polls are regarded as unscientific. Pre-screening of trial participants, or advertising for volunteers within particular groups. For example, a study to "prove" that smoking does not affect fitness might recruit at the local fitness center, but advertise for smokers during the advanced aerobics class, for non-smokers during the weight loss sessions. Exclusion bias results from exclusion of particular groups from the sample, e.g. exclusion of subjects who have migrated into the study area.
Excluding subjects who move out of the study area during follow-up is rather equivalent of dropout or nonresponse, a selection bias in that it rather affects the internal validity of the study. Healthy user bias, when the study population is healthier than the general population. For example, someone in poor health is unlikely to have a job as manual laborer. Berkson's fallacy, when the study population is selected from a hospital and so is less healthy than the general population; this can result in a spurious negative correlation between diseases: a hospital patient without diabetes is more to have another given disease such as cholecystitis, since they must have had some reason to enter the hospital in the first place. Overmatching, matching for an apparent confounder, a result of the exposure; the control group becomes more similar to the cases in regard to exposure than does the general population. Survivorship bias, in which only "surviving" subjects are selected, ignoring those that fell out of view.
For example, using the record of current companies as an indicator of business climate or economy ignores the businesses that failed and no longer exist. Malmquist bias, an effect in observational astronomy which leads to the preferential detection of intrinsically bright objects; the study of medical conditions begins with anecdotal reports. By their nature, such reports only include those referred for treatment. A child who can't function in school is more to be diagnosed with dyslexia than a child who struggles but passes. A child examined for one condition is more to be tested for and diagnosed with other conditions, skewing comorbidity statistics; as certain diagnoses become associated with behavior problems or intellectual disability, parents try to prevent their children from being stigmatized with those diagnoses, introducing further bias. Studies selected from whole populations are showing that many conditions are much more common and much milder than believed. Geneticists are limited in.
As an example, consider a h
Statistics is a branch of mathematics dealing with data collection, analysis and presentation. In applying statistics to, for example, a scientific, industrial, or social problem, it is conventional to begin with a statistical population or a statistical model process to be studied. Populations can be diverse topics such as "all people living in a country" or "every atom composing a crystal". Statistics deals with every aspect of data, including the planning of data collection in terms of the design of surveys and experiments. See glossary of probability and statistics; when census data cannot be collected, statisticians collect data by developing specific experiment designs and survey samples. Representative sampling assures that inferences and conclusions can reasonably extend from the sample to the population as a whole. An experimental study involves taking measurements of the system under study, manipulating the system, taking additional measurements using the same procedure to determine if the manipulation has modified the values of the measurements.
In contrast, an observational study does not involve experimental manipulation. Two main statistical methods are used in data analysis: descriptive statistics, which summarize data from a sample using indexes such as the mean or standard deviation, inferential statistics, which draw conclusions from data that are subject to random variation. Descriptive statistics are most concerned with two sets of properties of a distribution: central tendency seeks to characterize the distribution's central or typical value, while dispersion characterizes the extent to which members of the distribution depart from its center and each other. Inferences on mathematical statistics are made under the framework of probability theory, which deals with the analysis of random phenomena. A standard statistical procedure involves the test of the relationship between two statistical data sets, or a data set and synthetic data drawn from an idealized model. A hypothesis is proposed for the statistical relationship between the two data sets, this is compared as an alternative to an idealized null hypothesis of no relationship between two data sets.
Rejecting or disproving the null hypothesis is done using statistical tests that quantify the sense in which the null can be proven false, given the data that are used in the test. Working from a null hypothesis, two basic forms of error are recognized: Type I errors and Type II errors. Multiple problems have come to be associated with this framework: ranging from obtaining a sufficient sample size to specifying an adequate null hypothesis. Measurement processes that generate statistical data are subject to error. Many of these errors are classified as random or systematic, but other types of errors can be important; the presence of missing data or censoring may result in biased estimates and specific techniques have been developed to address these problems. Statistics can be said to have begun in ancient civilization, going back at least to the 5th century BC, but it was not until the 18th century that it started to draw more from calculus and probability theory. In more recent years statistics has relied more on statistical software to produce tests such as descriptive analysis.
Some definitions are: Merriam-Webster dictionary defines statistics as "a branch of mathematics dealing with the collection, analysis and presentation of masses of numerical data." Statistician Arthur Lyon Bowley defines statistics as "Numerical statements of facts in any department of inquiry placed in relation to each other."Statistics is a mathematical body of science that pertains to the collection, interpretation or explanation, presentation of data, or as a branch of mathematics. Some consider statistics to be a distinct mathematical science rather than a branch of mathematics. While many scientific investigations make use of data, statistics is concerned with the use of data in the context of uncertainty and decision making in the face of uncertainty. Mathematical statistics is the application of mathematics to statistics. Mathematical techniques used for this include mathematical analysis, linear algebra, stochastic analysis, differential equations, measure-theoretic probability theory.
In applying statistics to a problem, it is common practice to start with a population or process to be studied. Populations can be diverse topics such as "all people living in a country" or "every atom composing a crystal". Ideally, statisticians compile data about the entire population; this may be organized by governmental statistical institutes. Descriptive statistics can be used to summarize the population data. Numerical descriptors include mean and standard deviation for continuous data types, while frequency and percentage are more useful in terms of describing categorical data; when a census is not feasible, a chosen subset of the population called. Once a sample, representative of the population is determined, data is collected for the sample members in an observational or experimental setting. Again, descriptive statistics can be used to summarize the sample data. However, the drawing of the sample has been subject to an element of randomness, hence the established numerical descriptors from the sample are due to uncertainty.
To still draw meaningful conclusions about the entire population, in
Integrated Authority File
The Integrated Authority File or GND is an international authority file for the organisation of personal names, subject headings and corporate bodies from catalogues. It is used for documentation in libraries and also by archives and museums; the GND is managed by the German National Library in cooperation with various regional library networks in German-speaking Europe and other partners. The GND falls under the Creative Commons Zero licence; the GND specification provides a hierarchy of high-level entities and sub-classes, useful in library classification, an approach to unambiguous identification of single elements. It comprises an ontology intended for knowledge representation in the semantic web, available in the RDF format; the Integrated Authority File became operational in April 2012 and integrates the content of the following authority files, which have since been discontinued: Name Authority File Corporate Bodies Authority File Subject Headings Authority File Uniform Title File of the Deutsches Musikarchiv At the time of its introduction on 5 April 2012, the GND held 9,493,860 files, including 2,650,000 personalised names.
There are seven main types of GND entities: LIBRIS Virtual International Authority File Information pages about the GND from the German National Library Search via OGND Bereitstellung des ersten GND-Grundbestandes DNB, 19 April 2012 From Authority Control to Linked Authority Data Presentation given by Reinhold Heuvelmann to the ALA MARC Formats Interest Group, June 2012
In statistics, stratified sampling is a method of sampling from a population. In statistical surveys, when subpopulations within an overall population vary, it could be advantageous to sample each subpopulation independently. Stratification is the process of dividing members of the population into homogeneous subgroups before sampling; the strata should be mutually exclusive: every element in the population must be assigned to only one stratum. The strata should be collectively exhaustive: no population element can be excluded. Simple random sampling or systematic sampling is applied within each stratum; the objective is to improve the precision of the sample by reducing sampling error. It can produce a weighted mean that has less variability than the arithmetic mean of a simple random sample of the population. In computational statistics, stratified sampling is a method of variance reduction when Monte Carlo methods are used to estimate population statistics from a known population. Assume that we need to estimate average number of votes for each candidate in an election.
Assume that country has 3 towns: Town A has 1 million factory workers, Town B has 2 million office workers and Town C has 3 million retirees. We can choose to get a random sample of size 60 over the entire population but there is some chance that the random sample turns out to be not well balanced across these towns and hence is biased causing a significant error in estimation. Instead if we choose to take a random sample of 10, 20 and 30 from Town A, B and C then we can produce a smaller error in estimation for the same total size of sample. Proportionate allocation uses a sampling fraction in each of the strata, proportional to that of the total population. For instance, if the population consists of X total individuals, m of which are male and f female the relative size of the two samples should reflect this proportion. Optimum allocation - The sampling fraction of each stratum is proportionate to both the proportion and the standard deviation of the distribution of the variable. Larger samples are taken in the strata with the greatest variability to generate the least possible overall sampling variance.
A real-world example of using stratified sampling would be for a political survey. If the respondents needed to reflect the diversity of the population, the researcher would seek to include participants of various minority groups such as race or religion, based on their proportionality to the total population as mentioned above. A stratified survey could thus claim to be more representative of the population than a survey of simple random sampling or systematic sampling; the reasons to use stratified sampling rather than simple random sampling include If measurements within strata have lower standard deviation, stratification gives smaller error in estimation. For many applications, measurements become more manageable and/or cheaper when the population is grouped into strata, it is desirable to have estimates of population parameters for groups within the population. If the population density varies within a region, stratified sampling will ensure that estimates can be made with equal accuracy in different parts of the region, that comparisons of sub-regions can be made with equal statistical power.
For example, in Ontario a survey taken throughout the province might use a larger sampling fraction in the less populated north, since the disparity in population between north and south is so great that a sampling fraction based on the provincial sample as a whole might result in the collection of only a handful of data from the north. Stratified sampling is not useful when the population cannot be exhaustively partitioned into disjoint subgroups, it would be a misapplication of the technique to make subgroups' sample sizes proportional to the amount of data available from the subgroups, rather than scaling sample sizes to subgroup sizes. Data representing each subgroup are taken to be of equal importance if suspected variation among them warrants stratified sampling. If subgroup variances differ and the data needs to be stratified by variance, it is not possible to make each subgroup sample size proportional to subgroup size within the total population. For an efficient way to partition sampling resources among groups that vary in their means and costs, see "optimum allocation".
The problem of stratified sampling in the case of unknown class priors can have deleterious effect on the performance of any analysis on the dataset, e.g. classification. In that regard, minimax sampling ratio can be used to make the dataset robust with respect to uncertainty in the underlying data generating process. Combining sub-strata to ensure adequate numbers can lead to Simpson's paradox, where trends that exist in different groups of data disappear or reverse when the groups are combined; the mean and variance of stratified random sampling are given by: x ¯ = 1 N ∑ h = 1 L N h x h ¯ s x ¯ 2 = ∑ h = 1 L 2 ( N h − n
In statistics, quality assurance, survey methodology, sampling is the selection of a subset of individuals from within a statistical population to estimate characteristics of the whole population. Statisticians attempt for the samples to represent the population in question. Two advantages of sampling are lower cost and faster data collection than measuring the entire population; each observation measures one or more properties of observable bodies distinguished as independent objects or individuals. In survey sampling, weights can be applied to the data to adjust for the sample design in stratified sampling. Results from probability theory and statistical theory are employed to guide the practice. In business and medical research, sampling is used for gathering information about a population. Acceptance sampling is used to determine if a production lot of material meets the governing specifications. Successful statistical practice is based on focused problem definition. In sampling, this includes defining the "population".
A population can be defined as including all people or items with the characteristic one wishes to understand. Because there is rarely enough time or money to gather information from everyone or everything in a population, the goal becomes finding a representative sample of that population. Sometimes what defines. For example, a manufacturer needs to decide whether a batch of material from production is of high enough quality to be released to the customer, or should be sentenced for scrap or rework due to poor quality. In this case, the batch is the population. Although the population of interest consists of physical objects, sometimes it is necessary to sample over time, space, or some combination of these dimensions. For instance, an investigation of supermarket staffing could examine checkout line length at various times, or a study on endangered penguins might aim to understand their usage of various hunting grounds over time. For the time dimension, the focus may be on discrete occasions.
In other cases, the examined'population' may be less tangible. For example, Joseph Jagger studied the behaviour of roulette wheels at a casino in Monte Carlo, used this to identify a biased wheel. In this case, the'population' Jagger wanted to investigate was the overall behaviour of the wheel, while his'sample' was formed from observed results from that wheel. Similar considerations arise when taking repeated measurements of some physical characteristic such as the electrical conductivity of copper; this situation arises when seeking knowledge about the cause system of which the observed population is an outcome. In such cases, sampling theory may treat the observed population as a sample from a larger'superpopulation'. For example, a researcher might study the success rate of a new'quit smoking' program on a test group of 100 patients, in order to predict the effects of the program if it were made available nationwide. Here the superpopulation is "everybody in the country, given access to this treatment" – a group which does not yet exist, since the program isn't yet available to all.
Note that the population from which the sample is drawn may not be the same as the population about which information is desired. There is large but not complete overlap between these two groups due to frame issues etc.. Sometimes they may be separate – for instance, one might study rats in order to get a better understanding of human health, or one might study records from people born in 2008 in order to make predictions about people born in 2009. Time spent in making the sampled population and population of concern precise is well spent, because it raises many issues and questions that would otherwise have been overlooked at this stage. In the most straightforward case, such as the sampling of a batch of material from production, it would be most desirable to identify and measure every single item in the population and to include any one of them in our sample. However, in the more general case this is not possible or practical. There is no way to identify all rats in the set of all rats. Where voting is not compulsory, there is no way to identify which people will vote at a forthcoming election.
These imprecise populations are not amenable to sampling in any of the ways below and to which we could apply statistical theory. As a remedy, we seek a sampling frame which has the property that we can identify every single element and include any in our sample; the most straightforward type of frame is a list of elements of the population with appropriate contact information. For example, in an opinion poll, possible sampling frames include an electoral register and a telephone directory. A probability sample is a sample in which every unit in the population has a chance of being selected in the sample, this probability can be determined; the combination of these traits makes it possible to produce unbiased estimates of population totals, by weighting sampled units according to their probability of selection. Example: We want to estimate the total income of adults living in a given street. We visit each household in that street, identify all adults living there, randomly select one adult from each household..
We interview the selected person and find their income