1.
Hypergeometric distribution
–
In contrast, the binomial distribution describes the probability of k successes in n draws with replacement. In statistics, the hypergeometric test uses the hypergeometric distribution to calculate the significance of having drawn a specific k successes from the aforementioned population. The test is used to identify which sub-populations are over- or under-represented in a sample. This test has a range of applications. For example, a group could use the test to understand their customer base by testing a set of known customers for over-representation of various demographic subgroups. The following conditions characterize the distribution, The result of each draw can be classified into one of two mutually exclusive categories. The probability of a success changes on each draw, as each draw decreases the population, the pmf is positive when max ≤ k ≤ min. The pmf satisfies the recurrence relation P = P with P =, as one would expect, the probabilities sum up to 1, ∑0 ≤ k ≤ n =1 This is essentially Vandermondes identity from combinatorics. Also note the following identity holds, = and this follows from the symmetry of the problem, but it can also be shown by expressing the binomial coefficients in terms of factorials and rearranging the latter. The classical application of the distribution is sampling without replacement. Think of an urn with two types of marbles, red ones and green ones, define drawing a green marble as a success and drawing a red marble as a failure. If the variable N describes the number of all marbles in the urn and K describes the number of green marbles, in this example, X is the random variable whose outcome is k, the number of green marbles actually drawn in the experiment. This situation is illustrated by the following table, Now. Standing next to the urn, you close your eyes and draw 10 marbles without replacement, what is the probability that exactly 4 of the 10 are green. This problem is summarized by the following table, The probability of drawing exactly k green marbles can be calculated by the formula P = f =. Hence, in this example calculate P = f = =5 ⋅814506010272278170 =0.003964583 …, intuitively we would expect it to be even more unlikely for all 5 marbles to be green. P = f = =1 ⋅122175910272278170 =0.0001189375 …, As expected, in Holdem Poker players make the best hand they can combining the two cards in their hand with the 5 cards eventually turned up on the table. The deck has 52 and there are 13 of each suit, for this example assume a player has 2 clubs in the hand and there are 3 cards showing on the table,2 of which are also clubs
2.
Gumbel distribution
–
In probability theory and statistics, the Gumbel distribution is used to model the distribution of the maximum of a number of samples of various distributions. This distribution might be used to represent the distribution of the level of a river in a particular year if there was a list of maximum values for the past ten years. It is useful in predicting the chance that an extreme earthquake, the rest of this article refers to the Gumbel to model the distribution of the maximum value. To model the value, use the negative of the original values. The Gumbel distribution is a case of the generalized extreme value distribution. It is also known as the distribution and the double exponential distribution. It is related to the Gompertz distribution, when its density is first reflected about the origin and then restricted to the half line. In the latent variable formulation of the logit model — common in discrete choice theory — the errors of the latent variables follow a Gumbel distribution. This is useful because the difference of two Gumbel-distributed random variables has a logistic distribution, the Gumbel distribution is named after Emil Julius Gumbel, based on his original papers describing the distribution. The cumulative distribution function of the Gumbel distribution is F = e − e − / β. The mode is μ, while the median is μ − β ln , and the mean is given by E = μ + γ β, the standard deviation is β π /6. The standard Gumbel distribution is the case where μ =0 and β =1 with cumulative distribution function F = e − e and probability density function f = e −. In this case the mode is 0, the median is − ln ≈0.3665, the mean is γ, the cumulants, for n>1, are given by κ n =. If X has a Gumbel distribution, then the distribution of Y=-X given that Y is positive. The cdf G of Y is related to F, the cdf of X, consequently the densities are related by g = f / F, the Gompertz density is proportional to a reflected Gumbel density, restricted to the positive half-line. If X is an exponential with mean 1, then -log has a standard Gumbel-Distribution, theory related to the generalized multivariate log-gamma distribution provides a multivariate version of the Gumbel distribution. In pre-software times probability paper was used to picture the Gumbel distribution, the paper is based on linearization of the cumulative distribution function F, − ln = / β In the paper the horizontal axis is constructed at a double log scale. By plotting F on the axis of the paper and the x -variable on the vertical axis
3.
Diffusion of innovations
–
Diffusion of innovations is a theory that seeks to explain how, why, and at what rate new ideas and technology spread. Everett Rogers, a professor of communication studies, popularized the theory in his book Diffusion of Innovations, the book was first published in 1962, Rogers argues that diffusion is the process by which an innovation is communicated over time among the participants in a social system. The origins of the diffusion of innovations theory are varied and span multiple disciplines, Rogers proposes that four main elements influence the spread of a new idea, the innovation itself, communication channels, time, and a social system. This process relies heavily on human capital, the innovation must be widely adopted in order to self-sustain. Within the rate of adoption, there is a point at which an innovation reaches critical mass, the categories of adopters are innovators, early adopters, early majority, late majority, and laggards. Diffusion manifests itself in different ways and is subject to the type of adopters. The criterion for the categorization is innovativeness, defined as the degree to which an individual adopts a new idea. The study of diffusion of innovations took off in the subfield of sociology in the midwestern United States in the 1920s and 1930s. Agriculture technology was advancing rapidly, and researchers started to examine how independent farmers were adopting hybrid seeds, equipment, and techniques. A study of the adoption of hybrid corn seed in Iowa by Ryan, in 1962, Everett Rogers, a professor of rural sociology, published his seminal work, Diffusion of Innovations. Using his synthesis, Rogers produced a theory of the adoption of innovations among individuals, Diffusion of Innovations and Rogers later books are among the most often cited in diffusion research. The key elements in research are, Studies have explored many characteristics of innovations. Meta-reviews have identified several characteristics that are common among most studies and these are in line with the characteristics that Rogers initially cited in his reviews. These qualities interact and are judged as a whole, for example, an innovation might be extremely complex, reducing its likelihood to be adopted and diffused, but it might be very compatible with a large advantage relative to current tools. Even with this high learning curve, potential adopters might adopt the innovation anyway, Studies also identify other characteristics of innovations, but these are not as common as the ones that Rogers lists above. The fuzziness of the boundaries of the innovation can impact its adoption, specifically, innovations with a small core and large periphery are easier to adopt. Innovations that are less risky are easier to adopt as the loss from failed integration is lower. Innovations that are disruptive to routine tasks, even when they bring a relative advantage
4.
Gamma distribution
–
In probability theory and statistics, the gamma distribution is a two-parameter family of continuous probability distributions. The common exponential distribution and chi-squared distribution are special cases of the gamma distribution, there are three different parametrizations in common use, With a shape parameter k and a scale parameter θ. With a shape parameter α = k and a scale parameter β = 1/θ. With a shape parameter k and a mean parameter μ = k/β, in each of these three forms, both parameters are positive real numbers. The gamma distribution is the maximum entropy probability distribution for a random variable X for which E = kθ = α/β is fixed and greater than zero, and E = ψ + ln = ψ − ln is fixed. The parameterization with k and θ appears to be common in econometrics and certain other applied fields. For instance, in testing, the waiting time until death is a random variable that is frequently modeled with a gamma distribution. If k is an integer, then the distribution represents an Erlang distribution, i. e. the sum of k independent exponentially distributed random variables. The gamma distribution can be parameterized in terms of a shape parameter α = k, both parametrizations are common because either can be more convenient depending on the situation. The cumulative distribution function is the gamma function, F = ∫0 x f d u = γ Γ where γ is the lower incomplete gamma function. If α is an integer, the cumulative distribution function has the following series expansion. E − β x = e − β x ∑ i = α ∞ i i, here Γ is the gamma function evaluated at k. The cumulative distribution function is the gamma function, F = ∫0 x f d u = γ Γ where γ is the lower incomplete gamma function. It can also be expressed as follows, if k is a positive integer, I e − x / θ = e − x / θ ∑ i = k ∞1 i. I The skewness is equal to 2 / k, it only on the shape parameter. Unlike the mode and the mean which have readily calculable formulas based on the parameters, the median for this distribution is defined as the value ν such that 1 Γ θ k ∫0 ν x k −1 e − x / θ d x =12. A formula for approximating the median for any distribution, when the mean is known, has been derived based on the fact that the ratio μ/ is approximately a linear function of k when k ≥1. The approximation formula is ν ≈ μ3 k −0.83 k +0.2, K. P. Later, it was shown that λ is a convex function of m
5.
Zipf's law
–
The law is named after the American linguist George Kingsley Zipf, who popularized it and sought to explain it, though he did not claim to have originated it. The French stenographer Jean-Baptiste Estoup appears to have noticed the regularity before Zipf and it was also noted in 1913 by German physicist Felix Auerbach. Zipfs law states that given some corpus of natural language utterances, for example, in the Brown Corpus of American English text, the word the is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences. True to Zipfs Law, the word of accounts for slightly over 3. 5% of words, followed by. Only 135 vocabulary items are needed to account for half the Brown Corpus, the appearance of the distribution in rankings of cities by population was first noticed by Felix Auerbach in 1913. When Zipfs law is checked for cities, a better fit has been found with exponent s =1.07, while Zipfs law holds for the upper tail of the distribution, the entire distribution of cities is log-normal and follows Gibrats law. Both laws are consistent because a log-normal tail can not be distinguished from a Pareto tail. Zipfs law is most easily observed by plotting the data on a log-log graph, for example, the word the would appear at x = log, y = log. It is also possible to plot reciprocal rank against frequency or reciprocal frequency or interword interval against rank, the data conform to Zipfs law to the extent that the plot is linear. Formally, let, N be the number of elements, k be their rank and it has been claimed that this representation of Zipfs law is more suitable for statistical testing, and in this way it has been analyzed in more than 30,000 English texts. The goodness-of-fit tests yield that only about 15% of the texts are statistically compatible with this form of Zipfs law, slight variations in the definition of Zipfs law can increase this percentage up to close to 50%. In the example of the frequency of words in the English language, N is the number of words in the English language and, if we use the version of Zipfs law. F will then be the fraction of the time the kth most common word occurs, the law may also be written, f =1 k s H N, s where HN, s is the Nth generalized harmonic number. The simplest case of Zipfs law is a 1⁄f function, given a set of Zipfian distributed frequencies, sorted from most common to least common, the second most common frequency will occur ½ as often as the first. The third most common frequency will occur ⅓ as often as the first, the fourth most common frequency will occur ¼ as often as the first. The nth most common frequency will occur 1⁄n as often as the first, however, this cannot hold exactly, because items must occur an integer number of times, there cannot be 2.5 occurrences of a word. Nevertheless, over fairly wide ranges, and to a good approximation. Mathematically, the sum of all frequencies in a Zipf distribution is equal to the harmonic series
6.
Benford's law
–
Benfords law, also called the first-digit law, is an observation about the frequency distribution of leading digits in many real-life sets of numerical data. The law states that in naturally occurring collections of numbers. For example, in sets which obey the law, the number 1 appears as the most significant digit about 30% of the time, by contrast, if the digits were distributed uniformly, they would each occur about 11. 1% of the time. Benfords law also makes predictions about the distribution of digits, third digits, digit combinations. It tends to be most accurate values are distributed across multiple orders of magnitude. The graph here shows Benfords law for base 10, there is a generalization of the law to numbers expressed in other bases, and also a generalization from leading 1 digit to leading n digits. It is named after physicist Frank Benford, who stated it in 1938, Benfords law is a special case of Zipfs law. A set of numbers is said to satisfy Benfords law if the digit d occurs with probability P = log 10 − log 10 = log 10 = log 10 . Therefore, this is the distribution expected if the mantissae of the logarithms of the numbers are uniformly and randomly distributed. For example, a x, constrained to lie between 1 and 10, starts with the digit 1 if 1 ≤ x <2. Therefore, x starts with the digit 1 if log 1 ≤ log x < log 2, the probabilities are proportional to the interval widths, and this gives the equation above. An extension of Benfords law predicts the distribution of first digits in other bases besides decimal, in fact, the general form is, P = log b − log b = log b . For b =2, Benfords law is true but trivial, the discovery of Benfords law goes back to 1881, when the American astronomer Simon Newcomb noticed that in logarithm tables the earlier pages were much more worn than the other pages. Newcombs published result is the first known instance of this observation and includes a distribution on the second digit, Newcomb proposed a law that the probability of a single number N being the first digit of a number was equal to log − log. The phenomenon was noted in 1938 by the physicist Frank Benford. The total number of used in the paper was 20,229. This discovery was named after Benford. In 1995, Ted Hill proved the result about mixed distributions mentioned below, arno Berger and Ted Hill have stated that, The widely known phenomenon called Benford’s law continues to defy attempts at an easy derivation
7.
Beta-binomial distribution
–
The beta-binomial distribution is the binomial distribution in which the probability of success at each trial is not fixed but random and follows the beta distribution. It is frequently used in Bayesian statistics, empirical Bayes methods and it reduces to the Bernoulli distribution as a special case when n =1. For α = β =1, it is the uniform distribution from 0 to n. It also approximates the binomial distribution arbitrarily well for large α and β, the Beta distribution is a conjugate distribution of the binomial distribution. This fact leads to an analytically tractable compound distribution where one can think of the p parameter in the distribution as being randomly drawn from a beta distribution. Namely, if X ∼ Bin then P = L = p k n − k where Bin stands for the distribution. Using the properties of the function, this can alternatively be written f = Γ Γ Γ Γ Γ Γ Γ Γ Γ. The beta-binomial distribution can also be motivated via an urn model for positive values of α and β. Specifically, imagine an urn containing α red balls and β black balls, if a red ball is observed, then two red balls are returned to the urn. Likewise, if a ball is drawn, then two black balls are returned to the urn. If this is repeated n times, then the probability of observing k red balls follows a distribution with parameters n, α and β. The first three raw moments are μ1 = n α α + β μ2 = n α μ3 = n α, the parameter ρ is known as the intra class or intra cluster correlation. It is this positive correlation which gives rise to overdispersion, note that these estimates can be non-sensically negative which is evidence that the data is either undispersed or underdispersed relative to the binomial distribution. In this case, the distribution and the hypergeometric distribution are alternative candidates respectively. While closed-form maximum likelihood estimates are impractical, given that the pdf consists of common functions, maximum likelihood estimates from empirical data can be computed using general methods for fitting multinomial Pólya distributions, methods for which are described in. The R package VGAM through the function vglm, via maximum likelihood, note also that there is no requirement that n is fixed throughout the observations. The following data gives the number of children among the first 12 children of family size 13 in 6115 families taken from hospital records in 19th century Saxony. The 13th child is ignored to assuage the effect of families non-randomly stopping when a desired gender is reached
8.
Exponential distribution
–
It is a particular case of the gamma distribution. It is the analogue of the geometric distribution, and it has the key property of being memoryless. In addition to being used for the analysis of Poisson processes, the probability density function of an exponential distribution is f = { λ e − λ x x ≥0,0 x <0. Alternatively, this can be defined using the right-continuous Heaviside step function, H where H=1, f = λ e − λ x H Here λ >0 is the parameter of the distribution, the distribution is supported on the interval [0, ∞). If a random variable X has this distribution, we write X ~ Exp, the exponential distribution exhibits infinite divisibility. The cumulative distribution function is given by F = {1 − e − λ x x ≥0,0 x <0. Where β >0 is mean, standard deviation, and scale parameter of the distribution and that is to say, the expected duration of survival of the system is β units of time. The parametrization involving the rate parameter arises in the context of events arriving at a rate λ, the alternative specification is sometimes more convenient than the one given above, and some authors will use it as a standard definition. This alternative specification is not used here, unfortunately this gives rise to a notational ambiguity. An example of this switch, reference uses λ for β. The mean or expected value of an exponentially distributed random variable X with rate parameter λ is given by E =1 λ = β, see above. In light of the examples given above, this sense, if you receive phone calls at an average rate of 2 per hour. The variance of X is given by Var =1 λ2 = β2, the moments of X, for n =1,2. are given by E = n. The median of X is given by m = ln λ < E , where ln refers to the natural logarithm. Thus the absolute difference between the mean and median is | E − m | =1 − ln λ <1 λ = standard deviation, an exponentially distributed random variable T obeys the relation Pr = Pr, ∀ s, t ≥0. The exponential distribution and the distribution are the only memoryless probability distributions. The exponential distribution is also necessarily the only continuous probability distribution that has a constant Failure rate. The quantile function for Exp is F −1 = − ln λ,0 ≤ p <1 The quartiles are therefore, first quartile, ln/λ median, ln/λ third quartile, ln/λ And as a consequence the interquartile range is ln/λ
9.
Mixture model
–
Formally a mixture model corresponds to the mixture distribution that represents the probability distribution of observations in the overall population. However, not all inference procedures involve such steps, Mixture models should not be confused with models for compositional data, i. e. data whose components are constrained to sum to a constant value. However, compositional models can be thought of as mixture models, conversely, mixture models can be thought of as compositional models, where the total size of the population has been normalized to 1. In many cases, each parameter is actually a set of parameters, for example, observations distributed according to a mixture of one-dimensional Gaussian distributions will have a mean and variance for each component. Observations distributed according to a mixture of V-dimensional categorical distributions will have a vector of V probabilities, in addition, in a Bayesian setting, the mixture weights and parameters will themselves be random variables, and prior distributions will be placed over the variables. e. Typically H will be the conjugate prior of F, the two most common choices of F are Gaussian aka normal and categorical. g. To incorporate this prior into a Bayesian estimation, the prior is multiplied with the distribution p of the data x conditioned on the parameters θ to be estimated. Although EM-based parameter updates are well-established, providing the initial estimates for these parameters is currently an area of active research, note that this formulation yields a closed-form solution to the complete posterior distribution. Estimations of the random variable θ may be obtained via one of several estimators, such distributions are useful for assuming patch-wise shapes of images and clusters, for example. In the case of image representation, each Gaussian may be tilted, expanded, one Gaussian distribution of the set is fit to each patch in the image. g. g. A mixture model for return data seems reasonable, sometimes the model used is a jump-diffusion model, or as a mixture of two normal distributions. See Financial economics#Challenges and criticism for further context, assume that we observe the prices of N different houses. Assume that a document is composed of N different words from a vocabulary of size V. The distribution of words could be modelled as a mixture of K different V-dimensional categorical distributions. A model of this sort is commonly termed a topic model, note that expectation maximization applied to such a model will typically fail to produce realistic results, due to the excessive number of parameters. Some sorts of additional assumptions are necessary to get good results. Some sort of additional constraint is placed over the identities of words. For example, a Markov chain could be placed on the topic identities, the following example is based on an example in Christopher M. Bishop, Pattern Recognition and Machine Learning