1.
Probability mass function
–
In probability theory and statistics, a probability mass function is a function that gives the probability that a discrete random variable is exactly equal to some value. Suppose that X, S → A is a random variable defined on a sample space S. Then the probability mass function fX, A → for X is defined as f X = Pr = Pr and that is, fX may be defined for all real numbers and fX =0 for all x ∉ X as shown in the figure. Since the image of X is countable, the probability mass function fX is zero for all, the discontinuity of probability mass functions is related to the fact that the cumulative distribution function of a discrete random variable is also discontinuous. Where it is differentiable, the derivative is zero, just as the probability function is zero at all such points. We make this more precise below, suppose that is a probability space and that is a measurable space whose underlying σ-algebra is discrete, so in particular contains singleton sets of B. In this setting, a random variable X, A → B is discrete provided its image is countable, now suppose that is a measure space equipped with the counting measure μ. As a consequence, for any b in B we have P = P, = ∫ X −1 d P = ∫ f d μ = f, demonstrating that f is in fact a probability mass function. Suppose that S is the space of all outcomes of a single toss of a fair coin. Since the coin is fair, the probability function is f X = {12, x ∈,0, x ∉. This is a case of the binomial distribution, the Bernoulli distribution. An example of a discrete distribution, and of its probability mass function, is provided by the multinomial distribution. Johnson, N. L. Kotz, S. Kemp A. Univariate Discrete Distributions
2.
Cumulative distribution function
–
In the case of a continuous distribution, it gives the area under the probability density function from minus infinity to x. Cumulative distribution functions are used to specify the distribution of multivariate random variables. The probability that X lies in the semi-closed interval (a, b], in the definition above, the less than or equal to sign, ≤, is a convention, not a universally used one, but is important for discrete distributions. The proper use of tables of the binomial and Poisson distributions depends upon this convention, moreover, important formulas like Paul Lévys inversion formula for the characteristic function also rely on the less than or equal formulation. If treating several random variables X, Y. etc. the corresponding letters are used as subscripts while, if treating only one, the subscript is usually omitted. It is conventional to use a capital F for a distribution function, in contrast to the lower-case f used for probability density functions. This applies when discussing general distributions, some specific distributions have their own conventional notation, the CDF of a continuous random variable X can be expressed as the integral of its probability density function ƒX as follows, F X = ∫ − ∞ x f X d t. In the case of a random variable X which has distribution having a discrete component at a value b, P = F X − lim x → b − F X. If FX is continuous at b, this equals zero and there is no discrete component at b, every cumulative distribution function F is non-decreasing and right-continuous, which makes it a càdlàg function. Furthermore, lim x → − ∞ F =0, lim x → + ∞ F =1, the function f is equal to the derivative of F almost everywhere, and it is called the probability density function of the distribution of X. As an example, suppose X is uniformly distributed on the unit interval, then the CDF of X is given by F = {0, x <0 x,0 ≤ x <11, x ≥1. Suppose instead that X takes only the discrete values 0 and 1, then the CDF of X is given by F = {0, x <01 /2,0 ≤ x <11, x ≥1. Sometimes, it is useful to study the question and ask how often the random variable is above a particular level. This is called the cumulative distribution function or simply the tail distribution or exceedance. This has applications in statistical hypothesis testing, for example, because the one-sided p-value is the probability of observing a test statistic at least as extreme as the one observed. Thus, provided that the test statistic, T, has a continuous distribution, in survival analysis, F ¯ is called the survival function and denoted S, while the term reliability function is common in engineering. Properties For a non-negative continuous random variable having an expectation, Markovs inequality states that F ¯ ≤ E x, as x → ∞, F ¯ →0, and in fact F ¯ = o provided that E is finite. This form of illustration emphasises the median and dispersion of the distribution or of the empirical results, if the CDF F is strictly increasing and continuous then F −1, p ∈, is the unique real number x such that F = p
3.
Expected value
–
In probability theory, the expected value of a random variable, intuitively, is the long-run average value of repetitions of the experiment it represents. For example, the value in rolling a six-sided die is 3.5. Less roughly, the law of large states that the arithmetic mean of the values almost surely converges to the expected value as the number of repetitions approaches infinity. The expected value is known as the expectation, mathematical expectation, EV, average, mean value, mean. More practically, the value of a discrete random variable is the probability-weighted average of all possible values. In other words, each value the random variable can assume is multiplied by its probability of occurring. The same principle applies to a random variable, except that an integral of the variable with respect to its probability density replaces the sum. The expected value does not exist for random variables having some distributions with large tails, for random variables such as these, the long-tails of the distribution prevent the sum/integral from converging. The expected value is a key aspect of how one characterizes a probability distribution, by contrast, the variance is a measure of dispersion of the possible values of the random variable around the expected value. The variance itself is defined in terms of two expectations, it is the value of the squared deviation of the variables value from the variables expected value. The expected value plays important roles in a variety of contexts, in regression analysis, one desires a formula in terms of observed data that will give a good estimate of the parameter giving the effect of some explanatory variable upon a dependent variable. The formula will give different estimates using different samples of data, a formula is typically considered good in this context if it is an unbiased estimator—that is, if the expected value of the estimate can be shown to equal the true value of the desired parameter. In decision theory, and in particular in choice under uncertainty, one example of using expected value in reaching optimal decisions is the Gordon–Loeb model of information security investment. According to the model, one can conclude that the amount a firm spends to protect information should generally be only a fraction of the expected loss. Suppose random variable X can take value x1 with probability p1, value x2 with probability p2, then the expectation of this random variable X is defined as E = x 1 p 1 + x 2 p 2 + ⋯ + x k p k. If all outcomes xi are equally likely, then the weighted average turns into the simple average and this is intuitive, the expected value of a random variable is the average of all values it can take, thus the expected value is what one expects to happen on average. If the outcomes xi are not equally probable, then the simple average must be replaced with the weighted average, the intuition however remains the same, the expected value of X is what one expects to happen on average. Let X represent the outcome of a roll of a fair six-sided die, more specifically, X will be the number of pips showing on the top face of the die after the toss
4.
Median
–
The median is the value separating the higher half of a data sample, a population, or a probability distribution, from the lower half. In simple terms, it may be thought of as the value of a data set. For example, in the set, the median is 6. The median is a commonly used measure of the properties of a set in statistics. The basic advantage of the median in describing data compared to the mean is that it is not skewed so much by extremely large or small values, and so it may give a better idea of a typical value. For example, in understanding statistics like household income or assets which vary greatly, Median income, for example, may be a better way to suggest what a typical income is. The median of a finite list of numbers can be found by arranging all the numbers from smallest to greatest, if there is an odd number of numbers, the middle one is picked. For example, consider the set of numbers,1,3,3,6,7,8,9 This set contains seven numbers, the median is the fourth of them, which is 6. If there are a number of observations, then there is no single middle value. For example, in the set,1,2,3,4,5,6,8,9 The median is the mean of the middle two numbers, this is ÷2, which is 4.5. The formula used to find the number of a data set of n numbers is ÷2. This either gives the number or the halfway point between the two middle values. For example, with 14 values, the formula will give 7.5, and you will also be able to find the median using the Stem-and-Leaf Plot. There is no accepted standard notation for the median. In any of these cases, the use of these or other symbols for the needs to be explicitly defined when they are introduced. The median is used primarily for skewed distributions, which it summarizes differently from the arithmetic mean, the median is 2 in this case, and it might be seen as a better indication of central tendency than the arithmetic mean of 4. The widely cited empirical relationship between the locations of the mean and the median for skewed distributions is, however. There are, however, various relationships for the difference between them, see below
5.
Mode (statistics)
–
The mode is the value that appears most often in a set of data. The mode of a probability distribution is the value x at which its probability mass function takes its maximum value. In other words, it is the value that is most likely to be sampled, the mode of a continuous probability distribution is the value x at which its probability density function has its maximum value, so the mode is at the peak. Like the statistical mean and median, the mode is a way of expressing, in a single number, the numerical value of the mode is the same as that of the mean and median in a normal distribution, and it may be very different in highly skewed distributions. The mode is not necessarily unique to a distribution, since the probability mass function or probability density function may take the same maximum value at several points x1, x2. The most extreme case occurs in uniform distributions, where all values occur equally frequently, when a probability density function has multiple local maxima it is common to refer to all of the local maxima as modes of the distribution. Such a continuous distribution is called multimodal, in symmetric unimodal distributions, such as the normal distribution, the mean, median and mode all coincide. For samples, if it is known that they are drawn from a symmetric distribution, the mode of a sample is the element that occurs most often in the collection. For example, the mode of the sample is 6, given the list of data the mode is not unique - the dataset may be said to be bimodal, while a set with more than two modes may be described as multimodal. For a sample from a distribution, such as, the concept is unusable in its raw form. The mode is then the value where the histogram reaches its peak, the following MATLAB code example computes the mode of a sample, The algorithm requires as a first step to sort the sample in ascending order. It then computes the derivative of the sorted list. Unlike mean and median, the concept of mode makes sense for nominal data. For example, taking a sample of Korean family names, one might find that Kim occurs more often than any other name, then Kim would be the mode of the sample. In any voting system where a plurality victory, a single modal value determines the victor. Unlike median, the concept of mode makes sense for any random variable assuming values from a space, including the real numbers. For example, a distribution of points in the plane will typically have a mean and a mode, the median makes sense when there is a linear order on the possible values. Generalizations of the concept of median to higher-dimensional spaces are the geometric median, for the remainder, the assumption is that we have a real-valued random variable
6.
Variance
–
The variance has a central role in statistics. It is used in statistics, statistical inference, hypothesis testing, goodness of fit. This makes it a central quantity in numerous such as physics, biology, chemistry, cryptography, economics. The variance of a random variable X is the value of the squared deviation from the mean of X, μ = E . This definition encompasses random variables that are generated by processes that are discrete, continuous, neither, the variance can also be thought of as the covariance of a random variable with itself, Var = Cov . The variance is also equivalent to the second cumulant of a probability distribution that generates X, the variance is typically designated as Var , σ X2, or simply σ2. On computational floating point arithmetic, this equation should not be used, if a continuous distribution does not have an expected value, as is the case for the Cauchy distribution, it does not have a variance either. Many other distributions for which the value does exist also do not have a finite variance because the integral in the variance definition diverges. An example is a Pareto distribution whose index k satisfies 1 < k ≤2. e, the normal distribution with parameters μ and σ is a continuous distribution whose probability density function is given by f =12 π σ2 e −22 σ2. In this distribution, E = μ and the variance Var is related with σ via Var = ∫ − ∞ ∞22 π σ2 e −22 σ2 d x = σ2. The role of the distribution in the central limit theorem is in part responsible for the prevalence of the variance in probability. The exponential distribution with parameter λ is a distribution whose support is the semi-infinite interval. Its probability density function is given by f = λ e − λ x, the variance is equal to Var = ∫0 ∞2 λ e − λ x d x = λ −2. So for an exponentially distributed random variable, σ2 = μ2, the Poisson distribution with parameter λ is a discrete distribution for k =0,1,2, …. Its probability mass function is given by p = λ k k, E − λ, and it has expected value μ = λ. The variance is equal to Var = ∑ k =0 ∞ λ k k, E − λ2 = λ, So for a Poisson-distributed random variable, σ2 = μ. The binomial distribution with n and p is a discrete distribution for k =0,1,2, …, n. Its probability mass function is given by p = p k n − k, the variance is equal to Var = ∑ k =0 n p k n − k 2 = n p
7.
Skewness
–
In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive or negative, or even undefined, the qualitative interpretation of the skew is complicated and unintuitive. Skew must not be thought to refer to the direction the curve appears to be leaning, in fact, conversely, positive skew indicates that the tail on the right side is longer or fatter than the left side. In cases where one tail is long but the tail is fat. Further, in multimodal distributions and discrete distributions, skewness is also difficult to interpret, importantly, the skewness does not determine the relationship of mean and median. In cases where it is necessary, data might be transformed to have a normal distribution, consider the two distributions in the figure just below. Within each graph, the values on the side of the distribution taper differently from the values on the left side. A left-skewed distribution usually appears as a right-leaning curve, positive skew, The right tail is longer, the mass of the distribution is concentrated on the left of the figure. A right-skewed distribution usually appears as a left-leaning curve, Skewness in a data series may sometimes be observed not only graphically but by simple inspection of the values. For instance, consider the sequence, whose values are evenly distributed around a central value of 50. If the distribution is symmetric, then the mean is equal to the median, if, in addition, the distribution is unimodal, then the mean = median = mode. This is the case of a coin toss or the series 1,2,3,4, note, however, that the converse is not true in general, i. e. zero skewness does not imply that the mean is equal to the median. Paul T. von Hippel points out, Many textbooks, teach a rule of thumb stating that the mean is right of the median under right skew and this rule fails with surprising frequency. It can fail in multimodal distributions, or in distributions where one tail is long, most commonly, though, the rule fails in discrete distributions where the areas to the left and right of the median are not equal. Such distributions not only contradict the textbook relationship between mean, median, and skew, they contradict the textbook interpretation of the median. It is sometimes referred to as Pearsons moment coefficient of skewness, or simply the moment coefficient of skewness, the last equality expresses skewness in terms of the ratio of the third cumulant κ3 to the 1. 5th power of the second cumulant κ2. This is analogous to the definition of kurtosis as the fourth cumulant normalized by the square of the second cumulant, the skewness is also sometimes denoted Skew. Starting from a standard cumulant expansion around a distribution, one can show that skewness =6 /standard deviation + O
8.
Kurtosis
–
In probability theory and statistics, kurtosis is a measure of the tailedness of the probability distribution of a real-valued random variable. Depending on the measure of kurtosis that is used, there are various interpretations of kurtosis. The standard measure of kurtosis, originating with Karl Pearson, is based on a version of the fourth moment of the data or population. This number is related to the tails of the distribution, not its peak, hence, for this measure, higher kurtosis is the result of infrequent extreme deviations, as opposed to frequent modestly sized deviations. The kurtosis of any normal distribution is 3. It is common to compare the kurtosis of a distribution to this value, distributions with kurtosis less than 3 are said to be platykurtic, although this does not imply the distribution is flat-topped as sometimes reported. Rather, it means the distribution produces fewer and less extreme outliers than does the normal distribution, an example of a platykurtic distribution is the uniform distribution, which does not produce outliers. Distributions with kurtosis greater than 3 are said to be leptokurtic and it is also common practice to use an adjusted version of Pearsons kurtosis, the excess kurtosis, which is the kurtosis minus 3, to provide the comparison to the normal distribution. Some authors use kurtosis by itself to refer to the excess kurtosis, for the reason of clarity and generality, however, this article follows the non-excess convention and explicitly indicates where excess kurtosis is meant. Alternative measures of kurtosis are, the L-kurtosis, which is a version of the fourth L-moment. These are analogous to the measures of skewness that are not based on ordinary moments. The kurtosis is the fourth standardized moment, defined as Kurt = μ4 σ4 = E 2, several letters are used in the literature to denote the kurtosis. A very common choice is κ, which is fine as long as it is clear that it does not refer to a cumulant, other choices include γ2, to be similar to the notation for skewness, although sometimes this is instead reserved for the excess kurtosis. The kurtosis is bounded below by the squared skewness plus 1, μ4 σ4 ≥2 +1, the lower bound is realized by the Bernoulli distribution. There is no limit to the excess kurtosis of a general probability distribution. A reason why some authors favor the excess kurtosis is that cumulants are extensive, formulas related to the extensive property are more naturally expressed in terms of the excess kurtosis. Xn be independent random variables for which the fourth moment exists, the excess kurtosis of Y is Kurt −3 =12 ∑ i =1 n σ i 4 ⋅, where σ i is the standard deviation of X i. In particular if all of the Xi have the same variance, the reason not to subtract off 3 is that the bare fourth moment better generalizes to multivariate distributions, especially when independence is not assumed
9.
Entropy (information theory)
–
In information theory, systems are modeled by a transmitter, channel, and receiver. The transmitter produces messages that are sent through the channel, the channel modifies the message in some way. The receiver attempts to infer which message was sent, in this context, entropy is the expected value of the information contained in each message. Messages can be modeled by any flow of information, in a more technical sense, there are reasons to define information as the negative of the logarithm of the probability distribution of possible events or messages. The amount of information of every event forms a random variable whose expected value, units of entropy are the shannon, nat, or hartley, depending on the base of the logarithm used to define it, though the shannon is commonly referred to as a bit. The logarithm of the probability distribution is useful as a measure of entropy because it is additive for independent sources, for instance, the entropy of a coin toss is 1 shannon, whereas of m tosses it is m shannons. Generally, you need log2 bits to represent a variable that can take one of n if n is a power of 2. If these values are equally probable, the entropy is equal to the number of bits, equality between number of bits and shannons holds only while all outcomes are equally probable. If one of the events is more probable than others, observation of event is less informative. Conversely, rarer events provide more information when observed, since observation of less probable events occurs more rarely, the net effect is that the entropy received from non-uniformly distributed data is less than log2. Entropy is zero when one outcome is certain, Shannon entropy quantifies all these considerations exactly when a probability distribution of the source is known. The meaning of the events observed does not matter in the definition of entropy, generally, entropy refers to disorder or uncertainty. Shannon entropy was introduced by Claude E. Shannon in his 1948 paper A Mathematical Theory of Communication, Shannon entropy provides an absolute limit on the best possible average length of lossless encoding or compression of an information source. Entropy is a measure of unpredictability of the state, or equivalently, to get an intuitive understanding of these terms, consider the example of a political poll. Usually, such polls happen because the outcome of the poll is not already known, now, consider the case that the same poll is performed a second time shortly after the first poll. Now consider the example of a coin toss, assuming the probability of heads is the same as the probability of tails, then the entropy of the coin toss is as high as it could be. Such a coin toss has one shannon of entropy since there are two possible outcomes that occur with probability, and learning the actual outcome contains one shannon of information. Contrarily, a toss with a coin that has two heads and no tails has zero entropy since the coin will always come up heads