1.
Statistics
–
Statistics is a branch of mathematics dealing with the collection, analysis, interpretation, presentation, and organization of data. In applying statistics to, e. g. a scientific, industrial, or social problem, populations can be diverse topics such as all people living in a country or every atom composing a crystal. Statistics deals with all aspects of data including the planning of data collection in terms of the design of surveys, statistician Sir Arthur Lyon Bowley defines statistics as Numerical statements of facts in any department of inquiry placed in relation to each other. When census data cannot be collected, statisticians collect data by developing specific experiment designs, representative sampling assures that inferences and conclusions can safely extend from the sample to the population as a whole. In contrast, an observational study does not involve experimental manipulation, inferences on mathematical statistics are made under the framework of probability theory, which deals with the analysis of random phenomena. A standard statistical procedure involves the test of the relationship between two data sets, or a data set and a synthetic data drawn from idealized model. A hypothesis is proposed for the relationship between the two data sets, and this is compared as an alternative to an idealized null hypothesis of no relationship between two data sets. Rejecting or disproving the hypothesis is done using statistical tests that quantify the sense in which the null can be proven false. Working from a hypothesis, two basic forms of error are recognized, Type I errors and Type II errors. Multiple problems have come to be associated with this framework, ranging from obtaining a sufficient sample size to specifying an adequate null hypothesis, measurement processes that generate statistical data are also subject to error. Many of these errors are classified as random or systematic, the presence of missing data or censoring may result in biased estimates and specific techniques have been developed to address these problems. Statistics continues to be an area of research, for example on the problem of how to analyze Big data. Statistics is a body of science that pertains to the collection, analysis, interpretation or explanation. Some consider statistics to be a mathematical science rather than a branch of mathematics. While many scientific investigations make use of data, statistics is concerned with the use of data in the context of uncertainty, mathematical techniques used for this include mathematical analysis, linear algebra, stochastic analysis, differential equations, and measure-theoretic probability theory. In applying statistics to a problem, it is practice to start with a population or process to be studied. Populations can be diverse topics such as all living in a country or every atom composing a crystal. Ideally, statisticians compile data about the entire population and this may be organized by governmental statistical institutes
2.
Normal distribution
–
In probability theory, the normal distribution is a very common continuous probability distribution. Normal distributions are important in statistics and are used in the natural and social sciences to represent real-valued random variables whose distributions are not known. The normal distribution is useful because of the limit theorem. Physical quantities that are expected to be the sum of independent processes often have distributions that are nearly normal. Moreover, many results and methods can be derived analytically in explicit form when the relevant variables are normally distributed, the normal distribution is sometimes informally called the bell curve. However, many other distributions are bell-shaped, the probability density of the normal distribution is, f =12 π σ2 e −22 σ2 Where, μ is mean or expectation of the distribution. σ is standard deviation σ2 is variance A random variable with a Gaussian distribution is said to be distributed and is called a normal deviate. The simplest case of a distribution is known as the standard normal distribution. The factor 1 /2 in the exponent ensures that the distribution has unit variance and this function is symmetric around x =0, where it attains its maximum value 1 /2 π and has inflection points at x = +1 and x = −1. Authors may differ also on which normal distribution should be called the standard one, the probability density must be scaled by 1 / σ so that the integral is still 1. If Z is a normal deviate, then X = Zσ + μ will have a normal distribution with expected value μ. Conversely, if X is a normal deviate, then Z = /σ will have a standard normal distribution. Every normal distribution is the exponential of a function, f = e a x 2 + b x + c where a is negative. In this form, the mean value μ is −b/, for the standard normal distribution, a is −1/2, b is zero, and c is − ln /2. The standard Gaussian distribution is denoted with the Greek letter ϕ. The alternative form of the Greek phi letter, φ, is used quite often. The normal distribution is often denoted by N. Thus when a random variable X is distributed normally with mean μ and variance σ2, some authors advocate using the precision τ as the parameter defining the width of the distribution, instead of the deviation σ or the variance σ2
3.
Analysis of variance
–
In the ANOVA setting, the observed variance in a particular variable is partitioned into components attributable to different sources of variation. In its simplest form, ANOVA provides a statistical test of whether or not the means of groups are equal. Hy, ANOVAs are useful for comparing three or more means for statistical significance and it is conceptually similar to multiple two-sample t-tests, but is more conservative and is therefore suited to a wide range of practical problems. While the analysis of variance reached fruition in the 20th century and these include hypothesis testing, the partitioning of sums of squares, experimental techniques and the additive model. Laplace was performing hypothesis testing in the 1770s, the development of least-squares methods by Laplace and Gauss circa 1800 provided an improved method of combining observations. It also initiated much study of the contributions to sums of squares, Laplace soon knew how to estimate a variance from a residual sum of squares. By 1827 Laplace was using least squares methods to address ANOVA problems regarding measurements of atmospheric tides, before 1800 astronomers had isolated observational errors resulting from reaction times and had developed methods of reducing the errors. An eloquent non-mathematical explanation of the effects model was available in 1885. Ronald Fisher introduced the term variance and proposed its formal analysis in a 1918 article The Correlation Between Relatives on the Supposition of Mendelian Inheritance and his first application of the analysis of variance was published in 1921. Analysis of variance became widely known after being included in Fishers 1925 book Statistical Methods for Research Workers, Randomization models were developed by several researchers. The first was published in Polish by Neyman in 1923, one of the attributes of ANOVA which ensured its early popularity was computational elegance. The structure of the model allows solution for the additive coefficients by simple algebra rather than by matrix calculations. In the era of mechanical calculators this simplicity was critical, the determination of statistical significance also required access to tables of the F function which were supplied by early statistics texts. The analysis of variance can be used as an tool to explain observations. A dog show provides an example, a dog show is not a random sampling of the breed, it is typically limited to dogs that are adult, pure-bred, and exemplary. A histogram of dog weights from a show might plausibly be rather complex, suppose we wanted to predict the weight of a dog based on a certain set of characteristics of each dog. Before we could do that, we would need to explain the distribution of weights by dividing the dog population into groups based on those characteristics. A successful grouping will split dogs such that each group has a low variance of dog weights, in the illustrations to the right, each group is identified as X1, X2, etc
4.
Null hypothesis
–
In inferential statistics, the term null hypothesis is a general statement or default position that there is no relationship between two measured phenomena, or no association among groups. The null hypothesis is generally assumed to be true until evidence indicates otherwise, in statistics, it is often denoted H0. The concept of a hypothesis is used differently in two approaches to statistical inference. In the significance testing approach of Ronald Fisher, a hypothesis is rejected if the observed data are significantly unlikely to have occurred if the null hypothesis were true. In this case the null hypothesis is rejected and a hypothesis is accepted in its place. If the data are consistent with the hypothesis, then the null hypothesis is not rejected. In neither case is the hypothesis or its alternative proven, the null hypothesis is tested with data. This is analogous to a trial, in which the defendant is assumed to be innocent until proven guilty beyond a reasonable doubt. Proponents of each approach criticize the other approach, nowadays, though, a hybrid approach is widely practiced and presented in textbooks. The hybrid is in turn criticized as incorrect and incoherent—for details, hypothesis testing requires constructing a statistical model of what the data would look like given that chance or random processes alone were responsible for the results. The hypothesis that chance alone is responsible for the results is called the null hypothesis, the model of the result of the random process is called the distribution under the null hypothesis. The obtained results are compared with the distribution under the null hypothesis. The null hypothesis assumes no relationship between variables in the population from which the sample is selected, If the data-set of a randomly selected representative sample is very unlikely relative to the null hypothesis, the experimenter rejects the null hypothesis concluding it is false. This class of data-sets is usually specified via a test statistic which is designed to measure the extent of apparent departure from the null hypothesis. If the data do not contradict the hypothesis, then only a weak conclusion can be made, namely. For instance, a drug may reduce the chance of having a heart attack. Possible null hypotheses are this drug does not reduce the chances of having an attack or this drug has no effect on the chances of having a heart attack. The test of the consists of administering the drug to half of the people in a study group as a controlled experiment
5.
Central limit theorem
–
If this procedure is performed many times, the central limit theorem says that the computed values of the average will be distributed according to the normal distribution. The central limit theorem has a number of variants, in its common form, the random variables must be identically distributed. In variants, convergence of the mean to the normal distribution also occurs for non-identical distributions or for non-independent observations, in more general usage, a central limit theorem is any of a set of weak-convergence theorems in probability theory. When the variance of the i. i. d, Variables is finite, the attractor distribution is the normal distribution. In contrast, the sum of a number of i. i. d, Random variables with power law tail distributions decreasing as | x |−α −1 where 0 < α <2 will tend to an alpha-stable distribution with stability parameter of α as the number of variables grows. Suppose we are interested in the sample average S n, = X1 + ⋯ + X n n of these random variables, by the law of large numbers, the sample averages converge in probability and almost surely to the expected value µ as n → ∞. The classical central limit theorem describes the size and the form of the stochastic fluctuations around the deterministic number µ during this convergence. For large enough n, the distribution of Sn is close to the distribution with mean µ. The usefulness of the theorem is that the distribution of √n approaches normality regardless of the shape of the distribution of the individual Xi, formally, the theorem can be stated as follows, Lindeberg–Lévy CLT. Suppose is a sequence of i. i. d, Random variables with E = µ and Var = σ2 < ∞. Then as n approaches infinity, the random variables √n converge in distribution to a normal N, n → d N. Note that the convergence is uniform in z in the sense that lim n → ∞ sup z ∈ R | Pr − Φ | =0, the theorem is named after Russian mathematician Aleksandr Lyapunov. In this variant of the limit theorem the random variables Xi have to be independent. The theorem also requires that random variables | Xi | have moments of order. Suppose is a sequence of independent random variables, each with finite expected value μi, in practice it is usually easiest to check Lyapunov’s condition for δ =1. If a sequence of random variables satisfies Lyapunov’s condition, then it also satisfies Lindeberg’s condition, the converse implication, however, does not hold. In the same setting and with the notation as above. Suppose that for every ε >0 lim n → ∞1 s n 2 ∑ i =1 n E =0 where 1 is the indicator function
6.
P-value
–
Their misuse has been a matter of considerable controversy. The p-value is defined informally as the probability of obtaining an equal to or more extreme than what was actually observed. This ignores the distinction between two-tailed and one-tailed tests which is discussed below, in frequentist inference, the p-value is widely used in statistical hypothesis testing, specifically in null hypothesis significance testing. If the p-value is less than or equal to the significance level. However, that does not prove that the hypothesis is true. When the p-value is calculated correctly, this test guarantees that the Type I error rate is at most α. For typical analysis, using the standard α =0.05 cutoff, the p-value does not, in itself, support reasoning about the probabilities of hypotheses but is only a tool for deciding whether to reject the null hypothesis. In statistics, a hypothesis refers to a probability distribution that is assumed to govern the observed data. However, if X is a random variable and an instance x is observed. Thus, this definition is inadequate and needs to be changed so as to accommodate the continuous random variables. The p-value is defined as the probability, under the assumption of hypothesis H, depending on how it is looked at, the more extreme than what was actually observed can mean or or the smaller of and. Thus, the p-value is given by Pr for right tail event, the smaller the p-value, the larger the significance because it tells the investigator that the hypothesis under consideration may not adequately explain the observation. The hypothesis H is rejected if any of these probabilities is less than or equal to a small, fixed but arbitrarily pre-defined threshold value α, which is referred to as the level of significance. Unlike the p-value, the α level is not derived from any observational data and does not depend on the underlying hypothesis, thus, the p-value is not fixed. This implies that p-value cannot be given a frequency counting interpretation since the probability has to be fixed for the frequency counting interpretation to hold. In other words, if the same test is repeated independently bearing upon the same null hypothesis. Nevertheless, these different p-values can be combined using Fishers combined probability test, the fixed pre-defined α level can be interpreted as the rate of falsely rejecting the null hypothesis, since Pr = Pr = α. Usually, instead of the observations, X is instead a test statistic
7.
Journal of the American Statistical Association
–
The Journal of the American Statistical Association is the primary journal published by the American Statistical Association, the main professional body for statisticians in the United States. It is published four times a year and it had an impact factor of 2.063 in 2010, tenth highest in the Statistics and Probability category of Journal Citation Reports. In a 2003 survey of statisticians, the Journal of the American Statistical Association was ranked first, among all journals, for Applications of Statistics, the predecessor of this journal started in 1888 with the name Publications of the American Statistical Association. It became Quarterly publications of the American Statistical Association in 1912, Journal of the American Statistical Association
8.
Robust statistics
–
Robust statistics are statistics with good performance for data drawn from a wide range of probability distributions, especially for distributions that are not normal. Robust statistical methods have developed for many common problems, such as estimating location, scale. One motivation is to produce statistical methods that are not unduly affected by outliers, another motivation is to provide methods with good performance when there are small departures from parametric distributions. For example, robust methods work well for mixtures of two normal distributions with different standard-deviations, under this model, non-robust methods like a t-test work poorly. Robust statistics seek to provide methods that emulate popular statistical methods, in statistics, classical estimation methods rely heavily on assumptions which are often not met in practice. In particular, it is assumed that the data errors are normally distributed, at least approximately. Unfortunately, when there are outliers in the data, classical estimators often have poor performance, when judged using the breakdown point. For instance, one may use a mixture of 95% a normal distribution, the median is a robust measure of central tendency, while the mean is not. The median has a point of 50%, while the mean has a breakdown point of 0%. The median absolute deviation and interquartile range are robust measures of statistical dispersion, while the standard deviation, trimmed estimators and Winsorised estimators are general methods to make statistics more robust. There are various definitions of a robust statistic, strictly speaking, a robust statistic is resistant to errors in the results, produced by deviations from assumptions. One of the most important cases is distributional robustness, classical statistical procedures are typically sensitive to longtailedness. Thus, in the context of robust statistics, distributionally robust, for one perspective on research in robust statistics up to 2000, see Portnoy & He. A related topic is that of resistant statistics, which are resistant to the effect of extreme scores, gelman et al. in Bayesian Data Analysis consider a data set relating to speed-of-light measurements made by Simon Newcomb. The data sets for that book can be found via the Classic data sets page, although the bulk of the data look to be more or less normally distributed, there are two obvious outliers. These outliers have an effect on the mean, dragging it towards them. Thus, if the mean is intended as a measure of the location of the center of the data, it is, in a sense, also, the distribution of the mean is known to be asymptotically normal due to the central limit theorem. However, outliers can make the distribution of the mean non-normal even for large data sets
9.
Standard error
–
The standard error is the standard deviation of the sampling distribution of a statistic, most commonly of the mean. Different samples drawn from that population would in general have different values of the sample mean. The relationship with the deviation is defined such that, for a given sample size. As the sample increases, the dispersion of the sample means clusters more closely around the population mean. The term may also be used to refer to an estimate of that standard deviation, the standard error of the mean is the standard deviation of the sample-means estimate of a population mean. This estimate may be compared with the formula for the standard deviation of the sample mean. This formula may be derived from what we know about the variance of a sum of independent random variables. If X1, X2, …, X n are n independent observations from a population that has a mean μ and standard deviation σ, the variance of T / n must be 1 n 2 n σ2 = σ2 n. And the standard deviation of T / n must be σ / n, of course, T / n is the sample mean x ¯. With n =2 the underestimate is about 25%, but for n =6 the underestimate is only 5%, gurland and Tripathi provide a correction and equation for this effect. Sokal and Rohlf give an equation of the factor for small samples of n <20. See unbiased estimation of standard deviation for further discussion, a practical result, Decreasing the uncertainty in a mean value estimate by a factor of two requires acquiring four times as many observations in the sample. Or decreasing standard error by a factor of ten requires a hundred times as many observations, in many practical applications, the true value of σ is unknown. As a result, we need to use a distribution that takes into account that spread of possible σs, when the true underlying distribution is known to be Gaussian, although with unknown σ, then the resulting estimated distribution follows the Student t-distribution. The standard error is the deviation of the Student t-distribution. T-distributions are slightly different from Gaussian, and vary depending on the size of the sample, to estimate the standard error of a student t-distribution it is sufficient to use the sample standard deviation s instead of σ, and we could use this value to calculate confidence intervals. Note, The Students probability distribution is approximated well by the Gaussian distribution when the size is over 100. For such samples one can use the distribution, which is much simpler
10.
JSTOR
–
JSTOR is a digital library founded in 1995. Originally containing digitized back issues of journals, it now also includes books and primary sources. It provides full-text searches of almost 2,000 journals, more than 8,000 institutions in more than 160 countries have access to JSTOR, most access is by subscription, but some older public domain content is freely available to anyone. William G. Bowen, president of Princeton University from 1972 to 1988, JSTOR originally was conceived as a solution to one of the problems faced by libraries, especially research and university libraries, due to the increasing number of academic journals in existence. Most libraries found it prohibitively expensive in terms of cost and space to maintain a collection of journals. By digitizing many journal titles, JSTOR allowed libraries to outsource the storage of journals with the confidence that they would remain available long-term, online access and full-text search ability improved access dramatically. Bowen initially considered using CD-ROMs for distribution, JSTOR was initiated in 1995 at seven different library sites, and originally encompassed ten economics and history journals. JSTOR access improved based on feedback from its sites. Special software was put in place to make pictures and graphs clear, with the success of this limited project, Bowen and Kevin Guthrie, then-president of JSTOR, wanted to expand the number of participating journals. They met with representatives of the Royal Society of London and an agreement was made to digitize the Philosophical Transactions of the Royal Society dating from its beginning in 1665, the work of adding these volumes to JSTOR was completed by December 2000. The Andrew W. Mellon Foundation funded JSTOR initially, until January 2009 JSTOR operated as an independent, self-sustaining nonprofit organization with offices in New York City and in Ann Arbor, Michigan. JSTOR content is provided by more than 900 publishers, the database contains more than 1,900 journal titles, in more than 50 disciplines. Each object is identified by an integer value, starting at 1. In addition to the site, the JSTOR labs group operates an open service that allows access to the contents of the archives for the purposes of corpus analysis at its Data for Research service. This site offers a facility with graphical indication of the article coverage. Users may create focused sets of articles and then request a dataset containing word and n-gram frequencies and they are notified when the dataset is ready and may download it in either XML or CSV formats. The service does not offer full-text, although academics may request that from JSTOR, JSTOR Plant Science is available in addition to the main site. The materials on JSTOR Plant Science are contributed through the Global Plants Initiative and are only to JSTOR
11.
Homoscedasticity
–
In statistics, a sequence or a vector of random variables is homoscedastic /ˌhoʊmoʊskəˈdæstɪk/ if all random variables in the sequence or vector have the same finite variance. This is also known as homogeneity of variance, the complementary notion is called heteroscedasticity. The spellings homoskedasticity and heteroskedasticity are also frequently used, the assumption of homoscedasticity simplifies mathematical and computational treatment. Serious violations in homoscedasticity may result in overestimating the goodness of fit as measured by the Pearson coefficient. As used in describing simple linear regression analysis, one assumption of the model is that the standard deviations of the error terms are constant. Consequently, each probability distribution for y has the standard deviation regardless of the x-value. In short, this assumption is homoscedasticity, homoscedasticity is not required for the estimates to be unbiased, consistent, and asymptotically normal. Residuals can be tested for homoscedasticity using the Breusch–Pagan test, which performs an auxiliary regression of the squared residuals on the independent variables, the null hypothesis of this chi-squared test is homoscedasticity, and the alternative hypothesis would indicate heteroscedasticity. Since the Breusch–Pagan test is sensitive to departures from normality or small sample sizes, from the auxiliary regression, it retains the R-squared value which is then multiplied by the sample size, and then becomes the test statistic for a chi-squared distribution. Although it is not necessary for the Koenker–Bassett test, the Breusch–Pagan test requires that the squared residuals also be divided by the sum of squares divided by the sample size. Testing for groupwise heteroscedasticity requires the Goldfeld–Quandt test, two or more normal distributions, N, are homoscedastic if they share a common covariance matrix, Σ i = Σ j, ∀ i, j. Homoscedastic distributions are useful to derive statistical pattern recognition and machine learning algorithms. One popular example is Fishers linear discriminant analysis, the concept of homoscedasticity can be applied to distributions on spheres