Bar chart

A bar chart or bar graph is a chart or graph that presents categorical data with rectangular bars with heights or lengths proportional to the values that they represent. The bars can be plotted horizontally. A vertical bar chart is sometimes called a line graph. A bar graph shows comparisons among discrete categories. One axis of the chart shows the specific categories being compared, the other axis represents a measured value; some bar graphs present bars clustered in groups of more than one, showing the values of more than one measured variable. Many sources consider William Playfair to have invented the bar chart and the Exports and Imports of Scotland to and from different parts for one Year from Christmas 1780 to Christmas 1781 graph from his The Commercial and Political Atlas to be the first bar chart in history. Diagrams of the velocity of a accelerating object against time published in The Latitude of Forms about 300 years before can be interpreted as "proto bar charts". Bar charts have a discrete domain of categories, are scaled so that all the data can fit on the chart.

When there is no natural ordering of the categories being compared, bars on the chart may be arranged in any order. Bar charts arranged from highest to lowest incidence are called Pareto charts. Bar graphs/charts provide a visual presentation of categorical data. Categorical data is a grouping of data into discrete groups, such as months of the year, age group, shoe sizes, animals; these categories are qualitative. In a column bar chart, the categories appear along the horizontal axis. Bar graphs can be used for more complex comparisons of data with grouped bar charts and stacked bar charts. In a grouped bar chart, for each categorical group there are two or more bars; these bars are color-coded to represent a particular grouping. For example, a business owner with two stores might make a grouped bar chart with different colored bars to represent each store: the horizontal axis would show the months of the year and the vertical axis would show the revenue. Alternatively, a stacked bar chart could be used.

The stacked bar chart stacks bars. The height of the resulting bar shows the combined result of the groups. However, stacked bar charts are not suited to datasets. In such cases, grouped bar chart are preferable. Grouped bar graphs present the information in the same order in each grouping. Stacked bar graphs present the information in the same sequence on each bar. See Extension:EasyTimeline to include bar charts in Wikipedia. Enhanced Metafile Format to use in office suits, as MS PowerPoint. Histogram, similar appearance - for continuous data Misleading graph Directory of graph software and online tools Create A Graph. Free online graph creation tool at the website for the National Center for Education Statistics Livegap Charts. Free online chart maker

Statistics

Statistics is a branch of mathematics dealing with data collection, analysis and presentation. In applying statistics to, for example, a scientific, industrial, or social problem, it is conventional to begin with a statistical population or a statistical model process to be studied. Populations can be diverse topics such as "all people living in a country" or "every atom composing a crystal". Statistics deals with every aspect of data, including the planning of data collection in terms of the design of surveys and experiments. See glossary of probability and statistics; when census data cannot be collected, statisticians collect data by developing specific experiment designs and survey samples. Representative sampling assures that inferences and conclusions can reasonably extend from the sample to the population as a whole. An experimental study involves taking measurements of the system under study, manipulating the system, taking additional measurements using the same procedure to determine if the manipulation has modified the values of the measurements.

In contrast, an observational study does not involve experimental manipulation. Two main statistical methods are used in data analysis: descriptive statistics, which summarize data from a sample using indexes such as the mean or standard deviation, inferential statistics, which draw conclusions from data that are subject to random variation. Descriptive statistics are most concerned with two sets of properties of a distribution: central tendency seeks to characterize the distribution's central or typical value, while dispersion characterizes the extent to which members of the distribution depart from its center and each other. Inferences on mathematical statistics are made under the framework of probability theory, which deals with the analysis of random phenomena. A standard statistical procedure involves the test of the relationship between two statistical data sets, or a data set and synthetic data drawn from an idealized model. A hypothesis is proposed for the statistical relationship between the two data sets, this is compared as an alternative to an idealized null hypothesis of no relationship between two data sets.

Rejecting or disproving the null hypothesis is done using statistical tests that quantify the sense in which the null can be proven false, given the data that are used in the test. Working from a null hypothesis, two basic forms of error are recognized: Type I errors and Type II errors. Multiple problems have come to be associated with this framework: ranging from obtaining a sufficient sample size to specifying an adequate null hypothesis. Measurement processes that generate statistical data are subject to error. Many of these errors are classified as random or systematic, but other types of errors can be important; the presence of missing data or censoring may result in biased estimates and specific techniques have been developed to address these problems. Statistics can be said to have begun in ancient civilization, going back at least to the 5th century BC, but it was not until the 18th century that it started to draw more from calculus and probability theory. In more recent years statistics has relied more on statistical software to produce tests such as descriptive analysis.

Some definitions are: Merriam-Webster dictionary defines statistics as "a branch of mathematics dealing with the collection, analysis and presentation of masses of numerical data." Statistician Arthur Lyon Bowley defines statistics as "Numerical statements of facts in any department of inquiry placed in relation to each other."Statistics is a mathematical body of science that pertains to the collection, interpretation or explanation, presentation of data, or as a branch of mathematics. Some consider statistics to be a distinct mathematical science rather than a branch of mathematics. While many scientific investigations make use of data, statistics is concerned with the use of data in the context of uncertainty and decision making in the face of uncertainty. Mathematical statistics is the application of mathematics to statistics. Mathematical techniques used for this include mathematical analysis, linear algebra, stochastic analysis, differential equations, measure-theoretic probability theory.

In applying statistics to a problem, it is common practice to start with a population or process to be studied. Populations can be diverse topics such as "all people living in a country" or "every atom composing a crystal". Ideally, statisticians compile data about the entire population; this may be organized by governmental statistical institutes. Descriptive statistics can be used to summarize the population data. Numerical descriptors include mean and standard deviation for continuous data types, while frequency and percentage are more useful in terms of describing categorical data; when a census is not feasible, a chosen subset of the population called. Once a sample, representative of the population is determined, data is collected for the sample members in an observational or experimental setting. Again, descriptive statistics can be used to summarize the sample data. However, the drawing of the sample has been subject to an element of randomness, hence the established numerical descriptors from the sample are due to uncertainty.

To still draw meaningful conclusions about the entire population, in

Scatter plot

A scatter plot is a type of plot or mathematical diagram using Cartesian coordinates to display values for two variables for a set of data. If the points are color-coded, one additional variable can be displayed; the data are displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis. A scatter plot can be used either when one continuous variable, under the control of the experimenter and the other depends on it or when both continuous variables are independent. If a parameter exists, systematically incremented and/or decremented by the other, it is called the control parameter or independent variable and is customarily plotted along the horizontal axis; the measured or dependent variable is customarily plotted along the vertical axis. If no dependent variable exists, either type of variable can be plotted on either axis and a scatter plot will illustrate only the degree of correlation between two variables.

A scatter plot can suggest various kinds of correlations between variables with a certain confidence interval. For example and height, weight would be on y axis and height would be on the x axis. Correlations may be negative, or null. If the pattern of dots slopes from lower left to upper right, it indicates a positive correlation between the variables being studied. If the pattern of dots slopes from upper left to lower right, it indicates a negative correlation. A line of best fit can be drawn in order to study the relationship between the variables. An equation for the correlation between the variables can be determined by established best-fit procedures. For a linear correlation, the best-fit procedure is known as linear regression and is guaranteed to generate a correct solution in a finite time. No universal best-fit procedure is guaranteed to generate a correct solution for arbitrary relationships. A scatter plot is very useful when we wish to see how two comparable data sets agree to show nonlinear relationships between variables.

The ability to do this can be enhanced by adding a smooth line such as LOESS. Furthermore, if the data are represented by a mixture model of simple relationships, these relationships will be visually evident as superimposed patterns; the scatter diagram is one of the seven basic tools of quality control. Scatter charts can be built in the form of marker, or/and line charts. For example, to display a link between a person's lung capacity, how long that person could hold his/her breath, a researcher would choose a group of people to study measure each one's lung capacity and how long that person could hold his/her breath; the researcher would plot the data in a scatter plot, assigning "lung capacity" to the horizontal axis, "time holding breath" to the vertical axis. A person with a lung capacity of 400 cl who held his/her breath for 21.7 seconds would be represented by a single dot on the scatter plot at the point in the Cartesian coordinates. The scatter plot of all the people in the study would enable the researcher to obtain a visual comparison of the two variables in the data set, will help to determine what kind of relationship there might be between the two variables.

For a set of data variables X1, X2... Xk, the scatter plot matrix shows all the pairwise scatter plots of the variables on a single view with multiple scatterplots in a matrix format. For k variables, the scatterplot matrix will contain k rows and k columns. A plot located on the intersection of i-th row and j-th column is a plot of variables Xj; this means that each row and column is one dimension, each cell plots a scatterplot of two dimensions. A generalized scatterplot matrix offers a range of displays of paired combinations of categorical and quantitative variables. A mosaic plot, fluctuation diagram, or faceted bar chart may be used to display two categorical variables. Other plots are used for one quantitative variables. Rug plot What is a scatterplot? Correlation scatter-plot matrix for ordered-categorical data – Explanation and R code Density scatterplot for large datasets

Kriging

In statistics in geostatistics, kriging or Gaussian process regression is a method of interpolation for which the interpolated values are modeled by a Gaussian process governed by prior covariances. Under suitable assumptions on the priors, kriging gives the best linear unbiased prediction of the intermediate values. Interpolating methods based on other criteria such as smoothness need not yield the most intermediate values; the method is used in the domain of spatial analysis and computer experiments. The technique is known as Wiener–Kolmogorov prediction, after Norbert Wiener and Andrey Kolmogorov; the theoretical basis for the method was developed by the French mathematician Georges Matheron in 1960, based on the Master's thesis of Danie G. Krige, the pioneering plotter of distance-weighted average gold grades at the Witwatersrand reef complex in South Africa. Krige sought to estimate the most distribution of gold based on samples from a few boreholes; the English verb is to krige and the most common noun is kriging.

The word is sometimes capitalized as Kriging in the literature. The basic idea of kriging is to predict the value of a function at a given point by computing a weighted average of the known values of the function in the neighborhood of the point; the method is mathematically related to regression analysis. Both theories derive a best linear unbiased estimator, based on assumptions on covariances, make use of Gauss–Markov theorem to prove independence of the estimate and error, make use of similar formulae. So, they are useful in different frameworks: kriging is made for estimation of a single realization of a random field, while regression models are based on multiple observations of a multivariate data set; the kriging estimation may be seen as a spline in a reproducing kernel Hilbert space, with the reproducing kernel given by the covariance function. The difference with the classical kriging approach is provided by the interpretation: while the spline is motivated by a minimum norm interpolation based on a Hilbert space structure, kriging is motivated by an expected squared prediction error based on a stochastic model.

Kriging with polynomial trend surfaces is mathematically identical to generalized least squares polynomial curve fitting. Kriging can be understood as a form of Bayesian inference. Kriging starts with a prior distribution over functions; this prior takes the form of a Gaussian process: N samples from a function will be distributed, where the covariance between any two samples is the covariance function of the Gaussian process evaluated at the spatial location of two points. A set of values is observed, each value associated with a spatial location. Now, a new value can be predicted at any new spatial location, by combining the Gaussian prior with a Gaussian likelihood function for each of the observed values; the resulting posterior distribution is Gaussian, with a mean and covariance that can be computed from the observed values, their variance, the kernel matrix derived from the prior. In geostatistical models, sampled data is interpreted as the result of a random process; the fact that these models incorporate uncertainty in their conceptualization doesn't mean that the phenomenon – the forest, the aquifer, the mineral deposit – has resulted from a random process, but rather it allows one to build a methodological basis for the spatial inference of quantities in unobserved locations, to quantify the uncertainty associated with the estimator.

A stochastic process is, in the context of this model a way to approach the set of data collected from the samples. The first step in geostatistical modulation is to create a random process that best describes the set of observed data. A value from location x 1 is interpreted as a realization z of the random variable Z. In the space A, where the set of samples is dispersed, there are N realizations of the random variables Z, Z, …, Z, correlated between themselves; the set of random variables constitutes a random function of which only one realization is known z – the set of observed data. With only one realization of each random variable it's theoretically impossible to determine any statistical parameter of the individual variables or the function; the proposed solution in the geostatistical formalism consists in assuming various degrees of stationarity in the random function, in order to make possible the inference of some statistic values. For instance, if one assumes, based on the homogeneity of samples in area A where the variable is distributed, the hypothesis that the first moment is stationary one is assuming that the mean can be estimated by the arithmetic mean of sampled values.

Judging such a hypothesis as appropriate is equivalent to assuming the sample values are sufficiently homogeneous. The hypothesis of stationarity related to the second moment is defined i

Design of experiments

The design of experiments is the design of any task that aims to describe or explain the variation of information under conditions that are hypothesized to reflect the variation. The term is associated with experiments in which the design introduces conditions that directly affect the variation, but may refer to the design of quasi-experiments, in which natural conditions that influence the variation are selected for observation. In its simplest form, an experiment aims at predicting the outcome by introducing a change of the preconditions, represented by one or more independent variables referred to as "input variables" or "predictor variables." The change in one or more independent variables is hypothesized to result in a change in one or more dependent variables referred to as "output variables" or "response variables." The experimental design may identify control variables that must be held constant to prevent external factors from affecting the results. Experimental design involves not only the selection of suitable independent and control variables, but planning the delivery of the experiment under statistically optimal conditions given the constraints of available resources.

There are multiple approaches for determining the set of design points to be used in the experiment. Main concerns in experimental design include the establishment of validity and replicability. For example, these concerns can be addressed by choosing the independent variable, reducing the risk of measurement error, ensuring that the documentation of the method is sufficiently detailed. Related concerns include achieving appropriate levels of statistical sensitivity. Designed experiments advance knowledge in the natural and social sciences and engineering. Other applications include policy making. In 1747, while serving as surgeon on HMS Salisbury, James Lind carried out a systematic clinical trial to compare remedies for scurvy; this systematic clinical trial constitutes a type of DOE. Lind selected 12 men from all suffering from scurvy. Lind limited his subjects to men who "were as similar as I could have them," that is, he provided strict entry requirements to reduce extraneous variation, he divided them into six pairs, giving each pair different supplements to their basic diet for two weeks.

The treatments were all remedies, proposed: A quart of cider every day. Twenty five gutts of vitriol three times a day upon an empty stomach. One half-pint of seawater every day. A mixture of garlic and horseradish in a lump the size of a nutmeg. Two spoonfuls of vinegar three times a day. Two oranges and one lemon every day; the citrus treatment stopped after six days when they ran out of fruit, but by that time one sailor was fit for duty while the other had recovered. Apart from that, only group one showed some effect of its treatment; the remainder of the crew served as a control, but Lind did not report results from any control group. A theory of statistical inference was developed by Charles S. Peirce in "Illustrations of the Logic of Science" and "A Theory of Probable Inference", two publications that emphasized the importance of randomization-based inference in statistics. Charles S. Peirce randomly assigned volunteers to a blinded, repeated-measures design to evaluate their ability to discriminate weights.

Peirce's experiment inspired other researchers in psychology and education, which developed a research tradition of randomized experiments in laboratories and specialized textbooks in the 1800s. Charles S. Peirce contributed the first English-language publication on an optimal design for regression models in 1876. A pioneering optimal design for polynomial regression was suggested by Gergonne in 1815. In 1918, Kirstine Smith published optimal designs for polynomials of degree six; the use of a sequence of experiments, where the design of each may depend on the results of previous experiments, including the possible decision to stop experimenting, is within the scope of Sequential analysis, a field, pioneered by Abraham Wald in the context of sequential tests of statistical hypotheses. Herman Chernoff wrote an overview of optimal sequential designs, while adaptive designs have been surveyed by S. Zacks. One specific type of sequential design is the "two-armed bandit", generalized to the multi-armed bandit, on which early work was done by Herbert Robbins in 1952.

A methodology for designing experiments was proposed by Ronald Fisher, in his innovative books: The Arrangement of Field Experiments and The Design of Experiments. Much of his pioneering work dealt with agricultural applications of statistical methods; as a mundane example, he described how to test the lady tasting tea hypothesis, that a certain lady could distinguish by flavour alone whether the milk or the tea was first placed in the cup. These methods have been broadly adapted in the physical and social sciences, are still used in agricultural engineering and differ from the design and analysis of computer experiments. Comparison In some fields of study it is not possible to have independent measurements to a traceable metrology standard. Comparisons between treatments are much more valuable and are preferable, compared against a scientific control or traditional treatment that acts as baseline. Randomization Random assignment is the process of assigning individuals at random to groups or to different groups in an experiment, so that each individual of the population has the same chance of becoming a participant in the study.

The random assignme

Spatial analysis

Spatial analysis or spatial statistics includes any of the formal techniques which study entities using their topological, geometric, or geographic properties. Spatial analysis includes a variety of techniques, many still in their early development, using different analytic approaches and applied in fields as diverse as astronomy, with its studies of the placement of galaxies in the cosmos, to chip fabrication engineering, with its use of "place and route" algorithms to build complex wiring structures. In a more restricted sense, spatial analysis is the technique applied to structures at the human scale, most notably in the analysis of geographic data. Complex issues arise in spatial analysis, many of which are neither defined nor resolved, but form the basis for current research; the most fundamental of these is the problem of defining the spatial location of the entities being studied. Classification of the techniques of spatial analysis is difficult because of the large number of different fields of research involved, the different fundamental approaches which can be chosen, the many forms the data can take.

Spatial analysis can be considered to have arisen with early attempts at cartography and surveying but many fields have contributed to its rise in modern form. Biology contributed through botanical studies of global plant distributions and local plant locations, ethological studies of animal movement, landscape ecological studies of vegetation blocks, ecological studies of spatial population dynamics, the study of biogeography. Epidemiology contributed with early work on disease mapping, notably John Snow's work of mapping an outbreak of cholera, with research on mapping the spread of disease and with location studies for health care delivery. Statistics has contributed through work in spatial statistics. Economics has contributed notably through spatial econometrics. Geographic information system is a major contributor due to the importance of geographic software in the modern analytic toolbox. Remote sensing has contributed extensively in clustering analysis. Computer science has contributed extensively through the study of algorithms, notably in computational geometry.

Mathematics continues to provide the fundamental tools for analysis and to reveal the complexity of the spatial realm, for example, with recent work on fractals and scale invariance. Scientific modelling provides a useful framework for new approaches. Spatial analysis confronts many fundamental issues in the definition of its objects of study, in the construction of the analytic operations to be used, in the use of computers for analysis, in the limitations and particularities of the analyses which are known, in the presentation of analytic results. Many of these issues are active subjects of modern research. Common errors arise in spatial analysis, some due to the mathematics of space, some due to the particular ways data are presented spatially, some due to the tools which are available. Census data, because it protects individual privacy by aggregating data into local units, raises a number of statistical issues; the fractal nature of coastline makes precise measurements of its length difficult if not impossible.

A computer software fitting straight lines to the curve of a coastline, can calculate the lengths of the lines which it defines. However these straight lines may have no inherent meaning in the real world, as was shown for the coastline of Britain; these problems represent a challenge in spatial analysis because of the power of maps as media of presentation. When results are presented as maps, the presentation combines spatial data which are accurate with analytic results which may be inaccurate, leading to an impression that analytic results are more accurate than the data would indicate; the definition of the spatial presence of an entity constrains the possible analysis which can be applied to that entity and influences the final conclusions that can be reached. While this property is fundamentally true of all analysis, it is important in spatial analysis because the tools to define and study entities favor specific characterizations of the entities being studied. Statistical techniques favor the spatial definition of objects as points because there are few statistical techniques which operate directly on line, area, or volume elements.

Computer tools favor the spatial definition of objects as homogeneous and separate elements because of the limited number of database elements and computational structures available, the ease with which these primitive structures can be created. Spatial dependency is the co-variation of properties within geographic space: characteristics at proximal locations appear to be correlated, either positively or negatively. Spatial dependency leads to the spatial autocorrelation problem in statistics since, like temporal autocorrelation, this violates standard statistical techniques that assume independence among observations. For example, regression analyses that do not compensate for spatial dependency can have unstable parameter estimates and yield unreliable significance tests. Spatial regression models do not suffer from these weaknesses, it is appropriate to view spatial dependency as a source of information rather than something to be corrected. Locational effects manifest as spatial heterogeneity, or the apparent variation in a process with respect to location in geographic space.

Unless a space is uniform and boundless, every location will have some degree of uniqueness relative to the other locations. This affects the spatial dependency relations and therefore the spatial process. Spatial heterogeneity means that overall parameters estimated for the entire system may not adequately describe the

Biplot

Biplots are a type of exploratory graph used in statistics, a generalization of the simple two-variable scatterplot. A biplot allows information on both samples and variables of a data matrix to be displayed graphically. Samples are displayed as points while variables are displayed either as vectors, linear axes or nonlinear trajectories. In the case of categorical variables, category level points may be used to represent the levels of a categorical variable. A generalised biplot displays information on both categorical variables; the biplot was introduced by K. Ruben Gabriel. Gower and Hand wrote a monograph on biplots. Yan and Kang described various methods which can be used in order to visualize and interpret a biplot; the book by Greenacre is a practical user-oriented guide to biplots, along with scripts in the open-source R programming language, to generate biplots associated with principal component analysis, multidimensional scaling, log-ratio analysis —also known as spectral mapping—discriminant analysis and various forms of correspondence analysis: simple correspondence analysis, multiple correspondence analysis and canonical correspondence analysis.

The book by Gower, Lubbe and le Roux aims to popularize biplots as a useful and reliable method for the visualization of multivariate data when researchers want to consider, for example, principal component analysis, canonical variates analysis or various types of correspondence analysis. A biplot is constructed by using the singular value decomposition to obtain a low-rank approximation to a transformed version of the data matrix X, whose n rows are the samples, whose p columns are the variables; the transformed data matrix Y is obtained from the original matrix X by centering and optionally standardizing the columns. Using the SVD, we can write Y = ∑k=1...pdkukvkT. The biplot is formed from two scatterplots that share a common set of axes and have a between-set scalar product interpretation; the first scatterplot is formed from the points, for i = 1...n. The second plot is formed from the points, for j = 1...p. This is the biplot formed by the dominant two terms of the SVD, which can be represented in a two-dimensional display.

Typical choices of α are 1 and 0, in some rare cases α=1/2 to obtain a symmetrically scaled biplot. The set of points depicting the variables can be drawn as arrows from the origin to reinforce the idea that they represent biplot axes onto which the samples can be projected to approximate the original data. Greenacre, M.. Biplots in Practice. BBVA Foundation, Spain. Available for free download ISBN 978-84-923846-8-6, with materials. Gabriel, K. R.. "The biplot graphic display of matrices with application to principal component analysis". Biometrika. 58: 453–467. Doi:10.1093/biomet/58.3.453. Gower, J. C. Lubbe, S. and le Roux, N.. Understanding Biplots. Wiley. ISBN 978-0-470-01255-0 Gower, J. C. and Hand, D. J. Biplots. Chapman & Hall, London, UK. ISBN 0-412-71630-5 Yan, W. and Kang, M. S.. GGE Biplot Analysis. CRC Press, Boca Raton, Florida. ISBN 0-8493-1338-4 Demey, J. R. Vicente-Villardón, J. L. Galindo-Villardón, M. P. and Zambrano, A. Y.. Identifying molecular markers associated with classification of genotypes by External Logistic Biplots.

Bioinformatics. 24:2832–2838