1.
Biological neural network
–
In neuroscience, a biological neural network is a series of interconnected neurons whose activation defines a recognizable linear pathway. The interface through which neurons interact with their neighbors usually consists of axon terminals connected via synapses to dendrites on other neurons. If the sum of the signals into one neuron surpasses a certain threshold. Biological neural networks have inspired the design of neural networks. The first rule of neuronal learning was described by Hebb in 1949, the neuroscientists Warren Sturgis McCulloch and Walter Pitts published the first works on the processing of neural networks. They showed theoretically that networks of neurons could implement logical, arithmetic. Simplified models of neurons were set up, now usually called perceptrons or artificial neurons. These simple models accounted for neural summation, later models also provided for excitatory and inhibitory synaptic transmission. The connections between neurons are more complex than those implemented in neural computing architectures. The basic kinds of connections between neurons are chemical synapses and electrical gap junctions, one principle by which neurons work is neural summation, i. e. potentials at the post synaptic membrane will sum up in the cell body. If the depolarization of the neuron at the axon goes above threshold an action potential will occur that travels down the axon to the endings to transmit a signal to other neurons. Excitatory and inhibitory synaptic transmission is realized mostly by inhibitory postsynaptic potentials, on the electrophysiological level, there are various phenomena which alter the response characteristics of individual synapses and individual neurons. These are often divided into short-term plasticity and long-term plasticity, long-term synaptic plasticity is often contended to be the most likely memory substrate. Usually the term neuroplasticity refers to changes in the brain that are caused by activity or experience, connections display temporal and spatial characteristics. Temporal characteristics refer to the continuously modified activity-dependent efficacy of synaptic transmission and it has been observed in several studies that the synaptic efficacy of this transmission can undergo short-term increase or decrease according to the activity of the presynaptic neuron. LTP is induced by a series of action potentials which cause a variety of biochemical responses, eventually, the reactions cause the expression of new receptors on the cellular membranes of the postsynaptic neurons or increase the efficacy of the existing receptors through phosphorylation. In some cells, however, neural backpropagation does occur through the dendritic arbor and may have important effects on synaptic plasticity, a neuron in the brain requires a single signal to a neuromuscular junction to stimulate contraction of the postsynaptic muscle cell. In the spinal cord, however, at least 75 afferent neurons are required to produce firing and this picture is further complicated by variation in time constant between neurons, as some cells can experience their EPSPs over a wider period of time than others
Biological neural network
–
From "Texture of the
Nervous System of Man and the
Vertebrates " by
Santiago Ramón y Cajal. The figure illustrates the diversity of neuronal morphologies in the
auditory cortex.
2.
Feature engineering
–
Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. Feature engineering is fundamental to the application of learning, and is both difficult and expensive. The need for manual feature engineering can be obviated by automated feature learning, feature engineering is an informal topic, but it is considered essential in applied machine learning. Coming up with features is difficult, time-consuming, requires expert knowledge, applied machine learning is basically feature engineering. When working on a learning problem, feature engineering is manually designing what the input xs should be. A feature is a piece of information that might be useful for prediction, any attribute could be a feature, as long as it is useful to the model. The purpose of a feature, other than being an attribute, a feature is a characteristic that might help when solving the problem. The features in your data are important to the models you use. The quality and quantity of the features will have influence on whether the model is good or not. You could say the better the features are, the better the result is and this isnt entirely true, because the results achieved also depend on the model and the data, not just the chosen features. That said, choosing the right features is very important. Better features can produce simpler and more models, and they often yield better results. The algorithms we used are standard for Kagglers. We spent most of our efforts in feature engineering and we were also very careful to discard features likely to expose us to the risk of over-fitting our model. …some machine learning projects succeed and some fail, easily the most important factor is the features used. Depending on a feature it could be relevant, relevant. It is important to create a lot of features, even if some of them are irrelevant, you cant afford missing the rest. Afterwards, feature selection can be used in order to prevent overfitting, feature explosion can be caused by feature combination or feature templates, both leading to a quick growth in the total number of features
Feature engineering
–
Machine learning and
data mining
3.
Feature learning
–
This obviates manual feature engineering, which is otherwise necessary, and allows a machine to both learn at a specific task and learn the features themselves. Feature learning is motivated by the fact that machine learning such as classification often require input that is mathematically and computationally convenient to process. However, real-world data such as images, video, and sensor measurement is usually complex, redundant, thus, it is necessary to discover useful features or representations from raw data. Traditional hand-crafted features often require expensive human labor and often rely on expert knowledge, also, they normally do not generalize well. This motivates the design of efficient feature learning techniques, to automate, Feature learning can be divided into two categories, supervised and unsupervised feature learning, analogous to these categories in machine learning generally. In supervised feature learning, features are learned with labeled input data, examples include supervised neural networks, multilayer perceptron, and dictionary learning. In unsupervised feature learning, features are learned with unlabeled input data, examples include dictionary learning, independent component analysis, autoencoders, matrix factorization, and various forms of clustering. Supervised feature learning is learning features from labeled data, several approaches are introduced in the following. Dictionary learning is to learn a set of elements from the input data such that each data point can be represented as a weighted sum of the representative elements. The dictionary elements and the weights may be found by minimizing the average representation error, supervised dictionary learning exploits both the structure underlying the input data and the labels for optimizing the dictionary elements. For example, a dictionary learning technique was proposed by Mairal et al. in 2009. Neural networks are used to illustrate a family of learning algorithms via a network consisting of layers of inter-connected nodes. It is inspired by the system, where the nodes are viewed as neurons. Each edge has a weight, and the network defines computational rules that passes input data from the input layer to the output layer. A network function associated with a neural network characterizes the relationship between input and output layers, which is parameterized by the weights, with appropriately defined network functions, various learning tasks can be performed by minimizing a cost function over the network function. Unsupervised feature learning is to features from unlabeled data. The goal of unsupervised learning is often to discover low-dimensional features that captures some structure underlying the high-dimensional input data. Several approaches are introduced in the following, k-means clustering is an approach for vector quantization
Feature learning
–
Machine learning and
data mining
4.
Online machine learning
–
Online learning is a common technique used in areas of machine learning where it is computationally infeasible to train over the entire dataset, requiring the need of out-of-core algorithms. Two general modelling strategies exist for online learning models, statistical learning models, in statistical learning models, the data samples are assumed to be independent and identically distributed random variables, and the algorithm just has a limited access to the data. In adversarial models, looking at the problem as a game between two players, the goal is to minimize losses regardless of the move played by the other player. In this model, the opponent is allowed to adapt the data generated based on the output of the learning algorithm. Spam filtering falls in category, as the adversary will dynamically generate new spam based on the current behavior of the spam detector. Examples of algorithms in this model include follow the leader, follow the regularized leader, in reality, the learner never knows the true distribution p over instances. Instead, the learner usually has access to a set of examples. In this setting, the function is given as V, Y × Y → R. The ideal goal is to select a function f ∈ H, depending on the type of model, one can devise different notions of loss, which lead to different learning algorithms. In statistical learning models, the sample are assumed to have been drawn i. i. d. from the true distribution p. A common paradigm in this situation is to estimate a function f ^ through empirical risk minimization or regularized empirical risk minimization, the choice of loss function here gives rise to several well-known learning algorithms such as regularized least squares and support vector machines. For the case of learning, the data is still assumed to be i. i. d without access to all the data. A purely online model in this category would learn based on just the new input, Mini-batch techniques are used with repeated passing over the training data to obtain optimized out-of-core versions of machine learning algorithms, for e. g. Stochastic gradient descent. When combined with backpropogation, this is currently the de facto training method for training neural networks. The simple example of linear least squares is used to explain a variety of ideas in online learning, the ideas are general enough to be applied to other settings, for e. g. with other convex loss functions. Let X be the i × d data matrix and Y is the i ×1 matrix of target values after the arrival of the first i data points. When n total points in the dataset and having to recompute the solution after the arrival of every datapoint i =1, …, n, the recursive least squares algorithm considers an online approach to the least squares problem. The proof also shows that Γ i = Σ i −1, one can look at RLS also in the context of adaptive filters
Online machine learning
–
Machine learning and
data mining
5.
Bootstrap aggregating
–
It also reduces variance and helps to avoid overfitting. Although it is applied to decision tree methods, it can be used with any type of method. Bagging is a case of the model averaging approach. Bagging was proposed by Leo Breiman in 1994 to improve the classification by combining classifications of randomly generated training sets. Given a standard training set D of size n, bagging generates m new training sets D i, each of size n′, by sampling from D uniformly, by sampling with replacement, some observations may be repeated in each D i. If n′=n, then for large n the set D i is expected to have the fraction of the examples of D. This kind of sample is known as a bootstrap sample, the m models are fitted using the above m bootstrap samples and combined by averaging the output or voting. Bagging leads to improvements for unstable procedures, which include, for example, artificial neural networks, classification and regression trees, an interesting application of bagging showing improvement in preimage learning is provided here. On the other hand, it can degrade the performance of stable methods such as K-nearest neighbors. To illustrate the principles of bagging, below is an analysis on the relationship between ozone and temperature. The relationship between temperature and ozone in this set is apparently non-linear, based on the scatter plot. To mathematically describe this relationship, LOESS smoothers are used, instead of building a single smoother from the complete data set,100 bootstrap samples of the data were drawn. Each sample is different from the data set, yet resembles it in distribution. For each bootstrap sample, a LOESS smoother was fit, predictions from these 100 smoothers were then made across the range of the data. The first 10 predicted smooth fits appear as lines in the figure below. The lines are clearly very wiggly and they overfit the data - a result of the span being too low, by taking the average of 100 smoothers, each fitted to a subset of the original data set, we arrive at one bagged predictor. Clearly, the mean is more stable and there is less overfit, boosting Bootstrapping Cross-validation Random forest Random subspace method Breiman, Leo. Alfaro, E. Gámez, M. and García, N. adabag, An R package for classification with AdaBoost. M1, AdaBoost-SAMME and Bagging
Bootstrap aggregating
–
Machine learning and
data mining
6.
Logistic regression
–
In statistics, logistic regression, or logit regression, or logit model is a regression model where the dependent variable is categorical. This article covers the case of a binary dependent variable—that is, cases where the dependent variable has more than two outcome categories may be analysed in multinomial logistic regression, or, if the multiple categories are ordered, in ordinal logistic regression. In the terminology of economics, logistic regression is an example of a qualitative response/discrete choice model, Logistic regression was developed by statistician David Cox in 1958. The binary logistic model is used to estimate the probability of a response based on one or more predictor variables. It allows one to say that the presence of a risk factor increases the probability of an outcome by a specific percentage. Logistic regression is used in fields, including machine learning, most medical fields. For example, the Trauma and Injury Severity Score, which is used to predict mortality in injured patients, was originally developed by Boyd et al. using logistic regression. Many other medical scales used to assess severity of a patient have been developed using logistic regression, Logistic regression may be used to predict whether a patient has a given disease, based on observed characteristics of the patient. Another example might be to predict whether an American voter will vote Democratic or Republican, based on age, income, sex, race, state of residence, votes in previous elections, etc. The technique can also be used in engineering, especially for predicting the probability of failure of a given process and it is also used in marketing applications such as prediction of a customers propensity to purchase a product or halt a subscription, etc. Conditional random fields, an extension of logistic regression to sequential data, are used in language processing. Suppose we wish to answer the question, A group of 20 students spend between 0 and 6 hours studying for an exam. How does the number of hours spent studying affect the probability that the student will pass the exam, the reason for using logistic regression for this problem is that the dependent variable pass/fail represented by 1 and 0 are not cardinal numbers. If the problem were changed so that pass/fail was replaced with the grade 0–100, the table shows the number of hours each student spent studying, and whether they passed or failed. The graph shows the probability of passing the exam versus the number of hours studying, the logistic regression analysis gives the following output. The output indicates that hours studying is significantly associated with the probability of passing the exam, the output from the logistic regression analysis gives a p-value of p =0.0167, which is based on the Wald z-score. Rather than the Wald method, the method to calculate the p-value for logistic regression is the Likelihood Ratio Test. Logistic regression can be binomial, ordinal or multinomial, binomial or binary logistic regression deals with situations in which the observed outcome for a dependent variable can have only two possible types,0 and 1
Logistic regression
–
Graph of a logistic regression curve showing probability of passing an exam versus hours studying
7.
Perceptron
–
In machine learning, the perceptron is an algorithm for supervised learning of binary classifiers. It is a type of linear classifier, i. e. a classification algorithm that makes its predictions based on a linear predictor function combining a set of weights with the feature vector. The algorithm allows for learning, in that it processes elements in the training set one at a time. The perceptron algorithm dates back to the late 1950s, its first implementation, the perceptron algorithm was invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt, funded by the United States Office of Naval Research. This machine was designed for image recognition, it had an array of 400 photocells, weights were encoded in potentiometers, and weight updates during learning were performed by electric motors. Although the perceptron initially seemed promising, it was proved that perceptrons could not be trained to recognise many classes of patterns. It is often believed that they conjectured that a similar result would hold for a multi-layer perceptron network. However, this is not true, as both Minsky and Papert already knew that multi-layer perceptrons were capable of producing an XOR function, three years later Stephen Grossberg published a series of papers introducing networks capable of modelling differential, contrast-enhancing and XOR functions. Nevertheless, the often-miscited Minsky/Papert text caused a significant decline in interest and it took ten more years until neural network research experienced a resurgence in the 1980s. This text was reprinted in 1987 as Perceptrons - Expanded Edition where some errors in the text are shown. The kernel perceptron algorithm was introduced in 1964 by Aizerman et al. The bias shifts the decision boundary away from the origin and does not depend on any input value, the value of f is used to classify x as either a positive or a negative instance, in the case of a binary classification problem. If b is negative, then the combination of inputs must produce a positive value greater than | b | in order to push the classifier neuron over the 0 threshold. Spatially, the bias alters the position of the decision boundary, the perceptron learning algorithm does not terminate if the learning set is not linearly separable. If the vectors are not linearly separable learning will never reach a point where all vectors are classified properly, the most famous example of the perceptrons inability to solve problems with linearly nonseparable vectors is the Boolean exclusive-or problem. The solution spaces of decision boundaries for all functions and learning behaviors are studied in the reference. In the context of networks, a perceptron is an artificial neuron using the Heaviside step function as the activation function. The perceptron algorithm is termed the single-layer perceptron, to distinguish it from a multilayer perceptron
Perceptron
–
The Mark I Perceptron machine was the first implementation of the perceptron algorithm. The machine was connected to a camera that used 20×20
cadmium sulfide photocells to produce a 400-pixel image. The main visible feature is a patchboard that allowed experimentation with different combinations of input features. To the right of that are arrays of
potentiometers that implemented the adaptive weights.
8.
BIRCH
–
A birch is a thin-leaved deciduous hardwood tree of the genus Betula, in the family Betulaceae, which also includes alders, hazels, and hornbeams. It is closely related to the beech-oak family Fagaceae, the genus Betula contains 30 to 60 known taxa of which 11 are on the IUCN2011 Green List of Threatened Species. They are a typically rather short-lived pioneer species widespread in the Northern Hemisphere, particularly in northern temperate, Birch species are generally small to medium-sized trees or shrubs, mostly of northern temperate and boreal climates. The simple leaves are alternate, singly or doubly serrate, feather-veined and they often appear in pairs, but these pairs are really borne on spur-like, two-leaved, lateral branchlets. The fruit is a samara, although the wings may be obscure in some species. They differ from the alders in that the female catkins are not woody and disintegrate at maturity, falling apart to release the seeds, unlike the woody, cone-like female alder catkins. The bark of all birches is characteristically marked with long, horizontal lenticels and its decided color gives the common names gray, white, black, silver and yellow birch to different species. The buds form early and are grown by midsummer, all are lateral, no terminal bud is formed. The wood of all the species is close-grained with satiny texture, staminate aments are pendulous, clustered or solitary in the axils of the last leaves of the branch of the year or near the ends of the short lateral branchlets of the year. They form in autumn and remain rigid during the winter. The scales of the staminate aments when mature are broadly ovate, rounded, yellow or orange color below the middle, each scale bears two bractlets and three sterile flowers, each flower consisting of a sessile, membranaceous, usually two-lobed, calyx. Each calyx bears four short filaments with one-celled anthers or strictly, the pistillate aments are erect or pendulous, solitary, terminal on the two-leaved lateral spur-like branchlets of the year. The pistillate scales are oblong-ovate, three-lobed, pale yellow green often tinged with red and these scales bear two or three fertile flowers, each flower consisting of a naked ovary. The ovary is compressed, two-celled, and crowned with two styles, the ovule is solitary. Each scale bear a small, winged nut that is oval. Betula species are organised into five subgenera, pendula and B. pubescens confused, though they are distinct species with different chromosome numbers. This root is derived from *bʰreh₁ǵ- ‘to shine’, in reference to the birchs white bark. The Proto-Germanic rune berkanan is named after the birch, the generic name betula is from Latin, which is a diminutive borrowed from Gaulish betua
BIRCH
–
Birch
BIRCH
–
The front and rear sides of a piece of birch
bark
BIRCH
–
Birch trees near stream in
Hankasalmi, Finland
BIRCH
–
A stand of birch trees
9.
Hierarchical clustering
–
In data mining and statistics, hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of clusters. Divisive, This is a top down approach, all start in one cluster. In general, the merges and splits are determined in a greedy manner, the results of hierarchical clustering are usually presented in a dendrogram. In the general case, the complexity of agglomerative clustering is O, divisive clustering with an exhaustive search is O, which is even worse. However, for special cases, optimal efficient agglomerative methods ) are known, SLINK for single-linkage. In order to decide which clusters should be combined, or where a cluster should be split, the choice of an appropriate metric will influence the shape of the clusters, as some elements may be close to one another according to one distance and farther away according to another. Some commonly used metrics for hierarchical clustering are, For text or other non-numeric data, the linkage criterion determines the distance between sets of observations as a function of the pairwise distances between observations. Some commonly used linkage criteria between two sets of observations A and B are, where d is the chosen metric, other linkage criteria include, The sum of all intra-cluster variance. The decrease in variance for the cluster being merged, the probability that candidate clusters spawn from the same distribution function. The product of in-degree and out-degree on a k-nearest-neighbour graph, the increment of some cluster descriptor after merging two clusters. Hierarchical clustering has the advantage that any valid measure of distance can be used. In fact, the observations themselves are not required, all that is used is a matrix of distances, for example, suppose this data is to be clustered, and the Euclidean distance is the distance metric. Cutting the tree at a height will give a partitioning clustering at a selected precision. In this example, cutting after the row of the dendrogram will yield clusters. Cutting after the row will yield clusters, which is a coarser clustering. The hierarchical clustering dendrogram would be as such, This method builds the hierarchy from the elements by progressively merging clusters. In our example, we have six elements and, the first step is to determine which elements to merge in a cluster. Usually, we want to take the two closest elements, according to the chosen distance, optionally, one can also construct a distance matrix at this stage, where the number in the i-th row j-th column is the distance between the i-th and j-th elements
Hierarchical clustering
–
Machine learning and
data mining
10.
K-means clustering
–
K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean and this results in a partitioning of the data space into Voronoi cells. The problem is difficult, however, there are efficient heuristic algorithms that are commonly employed. These are usually similar to the algorithm for mixtures of Gaussian distributions via an iterative refinement approach employed by both algorithms. The algorithm has a relationship to the k-nearest neighbor classifier. One can apply the 1-nearest neighbor classifier on the centers obtained by k-means to classify new data into the existing clusters. This is known as nearest centroid classifier or Rocchio algorithm, the term k-means was first used by James MacQueen in 1967, though the idea goes back to Hugo Steinhaus in 1957. The standard algorithm was first proposed by Stuart Lloyd in 1957 as a technique for pulse-code modulation, in 1965, E. W. Forgy published essentially the same method, which is why it is sometimes referred to as Lloyd-Forgy. The most common uses a iterative refinement technique. Due to its ubiquity it is called the k-means algorithm, it is also referred to as Lloyds algorithm. Since the sum of squares is the squared Euclidean distance, this is intuitively the nearest mean, S i =, where each x p is assigned to exactly one S, even if it could be assigned to two or more of them. Update step, Calculate the new means to be the centroids of the observations in the new clusters. M i =1 | S i | ∑ x j ∈ S i x j Since the arithmetic mean is a least-squares estimator, the algorithm has converged when the assignments no longer change. Since both steps optimize the WCSS objective, and there exists a finite number of such partitionings. There is no guarantee that the optimum is found using this algorithm. The algorithm is often presented as assigning objects to the nearest cluster by distance, the standard algorithm aims at minimizing the WCSS objective, and thus assigns by least sum of squares, which is exactly equivalent to assigning by the smallest Euclidean distance. Using a different distance function other than Euclidean distance may stop the algorithm from converging, various modifications of k-means such as spherical k-means and k-medoids have been proposed to allow using other distance measures. Commonly used initialization methods are Forgy and Random Partition, the Forgy method randomly chooses k observations from the data set and uses these as the initial means
K-means clustering
11.
DBSCAN
–
Density-based spatial clustering of applications with noise is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996. DBSCAN is one of the most common clustering algorithms and also most cited in scientific literature, in 2014, the algorithm was awarded the test of time award at the leading data mining conference, KDD. Consider a set of points in space to be clustered. Those points are said to be reachable from p. By definition, no points are directly reachable from a non-core point, a point q is reachable from p if there is a path p1. Pn with p1 = p and pn = q, where each pi+1 is directly reachable from pi, All points not reachable from any other point are outliers. Now if p is a point, then it forms a cluster together with all points that are reachable from it. Each cluster contains at least one point, non-core points can be part of a cluster. Reachability is not a symmetric relation since, by definition, no point may be reachable from a non-core point, therefore a further notion of connectedness is needed to formally define the extent of the clusters found by DBSCAN. Two points p and q are density-connected if there is a point o such that p and q are density-reachable from o. A cluster then satisfies two properties, All points within the cluster are mutually density-connected, if a point is density-reachable from any point of the cluster, it is part of the cluster as well. DBSCAN requires two parameters, ε and the number of points required to form a dense region. It starts with a starting point that has not been visited. This points ε-neighborhood is retrieved, and if it contains sufficiently many points, otherwise, the point is labeled as noise. Note that this point might later be found in a sufficiently sized ε-environment of a different point, if a point is found to be a dense part of a cluster, its ε-neighborhood is also part of that cluster. Hence, all points that are found within the ε-neighborhood are added and this process continues until the density-connected cluster is completely found. Then, a new unvisited point is retrieved and processed, leading to the discovery of a cluster or noise. These simplifications have been omitted from the above pseudocode in order to reflect the originally published version, additionally, the regionQuery function need not return P in the list of points to be visited, as long as it is otherwise still counted in the local density estimate
DBSCAN
–
Machine learning and
data mining
12.
Mean-shift
–
Mean shift is a non-parametric feature-space analysis technique for locating the maxima of a density function, a so-called mode-seeking algorithm. Application domains include cluster analysis in computer vision and image processing, the mean shift procedure was originally presented in 1975 by Fukunaga and Hostetler. Mean shift is a procedure for locating the maxima of a density function given discrete data sampled from that function and it is useful for detecting the modes of this density. This is a method, and we start with an initial estimate x. Let a kernel function K be given and this function determines the weight of nearby points for re-estimation of the mean. Typically a Gaussian kernel on the distance to the current estimate is used, the difference m − x is called mean shift in Fukunaga and Hostetler. The mean-shift algorithm now sets x ← m, and repeats the estimation until m converges, although the mean shift algorithm has been widely used in many applications, a rigid proof for the convergence of the algorithm using a general kernel in a high dimensional space is still missing. Aliyari Ghassabeh showed the convergence of the mean shift algorithm in one-dimension with a differentiable, convex, however, the one-dimensional case has limited real world applications. Also, the convergence of the algorithm in higher dimensions with a number of the stationary points has been proved. However, sufficient conditions for a kernel function to have finite stationary points have not been provided. Let data be a finite set S embedded in the n-dimensional Euclidean space, Let K be a flat kernel that is the characteristic function of the λ -ball in X, In each iteration of the algorithm, s ← m is performed for all s ∈ S simultaneously. The first question, then, is how to estimate the density function given a set of samples. One of the simplest approaches is to just smooth the data, e. g. by convolving it with a kernel of width h. H is the parameter in the algorithm and is called the bandwidth. This approach is known as kernel density estimation or the Parzen window technique, once we have computed f from equation above, we can find its local maxima using gradient ascent or some other optimization technique. The problem with this force approach is that, for higher dimensions. Instead, mean shift uses a variant of what is known in the literature as multiple restart gradient descent. Kernel definition, Let X be the n-dimensional Euclidean space, R n, denote the ith component of x by x i
Mean-shift
–
Machine learning and
data mining
13.
Independent component analysis
–
In signal processing, independent component analysis is a computational method for separating a multivariate signal into additive subcomponents. This is done by assuming that the subcomponents are non-Gaussian signals, ICA is a special case of blind source separation. A common example application is the cocktail party problem of listening in on one persons speech in a noisy room, Independent component analysis attempts to decompose a multivariate signal into independent non-Gaussian signals. As an example, sound is usually a signal that is composed of the addition, at each time t. The question then is whether it is possible to separate these contributing sources from the total signal. When the statistical independence assumption is correct, blind ICA separation of a mixed signal gives very good results and it is also used for signals that are not supposed to be generated by a mixing for analysis purposes. A simple application of ICA is the cocktail party problem, where the speech signals are separated from a sample data consisting of people talking simultaneously in a room. Usually the problem is simplified by assuming no delays or echoes. An important note to consider is that if N sources are present, other cases of underdetermined and overdetermined have been investigated. That the ICA separation of mixed signals gives very good results is based on two assumptions and three effects of mixing source signals, two assumptions, The source signals are independent of each other. The values in each source signal have non-Gaussian distributions, three effects of mixing source signals, Independence, As per assumption 1, the source signals are independent, however, their signal mixtures are not. This is because the signal mixtures share the source signals. Normality, According to the Central Limit Theorem, the distribution of a sum of independent random variables with finite variance tends towards a Gaussian distribution. Loosely speaking, a sum of two independent random variables usually has a distribution that is closer to Gaussian than any of the two original variables, here we consider the value of each signal as the random variable. Complexity, The temporal complexity of any signal mixture is greater than that of its simplest constituent source signal and those principles contribute to the basic establishment of ICA. ICA finds the independent components by maximizing the statistical independence of the estimated components and we may choose one of many ways to define a proxy for independence, and this choice governs the form of the ICA algorithm. The non-Gaussianity family of ICA algorithms, motivated by the limit theorem, uses kurtosis. Whitening and dimension reduction can be achieved with principal component analysis or singular value decomposition, whitening ensures that all dimensions are treated equally a priori before the algorithm is run
Independent component analysis
–
Independent component analysis in
EEGLAB
14.
Non-negative matrix factorization
–
This non-negativity makes the resulting matrices easier to inspect. Also, in such as processing of audio spectrograms or muscular activity, non-negativity is inherent to the data being considered. Since the problem is not exactly solvable in general, it is commonly approximated numerically, NMF finds applications in such fields as computer vision, document clustering, chemometrics, audio signal processing and recommender systems. In chemometrics non-negative matrix factorization has a history under the name self modeling curve resolution. In this framework the vectors in the matrix are continuous curves rather than discrete vectors. Also early work on non-negative matrix factorizations was performed by a Finnish group of researchers in the middle of the 1990s under the name positive matrix factorization. That is, each column of V can be computed as follows, v i = W h i, when multiplying matrices, the dimensions of the factor matrices may be significantly lower than those of the product matrix and it is this property that forms the basis of NMF. NMF generates factors with significantly reduced compared to the original matrix. For example, if V is an m × n matrix, W is an m × p matrix, heres an example based on a text-mining application, Let the input matrix be V with 10000 rows and 500 columns where words are in rows and documents are in columns. That is, we have 500 documents indexed by 10000 words and it follows that a column vector v in V represents a document. Assume we ask the algorithm to find 10 features in order to generate a features matrix W with 10000 rows and 10 columns and a coefficients matrix H with 10 rows and 500 columns. The product of W and H is a matrix with 10000 rows and 500 columns and this last point is the basis of NMF because we can consider each original document in our example as being built from a small set of hidden features. A column in the coefficients matrix H represents a document with a cell value defining the documents rank for a feature. This follows because each row in H represents a feature and it is this property that drives most applications of NMF. More specifically, the approximation of V by V ≃ W H is achieved by minimizing the error function min W, H | | V − W H | | F, subject to W ≥0, H ≥0. If we add additional orthogonality constraint on H, i. e. H H T = I, then the above minimization is mathematically equivalent to the minimization of K-means clustering ). Furthermore, the computed H gives the cluster indicator, i. e. if H k j >0, and the computed W gives the cluster centroids, i. e. the k t h column gives the cluster centroid of k t h cluster. This centroids representation can be enhanced by convex NMF
Non-negative matrix factorization
–
NMF as a probabilistic graphical model: visible units (V) are connected to hidden units (H) through weights W, so that V is
generated from a probability distribution with mean.
15.
Principal component analysis
–
The number of principal components is less than or equal to the smaller of the number of original variables or the number of observations. The resulting vectors are an orthogonal basis set. PCA is sensitive to the scaling of the original variables. PCA was invented in 1901 by Karl Pearson, as an analogue of the principal theorem in mechanics. PCA is mostly used as a tool in exploratory data analysis, PCA can be done by eigenvalue decomposition of a data covariance matrix or singular value decomposition of a data matrix, usually after mean centering the data matrix for each attribute. The results of a PCA are usually discussed in terms of component scores, sometimes called factor scores, PCA is the simplest of the true eigenvector-based multivariate analyses. Often, its operation can be thought of as revealing the structure of the data in a way that best explains the variance in the data. This is done by using only the first few principal components so that the dimensionality of the data is reduced. PCA is closely related to factor analysis, Factor analysis typically incorporates more domain specific assumptions about the underlying structure and solves eigenvectors of a slightly different matrix. PCA is also related to canonical correlation analysis, PCA can be thought of as fitting an n-dimensional ellipsoid to the data, where each axis of the ellipsoid represents a principal component. To find the axes of the ellipsoid, we must first subtract the mean of each variable from the dataset to center the data around the origin, then, we compute the covariance matrix of the data, and calculate the eigenvalues and corresponding eigenvectors of this covariance matrix. Then, we must orthogonalize the set of eigenvectors, and normalize each to become unit vectors, once this is done, each of the mutually orthogonal, unit eigenvectors can be interpreted as an axis of the ellipsoid fitted to the data. The proportion of the variance that each eigenvector represents can be calculated by dividing the corresponding to that eigenvector by the sum of all eigenvalues. This procedure is sensitive to the scaling of the data, a standard result for a symmetric matrix such as XTX is that the quotients maximum possible value is the largest eigenvalue of the matrix, which occurs when w is the corresponding eigenvector. With w found, the first component of a vector x can then be given as a score t1 = x ⋅ w in the transformed co-ordinates, or as the corresponding vector in the original variables. Thus the loading vectors are eigenvectors of XTX, however eigenvectors w and w corresponding to eigenvalues of a symmetric matrix are orthogonal, or can be orthogonalised. The product in the line is therefore zero, there is no sample covariance between different principal components over the dataset. Another way to characterise the principal components transformation is therefore as the transformation to coordinates which diagonalise the empirical sample covariance matrix, however, not all the principal components need to be kept
Principal component analysis
–
PCA of a
multivariate Gaussian distribution centered at (1,3) with a standard deviation of 3 in roughly the (0.878, 0.478) direction and of 1 in the orthogonal direction. The vectors shown are the eigenvectors of the
covariance matrix scaled by the square root of the corresponding eigenvalue, and shifted so their tails are at the mean.
16.
T-distributed stochastic neighbor embedding
–
T-distributed stochastic neighbor embedding is a machine learning algorithm for dimensionality reduction developed by Geoffrey Hinton and Laurens van der Maaten. The t-SNE algorithm comprises two main stages, note that whilst the original algorithm uses the Euclidean distance between objects as the base of its similarity metric, this should be changed as appropriate. T-SNE has been used in a range of applications, including computer security research, music analysis, cancer research, bioinformatics. As a result, the bandwidth is adapted to the density of the data, t-SNE aims to learn a d -dimensional map y 1, …, y N that reflects the similarities p i j as well as possible. To this end, it measures similarities q i j between two points in the map y i and y j, using a similar approach. The result of this optimization is a map that reflects the similarities between the high-dimensional inputs well, visualizing Data Using t-SNE, Google Tech Talk about t-SNE ELKI contains tSNE, also with Barnes-Hut approximation. Https, //github. com/elki-project/elki/blob/master/elki/src/main/java/de/lmu/ifi/dbs/elki/algorithm/projection/TSNE. java t-Distributed Stochastic Neighbor Embedding http, //lvdmaaten. github. io/tsne/
T-distributed stochastic neighbor embedding
–
Machine learning and
data mining
17.
Graphical model
–
A graphical model or probabilistic graphical model is a probabilistic model for which a graph expresses the conditional dependence structure between random variables. They are commonly used in probability theory, statistics—particularly Bayesian statistics—and machine learning, two branches of graphical representations of distributions are commonly used, namely, Bayesian networks and Markov random fields. Both families encompass the properties of factorization and independences, but they differ in the set of independences they can encode, if the network structure of the model is a directed acyclic graph, the model represents a factorization of the joint probability of all random variables. More precisely, if the events are X1, …, X n then the joint probability satisfies P = ∏ i =1 n P where p a i is the set of parents of node X i. In other words, the joint distribution factors into a product of conditional distributions, in general, any two sets of nodes are conditionally independent given a third set if a criterion called d-separation holds in the graph. Local independences and global independences are equivalent in Bayesian networks and this type of graphical model is known as a directed graphical model, Bayesian network, or belief network. Classic machine learning models like hidden Markov models, neural networks, a Markov random field, also known as a Markov network, is a model over an undirected graph. A graphical model with many repeated subunits can be represented with plate notation, a factor graph is an undirected bipartite graph connecting variables and factors. Each factor represents a function over the variables it is connected to and this is a helpful representation for understanding and implementing belief propagation. A clique tree or junction tree is a tree of cliques, a chain graph is a graph which may have both directed and undirected edges, but without any directed cycles. Both directed acyclic graphs and undirected graphs are special cases of chain graphs, an ancestral graph is a further extension, having directed, bidirected and undirected edges. A conditional random field is a model specified over an undirected graph. A restricted Boltzmann machine is a generative model specified over an undirected graph. Belief propagation Structural equation model Graphical models and Conditional Random Fields Probabilistic Graphical Models taught by Eric Xing at CMU Bishop, cowell, Robert G. Dawid, A. Philip, Lauritzen, Steffen L. Spiegelhalter, David J. Probabilistic networks and expert systems. A more advanced and statistically oriented book Jensen, Finn, koller, D. Friedman, N. Probabilistic Graphical Models. A computational reasoning approach, where the relationships between graphs and probabilities were formally introduced, getting Started in Probabilistic Graphical Models. Heckermans Bayes Net Learning Tutorial A Brief Introduction to Graphical Models and Bayesian Networks Sargur Sriharis lecture slides on probabilistic graphical models
Graphical model
–
An example of a graphical model. Each arrow indicates a dependency. In this example: D depends on A, D depends on B, D depends on C, C depends on B, and C depends on D.
18.
Bayesian network
–
For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases, formally, Bayesian networks are DAGs whose nodes represent random variables in the Bayesian sense, they may be observable quantities, latent variables, unknown parameters or hypotheses. Edges represent conditional dependencies, nodes that are not connected represent variables that are independent of each other. Each node is associated with a probability function that takes, as input, a set of values for the nodes parent variables. Similar ideas may be applied to undirected, and possibly cyclic, graphs, efficient algorithms exist that perform inference and learning in Bayesian networks. Bayesian networks that model sequences of variables are called dynamic Bayesian networks, generalizations of Bayesian networks that can represent and solve decision problems under uncertainty are called influence diagrams. Suppose that there are two events which could cause grass to be wet, either the sprinkler is on or its raining, also, suppose that the rain has a direct effect on the use of the sprinkler. Then the situation can be modeled with a Bayesian network, all three variables have two possible values, T and F. The joint probability function is, Pr = Pr Pr Pr where the names of the variables have been abbreviated to G = Grass wet, S = Sprinkler turned on, and R = Raining. The model can answer questions like What is the probability that it is raining, for example, Pr = Pr Pr Pr =0.99 ×0.01 ×0.2 =0.00198. Then the numerical results are Pr =0.00198 T T T +0.1584 T F T0.00198 T T T +0.288 T T F +0.1584 T F T +0.0 T F F =8912491 ≈35.77 %. If, on the hand, we wish to answer an interventional question, What is the probability that it would rain. The answer would be governed by the joint distribution function Pr = Pr P obtained by removing the factor Pr from the pre-intervention distribution. As expected, the probability of rain is unaffected by the action, Pr = Pr. If, moreover, we wish to predict the impact of turning the sprinkler on, we have Pr = Pr Pr with the term Pr removed, showing that the action has an effect on the grass but not on the rain. These predictions may not be feasible when some of the variables are unobserved, the effect of the action do can still be predicted, however, whenever a criterion called back-door is satisfied. It states that, if a set Z of nodes can be observed that d-separates all back-door paths from X to Y then Pr = Pr / Pr, a back-door path is one that ends with an arrow into X. Sets that satisfy the back-door criterion are called sufficient or admissible and we then say that P is not identified
Bayesian network
–
A simple Bayesian network. Rain influences whether the sprinkler is activated, and both rain and the sprinkler influence whether the grass is wet.
19.
Conditional random field
–
Conditional random fields are a class of statistical modeling method often applied in pattern recognition and machine learning and used for structured prediction. CRFs fall into the sequence modeling family, CRFs are a type of discriminative undirected probabilistic graphical model. It is used to encode known relationships between observations and construct consistent interpretations and it is often used for labeling or parsing of sequential data, such as natural language processing or biological sequences and in computer vision. In computer vision, CRFs are often used for object recognition and image segmentation. Lafferty, McCallum and Pereira define a CRF on observations X and random variables Y as follows, Let G = be a graph such that Y = v ∈ V, so that Y is indexed by the vertices of G. Then is a random field when the random variables Y v, conditioned on X, obey the Markov property with respect to the graph, p = p. For general graphs, the problem of exact inference in CRFs is intractable, the inference problem for a CRF is basically the same as for an MRF and the same arguments hold. However, there exist special cases for which exact inference is feasible, If the graph is a chain or a tree, the algorithms used in these cases are analogous to the forward-backward and Viterbi algorithm for the case of HMMs. If the CRF only contains pair-wise potentials and the energy is submodular, If exact inference is impossible, several algorithms can be used to obtain approximate solutions. These include, Loopy belief propagation Alpha expansion Mean field inference Linear programming relaxations Learning the parameters θ is usually done by maximum likelihood learning for p, If all nodes have exponential family distributions and all nodes are observed during training, this optimization is convex. It can be solved for example using gradient descent algorithms, or Quasi-Newton methods such as the L-BFGS algorithm, on the other hand, if some variables are unobserved, the inference problem has to be solved for these variables. Exact inference is intractable in general graphs, so approximations have to be used, in sequence modeling, the graph of interest is usually a chain graph. An input sequence of observed variables X represents a sequence of observations, the model assigns each feature a numerical weight and combines them to determine the probability of a certain value for Y i. Linear-chain CRFs have many of the same applications as conceptually simpler hidden Markov models, an HMM can loosely be understood as a CRF with very specific feature functions that use constant probabilities to model state transitions and emissions. CRFs can be extended into higher order models by making each Y i dependent on a fixed number o of previous variables Y i − o, training and inference are only practical for small values of o, since their computational cost increases exponentially with o. Large-margin models for structured prediction, such as the structured Support Vector Machine can be seen as an alternative training procedure to CRFs, there exists another generalization of CRFs, the semi-Markov conditional random field, which models variable-length segmentations of the label sequence Y. This provides much of the power of higher-order CRFs to model long-range dependencies of the Y i, latent-dynamic conditional random fields or discriminative probabilistic latent variable models are a type of CRFs for sequence tagging tasks. They are latent variable models that are trained discriminatively and these models find applications in computer vision, specifically gesture recognition from video streams and shallow parsing
Conditional random field
–
Machine learning and
data mining
20.
Hidden Markov model
–
A hidden Markov model is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved states. An HMM can be presented as the simplest dynamic Bayesian network, the mathematics behind the HMM were developed by L. E. Baum and coworkers. It is closely related to a work on the optimal nonlinear filtering problem by Ruslan L. Stratonovich. In simpler Markov models, the state is visible to the observer. In a hidden Markov model, the state is not directly visible, each state has a probability distribution over the possible output tokens. Therefore, the sequence of tokens generated by an HMM gives some information about the sequence of states, in its discrete form, a hidden Markov process can be visualized as a generalization of the Urn problem with replacement. Consider this example, in a room that is not visible to an observer there is a genie, the room contains urns X1, X2, X3, … each of which contains a known mix of balls, each ball labeled y1, y2, y3, …. The genie chooses an urn in that room and randomly draws a ball from that urn and it then puts the ball onto a conveyor belt, where the observer can observe the sequence of the balls but not the sequence of urns from which they were drawn. The genie has some procedure to choose urns, the choice of the urn for the n-th ball depends only upon a random number, the choice of urn does not directly depend on the urns chosen before this single previous urn, therefore, this is called a Markov process. It can be described by the part of Figure 1. The Markov process itself cannot be observed, only the sequence of labeled balls and this is illustrated by the lower part of the diagram shown in Figure 1, where one can see that balls y1, y2, y3, y4 can be drawn at each state. However, the observer can work out other information, such as the likelihood that the ball came from each of the urns. The diagram below shows the architecture of an instantiated HMM. Each oval shape represents a variable that can adopt any of a number of values. The random variable x is the state at time t. The random variable y is the observation at time t, the arrows in the diagram denote conditional dependencies. This is called the Markov property, similarly, the value of the observed variable y only depends on the value of the hidden variable x. In the standard type of hidden Markov model considered here, the space of the hidden variables is discrete
Hidden Markov model
–
Machine learning and
data mining
21.
Local outlier factor
–
LOF shares some concepts with DBSCAN and OPTICS such as the concepts of core distance and reachability distance, which are used for local density estimation. The local outlier factor is based on a concept of a local density and these are considered to be outliers. The local density is estimated by the distance at which a point can be reached from its neighbors. The definition of reachability distance used in LOF is a measure to produce more stable results within clusters. Let k-distance be the distance of the object A to the k-th nearest neighbor, note that the set of the k nearest neighbors includes all objects at this distance, which can in the case of a tie be more than k objects. We denote the set of k nearest neighbors as N k, objects that belong to the k nearest neighbors of B are considered to be equally distant. The reason for this distance is to get more stable results, note that this is not a distance in the mathematical definition, since it is not symmetric. The local reachability density of an object A is defined by lrd, note that it is not the average reachability of the neighbors from A, but the distance at which A can be reached from its neighbors. With duplicate points, this value can become infinite, a value of approximately 1 indicates that the object is comparable to its neighbors. A value below 1 indicates a region, while values significantly larger than 1 indicate outliers. Due to the approach, LOF is able to identify outliers in a data set that would not be outliers in another area of the data set. For example, a point at a distance to a very dense cluster is an outlier. While the geometric intuition of LOF is only applicable to low-dimensional vector spaces and it has experimentally been shown to work very well in numerous setups, often outperforming the competitors, for example in network intrusion detection and on processed classification benchmark data. The LOF family of methods can be generalized and then applied to various other problems, such as detecting outliers in geographic data. The resulting values are quotient-values and hard to interpret, a value of 1 or even less indicates a clear inlier, but there is no clear rule for when a point is an outlier. In one data set, a value of 1.1 may already be an outlier, in another dataset and these differences can also occur within a dataset due to the locality of the method. This is the first ensemble learning approach to outlier detection, for other variants see ref, local Outlier Probability is a method derived from LOF but using inexpensive local statistics to become less sensitive to the choice of the parameter k. In addition, the values are scaled to a value range of
Local outlier factor
–
Basic idea of LOF: comparing the local density of a point with the densities of its neighbors. A has a much lower density than its neighbors.
22.
Autoencoder
–
An autoencoder, autoassociator or Diabolo network is an artificial neural network used for unsupervised learning of efficient codings. The aim of an autoencoder is to learn a representation for a set of data, recently, the autoencoder concept has become more widely used for learning generative models of data. Therefore, autoencoders are unsupervised learning models, Here, σ1 is an element-wise activation function such as a sigmoid function or a rectified linear unit. If the hidden layers are larger than the layer, an autoencoder can potentially learn the identity function. However, experimental results have shown that autoencoders might still learn useful features in these cases and this technique has been introduced with a specific approach to good representation. A good representation is one that can be obtained robustly from a corrupted input, by imposing sparsity on the hidden units during training, an autoencoder can learn useful structures in the input data. This allows sparse representations of inputs and these are useful in pretraining for classification tasks. Sparsity may be achieved by additional terms in the function during training, or by manually zeroing all. Variational autoencoder models inherit autoencoder architecture, but make strong assumptions concerning the distribution of latent variables and they use variational approach for latent representation learning, which results in an additional loss component and specific training algorithm called Stochastic Gradient Variational Bayes. The objective of the variational autoencoder in this case has the form, L = D K L − E q ϕ Here. This regularizer corresponds to the Frobenius norm of the Jacobian matrix of the encoder activations with respect to the input, an autoencoder is often trained using one of the many variants of backpropagation. Though these are often effective, there are fundamental problems with the use of backpropagation to train networks with many hidden layers. Once errors are backpropagated to the first few layers, they become minuscule and this means that the network will almost always learn to reconstruct the average of all the training data. Though more advanced backpropagation methods can solve this problem to an extent, they still result in a very slow learning process. This problem can be remedied by using initial weights that approximate the final solution, the process of finding these initial weights is often referred to as pretraining. Geoffrey Hinton developed a technique for training many-layered deep autoencoders. This model takes the name of deep belief network, representation learning Restricted Boltzmann machine Sparse dictionary learning
Autoencoder
–
Machine learning and
data mining
23.
Multilayer perceptron
–
A multilayer perceptron is a feedforward artificial neural network model that maps sets of input data onto a set of appropriate outputs. An MLP consists of layers of nodes in a directed graph. Except for the nodes, each node is a neuron with a nonlinear activation function. MLP utilizes a supervised learning technique called backpropagation for training the network, MLP is a modification of the standard linear perceptron and can distinguish data that are not linearly separable. This function is modeled in several ways, here y i is the output of the i th node and v i is the weighted sum of the input synapses. Alternative activation functions have been proposed, including the rectifier and softplus functions, more specialized activation functions include radial basis functions which are used in another class of supervised neural network models. The multilayer perceptron consists of three or more layers of nonlinearly-activating nodes and is considered a deep neural network. Since an MLP is a Fully Connected Network, each node in one layer connects with a weight w i j to every node in the following layer. Some people do not include the input layer when counting the number of layers, learning occurs in the perceptron by changing connection weights after each piece of data is processed, based on the amount of error in the output compared to the expected result. This is an example of supervised learning, and is carried out through backpropagation, a generalization of the least mean squares algorithm in the linear perceptron. We represent the error in output node j in the n th data point by e j = d j − y j, where d is the target value and y is the value produced by the perceptron. We then make corrections to the weights of the based on those corrections which minimize the error in the entire output. The derivative to be calculated depends on the local field v j. This depends on the change in weights of the k th nodes, the term multilayer perceptron often causes confusion. It is argued the model is not a single perceptron that has multiple layers, rather, it contains many perceptrons that are organised into layers, leading some to believe that a more fitting term might therefore be multilayer perceptron network. In this case, the network can indeed be considered to be a binary classifier with multiple layers. Furthermore, the term multilayer perceptron now does not specify the nature of the layers, the layers are free to be composed of artificial neurons. This interpretation of the term multilayer perceptron avoids the loosening of the definition of perceptron to mean an artificial neuron in general
Multilayer perceptron
–
Machine learning and
data mining
24.
Recurrent neural network
–
A recurrent neural network is a class of artificial neural network where connections between units form a directed cycle. This creates a state of the network which allows it to exhibit dynamic temporal behavior. Unlike feedforward neural networks, RNNs can use their memory to process arbitrary sequences of inputs. This makes them applicable to such as unsegmented connected handwriting recognition or speech recognition. This is the basic architecture developed in the 1980s, a network of neuron-like units, each unit has a time-varying real-valued activation. Each connection has a modifiable real-valued weight, some of the nodes are called input nodes, some output nodes, the rest hidden nodes. Most architectures below are special cases, for supervised learning in discrete time settings, training sequences of real-valued input vectors become sequences of activations of the input nodes, one input vector at a time. At any given time step, each non-input unit computes its current activation as a function of the weighted sum of the activations of all units from which it receives connections. There may be teacher-given target activations for some of the units at certain time steps. For example, if the sequence is a speech signal corresponding to a spoken digit. For each sequence, its error is the sum of the deviations of all signals from the corresponding activations computed by the network. For a training set of sequences, the total error is the sum of the errors of all individual sequences. Algorithms for minimizing this error are mentioned in the section on training algorithms below, again, compare the section on training algorithms below. A recursive neural network is created by applying the same set of weights recursively over a differentiable graph-like structure, such networks are typically also trained by the reverse mode of automatic differentiation. They were introduced to learn distributed representations of structure, such as logical terms, a special case of recursive neural networks is the RNN itself whose structure corresponds to a linear chain. Recursive neural networks have been applied to language processing. The Recursive Neural Tensor Network uses a tensor-based composition function for all nodes in the tree, the Hopfield network is of historic interest although it is not a general RNN, as it is not designed to process sequences of patterns. It is a RNN in which all connections are symmetric, invented by John Hopfield in 1982, it guarantees that its dynamics will converge
Recurrent neural network
–
Machine learning and
data mining
25.
Self-organizing map
–
This makes SOMs useful for visualizing low-dimensional views of high-dimensional data, akin to multidimensional scaling. The artificial neural network introduced by the Finnish professor Teuvo Kohonen in the 1980s is sometimes called a Kohonen map or network. The Kohonen net is a convenient abstraction building on work on biologically neural models from the 1970s. Like most artificial neural networks, SOMs operate in two modes, training and mapping, training builds the map using input examples, while mapping automatically classifies a new input vector. A self-organizing map consists of components called nodes or neurons, associated with each node are a weight vector of the same dimension as the input data vectors, and a position in the map space. The usual arrangement of nodes is a regular spacing in a hexagonal or rectangular grid. The self-organizing map describes a mapping from a higher-dimensional input space to a lower-dimensional map space, the procedure for placing a vector from data space onto the map is to find the node with the closest weight vector to the data space vector. Useful extensions include using toroidal grids where opposite edges are connected and it is also common to use the U-Matrix. The U-Matrix value of a node is the average distance between the nodes weight vector and that of its closest neighbors. In a square grid, for instance, we consider the closest 4 or 8 nodes. In maps consisting of thousands of nodes, it is possible to perform operations on the map itself. The goal of learning in the map is to cause different parts of the network to respond similarly to certain input patterns. This is partly motivated by how visual, auditory or other information is handled in separate parts of the cerebral cortex in the human brain. The weights of the neurons are initialized either to small random values or sampled evenly from the subspace spanned by the two largest principal component eigenvectors, with the latter alternative, learning is much faster because the initial weights already give a good approximation of SOM weights. The network must be fed a number of example vectors that represent, as close as possible. The examples are usually administered several times as iterations, when a training example is fed to the network, its Euclidean distance to all weight vectors is computed. The neuron whose weight vector is most similar to the input is called the best matching unit, the weights of the BMU and neurons close to it in the SOM lattice are adjusted towards the input vector. The magnitude of the change decreases with time and with distance from the BMU, depending on the implementations, t can scan the training data set systematically, be randomly drawn from the data set, or implement some other sampling method
Self-organizing map
–
Self organizing maps (SOM) of three and eight colors with U-Matrix.
Self-organizing map
–
A self-organizing map showing
U.S. Congress voting patterns visualized in
Synapse. The first two boxes show clustering and distances while the remaining ones show the component planes. Red means a yes vote while blue means a no vote in the component planes (except the party component where red is
Republican and blue is
Democratic).
Self-organizing map
–
Self organizing map of Fisher's Iris flower data.
Self-organizing map
–
Cartographical representation of a self-organizing map (
U-Matrix) based on Wikipedia featured article data (word frequency). Distance is inversely proportional to similarity. The "mountains" are edges between clusters. The red lines are links between articles.
26.
Convolutional neural network
–
Individual cortical neurons respond to stimuli in a restricted region of space known as the receptive field. The receptive fields of different neurons partially overlap such that they tile the visual field, the response of an individual neuron to stimuli within its receptive field can be approximated mathematically by a convolution operation. Convolutional networks were inspired by biological processes and are variations of multilayer perceptrons designed to use minimal amounts of preprocessing and they have wide applications in image and video recognition, recommender systems and natural language processing. Convolutional neural networks model animal visual perception, and can be applied to visual recognition tasks, Convolutional neural networks consist of multiple layers of receptive fields. These are small neuron collections which process portions of the input image, the outputs of these collections are then tiled so that their input regions overlap, to obtain a higher-resolution representation of the original image, this is repeated for every such layer. Tiling allows CNNs to tolerate translation of the input image, Convolutional networks may include local or global pooling layers, which combine the outputs of neuron clusters. They also consist of combinations of convolutional and fully connected layers. A convolution operation on small regions of input is introduced to reduce the number of free parameters, one major advantage of networks is the use of shared weight in convolutional layers, which means that the same filt, this both reduces memory footprint and improves performance. Compared to other image classification algorithms, convolutional neural networks use relatively little pre-processing and this means that the network is responsible for learning the filters that in traditional algorithms were hand-engineered. The lack of dependence on prior knowledge and human effort in designing features is an advantage for CNNs. The design of neural networks follows visual mechanisms in living organisms. Work by Hubel and Wiesel in the 1950s and 1960s showed that cat, provided the eyes are not moving, the region of visual space within which visual stimuli affect the firing of a single neuron is known as its receptive fields. Neighboring cells have similar and overlapping receptive field, receptive field size and location varies systematically across the cortex to form a complete map of visual space, the cortex in each hemisphere representing the contralateral visual field. The neocognitron was introduced in 1980, the neocognitron does not require units located at several network positions to have the same trainable weights. This idea appears in 1986 in the version of the original backpropagation paper. They were developed in 1988 for temporal signals and their design was improved in 1998, generalized in 2003, and simplified in the same year. The ability to higher resolution images requires larger and more convolutional layers. Similarly, a shift invariant neural network was proposed for image recognition in 1988
Convolutional neural network
–
Machine learning and
data mining
27.
Probably approximately correct learning
–
In computational learning theory, probably approximately correct learning is a framework for mathematical analysis of machine learning. It was proposed in 1984 by Leslie Valiant, in this framework, the learner receives samples and must select a generalization function from a certain class of possible functions. The goal is that, with probability, the selected function will have low generalization error. The learner must be able to learn the concept given any arbitrary approximation ratio, probability of success, the model was later extended to treat noise. An important innovation of the PAC framework is the introduction of computational complexity theory concepts to machine learning, in particular, the learner is expected to find efficient functions, and the learner itself must implement an efficient procedure. In order to give the definition for something that is PAC-learnable, for the following definitions, two examples will be used. The first is the problem of recognition given an array of n bits encoding a binary-valued image. The other example is the problem of finding an interval that will correctly classify points within the interval as positive, let X be a set called the instance space or the encoding of all the samples, and each instance have length assigned. In the character recognition problem, the space is X = n. In the interval problem the instance space is X = R, a concept is a subset c ⊂ X. One concept is the set of all patterns of bits in X = n that encode a picture of the letter P, an example concept from the second example is the set of all of the numbers between π /2 and 10. A concept class C is a set of concepts over X and this could be the set of all subsets of the array of bits that are skeletonized 4-connected. Let E X be a procedure that draws an example, x, using a probability distribution D and gives the correct label c, that is 1 if x ∈ c and 0 otherwise. Further if the statement for algorithm A is true for every concept c ∈ C and for every distribution D over X. We can also say that A is a PAC learning algorithm for C, under some regularity conditions these three conditions are equivalent, The concept class C is PAC learnable. The VC dimension of C is finite, C is a uniform Glivenko-Cantelli class. Machine learning Data mining Error Tolerance M. Kearns, U, an Introduction to Computational Learning Theory. Overview of the Probably Approximately Correct Learning Framework, in which Valiant argues that PAC learning describes how organisms evolve and learn
Probably approximately correct learning
–
Machine learning and
data mining
28.
Artificial neuron
–
An artificial neuron is a mathematical function conceived as a model of biological neurons. Artificial neurons are the units in an artificial neural network. Depending on the model used they may be called a semi-linear unit, Nv neuron, binary neuron, linear threshold function. The artificial neuron receives one or more inputs and sums them to produce an output, usually the sums of each node are weighted, and the sum is passed through a non-linear function known as an activation function or transfer function. The transfer functions usually have a shape, but they may also take the form of other non-linear functions, piecewise linear functions. They are also often monotonically increasing, continuous, differentiable and bounded, the thresholding function is inspired to build logic gates referred to as threshold logic, with a renewed interest to build logic circuit resembling brain processing. For example, new devices such as memristors have been used to develop such logic in the recent times. The artificial neuron transfer function should not be confused with a linear transfer function. For a given neuron, let there be m +1 inputs with signals x0 through xm. Usually, the x0 input is assigned the value +1, which makes it a bias input with wk0 = bk and this leaves only m actual inputs to the neuron, from x1 to xm. The output of the kth neuron is, y k = φ Where φ is the transfer function, the output is analogous to the axon of a biological neuron, and its value propagates to the input of the next layer, through a synapse. It may also exit the system, possibly as part of an output vector and it has no learning process as such. Its transfer function weights are calculated and threshold value are predetermined, artificial neurons are designed to mimic aspects of their biological counterparts. Dendrites – In a biological neuron, the act as the input vector. These dendrites allow the cell to receive signals from a number of neighboring neurons. As in the mathematical treatment, each dendrite is able to perform multiplication by that dendrites weight value. The multiplication is accomplished by increasing or decreasing the ratio of synaptic neurotransmitters to signal chemicals introduced into the dendrite in response to the synaptic neurotransmitter, a negative multiplication effect can be achieved by transmitting signal inhibitors along the dendrite in response to the reception of synaptic neurotransmitters. Soma – In a biological neuron, the acts as the summation function
Artificial neuron
–
Contents
29.
Axon
–
An axon, is a long, slender projection of a nerve cell, or neuron, that typically conducts electrical impulses away from the neurons cell body. Axons are also known as nerve fibers, the function of the axon is to transmit information to different neurons, muscles and glands. Axon dysfunction has caused many inherited and acquired neurological disorders which can affect both the peripheral and central neurons, nerve fibers are classed into three types – A delta fibers, B fibers, and C fibres. A and B are myelinated and C are unmyelinated, an axon is one of two types of protoplasmic protrusions that extrude from the cell body of a neuron, the other type being dendrites. Axons are distinguished from dendrites by several features, including shape, length, all of these rules have exceptions, however. Axons are covered by a known as axolemma, the cytoplasm of axon is called axoplasm. Some types of neurons have no axon and transmit signals from their dendrites, no neuron ever has more than one axon, however in invertebrates such as insects or leeches the axon sometimes consists of several regions that function more or less independently of each other. Most axons branch, in cases very profusely. Axons make contact with other cells—usually other neurons but sometimes muscle or gland cells—at junctions called synapses, at a synapse, the membrane of the axon closely adjoins the membrane of the target cell, and special molecular structures serve to transmit electrical or electrochemical signals across the gap. Some synaptic junctions appear partway along an axon as it extends—these are called en passant synapses, other synapses appear as terminals at the ends of axonal branches. A single axon, with all its branches together, can innervate multiple parts of the brain. Axons are the transmission lines of the nervous system. Some axons can extend up to one meter or more while others extend as little as one millimeter, the longest axons in the human body are those of the sciatic nerve, which run from the base of the spinal cord to the big toe of each foot. The diameter of axons is also variable, most individual axons are microscopic in diameter. The largest mammalian axons can reach a diameter of up to 20 µm, the squid giant axon, which is specialized to conduct signals very rapidly, is close to 1 millimetre in diameter, the size of a small pencil lead. Axonal arborization also differs from one nerve fiber to the next, axons in the central nervous system typically show complex trees with many branch points. In comparison, the granule cell axon is characterized by a single T-shaped branch node from which two parallel fibers extend. Elaborate arborization allows for the transmission of messages to a large number of target neurons within a single region of the brain
Axon
–
A dissected human brain, showing
grey matter and
white matter
Axon
–
Dendrite
Axon
–
A)pyramidal cell,interneuron,and short durationwaveform (Axon). (B) overlay of the three average waveforms; (C) average and standard error of peak-trough time for pyramidal cells interneurons, and putative axons.(D)Scatter plot of signal to noise ratios for individual units againstpeak-trough time for axons, pyramidal cells (PYR) and interneurons (INT).
Axon
–
Axon of 9 day old mouse with growth cone visible
30.
Backpropagation
–
The backward propagation of errors or backpropagation, is a common method of training artificial neural networks and used in conjunction with an optimization method such as gradient descent. The algorithm repeats a two cycle, propagation and weight update. When an input vector is presented to the network, it is propagated forward through the network, layer by layer, until it reaches the output layer. The output of the network is compared to the desired output, using a loss function. The error values are then propagated backwards, starting from the output, backpropagation uses these error values to calculate the gradient of the loss function with respect to the weights in the network. In the second phase, this gradient is fed to the optimization method and it is a generalization of the delta rule to multi-layered feedforward networks, made possible by using the chain rule to iteratively compute gradients for each layer. Backpropagation requires that the function used by the artificial neurons be differentiable. The goal of any supervised learning algorithm is to find a function that best maps a set of inputs to its correct output. An example would be a task, where the input is an image of an animal. The goal of backpropagation is to compute the derivative, or gradient. For backpropagation, the loss function calculates the difference between the input training example and its output, after the example has been propagated through the network. For backpropagation to work, two assumptions are made about the form of the error function, the first is that it can be written as an average E =1 n ∑ x E x over error functions E x, for individual training examples, x. In practice, training examples are placed in batches, and the error is averaged at the end of the batch, the second assumption is that it can be written as a function of the outputs from the neural network. Let y, y ′ be vectors in R n, select an error function E measuring the difference between two outputs. The standard choice is E =12 ∥ y − y ′ ∥2, the factor of 12 conveniently cancels the exponent when the error function is subsequently differentiated. The error function over n training examples can be written as an average, And the partial derivative with respect to the outputs, Let N be a network with e connections, m inputs. Below, x, x 1, x 2, … will denote vectors in R m, y, y ′, y 1, y 2, … vectors in R n and these are called inputs, outputs and weights respectively. The neural network corresponds to a function y = f N which, given a weight w, maps an input x to an output y
Backpropagation
–
A simple neural network with two input units and one output unit
31.
Computer vision
–
Computer vision is an interdisciplinary field that deals with how computers can be made for gaining high-level understanding from digital images or videos. From the perspective of engineering, it seeks to automate tasks that the visual system can do. g. in the forms of decisions. Understanding in this means the transformation of visual images into descriptions of the world that can interface with other thought processes. This image understanding can be seen as the disentangling of symbolic information from image data using models constructed with the aid of geometry, physics, statistics, as a scientific discipline, computer vision is concerned with the theory behind artificial systems that extract information from images. The image data can take many forms, such as sequences, views from multiple cameras. As a technological discipline, computer vision seeks to apply its theories, sub-domains of computer vision include scene reconstruction, event detection, video tracking, object recognition, object pose estimation, learning, indexing, motion estimation, and image restoration. Computer vision is a field that deals with how computers can be made for gaining high-level understanding from digital images or videos. From the perspective of engineering, it seeks to automate tasks that the visual system can do. Computer vision is concerned with the extraction, analysis and understanding of useful information from a single image or a sequence of images. It involves the development of a theoretical and algorithmic basis to achieve automatic visual understanding, as a scientific discipline, computer vision is concerned with the theory behind artificial systems that extract information from images. The image data can take many forms, such as sequences, views from multiple cameras. As a technological discipline, computer vision seeks to apply its theories, in the late 1960s, computer vision began at universities that were pioneering artificial intelligence. It was meant to mimic the visual system, as a stepping stone to endowing robots with intelligent behavior. In 1966, it was believed that this could be achieved through a project, by attaching a camera to a computer. The next decade saw studies based on rigorous mathematical analysis. These include the concept of scale-space, the inference of shape from various cues such as shading, texture and focus, researchers also realized that many of these mathematical concepts could be treated within the same optimization framework as regularization and Markov random fields. By the 1990s, some of the research topics became more active than the others. Research in projective 3-D reconstructions led to better understanding of camera calibration, with the advent of optimization methods for camera calibration, it was realized that a lot of the ideas were already explored in bundle adjustment theory from the field of photogrammetry
Computer vision
–
Artist's Concept of Rover on Mars, an example of an unmanned land-based vehicle. Notice the
stereo cameras mounted on top of the Rover.
Computer vision
–
Relation between computer vision and various other fields [
original research?]
32.
Machine translation
–
Current machine translation software often allows for customization by domain or profession, improving output by limiting the scope of allowable substitutions. This technique is effective in domains where formal or formulaic language is used. It follows that machine translation of government and legal documents more readily produces usable output than conversation or less standardised text. With the assistance of techniques, MT has proven useful as a tool to assist human translators and, in a very limited number of cases. The progress and potential of machine translation have been debated much through its history, since the 1950s, a number of scholars have questioned the possibility of achieving fully automatic machine translation of high quality. Some critics claim there are in-principle obstacles to automating the translation process. The idea of machine translation may be traced back to the 17th century, in 1629, René Descartes proposed a universal language, with equivalent ideas in different tongues sharing one symbol. The field of machine translation appeared in Warren Weavers Memorandum on Translation, the first researcher in the field, Yehosha Bar-Hillel, began his research at MIT. A Georgetown University MT research team followed with a demonstration of its Georgetown-IBM experiment system in 1954. MT research programs popped up in Japan and Russia, and the first MT conference was held in London, real progress was much slower, however, and after the ALPAC report, which found that the ten-year-long research had failed to fulfill expectations, funding was greatly reduced. Beginning in the late 1980s, as computational power increased and became less expensive, various MT companies were launched, including Trados, which was the first to develop and market translation memory technology. The first commercial MT system for Russian / English / German-Ukrainian was developed at Kharkov State University, MT on the web started with SYSTRAN Offering free translation of small texts, followed by AltaVista Babelfish, which racked up 500,000 requests a day. Franz-Josef Och won DARPAs speed MT competition, recently, Google announced that Google Translate translates roughly enough text to fill 1 million books in one day. The idea of using computers for translation of natural languages was proposed as early as 1946 by A. D. Warren Weaver wrote an important memorandum Translation in 1949, several papers on the topic were published at the time, and even articles in popular journals. A similar application, also pioneered at Birkbeck College at the time, was reading and composing Braille texts by computer, the human translation process may be described as, Decoding the meaning of the source text, and Re-encoding this meaning in the target language. Behind this ostensibly simple procedure lies a complex cognitive operation, of the source language, as well as the culture of its speakers. The translator needs the same in-depth knowledge to re-encode the meaning in the target language, in its most general application, this is beyond current technology
Machine translation
–
Bernard Vauquois' pyramid showing comparative depths of intermediary representation,
interlingual machine translation at the peak, followed by transfer-based, then direct translation.
33.
Artificial intelligence
–
Artificial intelligence is intelligence exhibited by machines. Colloquially, the artificial intelligence is applied when a machine mimics cognitive functions that humans associate with other human minds, such as learning. As machines become increasingly capable, mental facilities once thought to require intelligence are removed from the definition, for instance, optical character recognition is no longer perceived as an example of artificial intelligence, having become a routine technology. AI research is divided into subfields that focus on specific problems or on specific approaches or on the use of a tool or towards satisfying particular applications. The central problems of AI research include reasoning, knowledge, planning, learning, natural language processing, perception, general intelligence is among the fields long-term goals. Approaches include statistical methods, computational intelligence, and traditional symbolic AI, Many tools are used in AI, including versions of search and mathematical optimization, logic, methods based on probability and economics. The AI field draws upon computer science, mathematics, psychology, linguistics, philosophy, neuroscience, the field was founded on the claim that human intelligence can be so precisely described that a machine can be made to simulate it. Some people also consider AI a danger to humanity if it progresses unabatedly, while thought-capable artificial beings appeared as storytelling devices in antiquity, the idea of actually trying to build a machine to perform useful reasoning may have begun with Ramon Llull. With his Calculus ratiocinator, Gottfried Leibniz extended the concept of the calculating machine, since the 19th century, artificial beings are common in fiction, as in Mary Shelleys Frankenstein or Karel Čapeks R. U. R. The study of mechanical or formal reasoning began with philosophers and mathematicians in antiquity, in the 19th century, George Boole refined those ideas into propositional logic and Gottlob Frege developed a notational system for mechanical reasoning. Around the 1940s, Alan Turings theory of computation suggested that a machine, by shuffling symbols as simple as 0 and 1 and this insight, that digital computers can simulate any process of formal reasoning, is known as the Church–Turing thesis. Along with concurrent discoveries in neurology, information theory and cybernetics, the first work that is now generally recognized as AI was McCullouch and Pitts 1943 formal design for Turing-complete artificial neurons. The field of AI research was born at a conference at Dartmouth College in 1956, attendees Allen Newell, Herbert Simon, John McCarthy, Marvin Minsky and Arthur Samuel became the founders and leaders of AI research. At the conference, Newell and Simon, together with programmer J. C, shaw, presented the first true artificial intelligence program, the Logic Theorist. This spurred tremendous research in the domain, computers were winning at checkers, solving problems in algebra, proving logical theorems. By the middle of the 1960s, research in the U. S. was heavily funded by the Department of Defense and laboratories had been established around the world. AIs founders were optimistic about the future, Herbert Simon predicted, machines will be capable, within twenty years, Marvin Minsky agreed, writing, within a generation. The problem of creating artificial intelligence will substantially be solved and they failed to recognize the difficulty of some of the remaining tasks
Artificial intelligence
–
Kismet, a robot with rudimentary social skills
Artificial intelligence
–
An ontology represents knowledge as a set of concepts within a domain and the relationships between those concepts.
Artificial intelligence
–
Main articles
34.
Exclusive-or
–
Exclusive or or exclusive disjunction is a logical operation that outputs true only when inputs differ. It is symbolized by the prefix operator J and by the infix operators XOR, EOR, EXOR, ⊻, ⊕, ↮, the negation of XOR is logical biconditional, which outputs true only when both inputs are the same. It gains the exclusive or because the meaning of or is ambiguous when both operands are true, the exclusive or operator excludes that case. This is sometimes thought of as one or the other but not both and this could be written as A or B, but not, A and B. More generally, XOR is true only when an odd number of inputs are true, a chain of XORs—a XOR b XOR c XOR d —is true whenever an odd number of the inputs are true and is false whenever an even number of inputs are true. The truth table of A XOR B shows that it outputs true whenever the inputs differ,0, false 1, true Exclusive disjunction essentially means either one, in other words, the statement is true if and only if one is true and the other is false. For example, if two horses are racing, then one of the two win the race, but not both of them. The exclusive or is equivalent to the negation of a logical biconditional, by the rules of material implication. This unfortunately prevents the combination of two systems into larger structures, such as a mathematical ring. However, the system using exclusive or is an abelian group, the combination of operators ∧ and ⊕ over elements produce the well-known field F2. This field can represent any logic obtainable with the system and has the benefit of the arsenal of algebraic analysis tools for fields. The Oxford English Dictionary explains either, or as follows, The primary function of either, etc. is to emphasize the perfect indifference of the two things or courses. But a secondary function is to emphasize the mutual exclusiveness, = either of the two, but not both, the exclusive-or explicitly states one or the other, but not neither nor both. Following this kind of common-sense intuition about or, it is argued that in many natural languages, English included. The exclusive disjunction of a pair of propositions, is supposed to mean that p is true or q is true, but not both. For example, it might be argued that the intention of a statement like You may have coffee. Certainly under some circumstances a sentence like this example should be taken as forbidding the possibility of accepting both options. Even so, there is reason to suppose that this sort of sentence is not disjunctive at all
Exclusive-or
–
Venn diagram of but not is
35.
Nobel laureate
–
They were established by the 1895 will of Alfred Nobel, which dictates that the awards should be administered by the Nobel Foundation. The Nobel Memorial Prize in Economic Sciences was established in 1968 by the Sveriges Riksbank, each recipient, or laureate, receives a gold medal, a diploma, and a sum of money, which is decided by the Nobel Foundation, yearly. Each recipient receives a medal, a diploma and an award that has varied throughout the years. In 1901, the recipients of the first Nobel Prizes were given 150,782 SEK, in 2008, the laureates were awarded a prize amount of 10,000,000 SEK. The awards are presented in Stockholm in a ceremony on December 10. In years in which the Nobel Prize is not awarded due to events or a lack of nominations. The Nobel Prize was not awarded between 1940 and 1942 due to the outbreak of World War II, between 1901 and 2015, the Nobel Prizes and the Nobel Memorial Prize in Economic Sciences were awarded 573 times to 900 people and organizations. With some receiving the Nobel Prize more than once, this makes a total of 870 individuals and 23 organizations, four Nobel laureates were not permitted by their governments to accept the Nobel Prize. Six laureates have received more than one prize, of the six, UNHCR has been awarded the Nobel Peace Prize twice. Also the Nobel Prize in Physics was awarded to John Bardeen twice, two laureates have been awarded twice but not in the same field, Marie Curie and Linus Pauling. Among the 870 Nobel laureates,48 have been women, the first woman to receive a Nobel Prize was Marie Curie and she was also the first person to be awarded two Nobel Prizes, the second award being the Nobel Prize in Chemistry, given in 1911. A In 1938 and 1939, the government of Germany did not allow three German Nobel nominees to accept their Nobel Prizes. The three were Richard Kuhn, Nobel laureate in Chemistry in 1938, Adolf Butenandt, Nobel laureate in Chemistry in 1939 and they were later awarded the Nobel Prize diploma and medal, but not the money. B In 1948, the Nobel Prize in Peace was not awarded, the Nobel Foundations website suggests that it would have been awarded to Mohandas Karamchand Gandhi, however, due to his assassination earlier that year, it was left unassigned in his honor. C In 1958, Russian-born Boris Pasternak, under pressure from the government of the Soviet Union, was forced to decline the Nobel Prize in Literature. D In 1964, Jean-Paul Sartre refused to accept the Nobel Prize in Literature, E In 1973, Lê Ðức Thọ declined the Nobel Peace Prize. His reason was that he felt he did not deserve it because although he helped negotiate the Paris Peace Accords, F In 2010, Liu Xiaobo was unable to receive the Nobel Peace Prize as he was sentenced to 11 years of imprisonment by the Chinese authorities
Nobel laureate
–
Nobel laureates of 2012
Alvin E. Roth,
Brian Kobilka,
Robert J. Lefkowitz,
David J. Wineland, and
Serge Haroche during the ceremony
Nobel laureate
–
Nobel laureates receive a gold medal together with a diploma and (as of 2012) 8 million
SEK (roughly US$1.2 million, €0.93 million).
36.
David H. Hubel
–
David Hunter Hubel was a Canadian neurophysiologist noted for his studies of the structure and function of the visual cortex. He was co-recipient with Torsten Wiesel of the 1981 Nobel Prize in Physiology or Medicine, for much of his career, Hubel was the John Franklin Enders University Professor of Neurobiology at Harvard Medical School. In 1978, Hubel and Wiesel were awarded the Louisa Gross Horwitz Prize from Columbia University, Hubel was born in Windsor, Ontario, Canada, to American parents in 1926. His paternal grandfather emigrated as a child to the United States from the Bavarian town of Nördlingen, in 1929, his family moved to Montreal, where he spent his formative years. His father was a engineer and Hubel developed a keen interest in science right from childhood, making many experiments in chemistry. He studied mathematics and physics at McGill University, and then entered medical school there, in 1954, he moved to the United States to work at Johns Hopkins University School of Medicine as an assistant resident in Neurology. He was later drafted by the army and served at Walter Reed Army Institute of Research, there, he began recording from the primary visual cortex of sleeping and awake cats. At WRAIR, he invented the modern metal microelectrode out of Stoner-Mudge lacquer and tungsten, in 1958, he moved to Johns Hopkins and began his collaborations with Wiesel, and discovered orientation selectivity and columnar organization in visual cortex. One year later, he joined the faculty of Harvard University, in 1981, Hubel became a founding member of the World Cultural Council. From 1988 to 1989 he was the president of the Society for Neuroscience, the Hubel and Wiesel experiments greatly expanded the scientific knowledge of sensory processing. The partnership lasted over twenty years and became known as one of the most prominent research pairings in science, in one experiment, done in 1959, they inserted a microelectrode into the primary visual cortex of an anesthetized cat. They then projected patterns of light and dark on a screen in front of the cat and they found that some neurons fired rapidly when presented with lines at one angle, while others responded best to another angle. Some of these neurons responded to light patterns and dark patterns differently, Hubel and Wiesel called these neurons simple cells. These studies showed how the visual system constructs complex representations of information from simple stimulus features. This has important implications for the understanding of deprivation amblyopia, a type of loss due to unilateral visual deprivation during the so-called critical period. These kittens also did not develop areas receiving input from eyes, a feature needed for binocular vision. Hubel and Wiesels experiments showed that the ocular dominance develops irreversibly early in childhood development and these studies opened the door for the understanding and treatment of childhood cataracts and strabismus. They were also important in the study of cortical plasticity, the SIFT descriptor is arguably the most widely used feature type for these tasks
David H. Hubel
–
David Hubel
37.
Primary visual cortex
–
The visual cortex of the brain is a part of the cerebral cortex that plays an important role in processing visual information. It is located in the lobe in the back of the skull. Visual information coming from the eye, goes through the lateral nucleus, that is located in the thalamus. The part of the cortex that receives the sensory inputs from the thalamus is the primary visual cortex, also known as Visual area one. The extrastriate areas consist of visual areas two, three, four, and five, the primary visual cortex is located in and around the calcarine fissure in the occipital lobe. Each hemispheres V1 receives information directly from its ipsilateral lateral geniculate nucleus that receives signals from the contralateral visual hemifield, neurons in the visual cortex fire action potentials when visual stimuli appear within their receptive field. By definition, the field is the region within the entire visual field that elicits an action potential. But, for any given neuron, it may respond best to a subset of stimuli within its receptive field and this property is called neuronal tuning. In the earlier areas, neurons have simpler tuning. For example, a neuron in V1 may fire to any vertical stimulus in its receptive field, in the higher visual areas, neurons have complex tuning. For example, in the temporal cortex, a neuron may fire only when a certain face appears in its receptive field. The visual cortex receives its blood supply primarily from the branch of the posterior cerebral artery. One recent discovery concerning the human V1 is that measured by fMRI show very large attentional modulation. This result is consistent with another recent electrophysiology study, other current work on V1 seeks to fully characterize its tuning properties, and to use it as a model area for the canonical cortical circuit. Lesions to primary visual cortex lead to a scotoma, or hole in the visual field. Note that patients with scotomas are often able to use of visual information presented to their scotomas. Each V1 transmits information to two pathways, called the ventral stream and the dorsal stream. The ventral stream begins with V1, goes through visual area V2, then through visual area V4, the ventral stream, sometimes called the What Pathway, is associated with form recognition and object representation
Primary visual cortex
–
Micrograph showing the visual cortex (pink). The
pia mater and
arachnoid mater including
blood vessels are seen at the top of the image. Subcortical
white matter (blue) is seen at the bottom of the image.
HE-LFB stain.
Primary visual cortex
–
View of the brain from behind. Red = Brodmann area 17 (primary visual cortex); orange = area 18; yellow = area 19
38.
CMOS
–
Complementary metal–oxide–semiconductor, abbreviated as CMOS /ˈsiːmɒs/, is a technology for constructing integrated circuits. CMOS technology is used in microprocessors, microcontrollers, static RAM, CMOS technology is also used for several analog circuits such as image sensors, data converters, and highly integrated transceivers for many types of communication. In 1963, while working for Fairchild Semiconductor, Frank Wanlass patented CMOS, CMOS is also sometimes referred to as complementary-symmetry metal–oxide–semiconductor. Two important characteristics of CMOS devices are high noise immunity and low power consumption. Since one transistor of the pair is always off, the series combination draws significant power only momentarily during switching between on and off states, CMOS also allows a high density of logic functions on a chip. It was primarily for this reason that CMOS became the most used technology to be implemented in VLSI chips, aluminium was once used but now the material is polysilicon. Other metal gates have made a comeback with the advent of high-k dielectric materials in the CMOS process, as announced by IBM and Intel for the 45 nanometer node and beyond. CMOS refers to both a style of digital circuitry design and the family of processes used to implement that circuitry on integrated circuits. CMOS circuitry dissipates less power than logic families with resistive loads, since this advantage has increased and grown more important, CMOS processes and variants have come to dominate, thus the vast majority of modern integrated circuit manufacturing is on CMOS processes. As of 2010, CPUs with the best performance per watt each year have been CMOS static logic since 1976, CMOS circuits use a combination of p-type and n-type metal–oxide–semiconductor field-effect transistor to implement logic gates and other digital circuits. CMOS always uses all enhancement-mode MOSFETs, CMOS circuits are constructed in such a way that all PMOS transistors must have either an input from the voltage source or from another PMOS transistor. Similarly, all NMOS transistors must have either an input from ground or from another NMOS transistor, the composition of a PMOS transistor creates low resistance between its source and drain contacts when a low gate voltage is applied and high resistance when a high gate voltage is applied. On the other hand, the composition of an NMOS transistor creates high resistance between source and drain when a low voltage is applied and low resistance when a high gate voltage is applied. CMOS accomplishes current reduction by complementing every nMOSFET with a pMOSFET, a high voltage on the gates will cause the nMOSFET to conduct and the pMOSFET to not conduct, while a low voltage on the gates causes the reverse. This arrangement greatly reduces power consumption and heat generation, however, during the switching time, both MOSFETs conduct briefly as the gate voltage goes from one state to another. This induces a brief spike in consumption and becomes a serious issue at high frequencies. The image on the right shows what happens when an input is connected to both a PMOS transistor and an NMOS transistor, when the voltage of input A is low, the NMOS transistors channel is in a high resistance state. This limits the current that can flow from Q to ground, the PMOS transistors channel is in a low resistance state and much more current can flow from the supply to the output
CMOS
–
CMOS inverter (
NOT logic gate)
39.
Nanodevice
–
A molecular machine, or nanomachine, refers to any discrete number of molecular components that produce quasi-mechanical movements in response to specific stimuli. The expression is more generally applied to molecules that simply mimic functions that occur at the macroscopic level. The term is common in nanotechnology where a number of highly complex molecular machines have been proposed that are aimed at the goal of constructing a molecular assembler. Molecular machines can be divided into two categories, synthetic and biological. The 2016 Nobel Prize in Chemistry was awarded to Jean-Pierre Sauvage, Sir J. Fraser Stoddart and Bernard L. Feringa for the design, from a synthetic perspective, there are two important types of molecular machines, molecular switches and molecular motors. The major difference between the two systems is that a switch influences a system as a function of state, whereas a motor influences a system as a function of trajectory. A switch may appear to undergo translational motion, but returning a switch to its original position undoes any mechanical effect, furthermore, switches cannot use chemical energy to repetitively and progressively drive a system away from equilibrium whereas a motor can. A wide variety of simple molecular machines have been synthesized by chemists. They can consist of a molecule, however, they are often constructed for mechanically-interlocked molecular architectures. Carbon nanotube nanomotors have also been produced, molecular motors are molecules that are capable of unidirectional rotation motion powered by external energy input. A number of machines have been synthesized powered by light or reaction with other molecules. A molecular propeller is a molecule that can propel fluids when rotated and it has several molecular-scale blades attached at a certain pitch angle around the circumference of a nanoscale shaft. A molecular switch is a molecule that can be shifted between two or more stable states. The molecules may be shifted between the states in response to changes in e. g. pH, light, temperature, a molecular shuttle is a molecule capable of shuttling molecules or ions from one location to another. A common molecular shuttle consists of a rotaxane where the macrocycle can move between two sites or stations along the dumbbell backbone, molecular tweezers are host molecules capable of holding items between their two arms. Examples of molecular tweezers have been reported that are constructed from DNA and are considered DNA machines, a molecular sensor is a molecule that interacts with an analyte to produce a detectable change. Molecular sensors combine molecular recognition with some form of reporter, so the presence of the item can be observed, a molecular logic gate is a molecule that performs a logical operation on one or more logic inputs and produces a single logic output. Unlike a molecular sensor, the logic gate will only output when a particular combination of inputs are present
Nanodevice
–
Comparison of Nanomaterials Sizes
Nanodevice
–
Image of
reconstruction on a clean
Gold (
100) surface, as visualized using
scanning tunneling microscopy. The positions of the individual atoms composing the surface are visible.
Nanodevice
–
Graphical representation of a
rotaxane, useful as a
molecular switch.
Nanodevice
–
This device transfers energy from nano-thin layers of
quantum wells to
nanocrystals above them, causing the nanocrystals to emit visible light.
40.
Convolution
–
It has applications that include probability, statistics, computer vision, natural language processing, image and signal processing, engineering, and differential equations. The convolution can be defined for functions on other than Euclidean space. For example, periodic functions, such as the discrete-time Fourier transform, can be defined on a circle, a discrete convolution can be defined for functions on the set of integers. Computing the inverse of the operation is known as deconvolution. The convolution of f and g is written f∗g, using an asterisk or star and it is defined as the integral of the product of the two functions after one is reversed and shifted. As such, it is a kind of integral transform. While the symbol t is used above, it need not represent the time domain, but in that context, the convolution formula can be described as a weighted average of the function f at the moment t where the weighting is given by g simply shifted by amount t. As t changes, the weighting function emphasizes different parts of the input function, for the multi-dimensional formulation of convolution, see Domain of definition. A primarily engineering convention that one sees is, f ∗ g = d e f ∫ − ∞ ∞ f g d τ ⏟. For instance, ƒ*g is equivalent to, but ƒ*g is in fact equivalent to, Convolution describes the output of an important class of operations known as linear time-invariant. See LTI system theory for a derivation of convolution as the result of LTI constraints, in terms of the Fourier transforms of the input and output of an LTI operation, no new frequency components are created. The existing ones are only modified, in other words, the output transform is the pointwise product of the input transform with a third transform. See Convolution theorem for a derivation of that property of convolution, conversely, convolution can be derived as the inverse Fourier transform of the pointwise product of two Fourier transforms. One of the earliest uses of the convolution integral appeared in DAlemberts derivation of Taylors theorem in Recherches sur différents points importants du système du monde, soon thereafter, convolution operations appear in the works of Pierre Simon Laplace, Jean-Baptiste Joseph Fourier, Siméon Denis Poisson, and others. The term itself did not come into use until the 1950s or 60s. Prior to that it was known as faltung, composition product, superposition integral. Yet it appears as early as 1903, though the definition is rather unfamiliar in older uses, the operation, ∫0 t φ ψ d s,0 ≤ t < ∞, is a particular case of composition products considered by the Italian mathematician Vito Volterra in 1913. The summation is called a summation of the function f
Convolution
–
Gaussian blur can be used in order to obtain a smooth grayscale digital image of a
halftone print
Convolution
–
Visual comparison of convolution,
cross-correlation and
autocorrelation.
41.
Back-propagation
–
The backward propagation of errors or backpropagation, is a common method of training artificial neural networks and used in conjunction with an optimization method such as gradient descent. The algorithm repeats a two cycle, propagation and weight update. When an input vector is presented to the network, it is propagated forward through the network, layer by layer, until it reaches the output layer. The output of the network is compared to the desired output, using a loss function. The error values are then propagated backwards, starting from the output, backpropagation uses these error values to calculate the gradient of the loss function with respect to the weights in the network. In the second phase, this gradient is fed to the optimization method and it is a generalization of the delta rule to multi-layered feedforward networks, made possible by using the chain rule to iteratively compute gradients for each layer. Backpropagation requires that the function used by the artificial neurons be differentiable. The goal of any supervised learning algorithm is to find a function that best maps a set of inputs to its correct output. An example would be a task, where the input is an image of an animal. The goal of backpropagation is to compute the derivative, or gradient. For backpropagation, the loss function calculates the difference between the input training example and its output, after the example has been propagated through the network. For backpropagation to work, two assumptions are made about the form of the error function, the first is that it can be written as an average E =1 n ∑ x E x over error functions E x, for individual training examples, x. In practice, training examples are placed in batches, and the error is averaged at the end of the batch, the second assumption is that it can be written as a function of the outputs from the neural network. Let y, y ′ be vectors in R n, select an error function E measuring the difference between two outputs. The standard choice is E =12 ∥ y − y ′ ∥2, the factor of 12 conveniently cancels the exponent when the error function is subsequently differentiated. The error function over n training examples can be written as an average, And the partial derivative with respect to the outputs, Let N be a network with e connections, m inputs. Below, x, x 1, x 2, … will denote vectors in R m, y, y ′, y 1, y 2, … vectors in R n and these are called inputs, outputs and weights respectively. The neural network corresponds to a function y = f N which, given a weight w, maps an input x to an output y
Back-propagation
–
Machine learning and
data mining
42.
Pattern recognition
–
Pattern recognition systems are in many cases trained from labeled training data, but when no labeled data are available other algorithms can be used to discover previously unknown patterns. The terms pattern recognition, machine learning, data mining and knowledge discovery in databases are hard to separate, in pattern recognition, there may be a higher interest to formalize, explain and visualize the pattern, while machine learning traditionally focuses on maximizing the recognition rates. In machine learning, pattern recognition is the assignment of a label to an input value. In statistics, discriminant analysis was introduced for this purpose in 1936. An example of pattern recognition is classification, which attempts to assign each input value to one of a set of classes. However, pattern recognition is a general problem that encompasses other types of output as well. Pattern recognition algorithms generally aim to provide an answer for all possible inputs and to perform most likely matching of the inputs. This is opposed to pattern matching algorithms, which look for matches in the input with pre-existing patterns. Pattern recognition is generally categorized according to the type of learning procedure used to generate the output value, supervised learning assumes that a set of training data has been provided, consisting of a set of instances that have been properly labeled by hand with the correct output. A combination of the two that has recently been explored is semi-supervised learning, which uses a combination of labeled and unlabeled data. Note that in cases of unsupervised learning, there may be no training data at all to speak of, in other words, Note that sometimes different terms are used to describe the corresponding supervised and unsupervised learning procedures for the same type of output. Note also that in some fields, the terminology is different, For example, in community ecology, the piece of input data for which an output value is generated is formally termed an instance. The instance is formally described by a vector of features, which constitute a description of all known characteristics of the instance. Typically, features are either categorical, ordinal, integer-valued or real-valued, often, categorical and ordinal data are grouped together, likewise for integer-valued and real-valued data. Furthermore, many algorithms work only in terms of categorical data, many common pattern recognition algorithms are probabilistic in nature, in that they use statistical inference to find the best label for a given instance. Unlike other algorithms, which output a best label, often probabilistic algorithms also output a probability of the instance being described by the given label. In addition, many probabilistic algorithms output a list of the N-best labels with associated probabilities, for some value of N, when the number of possible labels is fairly small, N may be set so that the probability of all possible labels is output. Probabilistic algorithms have many advantages over non-probabilistic algorithms, They output a confidence value associated with their choice, correspondingly, they can abstain when the confidence of choosing any particular output is too low
Pattern recognition
–
The face was automatically detected by special software.
43.
Multi-dimensional
–
In physics and mathematics, the dimension of a mathematical space is informally defined as the minimum number of coordinates needed to specify any point within it. Thus a line has a dimension of one only one coordinate is needed to specify a point on it – for example. The inside of a cube, a cylinder or a sphere is three-dimensional because three coordinates are needed to locate a point within these spaces, in classical mechanics, space and time are different categories and refer to absolute space and time. That conception of the world is a space but not the one that was found necessary to describe electromagnetism. The four dimensions of spacetime consist of events that are not absolutely defined spatially and temporally, Minkowski space first approximates the universe without gravity, the pseudo-Riemannian manifolds of general relativity describe spacetime with matter and gravity. Ten dimensions are used to string theory, and the state-space of quantum mechanics is an infinite-dimensional function space. The concept of dimension is not restricted to physical objects, high-dimensional spaces frequently occur in mathematics and the sciences. They may be parameter spaces or configuration spaces such as in Lagrangian or Hamiltonian mechanics, in mathematics, the dimension of an object is an intrinsic property independent of the space in which the object is embedded. This intrinsic notion of dimension is one of the ways the mathematical notion of dimension differs from its common usages. The dimension of Euclidean n-space En is n, when trying to generalize to other types of spaces, one is faced with the question what makes En n-dimensional. One answer is that to cover a ball in En by small balls of radius ε. This observation leads to the definition of the Minkowski dimension and its more sophisticated variant, the Hausdorff dimension, for example, the boundary of a ball in En looks locally like En-1 and this leads to the notion of the inductive dimension. While these notions agree on En, they turn out to be different when one looks at more general spaces, a tesseract is an example of a four-dimensional object. The rest of this section some of the more important mathematical definitions of the dimensions. A complex number has a real part x and an imaginary part y, a single complex coordinate system may be applied to an object having two real dimensions. For example, an ordinary two-dimensional spherical surface, when given a complex metric, complex dimensions appear in the study of complex manifolds and algebraic varieties. The dimension of a space is the number of vectors in any basis for the space. This notion of dimension is referred to as the Hamel dimension or algebraic dimension to distinguish it from other notions of dimension
Multi-dimensional
–
From left to right: the
square, the
cube and the
tesseract. The
two-dimensional (2d) square is bounded by
one-dimensional (1d) lines; the
three-dimensional (3d) cube by two-dimensional areas; and the
four-dimensional (4d) tesseract by three-dimensional volumes. For display on a two-dimensional surface such as a screen, the 3d cube and 4d tesseract require
projection.
44.
Long short-term memory
–
Long short-term memory is a recurrent neural network architecture proposed in 1997 by Sepp Hochreiter and Jürgen Schmidhuber. Relative insensitivity to gap length gives an advantage to LSTM over alternative RNNs and hidden Markov models, among other successes, LSTM achieved the best known results in natural language text compression, unsegmented connected handwriting recognition, and in 2009 won the ICDAR handwriting competition. As of 2016, major companies including Google, Apple, Microsoft. A LSTM network is a neural network that contains LSTM units instead of, or in addition to. A LSTM unit is a recurrent network unit that excels at remembering values for either long or short durations of time, the key to this ability is that it uses no activation function within its recurrent components. Thus, the value is not iteratively squashed over time. LSTM units are often implemented in blocks containing several LSTM units and this design is typical with deep multi-layered neural networks, and facilitates implementations with parallel hardware. In the equations below, each variable in lowercase italics represents a vector with a equal to the number of LSTM units in the block. LSTM blocks contain three or four gates that use to control the flow of information into or out of their memory. These gates are implemented using the function to compute a value between 0 and 1. Multiplication is applied with this value to partially allow or deny information to flow into or out of the memory, for example, an input gate controls the extent to which a new value flows into the memory. A forget gate controls the extent to which a value remains in memory, and, an output gate controls the extent to which the value in memory is used to compute the output activation of the block. The only weights in a LSTM block are used to direct the operation of the gates and these weights occur between the values that feed into the block and each of the gates. Thus, the LSTM block determines how to maintain its memory as a function of those values, LSTM blocks are usually trained with Backpropagation through time. Activation functions σ g, The original is a sigmoid function, σ c, The original is a hyperbolic tangent. σ h, The original is a tangent, but the peephole LSTM paper suggests σ h = x. H t −1 is not used, c t −1 is used instead in most places. F t = σ g i t = σ g o t = σ g c t = f t ∘ c t −1 + i t ∘ σ c h t = o t ∘ σ h Convolutional LSTM
Long short-term memory
–
A simple LSTM gate with only input, output, and forget gates. LSTM gates may have more gates.
45.
Directed graph
–
In mathematics, and more specifically in graph theory, a directed graph is a graph that is a set of vertices connected by edges, where the edges have a direction associated with them. It differs from an ordinary or undirected graph, in that the latter is defined in terms of unordered pairs of vertices, more specifically, these entities are addressed as directed multigraphs. On the other hand, the definition allows a directed graph to have loops. More specifically, directed graphs without loops are addressed as directed graphs. Symmetric directed graphs are directed graphs where all edges are bidirected, simple directed graphs are directed graphs that have no loops and no multiple arrows with same source and target nodes. As already introduced, in case of arrows the entity is usually addressed as directed multigraph. Some authors describe digraphs with loops as loop-digraphs. Complete directed graphs are directed graphs where each pair of vertices is joined by a symmetric pair of directed arrows. It follows that a complete digraph is symmetric, oriented graphs are directed graphs having no bidirected edges. It follows that a graph is an oriented graph iff it hasnt any 2-cycle. Tournaments are oriented graphs obtained by choosing a direction for each edge in undirected complete graphs. Directed acyclic graphs are directed graphs with no directed cycles, multitrees are DAGs in which no two directed paths from a single starting vertex meet back at the same ending vertex. Oriented trees or polytrees are DAGs formed by orienting the edges of undirected acyclic graphs, rooted trees are oriented trees in which all edges of the underlying undirected tree are directed away from the roots. Rooted directed graphs are digraphs in which a vertex has been distinguished as the root, control flow graphs are rooted digraphs used in computer science as a representation of the paths that might be traversed through a program during its execution. Signal-flow graphs are directed graphs in which nodes represent system variables and branches represent functional connections between pairs of nodes, flow graphs are digraphs associated with a set of linear algebraic or differential equations. State diagrams are directed multigraphs that represent finite state machines, representations of a quiver label its vertices with vector spaces and its edges compatibly with linear transformations between them, and transform via natural transformations. If a path leads from x to y, then y is said to be a successor of x and reachable from x, the arrow is called the inverted arrow of. The adjacency matrix of a graph is unique up to identical permutation of rows. Another matrix representation for a graph is its incidence matrix. For a vertex, the number of head ends adjacent to a vertex is called the indegree of the vertex, the indegree of v is denoted deg− and its outdegree is denoted deg+
Directed graph
–
A directed graph.
46.
Graph theory
–
In mathematics graph theory is the study of graphs, which are mathematical structures used to model pairwise relations between objects. A graph in this context is made up of vertices, nodes, or points which are connected by edges, arcs, Graphs are one of the prime objects of study in discrete mathematics. Refer to the glossary of graph theory for basic definitions in graph theory, the following are some of the more basic ways of defining graphs and related mathematical structures. To avoid ambiguity, this type of graph may be described precisely as undirected, other senses of graph stem from different conceptions of the edge set. In one more generalized notion, V is a set together with a relation of incidence that associates with each two vertices. In another generalized notion, E is a multiset of unordered pairs of vertices, Many authors call this type of object a multigraph or pseudograph. All of these variants and others are described more fully below, the vertices belonging to an edge are called the ends or end vertices of the edge. A vertex may exist in a graph and not belong to an edge, V and E are usually taken to be finite, and many of the well-known results are not true for infinite graphs because many of the arguments fail in the infinite case. The order of a graph is |V|, its number of vertices, the size of a graph is |E|, its number of edges. The degree or valency of a vertex is the number of edges that connect to it, for an edge, graph theorists usually use the somewhat shorter notation xy. Graphs can be used to model many types of relations and processes in physical, biological, social, Many practical problems can be represented by graphs. Emphasizing their application to real-world systems, the network is sometimes defined to mean a graph in which attributes are associated with the nodes and/or edges. In computer science, graphs are used to represent networks of communication, data organization, computational devices, the flow of computation, etc. For instance, the structure of a website can be represented by a directed graph, in which the vertices represent web pages. A similar approach can be taken to problems in media, travel, biology, computer chip design. The development of algorithms to handle graphs is therefore of major interest in computer science, the transformation of graphs is often formalized and represented by graph rewrite systems. Graph-theoretic methods, in forms, have proven particularly useful in linguistics. Traditionally, syntax and compositional semantics follow tree-based structures, whose power lies in the principle of compositionality
Graph theory
–
A
drawing of a graph.
47.
Identity function
–
In mathematics, an identity function, also called an identity relation or identity map or identity transformation, is a function that always returns the same value that was used as its argument. In equations, the function is given by f = x, formally, if M is a set, the identity function f on M is defined to be that function with domain and codomain M which satisfies f = x for all elements x in M. In other words, the value f in M is always the same input element x of M. The identity function on M is clearly a function as well as a surjective function. The identity function f on M is often denoted by idM, in set theory, where a function is defined as a particular kind of binary relation, the identity function is given by the identity relation, or diagonal of M. If f, M → N is any function, then we have f ∘ idM = f = idN ∘ f, in particular, idM is the identity element of the monoid of all functions from M to M. Since the identity element of a monoid is unique, one can define the identity function on M to be this identity element. Such a definition generalizes to the concept of an identity morphism in category theory, the identity function is a linear operator, when applied to vector spaces. The identity function on the integers is a completely multiplicative function. In an n-dimensional vector space the identity function is represented by the identity matrix In, in a metric space the identity is trivially an isometry. An object without any symmetry has as symmetry group the group only containing this isometry. In a topological space, the identity function is always continuous
Identity function
–
Graph of the identity function on the real numbers
48.
Graphical models
–
A graphical model or probabilistic graphical model is a probabilistic model for which a graph expresses the conditional dependence structure between random variables. They are commonly used in probability theory, statistics—particularly Bayesian statistics—and machine learning, two branches of graphical representations of distributions are commonly used, namely, Bayesian networks and Markov random fields. Both families encompass the properties of factorization and independences, but they differ in the set of independences they can encode, if the network structure of the model is a directed acyclic graph, the model represents a factorization of the joint probability of all random variables. More precisely, if the events are X1, …, X n then the joint probability satisfies P = ∏ i =1 n P where p a i is the set of parents of node X i. In other words, the joint distribution factors into a product of conditional distributions, in general, any two sets of nodes are conditionally independent given a third set if a criterion called d-separation holds in the graph. Local independences and global independences are equivalent in Bayesian networks and this type of graphical model is known as a directed graphical model, Bayesian network, or belief network. Classic machine learning models like hidden Markov models, neural networks, a Markov random field, also known as a Markov network, is a model over an undirected graph. A graphical model with many repeated subunits can be represented with plate notation, a factor graph is an undirected bipartite graph connecting variables and factors. Each factor represents a function over the variables it is connected to and this is a helpful representation for understanding and implementing belief propagation. A clique tree or junction tree is a tree of cliques, a chain graph is a graph which may have both directed and undirected edges, but without any directed cycles. Both directed acyclic graphs and undirected graphs are special cases of chain graphs, an ancestral graph is a further extension, having directed, bidirected and undirected edges. A conditional random field is a model specified over an undirected graph. A restricted Boltzmann machine is a generative model specified over an undirected graph. Belief propagation Structural equation model Graphical models and Conditional Random Fields Probabilistic Graphical Models taught by Eric Xing at CMU Bishop, cowell, Robert G. Dawid, A. Philip, Lauritzen, Steffen L. Spiegelhalter, David J. Probabilistic networks and expert systems. A more advanced and statistically oriented book Jensen, Finn, koller, D. Friedman, N. Probabilistic Graphical Models. A computational reasoning approach, where the relationships between graphs and probabilities were formally introduced, getting Started in Probabilistic Graphical Models. Heckermans Bayes Net Learning Tutorial A Brief Introduction to Graphical Models and Bayesian Networks Sargur Sriharis lecture slides on probabilistic graphical models
Graphical models
–
An example of a graphical model. Each arrow indicates a dependency. In this example: D depends on A, D depends on B, D depends on C, C depends on B, and C depends on D.
49.
ReLU
–
In the context of artificial neural networks, the rectifier is an activation function defined as f = max, where x is the input to a neuron. This is also known as a function and is analogous to half-wave rectification in electrical engineering. This activation function was first introduced to a network by Hahnloser et al. in a 2000 paper in Nature with strong biological motivations. It has been used in convolutional networks more effectively than the widely used logistic sigmoid and its practical counterpart. The rectifier is, as of 2015, the most popular activation function for deep neural networks, a unit employing the rectifier is also called a rectified linear unit. A smooth approximation to the rectifier is the function f = ln . The derivative of softplus is f ′ = e x / =1 /, i. e. the logistic function, rectified linear units find applications in computer vision and speech recognition using deep neural nets. Leaky ReLUs allow a small, non-zero gradient when the unit is not active. F = { x if x >00.01 x otherwise Parametric ReLUs take this further by making the coefficient of leakage into a parameter that is learned along with the other neural network parameters. F = { x if x >0 a x otherwise Note that for a ≤1, exponential linear units try to make the mean activations closer to zero which speeds up learning. It has been shown that ELUs obtain higher accuracy than ReLUs. F = { x if x >=0 a otherwise a is a hyper-parameter to be tuned, biological plausibility, One-sided, compared to the antisymmetry of tanh. Sparse activation, For example, in a randomly initialized network, efficient gradient propagation, No vanishing or exploding gradient problems. Efficient computation, Only comparison, addition and multiplication, for the first time in 2011, the use of the rectifier as a non-linearity has been shown to enable training deep supervised neural networks without requiring unsupervised pre-training. Rectified linear units, compared to sigmoid function or similar functions, allow for faster and effective training of deep neural architectures on large. Non-differentiable at zero, however it is anywhere else, including points arbitrarily close to zero. Non-zero centered Unbounded, Could potentially blow up, dying Relu problem, Relu neurons can sometimes be pushed into states in which they become inactive for essentially all inputs. In this state, no gradients flow backward through the neuron, in some cases, large numbers of neurons in a network can become stuck in dead states, effectively decreasing the model capacity
ReLU
–
Plot of the rectifier (blue) and softplus (green) functions near x = 0.
50.
Mathematical optimization
–
In mathematics, computer science and operations research, mathematical optimization, also spelled mathematical optimisation, is the selection of a best element from some set of available alternatives. The generalization of optimization theory and techniques to other formulations comprises an area of applied mathematics. Such a formulation is called a problem or a mathematical programming problem. Many real-world and theoretical problems may be modeled in this general framework, typically, A is some subset of the Euclidean space Rn, often specified by a set of constraints, equalities or inequalities that the members of A have to satisfy. The domain A of f is called the space or the choice set. The function f is called, variously, a function, a loss function or cost function, a utility function or fitness function, or, in certain fields. A feasible solution that minimizes the objective function is called an optimal solution, in mathematics, conventional optimization problems are usually stated in terms of minimization. Generally, unless both the function and the feasible region are convex in a minimization problem, there may be several local minima. While a local minimum is at least as good as any nearby points, a global minimum is at least as good as every feasible point. In a convex problem, if there is a minimum that is interior, it is also the global minimum. Optimization problems are often expressed with special notation, consider the following notation, min x ∈ R This denotes the minimum value of the objective function x 2 +1, when choosing x from the set of real numbers R. The minimum value in case is 1, occurring at x =0. Similarly, the notation max x ∈ R2 x asks for the value of the objective function 2x. In this case, there is no such maximum as the function is unbounded. This represents the value of the argument x in the interval, John Wiley & Sons, Ltd. pp. xxviii+489. (2008 Second ed. in French, Programmation mathématique, Théorie et algorithmes, Editions Tec & Doc, Paris,2008. Nemhauser, G. L. Rinnooy Kan, A. H. G. Todd, handbooks in Operations Research and Management Science. Amsterdam, North-Holland Publishing Co. pp. xiv+709, J. E. Dennis, Jr. and Robert B
Mathematical optimization
–
Graph of a
paraboloid given by f(x, y) = −(x ² + y ²) + 4. The global
maximum at (0, 0, 4) is indicated by a red dot.
51.
Random variable
–
In probability and statistics, a random variable, random quantity, aleatory variable, or stochastic variable is a variable quantity whose value depends on possible outcomes. It is common that these outcomes depend on physical variables that are not well understood. For example, when you toss a coin, the outcome of heads or tails depends on the uncertain physics. Which outcome will be observed is not certain, of course the coin could get caught in a crack in the floor, but such a possibility is excluded from consideration. The domain of a variable is the set of possible outcomes. In the case of the coin, there are two possible outcomes, namely heads or tails. Since one of these outcomes must occur, thus either the event that the coin lands heads or the event that the coin lands tails must have non-zero probability, a random variable is defined as a function that maps outcomes to numerical quantities, typically real numbers. In this sense, it is a procedure for assigning a numerical quantity to each outcome, and, contrary to its name. What is random is the physics that describes how the coin lands. A random variables possible values might represent the possible outcomes of a yet-to-be-performed experiment and they may also conceptually represent either the results of an objectively random process or the subjective randomness that results from incomplete knowledge of a quantity. The mathematics works the same regardless of the interpretation in use. A random variable has a probability distribution, which specifies the probability that its value falls in any given interval, two random variables with the same probability distribution can still differ in terms of their associations with, or independence from, other random variables. The realizations of a variable, that is, the results of randomly choosing values according to the variables probability distribution function, are called random variates. The formal mathematical treatment of random variables is a topic in probability theory, in that context, a random variable is understood as a function defined on a sample space whose outputs are numerical values. A random variable X, Ω → E is a function from a set of possible outcomes Ω to a measurable space E. The technical axiomatic definition requires Ω to be a probability space, a random variable does not return a probability. The probability of a set of outcomes is given by the probability measure P with which Ω is equipped. Rather, X returns a numerical quantity of outcomes in Ω — e. g. the number of heads in a collection of coin flips
Random variable
–
If the sample space is the set of possible numbers rolled on two dice, and the random variable of interest is the sum S of the numbers on the two dice, then S is a discrete random variable whose distribution is described by the
probability mass function plotted as the height of picture columns here.
52.
Feedforward neural network
–
A feedforward neural network is an artificial neural network wherein connections between the units do not form a cycle. As such, it is different from recurrent neural networks, the feedforward neural network was the first and simplest type of artificial neural network devised. In this network, the moves in only one direction, forward, from the input nodes, through the hidden nodes. There are no cycles or loops in the network, the simplest kind of neural network is a single-layer perceptron network, which consists of a single layer of output nodes, the inputs are fed directly to the outputs via a series of weights. In this way it can be considered the simplest kind of feed-forward network, neurons with this kind of activation function are also called artificial neurons or linear threshold units. In the literature the term often refers to networks consisting of just one of these units. A similar neuron was described by Warren McCulloch and Walter Pitts in the 1940s, a perceptron can be created using any values for the activated and deactivated states as long as the threshold value lies between the two. Perceptrons can be trained by a learning algorithm that is usually called the delta rule. It calculates the errors between calculated output and sample data, and uses this to create an adjustment to the weights. This result can be found in Peter Auer, Harald Burgsteiner, a multi-layer neural network can compute a continuous output instead of a step function. A common choice is the logistic function, f =11 + e − x With this choice. The logistic function is known as the sigmoid function. It has a derivative, which allows it to be used in backpropagation. This function is also preferred because its derivative is easily calculated and this class of networks consists of multiple layers of computational units, usually interconnected in a feed-forward way. Each neuron in one layer has directed connections to the neurons of the subsequent layer, in many applications the units of these networks apply a sigmoid function as an activation function. This result holds for a range of activation functions, e. g. for the sigmoidal functions. Multi-layer networks use a variety of learning techniques, the most popular being back-propagation, here, the output values are compared with the correct answer to compute the value of some predefined error-function. By various techniques, the error is fed back through the network
Feedforward neural network
Feedforward neural network
–
In a feed forward network information always moves one direction; it never goes backwards.
53.
Directed acyclic graph
–
In mathematics and computer science, a directed acyclic graph, is a finite directed graph with no directed cycles. Equivalently, a DAG is a graph that has a topological ordering. DAGs can model different kinds of information. Similarly, topological orderings of DAGs can be used to order the compilation operations in a makefile, the program evaluation and review technique uses DAGs to model the milestones and activities of large human projects, and schedule these projects to use as little total time as possible. Combinational logic blocks in electronic design, and the operations in dataflow programming languages. More abstractly, the reachability relation in a DAG forms a partial order, the corresponding concept for undirected graphs is a forest, an undirected graph without cycles. Choosing an orientation for a forest produces a kind of directed acyclic graph called a polytree. However there are other kinds of directed acyclic graph that are not formed by orienting the edges of an undirected acyclic graph. Moreover, every undirected graph has an orientation, an assignment of a direction for its edges that makes it into a directed acyclic graph. To emphasize that DAGs are not the thing as directed versions of undirected acyclic graphs. A graph is formed by a collection of vertices and edges, in the case of a directed graph, each edge has an orientation, from one vertex to another vertex. A directed acyclic graph is a graph that has no cycles. A vertex v of a graph is said to be reachable from another vertex u when there exists a path that starts at u. As a special case, every vertex is considered to be reachable from itself, a graph that has a topological ordering cannot have any cycles, because the edge into the earliest vertex of a cycle would have to be oriented the wrong way. Therefore, every graph with an ordering is acyclic. Conversely, every directed acyclic graph has a topological ordering, therefore, this property can be used as an alternative definition of the directed acyclic graphs, they are exactly the graphs that have topological orderings. The reachability relationship in any directed graph can be formalized as a partial order ≤ on the vertices of the DAG. For example, the DAG with two edges a → b and b → c has the same reachability relation as the graph with three edges a → b, b → c, and a → c
Directed acyclic graph
–
An example of a directed acyclic graph
54.
Path (graph theory)
–
In graph theory, a path in a graph is a finite or infinite sequence of edges which connect a sequence of vertices which, by most definitions, are all distinct from one another. In a directed graph, a path is again a sequence of edges which connect a sequence of vertices. Paths are fundamental concepts of theory, described in the introductory sections of most graph theory texts. See e. g. Bondy and Murty, Gibbons, or Diestel, korte et al. cover more advanced algorithmic topics concerning paths in graphs. A path is a trail in which all vertices are distinct, a trail is a walk in which all edges are distinct. If the graph is undirected, then the endpoints of e i are v i and v i +1, if the graph is directed, then e i is an arc from v i to v i +1. An infinite path is a sequence of the same type described here, but with no first or last vertex, and a semi-infinite path has a first vertex, v 0. Most authors require all of the edges and vertices be distinct from one another. However, some authors do not make this requirement, and instead use the simple path to refer to a path which contains no repeated vertices. A weighted graph associates a value with every edge in the graph, the weight of a path in a weighted graph is the sum of the weights of the traversed edges. Sometimes the words cost or length are used instead of weight, a graph is connected if there are paths containing each pair of vertices. A directed graph is connected if there are oppositely oriented directed paths containing each pair of vertices. A path such that no graph edges connect two nonconsecutive path vertices is called an induced path, a path that includes every vertex of the graph is known as a Hamiltonian path. Two paths are vertex-independent if they do not have any vertex in common. Similarly, two paths are edge-independent if they do not have any edge in common. Two internally vertex-disjoint paths are edge-disjoint, but the converse is not necessarily true, the distance between two vertices in a graph is the length of a shortest path between them, if one exists, and otherwise the distance is infinity. The diameter of a graph is the largest distance between pairs of vertices of the graph. Several algorithms exist to find shortest and longest paths in graphs, the Floyd–Warshall algorithm can be used to find the shortest paths between all pairs of vertices in weighted directed graphs
Path (graph theory)
–
A
hypercube graph showing a
Hamiltonian path in red, and a
longest induced path in bold black.
55.
Statistic
–
A statistic or sample statistic is a single measure of some attribute of a sample. It is calculated by applying a function to the values of the items of the sample, the term statistic is used both for the function and for the value of the function on a given sample. A statistic is distinct from a statistical parameter, which is not computable because often the population is too large to examine. However, a statistic, when used to estimate a parameter, is called an estimator. For instance, the mean is a statistic that estimates the population mean. However, a single statistic can be used for multiple purposes – for example the sample mean can be used to describe a set, to estimate the population mean. In calculating the mean of a sample, for example. Other examples of statistics include Sample mean discussed in the example above and sample median Sample variance and sample standard deviation Sample quantiles besides the median, e. g. A parameter can only be computed if the entire population can be observed without error, for instance. For example, the parameter may be the height of 25-year-old men in North America. The height of the members of a sample of 100 such men are measured, the average of the heights of all members of the population is not a statistic unless that has somehow also been ascertained. The average height that would be calculated using the all of the heights of all 25-year-old North American men is a parameter. Important potential properties of statistics include completeness, consistency, sufficiency, unbiasedness, minimum mean square error, low variance, robustness, information of a statistic on model parameters can be defined in several ways. The most common is the Fisher information, which is defined on the statistic model induced by the statistic, kullback information measure can also be used. Statistics Statistical theory Descriptive statistics Statistical hypothesis testing Well-behaved statistic Parker, mcGraw-Hill Dictionary of Scientific and Technical Terms
Statistic
–
A diagram that deals with the difference between statistic and parameter.
56.
Gradient
–
In mathematics, the gradient is a multi-variable generalization of the derivative. While a derivative can be defined on functions of a variable, for functions of several variables. The gradient is a function, as opposed to a derivative. If f is a differentiable, real-valued function of several variables, like the derivative, the gradient represents the slope of the tangent of the graph of the function. More precisely, the gradient points in the direction of the greatest rate of increase of the function, the components of the gradient in coordinates are the coefficients of the variables in the equation of the tangent space to the graph. The Jacobian is the generalization of the gradient for vector-valued functions of several variables, a further generalization for a function between Banach spaces is the Fréchet derivative. Consider a room in which the temperature is given by a field, T. At each point in the room, the gradient of T at that point will show the direction in which the temperature rises most quickly, the magnitude of the gradient will determine how fast the temperature rises in that direction. Consider a surface whose height above sea level at point is H, the gradient of H at a point is a vector pointing in the direction of the steepest slope or grade at that point. The steepness of the slope at point is given by the magnitude of the gradient vector. The gradient can also be used to measure how a scalar field changes in other directions, rather than just the direction of greatest change, suppose that the steepest slope on a hill is 40%. If a road goes directly up the hill, then the steepest slope on the road will also be 40%, if, instead, the road goes around the hill at an angle, then it will have a shallower slope. This observation can be stated as follows. If the hill height function H is differentiable, then the gradient of H dotted with a unit vector gives the slope of the hill in the direction of the vector. More precisely, when H is differentiable, the dot product of the gradient of H with a unit vector is equal to the directional derivative of H in the direction of that unit vector. The gradient of a function f is denoted ∇f or ∇→f where ∇ denotes the vector differential operator. The notation grad f is commonly used for the gradient. The gradient of f is defined as the vector field whose dot product with any vector v at each point x is the directional derivative of f along v. That is
Gradient
–
Gradient of the 2-d function f (x, y) = xe −(x 2 + y 2) is plotted as blue arrows over the pseudocolor plot of the function.
Gradient
–
In the above two images, the values of the function are represented in black and white, black representing higher values, and its corresponding gradient is represented by blue arrows.
57.
Control theory
–
Control theory is an interdisciplinary branch of engineering and mathematics that deals with the behavior of dynamical systems with inputs, and how their behavior is modified by feedback. The usual objective of control theory is to control a system, often called the plant, so its output follows a control signal, called the reference. To do this a controller is designed, which monitors the output, the difference between actual and desired output, called the error signal, is applied as feedback to the input of the system, to bring the actual output closer to the reference. Some topics studied in control theory are stability, controllability and observability, extensive use is usually made of a diagrammatic style known as the block diagram. Although a major application of theory is in control systems engineering. As the general theory of systems, control theory is useful wherever feedback occurs. A few examples are in physiology, electronics, climate modeling, machine design, ecosystems, navigation, neural networks, predator–prey interaction, gene expression, Control systems may be thought of as having four functions, measure, compare, compute and correct. These four functions are completed by five elements, detector, transducer, transmitter, controller, the measuring function is completed by the detector, transducer and transmitter. In practical applications these three elements are contained in one unit. A standard example of a unit is a resistance thermometer. Older controller units have been mechanical, as in a governor or a carburetor. The correct function is completed with a control element. The final control element changes an input or output in the system that affects the manipulated or controlled variable. Fundamentally, there are two types of loops, open loop control and closed loop control. In open loop control, the action from the controller is independent of the process output. A good example of this is a central heating boiler controlled only by a timer, so heat is applied for a constant time. In closed loop control, the action from the controller is dependent on the process output. A closed loop controller therefore has a loop which ensures the controller exerts a control action to give a process output the same as the Reference input or set point
Control theory
–
Centrifugal governor in a
Boulton & Watt engine of 1788
58.
Dynamic programming
–
The technique of storing solutions to subproblems instead of recomputing them is called memoization. Dynamic programming algorithms are used for optimization. A dynamic programming algorithm will examine the previously solved subproblems and will combine their solutions to give the best solution for the given problem, in comparison, a greedy algorithm treats the solution as some sequence of steps and picks the locally optimal choice at each step. Using a greedy algorithm does not guarantee a solution, because picking locally optimal choices may result in a bad global solution. Fortunately, some algorithms are proven to lead to the optimal solution. If the coin denominations are 1,4,5,15,20 and the amount is 23. Dynamic programming is both a mathematical optimization method and a programming method. In both contexts it refers to simplifying a complicated problem by breaking it down into simpler sub-problems in a recursive manner. While some decision problems cannot be taken apart this way, decisions that span several points in time do often break apart recursively, in the optimization literature this relationship is called the Bellman equation. In terms of optimization, dynamic programming usually refers to simplifying a decision by breaking it down into a sequence of decision steps over time. This is done by defining a sequence of value functions V1, Vn, with an argument y representing the state of the system at times i from 1 to n. The definition of Vn is the value obtained in state y at the last time n, the values Vi at earlier times i = n −1, n −2. 2,1 can be found by working backwards, using a relationship called the Bellman equation. Since Vi has already been calculated for the states, the above operation yields Vi−1 for those states. Finally, V1 at the state of the system is the value of the optimal solution. The optimal values of the variables can be recovered, one by one. Dynamic programming is used in bioinformatics for the tasks such as sequence alignment, protein folding, RNA structure prediction. The first dynamic programming algorithms for protein-DNA binding were developed in the 1970s independently by Charles DeLisi in USA and Georgii Gurskii, recently these algorithms have become very popular in bioinformatics and computational biology, particularly in the studies of nucleosome positioning and transcription factor binding
Dynamic programming
–
A model set of the Towers of Hanoi (with 8 disks)
Dynamic programming
–
Figure 1. Finding the shortest path in a graph using optimal substructure; a straight line indicates a single edge; a wavy line indicates a shortest path between the two vertices it connects (other nodes on these paths are not shown); the bold line is the overall shortest path from start to goal.
59.
Yu-Chi Ho
–
Yu-Chi Larry Ho is a Chinese-American mathematician, control theorist, and a professor at the School of Engineering and Applied Sciences, Harvard University. He is the co-author of Applied Optimal Control, and a researcher in differential games, pattern recognition. Yu-Chi Ho was born on March 1,1934 in Shanghai, China and left home at the age of 15 in 1949 to complete his high school education in Hong Kong. In 1950, Ho was accepted to the Massachusetts Institute of Technology and came to the U. S. where he received a B. S. in Electrical Engineering in the summer of 1953, at the age of 19. Ho later reported that his interests were in mechanical engineering. His application to MIT mechanical engineering, however, was routed to the engineering department by mistake. Ho continued his studies at MIT, receiving a M. S. degree in Electrical Engineering in 1955. After working for Bendix Aviation for three years, Ho moved to Harvard University in 1958 where he completed a Ph. D. in Applied Mathematics in 1961. His thesis A Study Optimal Control of Dynamic Systems was advised by Arthur E. Bryson, Jr. at Harvard, Ho held the title of Gordon McKay Professor of Systems Engineering, Emeritus, as well as the T. Jefferson Coolidge Chair of Applied Mathematics, Emeritus. He was also the professor to the Cockrell Family Regents Chair in Engineering at the University of Texas. In 2001, Ho retired from teaching duties at Harvard after 40 years of service and he was appointed the Chair Professor and Chief Scientist at the Center for Intelligent and Networked Systems, Department of Automation, Tsinghua University. In his career, he has supervised 50 Ph. D. students at Harvard and 3 students at Tsinghua. Yu-Chi Ho is a member of the U. S. National Academy of Engineering and a member of the Chinese Academy of Engineering. He is also an IEEE Life Fellow and an INFORMS Inaugural Fellow, Ho co-founded Network Dynamics, Inc. a software firm specializing in industrial automation. He is on the boards of several international journals and is the founding editor-in-chief of the international Journal on Discrete Event Dynamic Systems. Yu-Chi Ho was the founder and the first chair of the annual United Asian American Dinner of Massachusetts, guggenheim Fellowship IEEE Control Systems Science and Engineering Award Chiang Technology Achievement Award Richard E. He and his wife Sophia have three children and six grand children and they currently reside in Lexington, Massachusetts. Since April 25,2007, Ho blogs on ScienceNet. cn, a blog network sponsored by the Chinese Academy of Engineering
Yu-Chi Ho
–
Yu-Chi "Larry" Ho
Yu-Chi Ho
–
Yi-Chi Ho accepting the American Automatic Control Council
Richard E. Bellman Control Heritage Award in 1999.
60.
Differentiable function
–
In calculus, a differentiable function of one real variable is a function whose derivative exists at each point in its domain. As a result, the graph of a function must have a tangent line at each point in its domain, be relatively smooth. More generally, if x0 is a point in the domain of a function f and this means that the graph of f has a non-vertical tangent line at the point. The function f may also be called locally linear at x0, if f is differentiable at a point x0, then f must also be continuous at x0. In particular, any function must be continuous at every point in its domain. The converse does not hold, a function need not be differentiable. For example, a function with a bend, cusp, or vertical tangent may be continuous, most functions that occur in practice have derivatives at all points or at almost every point. However, a result of Stefan Banach states that the set of functions that have a derivative at some point is a set in the space of all continuous functions. Informally, this means that differentiable functions are very atypical among continuous functions, the first known example of a function that is continuous everywhere but differentiable nowhere is the Weierstrass function. A function f is said to be continuously differentiable if the derivative f exists and is itself a continuous function, though the derivative of a differentiable function never has a jump discontinuity, it is possible for the derivative to have an essential discontinuity. For example, the function f = { x 2 sin if x ≠00 if x =0 is differentiable at 0, since f ′ = lim ϵ →0 =0, exists. However, for x≠0, f ′ =2 x sin − cos which has no limit as x →0, nevertheless, Darbouxs theorem implies that the derivative of any function satisfies the conclusion of the intermediate value theorem. Sometimes continuously differentiable functions are said to be of class C1, a function is of class C2 if the first and second derivative of the function both exist and are continuous. More generally, a function is said to be of class Ck if the first k derivatives f′, F all exist and are continuous. If derivatives f exist for all integers n, the function is smooth or equivalently. If all the derivatives of a function exist and are continuous in a neighborhood of a point, then the function is differentiable at that point. If a function is differentiable at x0, then all of the partial derivatives exist at x0, a similar formulation of the higher-dimensional derivative is provided by the fundamental increment lemma found in single-variable calculus. Note that existence of the partial derivatives does not in general guarantee that a function is differentiable at a point
Differentiable function
–
A differentiable function
61.
Minimum bounding box
–
In geometry, the minimum or smallest bounding or enclosing box for a point set in N dimensions is the box with the smallest measure within which all the points lie. When other kinds of measure are used, the box is usually called accordingly. The minimum bounding box of a point set is the same as the bounding box of its convex hull. The term box/hyperrectangle comes from its usage in the Cartesian coordinate system, in the two-dimensional case it is called the minimum bounding rectangle. The axis-aligned minimum bounding box for a point set is its minimum bounding box subject to the constraint that the edges of the box are parallel to the coordinate axes. For example, in geometry and its applications when it is required to find intersections in the set of objects. Since it is usually a less expensive operation than the check of the actual intersection. The arbitrarily oriented minimum bounding box is the minimum bounding box, a three-dimensional rotating calipers algorithm can find the minimum-volume arbitrarily-oriented bounding box of a three-dimensional point set in cubic time. Bounding sphere Bounding volume Minimum bounding rectangle
Minimum bounding box
–
A series of geometric shapes enclosed by its minimum bounding box (in 2 dimensions)
62.
Gradient descent
–
Gradient descent is a first-order iterative optimization algorithm. To find a minimum of a function using gradient descent. If instead one takes steps proportional to the positive of the gradient, one approaches a maximum of that function. Gradient descent is known as steepest descent, or the method of steepest descent. Gradient descent should not be confused with the method of steepest descent for approximating integrals and it follows that, if a n +1 = a n − γ ∇ F for γ small enough, then F ≥ F. In other words, the term γ ∇ F is subtracted from a because we want to move against the gradient and we have F ≥ F ≥ F ≥ ⋯, so hopefully the sequence converges to the desired local minimum. Note that the value of the step size γ is allowed to change at every iteration. With certain assumptions on the function F and particular choices of γ, γ n = T | | ∇ F − ∇ F | |2 convergence to a minimum can be guaranteed. When the function F is convex, all local minima are also global minima and this process is illustrated in the adjacent picture. Here F is assumed to be defined on the plane, the blue curves are the contour lines, that is, the regions on which the value of F is constant. A red arrow originating at a point shows the direction of the gradient at that point. Note that the gradient at a point is orthogonal to the line going through that point. We see that gradient descent leads us to the bottom of the bowl, Gradient descent has problems with pathological functions such as the Rosenbrock function shown here. The Rosenbrock function has a curved valley which contains the minimum. The bottom of the valley is very flat, because of the curved flat valley the optimization is zig-zagging slowly with small stepsizes towards the minimum. The Zig-Zagging nature of the method is also evident below, where the gradient descent method is applied to F = sin cos . For some of the examples, gradient descent is relatively slow close to the minimum, technically. For poorly conditioned convex problems, gradient descent increasingly zigzags as the point nearly orthogonally to the shortest direction to a minimum point
Gradient descent
–
Illustration of gradient descent.
63.
Data clustering
–
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups. Cluster analysis itself is not one specific algorithm, but the task to be solved. It can be achieved by various algorithms that differ significantly in their notion of what constitutes a cluster, popular notions of clusters include groups with small distances among the cluster members, dense areas of the data space, intervals or particular statistical distributions. Clustering can therefore be formulated as an optimization problem. The appropriate clustering algorithm and parameter settings depend on the data set. Cluster analysis as such is not a task, but an iterative process of knowledge discovery or interactive multi-objective optimization that involves trial. It is often necessary to modify data preprocessing and model parameters until the result achieves the desired properties, besides the term clustering, there are a number of terms with similar meanings, including automatic classification, numerical taxonomy, botryology and typological analysis. The notion of a cluster cannot be defined, which is one of the reasons why there are so many clustering algorithms. There is a common denominator, a group of data objects, however, different researchers employ different cluster models, and for each of these cluster models again different algorithms can be given. The notion of a cluster, as found by different algorithms, understanding these cluster models is key to understanding the differences between the various algorithms. Typical cluster models include, Connectivity models, for example, hierarchical clustering builds models based on distance connectivity, centroid models, for example, the k-means algorithm represents each cluster by a single mean vector. Distribution models, clusters are modeled using statistical distributions, such as multivariate normal distributions used by the Expectation-maximization algorithm, density models, for example, DBSCAN and OPTICS defines clusters as connected dense regions in the data space. Subspace models, in Biclustering, clusters are modeled with both members and relevant attributes. Group models, some algorithms do not provide a model for their results. Graph-based models, a clique, that is, a subset of nodes in a such that every two nodes in the subset are connected by an edge can be considered as a prototypical form of cluster. Relaxations of the connectivity requirement are known as quasi-cliques, as in the HCS clustering algorithm. A clustering is essentially a set of clusters, usually containing all objects in the data set. Additionally, it may specify the relationship of the clusters to each other, for example, the following overview will only list the most prominent examples of clustering algorithms, as there are possibly over 100 published clustering algorithms
Data clustering
–
Machine learning and
data mining
64.
Statistical distributions
–
For instance, if the random variable X is used to denote the outcome of a coin toss, then the probability distribution of X would take the value 0.5 for X = heads, and 0.5 for X = tails. In more technical terms, the probability distribution is a description of a phenomenon in terms of the probabilities of events. Examples of random phenomena can include the results of an experiment or survey, a probability distribution is defined in terms of an underlying sample space, which is the set of all possible outcomes of the random phenomenon being observed. The sample space may be the set of numbers or a higher-dimensional vector space, or it may be a list of non-numerical values, for example. Probability distributions are divided into two classes. A discrete probability distribution can be encoded by a discrete list of the probabilities of the outcomes, on the other hand, a continuous probability distribution is typically described by probability density functions. The normal distribution represents a commonly encountered continuous probability distribution, more complex experiments, such as those involving stochastic processes defined in continuous time, may demand the use of more general probability measures. A probability distribution whose sample space is the set of numbers is called univariate. Important and commonly encountered univariate probability distributions include the distribution, the hypergeometric distribution. The multivariate normal distribution is a commonly encountered multivariate distribution, to define probability distributions for the simplest cases, one needs to distinguish between discrete and continuous random variables. For example, the probability that an object weighs exactly 500 g is zero. Continuous probability distributions can be described in several ways, the cumulative distribution function is the antiderivative of the probability density function provided that the latter function exists. As probability theory is used in diverse applications, terminology is not uniform. The following terms are used for probability distribution functions, Distribution. Probability distribution, is a table that displays the probabilities of outcomes in a sample. Could be called a frequency distribution table, where all occurrences of outcomes sum to 1. Distribution function, is a form of frequency distribution table. Probability distribution function, is a form of probability distribution table
Statistical distributions
–
The
probability mass function (pmf) p (S) specifies the probability distribution for the sum S of counts from two
dice. For example, the figure shows that p (11) = 1/18. The pmf allows the computation of probabilities of events such as P (S > 9) = 1/12 + 1/18 + 1/36 = 1/6, and all other probabilities in the distribution.
65.
Bayesian spam filtering
–
Naive Bayes classifiers are a popular statistical technique of e-mail filtering. They typically use bag of features to identify spam e-mail. Naive Bayes classifiers work by correlating the use of tokens, with spam and non-spam e-mails and it is one of the oldest ways of doing spam filtering, with roots in the 1990s. The first known mail-filtering program to use a naive Bayes classifier was Jason Rennies ifile program, the program was used to sort mail into folders. The first scholarly publication on Bayesian spam filtering was by Sahami et al. in 1998 and that work was soon thereafter deployed in commercial spam filters. However, in 2002 Paul Graham greatly decreased the false positive rate, variants of the basic technique have been implemented in a number of research works and commercial software products. Many modern mail clients implement Bayesian spam filtering, users can also install separate email filtering programs. CRM114, oft cited as a Bayesian filter, is not intended to use a Bayes filter in production, particular words have particular probabilities of occurring in spam email and in legitimate email. For instance, most email users will frequently encounter the word Viagra in spam email, the filter doesnt know these probabilities in advance, and must first be trained so it can build them up. To train the filter, the user must manually indicate whether a new email is spam or not, for all words in each training email, the filter will adjust the probabilities that each word will appear in spam or legitimate email in its database. After training, the probabilities are used to compute the probability that an email with a particular set of words in it belongs to either category. Each word in the email contributes to the emails spam probability and this contribution is called the posterior probability and is computed using Bayes theorem. Then, the emails spam probability is computed over all words in the email, and if the total exceeds a certain threshold, as in any other spam filtering technique, email marked as spam can then be automatically moved to a Junk email folder, or even deleted outright. Some software implement quarantine mechanisms that define a frame during which the user is allowed to review the softwares decision. The initial training can usually be refined when wrong judgements from the software are identified and that allows the software to dynamically adapt to the ever evolving nature of spam. Some spam filters combine the results of both Bayesian spam filtering and other heuristics, resulting in even higher filtering accuracy, sometimes at the cost of adaptiveness, Bayesian email filters utilize Bayes theorem. Lets suppose the suspected message contains the word replica, most people who are used to receiving e-mail know that this message is likely to be spam, more precisely a proposal to sell counterfeit copies of well-known brands of watches. The spam detection software, however, does not know such facts and this assumption permits simplifying the general formula to, Pr = Pr Pr + Pr This is functionally equivalent to asking, what percentage of occurrences of the word replica appear in spam messages
Bayesian spam filtering
–
Machine learning and
data mining
66.
Markov decision process
–
Markov decision processes provide a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. MDPs are useful for studying a range of optimization problems solved via dynamic programming. MDPs were known at least as early as the 1950s, a body of research on Markov decision processes resulted from Ronald A. Howards book published in 1960, Dynamic Programming. They are used in a area of disciplines, including robotics, automated control, economics. More precisely, a Markov Decision Process is a discrete time stochastic control process, at each time step, the process is in some state s, and the decision maker may choose any action a that is available in state s. The process responds at the time step by randomly moving into a new state s ′. The probability that the moves into its new state s ′ is influenced by the chosen action. Specifically, it is given by the transition function P a. Thus, the state s ′ depends on the current state s. But given s and a, it is independent of all previous states and actions, in other words. Markov decision processes are an extension of Markov chains, the difference is the addition of actions, conversely, if only one action exists for each state and all rewards are the same, a Markov decision process reduces to a Markov chain. The core problem of MDPs is to find a policy for the decision maker, note that once a Markov decision process is combined with a policy in this way, this fixes the action for each state and the resulting combination behaves like a Markov chain. γ is typically close to 1, because of the Markov property, the optimal policy for this particular problem can indeed be written as a function of s only, as assumed above. MDPs can be solved by linear programming or dynamic programming, in what follows we present the latter approach. Suppose we know the state transition function P and the reward function R, the standard family of algorithms to calculate this optimal policy requires storage for two arrays indexed by state, value V, which contains real values, and policy π which contains actions. At the end of the algorithm, π will contain the solution, the algorithm has the following two kinds of steps, which are repeated in some order for all the states until no further changes take place. As long as no state is permanently excluded from either of the steps, in value iteration, which is also called backward induction, the π function is not used, instead, the value of π is calculated within V whenever it is needed. Lloyd Shapleys 1953 paper on stochastic games included as a case the value iteration method for MDPs
Markov decision process
–
Example of a simple MDP with three states and two actions.
67.
Dimitri Bertsekas
–
Bertsekas was born in Greece and lived his childhood there. He is known for his work, and for his sixteen textbooks and monographs in theoretical and algorithmic optimization and control. He is featured among the top 100 most cited computer science authors in the CiteSeer search engine academic database and digital library, in 1995, he co-founded a publishing company, Athena Scientific that among others, publishes most of his books. In the late 90s Bertsekas developed a strong interest in digital photography and his photographs have been exhibited on several occasions at M. I. T. and can also be accessed from his www site http, //web. mit. edu/dimitrib/www/home. html. See also an article describing his career and views on mathematical research, ragazzini Education Award for outstanding contributions to education. And the 2015 Dantzig prize from SIAM and the Mathematical Optimization Society, some of these books have been published in multiple editions, and have been translated in various foreign languages. He has also written several widely referenced research monographs, which contain most of his research. These include, Stochastic Optimal Control, The Discrete-Time Case, a complex work, establishing the measure-theoretic foundations of dynamic programming. Parallel and Distributed Computation, Numerical Methods, which among others established the theoretical structures for the analysis of distributed asynchronous algorithms. Neuro-Dynamic Programming, which laid the foundations for suboptimal approximations of highly complex sequential decision-making problems
Dimitri Bertsekas
–
Dimitri P. Bertsekas
68.
Natural resource management
–
Natural resource management deals with managing the way in which people and natural landscapes interact. It brings together land use planning, water management, biodiversity conservation, Natural resource management specifically focuses on a scientific and technical understanding of resources and ecology and the life-supporting capacity of those resources. Environmental management is similar to natural resource management. In academic contexts, the sociology of natural resources is closely related to and this type of analysis coalesced in the 20th century with recognition that preservationist conservation strategies had not been effective in halting the decline of natural resources. A more integrated approach was implemented recognising the social, cultural, economic. A more holistic, national and even global form evolved, from the Brundtland Commission, in 2005 the government of New South Wales, established a Standard for Quality Natural Resource Management, to improve the consistency of practice, based on an adaptive management approach. In the United States, the most active areas of natural resource management are wildlife management often associated with ecotourism, in Australia, water sharing, such as the Murray Darling Basin Plan and catchment management are also significant. Individuals or groups may be able to use of the resources. National forest, National parks and military reservations are some US examples, private property, Any property owned by a defined individual or corporate entity. Both the benefit and duties to the fall to the owner. Private land is the most common example, Common property, It is a private property of a group. The group may vary in size, nature and internal structure e. g. indigenous neighbours of village, some examples of common property are community forests. Non-property, There is no owner of these properties. Each potential user has equal ability to use it as they wish and these areas are the most exploited. It is said that Everybodys property is nobodys property, an example is a lake fishery. Common land may exist without ownership, in case in the UK it is vested in a local authority. Stakeholder analysis originated from business management practices and has incorporated into natural resource management in ever growing popularity. Stakeholder analysis in the context of natural resource management identifies distinctive interest groups affected in the utilisation and conservation of natural resources, There is no definitive definition of a stakeholder as illustrated in the table below
Natural resource management
–
The
Tongass National Forest in
Alaska is managed by the
United States Forest Service
Natural resource management
–
Air
69.
Medicine
–
Medicine is the science and practice of the diagnosis, treatment, and prevention of disease. The word medicine is derived from Latin medicus, meaning a physician, Medicine encompasses a variety of health care practices evolved to maintain and restore health by the prevention and treatment of illness. Medicine has existed for thousands of years, during most of which it was an art frequently having connections to the religious and philosophical beliefs of local culture. For example, a man would apply herbs and say prayers for healing, or an ancient philosopher. In recent centuries, since the advent of modern science, most medicine has become a combination of art, while stitching technique for sutures is an art learned through practice, the knowledge of what happens at the cellular and molecular level in the tissues being stitched arises through science. Prescientific forms of medicine are now known as medicine and folk medicine. They remain commonly used with or instead of medicine and are thus called alternative medicine. For example, evidence on the effectiveness of acupuncture is variable and inconsistent for any condition, in contrast, treatments outside the bounds of safety and efficacy are termed quackery. Medical availability and clinical practice varies across the world due to differences in culture. In modern clinical practice, physicians personally assess patients in order to diagnose, treat, the doctor-patient relationship typically begins an interaction with an examination of the patients medical history and medical record, followed by a medical interview and a physical examination. Basic diagnostic medical devices are typically used, after examination for signs and interviewing for symptoms, the doctor may order medical tests, take a biopsy, or prescribe pharmaceutical drugs or other therapies. Differential diagnosis methods help to rule out conditions based on the information provided, during the encounter, properly informing the patient of all relevant facts is an important part of the relationship and the development of trust. The medical encounter is then documented in the record, which is a legal document in many jurisdictions. Follow-ups may be shorter but follow the general procedure. The diagnosis and treatment may take only a few minutes or a few weeks depending upon the complexity of the issue, the components of the medical interview and encounter are, Chief complaint, the reason for the current medical visit. They are in the patients own words and are recorded along with the duration of each one, also called chief concern or presenting complaint. History of present illness, the order of events of symptoms. Distinguishable from history of illness, often called past medical history
Medicine
–
Early Medicine Bottles
Medicine
Medicine
–
The Doctor, by
Sir Luke Fildes (1891)
Medicine
–
The Hospital of
Santa Maria della Scala, fresco by
Domenico di Bartolo, 1441–1442
70.
Secant method
–
In numerical analysis, the secant method is a root-finding algorithm that uses a succession of roots of secant lines to better approximate a root of a function f. The secant method can be thought of as a finite difference approximation of Newtons method, however, the method was developed independently of Newtons method, and predates it by over 3,000 years. Starting with initial values x0 and x1, we construct a line through the points and and we continue this process, solving for x3, x4, etc. until we reach a sufficiently high level of precision. The order of convergence is α, where α =1 +52 ≈1.618 is the golden ratio, in particular, the convergence is superlinear, but not quite quadratic. This result only holds under some conditions, namely that f be twice continuously differentiable. If the initial values are not close enough to the root, there is no general definition of close enough, but the criterion has to do with how wiggly the function is on the interval. For example, if f is differentiable on that interval and there is a point where f ′ =0 on the interval, the secant method does not require that the root remain bracketed like the bisection method does, and hence it does not always converge. The false position method uses the formula as the secant method. This means that the false position method always converges, the secant method can be interpreted as a method in which the derivative is replaced by an approximation and is thus a Quasi-Newton method. If we compare Newtons method with the secant method, we see that Newtons method converges faster, however, Newtons method requires the evaluation of both f and its derivative f ′ at every step, while the secant method only requires the evaluation of f. Therefore, the secant method may occasionally be faster in practice, if however we consider parallel processing for the evaluation of the derivative, Newtons method proves its worth, being faster in time, though still spending more steps. Broydens method is a generalization of the secant method to more than one dimension, the following graph shows the function f in red and the last secant line in bold blue. In the graph, the x-intercept of the secant line seems to be an approximation of the root of f. The Secant method is applied to find a root of the function f = x2 −612, here is an implementation in the Matlab language, False position method Kaw, Autar, Kalu, Egwu, Numerical Methods with Applications. Isaacson, Eli L. Numerical analysis for applied science, Secant Method Notes, PPT, Mathcad, Maple, Mathematica, Matlab at Holistic Numerical Methods Institute Weisstein, Eric W. Secant Method
Secant method
–
The first two iterations of the secant method. The red curve shows the function f and the blue lines are the secants. For this particular case, the secant method will not converge.
71.
Conjugate gradient method
–
In mathematics, the conjugate gradient method is an algorithm for the numerical solution of particular systems of linear equations, namely those whose matrix is symmetric and positive-definite. Large sparse systems often arise when numerically solving partial differential equations or optimization problems, the conjugate gradient method can also be used to solve unconstrained optimization problems such as energy minimization. It was mainly developed by Magnus Hestenes and Eduard Stiefel, the biconjugate gradient method provides a generalization to non-symmetric matrices. Various nonlinear conjugate gradient methods seek minima of nonlinear equations. Suppose we want to solve the system of linear equations Ax = b for the vector x where the known n × n matrix A is symmetric, positive definite, and real. We denote the solution of this system by x∗. We say that two vectors u and v are conjugate if u T A v =0. Being conjugate is a relation, if u is conjugate to v. Suppose that P = is a set of n mutually conjugate vectors and this gives the following method for solving the equation Ax = b, find a sequence of n conjugate directions, and then compute the coefficients αk. If we choose the conjugate vectors pk carefully, then we may not need all of them to obtain an approximation to the solution x∗. So, we want to regard the conjugate gradient method as an iterative method and this also allows us to approximately solve systems where n is so large that the direct method would take too much time. We denote the initial guess for x∗ by x0 and we can assume without loss of generality that x0 =0. Starting with x0 we search for the solution and in each iteration we need a metric to tell us whether we are closer to the solution x∗. This metric comes from the fact that the solution x∗ is also the unique minimizer of the quadratic function. This suggests taking the first basis vector p0 to be the negative of the gradient of f at x = x0, the gradient of f equals Ax − b. Starting with a guessed solution x0, this means we take p0 = b − Ax0, the other vectors in the basis will be conjugate to the gradient, hence the name conjugate gradient method. Let rk be the residual at the kth step, r k = b − A x k, note that rk is the negative gradient of f at x = xk, so the gradient descent method would be to move in the direction rk. Here, we insist that the directions pk be conjugate to each other and we also require that the next search direction be built out of the current residue and all previous search directions, which is reasonable enough in practice
Conjugate gradient method
–
A comparison of the convergence of
gradient descent with optimal step size (in green) and conjugate vector (in red) for minimizing a quadratic function associated with a given linear system. Conjugate gradient, assuming exact arithmetic, converges in at most n steps where n is the size of the matrix of the system (here n =2).
72.
Simulated annealing
–
Simulated annealing is a probabilistic technique for approximating the global optimum of a given function. Specifically, it is a metaheuristic to approximate global optimization in a search space. It is often used when the space is discrete. The name and inspiration come from annealing in metallurgy, a technique involving heating and controlled cooling of a material to increase the size of its crystals, both are attributes of the material that depend on its thermodynamic free energy. Heating and cooling the material both the temperature and the thermodynamic free energy. Khachaturyan, Svetlana V. Semenovskaya, Boris K. Vainshtein in 1979 and by Armen G. Khachaturyan, Svetlana V. Semenovskaya and these authors used computer simulation mimicking annealing and cooling of such a system to find its global minimum. This notion of slow cooling implemented in the Simulated Annealing algorithm is interpreted as a decrease in the probability of accepting worse solutions as the solution space is explored. The method is an adaptation of the Metropolis–Hastings algorithm, a Monte Carlo method to sample states of a thermodynamic system. Rosenbluth and published by N. Metropolis et al. in 1953, the state of some physical systems, and the function E to be minimized is analogous to the internal energy of the system in that state. The goal is to bring the system, from an initial state. At each step, the SA heuristic considers some neighbouring state s of the current state s and these probabilities ultimately lead the system to move to states of lower energy. Typically this step is repeated until the system reaches a state that is enough for the application. Optimization of a solution involves evaluating the neighbours of a state of the problem, the well-defined way in which the states are altered to produce neighbouring states is called a move, and different moves give different sets of neighbouring states. These moves usually result in alterations of the last state. States with a smaller energy are better than those with a greater energy, the probability function P must be positive even when e ′ is greater than e. This feature prevents the method from becoming stuck at a minimum that is worse than the global one. When T tends to zero, the probability P must tend to zero if e ′ > e, for sufficiently small values of T, the system will then increasingly favor moves that go downhill, and avoid those that go uphill. With T =0 the procedure reduces to the greedy algorithm, which makes only the downhill transitions
Simulated annealing
–
Fast
Simulated annealing
–
Slow
73.
Expectation-maximization
–
These parameter-estimates are then used to determine the distribution of the latent variables in the next E step. The EM algorithm was explained and given its name in a classic 1977 paper by Arthur Dempster, Nan Laird and they pointed out that the method had been proposed many times in special circumstances by earlier authors. The Dempster-Laird-Rubin paper in 1977 generalized the method and sketched a convergence analysis for a class of problems. The Dempster-Laird-Rubin paper established the EM method as an important tool of statistical analysis, the convergence analysis of the Dempster-Laird-Rubin paper was flawed and a correct convergence analysis was published by C. F. Wus proof established the EM methods convergence outside of the exponential family, the EM algorithm is used to find maximum likelihood parameters of a statistical model in cases where the equations cannot be solved directly. Typically these models involve latent variables in addition to unknown parameters and that is, either missing values exist among the data, or the model can be formulated more simply by assuming the existence of further unobserved data points. In statistical models with latent variables, this is usually impossible, the EM algorithm proceeds from the observation that the following is a way to solve these two sets of equations numerically. In general, multiple maxima may occur, with no guarantee that the maximum will be found. Some likelihoods also have singularities in them, i. e. nonsensical maxima, associated with each data point may be a vector of observations. The missing values Z are discrete, drawn from a number of values. The parameters are continuous, and are of two kinds, Parameters that are associated with all points, and those associated with a specific value of a latent variable. However, it is possible to apply EM to other sorts of models and this suggests an iterative algorithm, in the case where both θ and Z are unknown, First, initialize the parameters θ to some random values. Compute the best value for Z given these parameter values, then, use the just-computed values of Z to compute a better estimate for the parameters θ. Parameters associated with a value of Z will use only those data points which associated latent variable has that value. Iterate steps 2 and 3 until convergence, the algorithm as just described monotonically approaches a local minimum of the cost function, and is commonly called hard EM. The k-means algorithm is an example of class of algorithms. The resulting algorithm is called soft EM, and is the type of algorithm normally associated with EM. The counts used to compute these weighted averages are called soft counts, the probabilities computed for Z are posterior probabilities and are what is computed in the E step
Expectation-maximization
–
Machine learning and
data mining
74.
Particle swarm optimization
–
In computer science, particle swarm optimization is a computational method that optimizes a problem by iteratively trying to improve a candidate solution with regard to a given measure of quality. This is expected to move the swarm toward the best solutions, the algorithm was simplified and it was observed to be performing optimization. The book by Kennedy and Eberhart describes many aspects of PSO. An extensive survey of PSO applications is made by Poli, recently, a comprehensive review on theoretical and experimental works on PSO has been published by Bonyadi and Michalewicz. PSO is a metaheuristic as it makes few or no assumptions about the problem being optimized, however, metaheuristics such as PSO do not guarantee an optimal solution is ever found. A basic variant of the PSO algorithm works by having a population of candidate solutions and these particles are moved around in the search-space according to a few simple formulae. The movements of the particles are guided by their own best known position in the search-space as well as the entire swarms best known position, when improved positions are being discovered these will then come to guide the movements of the swarm. The process is repeated and by doing so it is hoped, but not guaranteed, formally, let f, ℝn → ℝ be the cost function which must be minimized. The gradient of f is not known, the goal is to find a solution a for which f ≤ f for all b in the search-space, which would mean a is the global minimum. Maximization can be performed by considering the function h = -f instead, let S be the number of particles in the swarm, each having a position xi ∈ ℝn in the search-space and a velocity vi ∈ ℝn. Let pi be the best known position of i and let g be the best known position of the entire swarm. A basic PSO algorithm is then, for particle i =1. S do for each dimension d =1, the termination criterion can be number of iterations performed, or a solution with adequate objective function value is found. The parameters ω, φp, and φg are selected by the practitioner and control the behaviour and efficacy of the PSO method, the choice of PSO parameters can have a large impact on optimization performance. Selecting PSO parameters that yield good performance has therefore been the subject of much research, the PSO parameters can also be tuned by using another overlaying optimizer, a concept known as meta-optimization, or even fine-tuned during the optimization, e. g. by means of fuzzy logic. Parameters have also been tuned for various optimization scenarios, the topology of the swarm defines the subset of particles with which each particle can exchange information. The basic version of the uses the global topology as the swarm communication structure. This topology allows all particles to communicate all the other particles
Particle swarm optimization
–
Performance landscape showing how a simple PSO variant performs in aggregate on several benchmark problems when varying two PSO parameters.
Particle swarm optimization
–
Biological swarming
75.
Group method of data handling
–
GMDH is used in such fields as data mining, knowledge discovery, prediction, complex systems modeling, optimization and pattern recognition. In order to find the best solution GMDH algorithms consider various component subsets of the function called partial models. Coefficients of these models are estimated by the least squares method, GMDH algorithms gradually increase the number of partial model components and find a model structure with optimal complexity indicated by the minimum value of an external criterion. This process is called self-organization of models, jürgen Schmidhuber cites GDMH as one of the earliest deep learning methods, remarking that it was used to train eight-layer neural nets as early as 1971. The method was originated in 1968 by Prof. Alexey G. Ivakhnenko in the Institute of Cybernetics in Kiev, thanks to the authors policy of open code sharing the method was quickly settled in the large number of scientific laboratories worldwide. At that time code sharing was quite a physical action since the Internet is at least 5 years younger than GMDH, despite this fact the first investigation of GMDH outside the Soviet Union had been made soon by R. Shankar in 1972. Later on different GMDH variants were published by Japanese and Polish scientists, period 1968-1971 is characterized by application of only regularity criterion for solving of the problems of identification, pattern recognition and short-term forecasting. As reference functions polynomials, logical nets, fuzzy Zadeh sets, authors were stimulated by very high accuracy of forecasting with the new approach. The problem of modeling of noised data and incomplete information basis was solved, multicriteria selection and utilization of additional priory information for noiseimmunity increasing were proposed. Best experiments showed that with extended definition of the model by additional criterion noise level can be ten times more than signal. Then it was improved using Shannons Theorem of General Communication theory, the convergence of multilayered GMDH algorithms was investigated. It was shown that some multilayered algorithms have multilayerness error - analogous to static error of control systems, in 1977 a solution of objective systems analysis problems by multilayered GMDH algorithms was proposed. It turned out that sorting-out by criteria ensemble finds the only system of equations and therefore to show complex object elements. Many important theoretical results were received and it became clear that full physical models cannot be used for long-term forecasting. It was proved, that models of GMDH are more accurate for approximation. Two-level algorithms which use two different time scales for modeling were developed, since 1989 the new algorithms for non-parametric modeling of fuzzy objects and SLP for expert systems were developed and investigated. Present stage of GMDH development can be described as out of twice-multilayered neuronets. External criterion is one of the key features of GMDH. Criterion describes requirements to the model and it is always calculated with a separate part of data sample that have not been used for estimation of coefficients
Group method of data handling
–
GMDH author - Soviet scientist Prof. Alexey G. Ivakhnenko.
76.
Andrey Kolmogorov
–
Andrey Kolmogorov was born in Tambov, about 500 kilometers south-southeast of Moscow, in 1903. Kolmogorova, died giving birth to him, Andrey was raised by two of his aunts in Tunoshna at the estate of his grandfather, a well-to-do nobleman. Little is known about Andreys father and he was supposedly named Nikolai Matveevich Kataev and had been an agronomist. Nikolai had been exiled from St. Petersburg to the Yaroslavl province after his participation in the movement against the czars. He disappeared in 1919 and he was presumed to have killed in the Russian Civil War. Andrey Kolmogorov was educated in his aunt Veras village school, and his earliest literary efforts, Andrey was the editor of the mathematical section of this journal. In 1910, his aunt adopted him, and they moved to Moscow, later that same year, Kolmogorov began to study at the Moscow State University and at the same time Mendeleev Moscow Institute of Chemistry and Technology. Kolmogorov writes about this time, I arrived at Moscow University with a knowledge of mathematics. I knew in particular the beginning of set theory, I studied many questions in articles in the Encyclopedia of Brockhaus and Efron, filling out for myself what was presented too concisely in these articles. Kolmogorov gained a reputation for his wide-ranging erudition, during the same period, Kolmogorov worked out and proved several results in set theory and in the theory of Fourier series. In 1922, Kolmogorov gained international recognition for constructing a Fourier series that diverges almost everywhere, around this time, he decided to devote his life to mathematics. In 1925, Kolmogorov graduated from the Moscow State University and began to study under the supervision of Nikolai Luzin, Kolmogorov became interested in probability theory. In 1929, Kolmogorov earned his Doctor of Philosophy degree, from Moscow State University, in 1930, Kolmogorov went on his first long trip abroad, traveling to Göttingen and Munich, and then to Paris. He had various contacts in Göttingen. His pioneering work, About the Analytical Methods of Probability Theory, was published in 1931, also in 1931, he became a professor at the Moscow State University. In 1935, Kolmogorov became the first chairman of the department of probability theory at the Moscow State University, around the same years Kolmogorov contributed to the field of ecology and generalized the Lotka–Volterra model of predator-prey systems. In 1936, Kolmogorov and Alexandrov were involved in the persecution of their common teacher Nikolai Luzin, in the so-called Luzin affair. In a 1938 paper, Kolmogorov established the basic theorems for smoothing and predicting stationary stochastic processes—a paper that had military applications during the Cold War
Andrey Kolmogorov
–
Andrey Kolmogorov
Andrey Kolmogorov
–
Kolmogorov (left) delivers a talk at a Soviet information theory symposium. (
Tallinn, 1973).
Andrey Kolmogorov
–
Kolmogorov works on his talk (
Tallinn, 1973).
77.
Baidu
–
Baidu, Inc. incorporated on January 18,2000, is a Chinese-American web services company headquartered at the Baidu Campus in Beijings Haidian District. It is one of the largest Internet companies in the world, Baidu offers many services, including a Chinese search engine for websites, audio files and images. Baidu offers 57 search and community services including Baidu Baike and a searchable, Baidu was established in 2000 by Robin Li and Eric Xu. Both of the co-founders are Chinese nationals who studied and worked overseas before returning to China, in December 2016, Baidu ranked 4th overall in the Alexa Internet rankings. During Q4 of 2010, it is estimated there were 4.02 billion search queries in China of which Baidu had a market share of 56. 6%. Chinas Internet-search revenue share in second quarter 2011 by Baidu is 76%, in December 2007, Baidu became the first Chinese company to be included in the NASDAQ-100 index. As of 2006, Baidu provided an index of over 740 million web pages,80 million images, Baidu offers multimedia content including MP3 music, and movies, and is the first in China to offer Wireless Application Protocol and personal digital assistant -based mobile search. Baidu Baike is similar to Wikipedia as an encyclopedia, however, unlike Wikipedia. The company also hosts a service, called Baidu Music. On December 4,2015, Baidu announced plans to merge with Taihe Entertainment Group to help the service compete with Apple Inc. s Apple Music, the name Baidu literally means countless times, or alternatively, a hundred times. In 1994, Robin Li joined IDD Information Services, a New Jersey division of Dow Jones and Company and he also worked on developing better algorithms for search engines and remained at IDD Information Services from May 1994 to June 1997. In 1996, while at IDD, Li developed the RankDex site-scoring algorithm for search results page ranking. He later used this technology for the Baidu search engine, in 2000, the company Baidu launched in Beijing, China. The first office was located in a room, which was near Peking University from where Robin graduated. In 2003, Baidu launched a search engine and picture search engine. On January 12,2010, Baidu. coms DNS records in the United States were altered such that browsers to baidu, Internet users were met with a page saying This site has been attacked by Iranian Cyber Army. Chinese hackers later responded by attacking Iranian websites and leaving messages, on August 6,2012, the BBC reported that three employees of Baidu were arrested on suspicion that they accepted bribes. The bribes were paid for deleting posts from the forum service
Baidu
–
Baidu headquarters,
Haidian District, Beijing
Baidu
–
Baidu, Inc.
78.
Natural Language Processing
–
The history of NLP generally starts in the 1950s, although work can be found from earlier periods. In 1950, Alan Turing published an article titled Computing Machinery, the Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that three or five years, machine translation would be a solved problem. Little further research in translation was conducted until the late 1980s. Using almost no information about human thought or emotion, ELIZA sometimes provided a startlingly human-like interaction, when the patient exceeded the very small knowledge base, ELIZA might provide a generic response, for example, responding to My head hurts with Why do you say your head hurts. During the 1970s many programmers began to write conceptual ontologies, which structured real-world information into computer-understandable data, examples are MARGIE, SAM, PAM, TaleSpin, QUALM, Politics, and Plot Units. During this time, many chatterbots were written including PARRY, Racter, up to the 1980s, most NLP systems were based on complex sets of hand-written rules. Starting in the late 1980s, however, there was a revolution in NLP with the introduction of machine learning algorithms for language processing, some of the earliest-used machine learning algorithms, such as decision trees, produced systems of hard if-then rules similar to existing hand-written rules. The cache language models upon which many speech recognition systems now rely are examples of statistical models. Many of the early successes occurred in the field of machine translation, due especially to work at IBM Research. However, most other systems depended on corpora specifically developed for the tasks implemented by these systems, as a result, a great deal of research has gone into methods of more effectively learning from limited amounts of data. Recent research has focused on unsupervised and semi-supervised learning algorithms. Such algorithms are able to learn from data that has not been hand-annotated with the desired answers, generally, this task is much more difficult than supervised learning, and typically produces less accurate results for a given amount of input data. However, there is an amount of non-annotated data available. Since the so-called statistical revolution in the late 1980s and mid 1990s, formerly, many language-processing tasks typically involved the direct hand coding of rules, which is not in general robust to natural language variation. The machine-learning paradigm calls instead for using statistical inference to automatically learn such rules through the analysis of large corpora of typical real-world examples, Many different classes of machine learning algorithms have been applied to NLP tasks. These algorithms take as input a set of features that are generated from the input data. Some of the algorithms, such as decision trees, produced systems of hard if-then rules similar to the systems of hand-written rules that were then common
Natural Language Processing
–
An
automated online assistant providing
customer service on a web page, an example of an application where natural language processing is a major component.
79.
Deep belief network
–
When trained on a set of examples in an unsupervised way, a DBN can learn to probabilistically reconstruct its inputs. The layers then act as feature detectors on inputs, after this learning step, a DBN can be further trained in a supervised way to perform classification. This also leads to a fast, layer-by-layer unsupervised training procedure, the observation, due to Yee-Whye Teh, that DBNs can be trained greedily, one layer at a time, led to one of the first effective deep learning algorithms. The training algorithm for DBNs proceeds as follows, let X be a matrix of inputs, regarded as a set of feature vectors. Train a restricted Boltzmann machine on X to obtain its weight matrix, transform X by the RBM to produce new data X, either by sampling or by computing the mean activation of the hidden units. Repeat this procedure with X ← X for the pair of layers. Fine-tune all the parameters of this architecture with respect to a proxy for the DBN log- likelihood. Bayesian network Deep learning Deep Belief Networks
Deep belief network
–
Machine learning and
data mining
80.
Function composition
–
In mathematics, function composition is the pointwise application of one function to the result of another to produce a third function. The resulting composite function is denoted g ∘ f, X → Z, the notation g ∘ f is read as g circle f, or g round f, or g composed with f, g after f, g following f, or g of f, or g on f. Intuitively, composing two functions is a process in which the output of the inner function becomes the input of the outer function. The composition of functions is a case of the composition of relations. The composition of functions has some additional properties, Composition of functions on a finite set, If f =, and g =, then g ∘ f =. The composition of functions is always associative—a property inherited from the composition of relations, since there is no distinction between the choices of placement of parentheses, they may be left off without causing any ambiguity. In a strict sense, the composition g ∘ f can be only if fs codomain equals gs domain, in a wider sense it is sufficient that the former is a subset of the latter. The functions g and f are said to commute with each other if g ∘ f = f ∘ g, commutativity is a special property, attained only by particular functions, and often in special circumstances. For example, | x | +3 = | x + 3 | only when x ≥0, the composition of one-to-one functions is always one-to-one. Similarly, the composition of two functions is always onto. It follows that composition of two bijections is also a bijection, the inverse function of a composition has the property that −1 =. Derivatives of compositions involving differentiable functions can be using the chain rule. Higher derivatives of functions are given by Faà di Brunos formula. Suppose one has two functions f, X → X, g, X → X having the domain and codomain. Then one can form chains of transformations composed together, such as f ∘ f ∘ g ∘ f, such chains have the algebraic structure of a monoid, called a transformation monoid or composition monoid. In general, transformation monoids can have remarkably complicated structure, one particular notable example is the de Rham curve. The set of all functions f, X → X is called the transformation semigroup or symmetric semigroup on X. If the transformation are bijective, then the set of all combinations of these functions forms a transformation group
Function composition
–
g ∘ f, the composition of f and g. For example, (g ∘ f)(c) = #.
81.
Non-stationary
–
In mathematics and statistics, a stationary process is a stochastic process whose joint probability distribution does not change when shifted in time. Consequently, parameters such as mean and variance, if they are present, since stationarity is an assumption underlying many statistical procedures used in time series analysis, non-stationary data is often transformed to become stationary. The most common cause of violation of stationarity are trends in mean, in the former case of a unit root, stochastic shocks have permanent effects and the process is not mean-reverting. In the latter case of a trend, the process is called a trend stationary process. A trend stationary process is not strictly stationary, but can easily be transformed into a process by removing the underlying trend. Similarly, processes with one or more unit roots can be made stationary through differencing, an important type of non-stationary process that does not include a trend-like behavior is a cyclostationary process, which is a stochastic process that varies cyclically with time. A stationary process is not the thing as a process with a stationary distribution. Besides, all stationary Markov random processes are time-homogeneous, formally, let be a stochastic process and let F X represent the cumulative distribution function of the joint distribution of at times t 1 + τ, …, t k + τ. Then, is said to be stationary if, for all k, for all τ. Since τ does not affect F X, F X is not a function of time, as an example, white noise is stationary. The sound of a cymbal clashing, if hit once, is not stationary because the acoustic power of the clash diminishes with time. However, it would be possible to invent a stochastic process describing when the cymbal is hit, for example, if the cymbal were hit at moments in time corresponding to a homogeneous Poisson process, the overall response would be stationary. An example of a stationary process where the sample space is also discrete is a Bernoulli scheme. Let Y be any scalar random variable, and define a time-series, then is a stationary time series, for which realisations consist of a series of constant values, with a different constant value for each realisation. A law of large numbers does not apply on case, as the limiting value of an average from a single realisation takes the random value determined by Y. A weaker form of stationarity commonly employed in signal processing is known as weak-sense stationarity, wide-sense stationarity, covariance stationarity, WSS random processes only require that 1st moment and autocovariance do not vary with respect to time. Any strictly stationary process which has a mean and a covariance is also WSS, the first property implies that the mean function mx must be constant. The second property implies that the function depends only on the difference between t 1 and t 2 and only needs to be indexed by one variable rather than two variables
Non-stationary
–
Two simulated time series processes, one stationary the other non-stationary. The
Augmented Dickey–Fuller test is reported for each process and non-stationarity cannot be rejected for the second process.
82.
Convex optimization problem
–
Convex minimization, a subfield of optimization, studies the problem of minimizing convex functions over convex sets. The convexity property can make optimization in some sense easier than the general case - for example, the convexity of f makes the powerful tools of convex analysis applicable. With recent improvements in computing and in theory, convex minimization is nearly as straightforward as linear programming. Many optimization problems can be reformulated as convex minimization problems, for example, the problem of maximizing a concave function f can be re-formulated equivalently as a problem of minimizing the function -f, which is convex. The general form of a problem is to find some x ∗ ∈ X such that f = min, for some feasible set X ⊂ R n. The optimization problem is called an optimization problem if X is a convex set. The following statements are true about the convex minimization problem, if a local minimum exists, the set of all minima is convex. for each strictly convex function, if the function has a minimum, then the minimum is unique. Standard form is the usual and most intuitive form of describing a convex minimization problem, in practice, the terms linear and affine are often used interchangeably. Such constraints can be expressed in the h i = a i T x + b i. A convex minimization problem is thus written as x f s u b j e c t t o g i ≤0, i =1, …, m h i =0, i =1, …, p. Note that every equality constraint h =0 can be replaced by a pair of inequality constraints h ≤0 and − h ≤0. Therefore, for theoretical purposes, equality constraints are redundant, however, following from this fact, it is easy to understand why h i =0 has to be affine as opposed to merely being convex. If h i is convex, h i ≤0 is convex, therefore, the only way for h i =0 to be convex is for h i to be affine. Then the domain X is, X =, the Lagrangian function for the problem is L = λ0 f + λ1 g 1 + ⋯ + λ m g m. If there exists a strictly feasible point, that is, a point z satisfying g 1, …, g m <0, dual subgradient methods are subgradient methods applied to a dual problem. The drift-plus-penalty method is similar to the dual subgradient method, problems with convex level sets can be efficiently minimized, in theory. Yurii Nesterov proved that quasi-convex minimization problems could be solved efficiently, however, such theoretically efficient methods use divergent-series stepsize rules, which were first developed for classical subgradient methods. Solving even close-to-convex but non-convex problems can be computationally intractable, minimizing a unimodal function is intractable, regardless of the smoothness of the function, according to results of Ivanov
Convex optimization problem
–
Unconstrained nonlinear:
Methods calling …
83.
Multi-layer perceptron
–
A multilayer perceptron is a feedforward artificial neural network model that maps sets of input data onto a set of appropriate outputs. An MLP consists of layers of nodes in a directed graph. Except for the nodes, each node is a neuron with a nonlinear activation function. MLP utilizes a supervised learning technique called backpropagation for training the network, MLP is a modification of the standard linear perceptron and can distinguish data that are not linearly separable. This function is modeled in several ways, here y i is the output of the i th node and v i is the weighted sum of the input synapses. Alternative activation functions have been proposed, including the rectifier and softplus functions, more specialized activation functions include radial basis functions which are used in another class of supervised neural network models. The multilayer perceptron consists of three or more layers of nonlinearly-activating nodes and is considered a deep neural network. Since an MLP is a Fully Connected Network, each node in one layer connects with a weight w i j to every node in the following layer. Some people do not include the input layer when counting the number of layers, learning occurs in the perceptron by changing connection weights after each piece of data is processed, based on the amount of error in the output compared to the expected result. This is an example of supervised learning, and is carried out through backpropagation, a generalization of the least mean squares algorithm in the linear perceptron. We represent the error in output node j in the n th data point by e j = d j − y j, where d is the target value and y is the value produced by the perceptron. We then make corrections to the weights of the based on those corrections which minimize the error in the entire output. The derivative to be calculated depends on the local field v j. This depends on the change in weights of the k th nodes, the term multilayer perceptron often causes confusion. It is argued the model is not a single perceptron that has multiple layers, rather, it contains many perceptrons that are organised into layers, leading some to believe that a more fitting term might therefore be multilayer perceptron network. In this case, the network can indeed be considered to be a binary classifier with multiple layers. Furthermore, the term multilayer perceptron now does not specify the nature of the layers, the layers are free to be composed of artificial neurons. This interpretation of the term multilayer perceptron avoids the loosening of the definition of perceptron to mean an artificial neuron in general
Multi-layer perceptron
–
Machine learning and
data mining
84.
Convex optimization
–
Convex minimization, a subfield of optimization, studies the problem of minimizing convex functions over convex sets. The convexity property can make optimization in some sense easier than the general case - for example, the convexity of f makes the powerful tools of convex analysis applicable. With recent improvements in computing and in theory, convex minimization is nearly as straightforward as linear programming. Many optimization problems can be reformulated as convex minimization problems, for example, the problem of maximizing a concave function f can be re-formulated equivalently as a problem of minimizing the function -f, which is convex. The general form of a problem is to find some x ∗ ∈ X such that f = min, for some feasible set X ⊂ R n. The optimization problem is called an optimization problem if X is a convex set. The following statements are true about the convex minimization problem, if a local minimum exists, the set of all minima is convex. for each strictly convex function, if the function has a minimum, then the minimum is unique. Standard form is the usual and most intuitive form of describing a convex minimization problem, in practice, the terms linear and affine are often used interchangeably. Such constraints can be expressed in the h i = a i T x + b i. A convex minimization problem is thus written as x f s u b j e c t t o g i ≤0, i =1, …, m h i =0, i =1, …, p. Note that every equality constraint h =0 can be replaced by a pair of inequality constraints h ≤0 and − h ≤0. Therefore, for theoretical purposes, equality constraints are redundant, however, following from this fact, it is easy to understand why h i =0 has to be affine as opposed to merely being convex. If h i is convex, h i ≤0 is convex, therefore, the only way for h i =0 to be convex is for h i to be affine. Then the domain X is, X =, the Lagrangian function for the problem is L = λ0 f + λ1 g 1 + ⋯ + λ m g m. If there exists a strictly feasible point, that is, a point z satisfying g 1, …, g m <0, dual subgradient methods are subgradient methods applied to a dual problem. The drift-plus-penalty method is similar to the dual subgradient method, problems with convex level sets can be efficiently minimized, in theory. Yurii Nesterov proved that quasi-convex minimization problems could be solved efficiently, however, such theoretically efficient methods use divergent-series stepsize rules, which were first developed for classical subgradient methods. Solving even close-to-convex but non-convex problems can be computationally intractable, minimizing a unimodal function is intractable, regardless of the smoothness of the function, according to results of Ivanov
Convex optimization
–
Unconstrained nonlinear:
Methods calling …
85.
Bilinear map
–
In mathematics, a bilinear map is a function combining elements of two vector spaces to yield an element of a third vector space, and is linear in each of its arguments. Let V, W and X be three vector spaces over the base field F. In other words, when we hold the first entry of the bilinear map fixed while letting the second entry vary, the result is a linear operator, and similarly for when we hold the second entry fixed. If V = W and we have B = B for all v, w in V, the case where X is the base field F, and we have a bilinear form, is particularly useful. The definition works without any changes if instead of vector spaces over a field F and it generalizes to n-ary functions, where the proper term is multilinear. This satisfies B = r ⋅ B B = B ⋅ s for all m in M, n in N, r in R and s in S, a first immediate consequence of the definition is that B = 0X whenever v = 0V or w = 0W. This may be seen by writing the zero vector 0X as 0 ⋅ 0X and moving the scalar 0 outside, in front of B, the set L of all bilinear maps is a linear subspace of the space of all maps from V × W into X. If V, W, X are finite-dimensional, then so is L, for X = F, i. e. bilinear forms, the dimension of this space is dim V × dim W. To see this, choose a basis for V and W, then each bilinear map can be represented by the matrix B. Now, if X is a space of dimension, we obviously have dim L = dim V × dim W × dim X. Matrix multiplication is a bilinear map M × M → M. If a vector space V over the real numbers R carries an inner product, in general, for a vector space V over a field F, a bilinear form on V is the same as a bilinear map V × V → F. If V is a space with dual space V∗, then the application operator. Let V and W be vector spaces over the base field F. If f is a member of V∗ and g a member of W∗, the cross product in R3 is a bilinear map R3 × R3 → R3. Let B, V × W → X be a bilinear map, and L, U → W be a linear map, then ↦ B is a bilinear map on V × U
Bilinear map
–
A matrix M determines a bilinear map into the real by means of a real bilinear form (v, w) ↦ v ′ Mw, then
associates of this are taken to the other three possibilities using
duality and the
musical isomorphism
86.
Probability mass
–
In probability theory and statistics, a probability mass function is a function that gives the probability that a discrete random variable is exactly equal to some value. Suppose that X, S → A is a random variable defined on a sample space S. Then the probability mass function fX, A → for X is defined as f X = Pr = Pr and that is, fX may be defined for all real numbers and fX =0 for all x ∉ X as shown in the figure. Since the image of X is countable, the probability mass function fX is zero for all, the discontinuity of probability mass functions is related to the fact that the cumulative distribution function of a discrete random variable is also discontinuous. Where it is differentiable, the derivative is zero, just as the probability function is zero at all such points. We make this more precise below, suppose that is a probability space and that is a measurable space whose underlying σ-algebra is discrete, so in particular contains singleton sets of B. In this setting, a random variable X, A → B is discrete provided its image is countable, now suppose that is a measure space equipped with the counting measure μ. As a consequence, for any b in B we have P = P, = ∫ X −1 d P = ∫ f d μ = f, demonstrating that f is in fact a probability mass function. Suppose that S is the space of all outcomes of a single toss of a fair coin. Since the coin is fair, the probability function is f X = {12, x ∈,0, x ∉. This is a case of the binomial distribution, the Bernoulli distribution. An example of a discrete distribution, and of its probability mass function, is provided by the multinomial distribution. Johnson, N. L. Kotz, S. Kemp A. Univariate Discrete Distributions
Probability mass
–
The graph of a probability mass function. All the values of this function must be non-negative and sum up to 1.
87.
Probability density
–
In a more precise sense, the PDF is used to specify the probability of the random variable falling within a particular range of values, as opposed to taking on any one value. The probability density function is everywhere, and its integral over the entire space is equal to one. The terms probability distribution function and probability function have also sometimes used to denote the probability density function. However, this use is not standard among probabilists and statisticians, further confusion of terminology exists because density function has also been used for what is here called the probability mass function. In general though, the PMF is used in the context of random variables. Suppose a species of bacteria typically lives 4 to 6 hours, what is the probability that a bacterium lives exactly 5 hours. A lot of bacteria live for approximately 5 hours, but there is no chance that any given bacterium dies at exactly 5.0000000000, instead we might ask, What is the probability that the bacterium dies between 5 hours and 5.01 hours. Lets say the answer is 0.02, next, What is the probability that the bacterium dies between 5 hours and 5.001 hours. The answer is probably around 0.002, since this is 1/10th of the previous interval, the probability that the bacterium dies between 5 hours and 5.0001 hours is probably about 0.0002, and so on. In these three examples, the ratio / is approximately constant, and equal to 2 per hour, for example, there is 0.02 probability of dying in the 0. 01-hour interval between 5 and 5.01 hours, and =2 hour−1. This quantity 2 hour−1 is called the probability density for dying at around 5 hours, therefore, in response to the question What is the probability that the bacterium dies at 5 hours. A literally correct but unhelpful answer is 0, but an answer can be written as dt. This is the probability that the bacterium dies within a window of time around 5 hours. For example, the probability that it lives longer than 5 hours, there is a probability density function f with f =2 hour−1. The integral of f over any window of time is the probability that the dies in that window. A probability density function is most commonly associated with absolutely continuous univariate distributions, a random variable X has density fX, where fX is a non-negative Lebesgue-integrable function, if, Pr = ∫ a b f X d x. That is, f is any function with the property that. In the continuous univariate case above, the measure is the Lebesgue measure
Probability density
–
Boxplot and probability density function of a
normal distribution N (0, σ 2).
88.
Statistics
–
Statistics is a branch of mathematics dealing with the collection, analysis, interpretation, presentation, and organization of data. In applying statistics to, e. g. a scientific, industrial, or social problem, populations can be diverse topics such as all people living in a country or every atom composing a crystal. Statistics deals with all aspects of data including the planning of data collection in terms of the design of surveys, statistician Sir Arthur Lyon Bowley defines statistics as Numerical statements of facts in any department of inquiry placed in relation to each other. When census data cannot be collected, statisticians collect data by developing specific experiment designs, representative sampling assures that inferences and conclusions can safely extend from the sample to the population as a whole. In contrast, an observational study does not involve experimental manipulation, inferences on mathematical statistics are made under the framework of probability theory, which deals with the analysis of random phenomena. A standard statistical procedure involves the test of the relationship between two data sets, or a data set and a synthetic data drawn from idealized model. A hypothesis is proposed for the relationship between the two data sets, and this is compared as an alternative to an idealized null hypothesis of no relationship between two data sets. Rejecting or disproving the hypothesis is done using statistical tests that quantify the sense in which the null can be proven false. Working from a hypothesis, two basic forms of error are recognized, Type I errors and Type II errors. Multiple problems have come to be associated with this framework, ranging from obtaining a sufficient sample size to specifying an adequate null hypothesis, measurement processes that generate statistical data are also subject to error. Many of these errors are classified as random or systematic, the presence of missing data or censoring may result in biased estimates and specific techniques have been developed to address these problems. Statistics continues to be an area of research, for example on the problem of how to analyze Big data. Statistics is a body of science that pertains to the collection, analysis, interpretation or explanation. Some consider statistics to be a mathematical science rather than a branch of mathematics. While many scientific investigations make use of data, statistics is concerned with the use of data in the context of uncertainty, mathematical techniques used for this include mathematical analysis, linear algebra, stochastic analysis, differential equations, and measure-theoretic probability theory. In applying statistics to a problem, it is practice to start with a population or process to be studied. Populations can be diverse topics such as all living in a country or every atom composing a crystal. Ideally, statisticians compile data about the entire population and this may be organized by governmental statistical institutes
Statistics
–
Scatter plots are used in descriptive statistics to show the observed relationships between different variables.
Statistics
–
More
probability density is found as one gets closer to the expected (mean) value in a
normal distribution. Statistics used in
standardized testing assessment are shown. The scales include
standard deviations, cumulative percentages, percentile equivalents, Z-scores, T-scores, standard nines, and percentages in standard nines.
Statistics
–
Gerolamo Cardano, the earliest pioneer on the mathematics of probability.
Statistics
–
Karl Pearson, a founder of mathematical statistics.
89.
Greedy algorithm
–
A greedy algorithm is an algorithmic paradigm that follows the problem solving heuristic of making the locally optimal choice at each stage with the hope of finding a global optimum. For example, a strategy for the traveling salesman problem is the following heuristic. This heuristic need not find a best solution, but terminates in a number of steps. In mathematical optimization, greedy algorithms solve combinatorial problems having the properties of matroids, most problems for which they work will have two properties, Greedy choice property We can make whatever choice seems best at the moment and then solve the subproblems that arise later. The choice made by an algorithm may depend on choices made so far. It iteratively makes one greedy choice after another, reducing each given problem into a smaller one, in other words, a greedy algorithm never reconsiders its choices. This is the difference from dynamic programming, which is exhaustive and is guaranteed to find the solution. After every stage, dynamic programming makes decisions based on all the decisions made in the previous stage, optimal substructure A problem exhibits optimal substructure if an optimal solution to the problem contains optimal solutions to the sub-problems. For many other problems, greedy algorithms fail to produce the optimal solution, Greedy algorithms can be characterized as being short sighted, and also as non-recoverable. They are ideal only for problems which have optimal substructure, despite this, for many simple problems, the best suited algorithms are greedy algorithms. It is important, however, to note that the algorithm can be used as a selection algorithm to prioritize options within a search. They can make commitments to certain choices too early which prevent them from finding the best overall solution later, for example, all known greedy coloring algorithms for the graph coloring problem and all other NP-complete problems do not consistently find optimum solutions. Nevertheless, they are useful because they are quick to think up, examples of such greedy algorithms are Kruskals algorithm and Prims algorithm for finding minimum spanning trees, and the algorithm for finding optimum Huffman trees. The theory of matroids, and the general theory of greedoids. Greedy algorithms appear in network routing as well, using greedy routing, a message is forwarded to the neighboring node which is closest to the destination. The notion of a location may be determined by its physical location. Location may also be an artificial construct as in small world routing. The activity selection problem is characteristic to this class of problems, in the Macintosh computer game Crystal Quest the objective is to collect crystals, in a fashion similar to the travelling salesman problem
Greedy algorithm
90.
Hierarchical temporal memory
–
Hierarchical temporal memory is a biologically constrained theory of machine intelligence originally described in the 2004 book On Intelligence by Jeff Hawkins with Sandra Blakeslee. HTM is based on neuroscience and the physiology and interaction of neurons in the neocortex of the human brain. The technology has been tested and implemented in software through example applications from Numenta, at the core of HTM are learning algorithms that can store, learn, infer and recall high-order sequences. Unlike most other machine learning methods, HTM learns time-based patterns in unlabeled data on a continuous basis, HTM is robust to noise and high capacity, meaning that it can learn multiple patterns simultaneously. When applied to computers, HTM is well suited for prediction, anomaly detection, classification, a typical HTM network is a tree-shaped hierarchy of levels that are composed of smaller elements called nodes or columns. A single level in the hierarchy is called a region. Higher hierarchy levels often have fewer nodes and therefore less spatial resolvability, higher hierarchy levels can reuse patterns learned at the lower levels by combining them to memorize more complex patterns. Each HTM node has the basic functionality. In learning and inference modes, sensory data comes into the bottom level nodes, in generation mode, the bottom level nodes output the generated pattern of a given category. When in inference mode, a node in each level interprets information coming in from its nodes in the lower level as probabilities of the categories it has in memory. Each HTM region learns by identifying and memorizing spatial patterns - combinations of bits that often occur at the same time. It then identifies temporal sequences of patterns that are likely to occur one after another. During training, a node receives a temporal sequence of patterns as its input. The learning process consists of two stages, Spatial pooling identifies frequently observed patterns and memorizes them as coincidences, patterns that are significantly similar to each other are treated as the same coincidence. A large number of input patterns are reduced to a manageable number of known coincidences. Temporal pooling partitions coincidences that are likely to follow each other in the sequence into temporal groups. Each group of patterns represents a cause of the input pattern, during inference, the node calculates the set of probabilities that a pattern belongs to each known coincidence. Then it calculates the probabilities that the input represents each temporal group, the set of probabilities assigned to the groups is called a nodes belief about the input pattern
Hierarchical temporal memory
–
An example of HTM hierarchy used for image recognition
91.
Binary decoder
–
In digital electronics, a binary decoder is a combinational logic circuit that converts a binary integer value to an associated pattern of output bits. They are used in a variety of applications, including data demultiplexing, seven segment displays. In addition to data inputs, some decoders also have one or more enable inputs. When the enable input is negated, all outputs are forced to their inactive states. Depending on its function, a binary decoder will convert binary information from n input signals to as many as 2n unique output signals, some decoders have less than 2n output lines, in such cases, at least one output pattern will be repeated for different input values. A binary decoder is implemented as either a stand-alone integrated circuit or as part of a more complex IC. In the latter case the decoder may be synthesized by means of a description language such as VHDL or Verilog. Widely used decoders are often available in the form of standardized ICs, a 1-of-n binary decoder has n output bits. This type of decoder asserts exactly one of its n output bits, or none of them, the address of the activated output is specified by the integer input value. For example, output bit number 0 is selected when the integer value 0 is applied to the inputs. Examples of this type of include, A 3-to-8 line decoder activates one of eight output bits for each input value from 0 to 7 — the range of integer values that can be expressed in three bits. Similarly, a 4-to-16 line decoder activates one of 16 outputs for each 4-bit input in the integer range, a BCD to decimal decoder has ten output bits. It accepts an input consisting of a binary-coded decimal integer value and activates one specific. All outputs are held inactive when a value is applied to the inputs. A demultiplexer is a 1-of-n binary decoder that is used to route a data bit to one of its n outputs while all other outputs remain inactive, code translators differ from 1-of-n decoders in that multiple output bits may be active at the same time. An example of this is a seven-segment decoder, which converts an integer into the combination of segment control signals needed to display the value on a seven-segment display digit. One variant of seven-segment decoder is the BCD to seven-segment decoder and this decoder function is available in standard ICs such as the CMOS4511. Sum addressed decoder Multiplexer Priority encoder
Binary decoder
–
A 2-to-4 line decoder
92.
Kernel principal component analysis
–
In the field of multivariate statistics, kernel principal component analysis is an extension of principal component analysis using techniques of kernel methods. Using a kernel, the linear operations of PCA are performed in a reproducing kernel Hilbert space. Recall that conventional PCA operates on zero-centered data, that is,1 N ∑ i =1 N x i =0. That is, given N points, x i, if we map them to an N-dimensional space with Φ where Φ, R d → R N, it is easy to construct a hyperplane that divides the points into arbitrary clusters. Of course, this Φ creates linearly independent vectors, so there is no covariance on which to perform eigendecomposition explicitly as we would in linear PCA. The N-elements in each column of K represent the dot product of one point of the data with respect to all the transformed points. Some well-known kernels are shown in the example below and we use K ′ to perform the kernel PCA algorithm described above. One caveat of kernel PCA should be illustrated here, in linear PCA, we can use the eigenvalues to rank the eigenvectors based on how much of the variation of the data is captured by each principal component. This is useful for data dimensionality reduction and it could also be applied to KPCA, however, in practice there are cases that all variations of the data are same. This is typically caused by a choice of kernel scale. In practice, a data set leads to a large K. One way to deal with this is to perform clustering on the dataset, since even this method may yield a relatively large K, it is common to compute only the top P eigenvalues and eigenvectors of K. Consider three concentric clouds of points, we wish to use kernel PCA to identify these groups, the color of the points is not part of the algorithm, but only there to show how the data groups together before and after the transformation. First, consider the kernel k =2 Applying this to kernel PCA yields the next image, Kernel PCA has been demonstrated to be useful for novelty detection, and image de-noising. Cluster analysis Kernel trick Multilinear PCA Multilinear subspace learning Nonlinear dimensionality reduction Spectral clustering
Kernel principal component analysis
–
Input points before kernel PCA
93.
K-nearest neighbor
–
In pattern recognition, the k-nearest neighbors algorithm is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space, the output depends on whether k-NN is used for classification or regression, In k-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbors, if k =1, then the object is simply assigned to the class of that single nearest neighbor. In k-NN regression, the output is the property value for the object and this value is the average of the values of its k nearest neighbors. K-NN is a type of instance-based learning, or lazy learning, the k-NN algorithm is among the simplest of all machine learning algorithms. For example, a weighting scheme consists in giving each neighbor a weight of 1/d. The neighbors are taken from a set of objects for which the class or the property value is known. This can be thought of as the set for the algorithm. A shortcoming of the algorithm is that it is sensitive to the local structure of the data. The algorithm is not to be confused with k-means, another popular machine learning technique. Suppose we have pairs, …, taking values in R d ×, where Y is the class label of X, so that X | Y = r ∼ P r for r =1,2. Given some norm ∥ ⋅ ∥ on R d and a point x ∈ R d, let, …, the training examples are vectors in a multidimensional feature space, each with a class label. The training phase of the algorithm consists only of storing the feature vectors, a commonly used distance metric for continuous variables is Euclidean distance. For discrete variables, such as for text classification, another metric can be used, in the context of gene expression microarray data, for example, k-NN has also been employed with correlation coefficients such as Pearson and Spearman. A drawback of the majority voting classification occurs when the class distribution is skewed. That is, examples of a more frequent class tend to dominate the prediction of the new example, one way to overcome this problem is to weight the classification, taking into account the distance from the test point to each of its k nearest neighbors. The class of each of the k nearest points is multiplied by a proportional to the inverse of the distance from that point to the test point. Another way to overcome skew is by abstraction in data representation, for example, in a self-organizing map, each node is a representative of a cluster of similar points, regardless of their density in the original training data
K-nearest neighbor
–
Machine learning and
data mining
94.
Blind source separation
–
This problem is in general highly underdetermined, but useful solutions can be derived under a surprising variety of conditions. Much of the literature in this field focuses on the separation of temporal signals such as audio. However, blind signal separation is now performed on multidimensional data, such as images and tensors. The set of source signals, s = T, is mixed using a matrix, A = ∈ R m × n, to produce a set of mixed signals, x = T. Usually, n is equal to m, if m > n, then the system of equations is overdetermined and thus can be unmixed using a conventional linear method. If n > m, the system is underdetermined and a non-linear method must be employed to recover the unmixed signals, the signals themselves can be multidimensional. X = A ⋅ s The above equation is effectively inverted as follows. Blind source separation separates the set of mixed signals, x, through the determination of a matrix, B = ∈ R n × m, to recover an approximation of the original signals. Y = B ⋅ x At a cocktail party, there is a group of talking at the same time. You have multiple microphones picking up mixed signals, but you want to isolate the speech of a single person, BSS can be used to separate the individual sources by using mixed signals. Figure 2 shows the concept of BSS. The individual source signals are shown as well as the signals which are received signals. BSS is used to separate the signals with only knowing mixed signals. The separated signals are only approximations of the source signals, the separated images, were separated using Python and the Shogun toolbox using Joint Approximation Diagonalization of Eigen-matrices algorithm which is based off Independent component analysis, ICA. This toolbox method can be used with multi-dimensions but for a visual aspect images were used. Brain imaging is another application for BSS. In electroencephalogram and Magnetoencephalography, the interference from muscle activity masks the desired signal from brain activity, BSS, however, can be used to separate the two so an accurate representation of brain activity may be achieved. A second approach, exemplified by nonnegative matrix factorization, is to impose constraints on the source signals
Blind source separation
–
v
95.
Robotics
–
Robotics is the interdisciplinary branch of engineering and science that includes mechanical engineering, electrical engineering, computer science, and others. Robotics deals with the design, construction, operation, and use of robots, as well as systems for their control, sensory feedback. These technologies are used to develop machines that can substitute for humans, Robots can be used in any situation and for any purpose, but today many are used in dangerous environments, manufacturing processes, or where humans cannot survive. Robots can take on any form but some are made to resemble humans in appearance and this is said to help in the acceptance of a robot in certain replicative behaviors usually performed by people. Such robots attempt to replicate walking, lifting, speech, cognition, many of todays robots are inspired by nature, contributing to the field of bio-inspired robotics. Throughout history, it has been assumed that robots will one day be able to mimic human behavior. Many robots are built to do jobs that are hazardous to people such as defusing bombs, finding survivors in unstable ruins, Robotics is also used in STEM as a teaching aid. The word robotics was derived from the robot, which was introduced to the public by Czech writer Karel Čapek in his play R. U. R. which was published in 1920. The word robot comes from the Slavic word robota, which means labour, the play begins in a factory that makes artificial people called robots, creatures who can be mistaken for humans – very similar to the modern ideas of androids. Karel Čapek himself did not coin the word and he wrote a short letter in reference to an etymology in the Oxford English Dictionary in which he named his brother Josef Čapek as its actual originator. According to the Oxford English Dictionary, the word robotics was first used in print by Isaac Asimov, published in May 1941 in Astounding Science Fiction. Asimov was unaware that he was coining the term, since the science and technology of electrical devices is electronics, he assumed robotics already referred to the science and technology of robots. In some of Asimovs other works, he states that the first use of the word robotics was in his short story Runaround, however, the original publication of Liar. Predates that of Runaround by ten months, so the former is generally cited as the words origin, in 1942, the science fiction writer Isaac Asimov created his Three Laws of Robotics. In 1948, Norbert Wiener formulated the principles of cybernetics, the basis of practical robotics, fully autonomous only appeared in the second half of the 20th century. The first digitally operated and programmable robot, the Unimate, was installed in 1961 to lift hot pieces of metal from a die casting machine, commercial and industrial robots are widespread today and used to perform jobs more cheaply, more accurately and more reliably, than humans. They are also employed in jobs which are too dirty, dangerous. For example, a designed to travel across heavy dirt or mud
Robotics
–
The
Shadow robot hand system
Robotics
–
Robotic Construction
Robotics
–
Electrical Aspect
Robotics
–
A robotic leg powered by
air muscles
96.
Prosthesis
–
In medicine, a prosthesis is an artificial device that replaces a missing body part, which may be lost through trauma, disease, or congenital conditions. A persons prosthetics should be designed and assembled according to the patients appearance, for instance, a patient may need a transradial prosthesis, but need to choose between an aesthetic functional device, a myoelectric device, a body-powered device, or an activity specific device. The patients future goals and economical capabilities may help choose between one or more devices. Craniofacial prostheses include intra-oral and extra-oral prostheses, extra-oral prostheses are further divided into hemifacial, auricular, nasal, orbital and ocular. Intra-oral prostheses include dental prostheses such as dentures, obturators, limb prostheses include both upper- and lower-extremity prostheses. A transradial prosthesis is a limb that replaces an arm missing below the elbow. Two main types of prosthetics are available, cable operated limbs work by attaching a harness and cable around the opposite shoulder of the damaged arm. The other form of prosthetics available are myoelectric arms and these work by sensing, via electrodes, when the muscles in the upper arm move, causing an artificial hand to open or close. In the prosthetic industry a trans-radial prosthetic arm is referred to as a BE or below elbow prosthesis. Lower-extremity prostheses provide replacements at varying levels of amputation and these include hip disarticulation, transfemoral prosthesis, knee disarticulation, transtibial prosthesis, Symes amputation, foot, partial foot, and toe. The two main subcategories of lower extremity prosthetic devices are trans-tibial and trans-femoral, a transfemoral prosthesis is an artificial limb that replaces a leg missing above the knee. Transfemoral amputees can have a difficult time regaining normal movement. In general, a transfemoral amputee must use approximately 80% more energy to walk than a person with two whole legs and this is due to the complexities in movement associated with the knee. In the prosthetic industry a trans-femoral prosthetic leg is often referred to as an AK or above the knee prosthesis, a transtibial prosthesis is an artificial limb that replaces a leg missing below the knee. A transtibial amputee is usually able to regain normal movement more readily than someone with an amputation, due in large part to retaining the knee. Lower extremity prosthetics describes artificially replaced limbs located at the hip level or lower, in the prosthetic industry a trans-tibial prosthetic leg is often referred to as a BK or below the knee prosthesis. Prosthetics have been mentioned throughout history, the earliest recorded mention is the warrior queen Vishpala in the Rigveda. The Egyptians were early pioneers of the idea, as shown by the wooden toe found on a body from the New Kingdom, roman bronze crowns have also been found, but their use could have been more aesthetic than medical
Prosthesis
–
A man with two prosthetic arms playing
table football
Prosthesis
–
A
United States Marine with bilateral prosthetic legs leads a formation run
Prosthesis
–
Prosthetic toe from ancient Egypt
Prosthesis
–
Iron prosthetic hand believed to have been owned by Götz von Berlichingen (1480–1562)
97.
Poker
–
Poker is a family of card games that combine gambling, strategy, and skill. Poker games vary in the number of cards dealt, the number of shared or community cards, the number of cards that remain hidden, and the betting procedures. In most modern games, the first round of betting begins with one or more of the players making some form of a forced bet. In standard poker, each player bets according to the rank they believe their hand is worth as compared to the other players. The action then proceeds clockwise as each player in turn must either match, or call, a player who matches a bet may also raise, or increase the bet. The betting round ends when all players have called the last bet or folded. If all but one player folds on any round, the player collects the pot without being required to reveal their hand. If more than one remains in contention after the final betting round, a showdown takes place where the hands are revealed. By the 1990s some gaming historians including David Parlett started to challenge the notion that poker is a derivative of As-Nas. There is evidence that a game called poque, a French game similar to poker, was played around the region where poker is said to have originated. The name of the game likely descended from the Irish Poca or even the French poque, yet it is not clear whether the origins of poker itself lie with the games bearing those names. It is commonly regarded as sharing ancestry with the Renaissance game of primero, the English game brag clearly descended from brelan and incorporated bluffing. It is quite possible that all of these earlier games influenced the development of poker as it exists now, the unique features of poker have to do with the betting, and do not appear in any known older game. In this view poker originated much later, in the early or mid-18th century and it was played in a variety of forms, with 52 cards, and included both straight poker and stud. 20 card poker was a variant for two players, the development of poker is linked to the historical movement that also saw the invention of commercial gambling. English actor Joseph Crowell reported that the game was played in New Orleans in 1829, with a deck of 20 cards, and four players betting on which players hand was the most valuable. As it spread north along the Mississippi River and to the West during the gold rush, soon after this spread, the full 52-card French deck was used and the flush was introduced. The draw was added prior to 1850, during the American Civil War, many additions were made including stud poker, and the straight
Poker
–
A game of
Texas hold 'em in progress. "Hold 'em" is a popular form of poker.
Poker
Poker
–
Officers of the
114th Pennsylvania Infantry playing cards in front of tents.
Petersburg, Virginia, August 1864
Poker
–
2006 WSOP Main Event table
98.
Algorithmic trading
–
They were developed so that traders do not need to constantly watch a stock and repeatedly send those slices out manually. Popular algos include Percentage of Volume, Pegged, VWAP, TWAP, Implementation Shortfall, Algorithmic trading is not an attempt to make a trading profit. It is simply a way to minimise the cost, market impact, the term is also used to mean automated trading system. These do indeed have the goal of making a profit, also known as black box trading, these encompass trading strategies that are heavily reliant on complex mathematical formulas and high-speed computer programs. Much of the rest of this article should be moved to the page on automated trading systems, such systems run strategies including market making, inter-market spreading, arbitrage, or pure speculation such as trend following. Many fall into the category of high-frequency trading, which are characterized by high turnover, as a result, in February 2012, the Commodity Futures Trading Commission formed a special working group that included academics and industry experts to advise the CFTC on how best to define HFT. Algorithmic trading and HFT have resulted in a change of the market microstructure. Profitability projections by the TABB Group, a services industry research firm. A third of all European Union and United States stock trades in 2006 were driven by automatic programs, as of 2009, studies suggested HFT firms accounted for 60-73% of all US equity trading volume, with that number falling to approximately 50% in 2012. In 2006, at the London Stock Exchange, over 40% of all orders were entered by algorithmic traders, american markets and European markets generally have a higher proportion of algorithmic trades than other markets, and estimates for 2008 range as high as an 80% proportion in some markets. Foreign exchange markets also have active algorithmic trading, Futures markets are considered fairly easy to integrate into algorithmic trading, with about 20% of options volume expected to be computer-generated by 2010. Bond markets are moving toward more access to algorithmic traders, Algorithmic trading and HFT have been the subject of much public debate since the U. S. The same reports found HFT strategies may have contributed to subsequent volatility by rapidly pulling liquidity from the market, as a result of these events, the Dow Jones Industrial Average suffered its second largest intraday point swing ever to that date, though prices quickly recovered. However, other researchers have reached a different conclusion, One 2010 study found that HFT did not significantly alter trading inventory during the Flash Crash. Some algorithmic trading ahead of index fund rebalancing transfers profits from investors, the opening automated reporting system aided the specialist in determining the market clearing opening price. Program trading is defined by the New York Stock Exchange as an order to buy or sell 15 or more stocks valued at over US$1 million total, in practice this means that all program trades are entered with the aid of a computer. In the 1980s, program trading became widely used in trading between the S&P500 equity and futures markets. In stock index arbitrage a trader buys a stock index futures contract such as the S&P500 futures, both strategies, often simply lumped together as program trading, were blamed by many people for exacerbating or even starting the 1987 stock market crash
Algorithmic trading
–
Financial market participants
99.
E-mail spam
–
Email spam, also known as junk email, is a type of electronic spam where unsolicited messages are sent by email. Spam email may also include malware as scripts or other executable file attachments, Spam is named after Spam luncheon meat by way of a Monty Python sketch in which Spam in the sketch is ubiquitous, unavoidable and repetitive. Email spam has steadily grown since the early 1990s, botnets, networks of virus-infected computers, are used to send about 80% of spam. Since the expense of the spam is borne mostly by the recipient, the legal status of spam varies from one jurisdiction to another. In the United States, spam was declared to be legal by the CAN-SPAM Act of 2003 provided the message adheres to rules set by the Act and by the FTC. ISPs have attempted to recover the cost of spam through lawsuits against spammers, spammers collect email addresses from chatrooms, websites, customer lists, newsgroups, and viruses that harvest users address books. These collected email addresses are also sold to other spammers. The proportion of spam email was around 80% of email messages sent, from the beginning of the Internet, sending of junk email has been prohibited. Gary Thuerk sent the first email spam message in 1978 to 600 people and he was reprimanded and told not to do it again. The ban on spam is enforced by the Terms of Service/Acceptable Use Policy of internet service providers and it was estimated in 2009 that spam cost businesses around US$130 billion. As the scale of the problem has grown, ISPs and the public have turned to government for relief from spam. Spam has several definitions varying by source, Unsolicited bulk email —unsolicited email, sent in large quantities. Unsolicited commercial email —this more restrictive definition is used by regulators whose mandate is to regulate commerce, many spam emails contain URLs to a website or websites. According to a Cyberoam report in 2014, there are an average of 54 billion spam messages sent every day, pharmaceutical products jumped up 45% from last quarter’s analysis, leading this quarter’s spam pack. Emails purporting to offer jobs with fast, easy cash come in at two, accounting for approximately 15% of all spam email. And, rounding off at number three are spam emails about diet products, accounting for approximately 1%, according to information compiled by Commtouch Software Ltd. email spam for the first quarter of 2010 can be broken down as follows. Advance fee fraud spam, such as the Nigerian 419 scam, organized spam gangs operate from sites set up by the Russian mafia, with turf battles and revenge killings sometimes resulting. Spam is also a medium for fraudsters to scam users into entering personal information on fake Web sites using emails forged to look like they are from banks or other organizations, such as PayPal
E-mail spam
–
An
email box folder filled with spam messages.
100.
Boolean algebra
–
In mathematics and mathematical logic, Boolean algebra is the branch of algebra in which the values of the variables are the truth values true and false, usually denoted 1 and 0 respectively. It is thus a formalism for describing logical relations in the way that ordinary algebra describes numeric relations. Boolean algebra was introduced by George Boole in his first book The Mathematical Analysis of Logic, according to Huntington, the term Boolean algebra was first suggested by Sheffer in 1913. Boolean algebra has been fundamental in the development of digital electronics and it is also used in set theory and statistics. Booles algebra predated the modern developments in algebra and mathematical logic. In an abstract setting, Boolean algebra was perfected in the late 19th century by Jevons, Schröder, Huntington, in fact, M. H. Stone proved in 1936 that every Boolean algebra is isomorphic to a field of sets. Shannon already had at his disposal the abstract mathematical apparatus, thus he cast his switching algebra as the two-element Boolean algebra, in circuit engineering settings today, there is little need to consider other Boolean algebras, thus switching algebra and Boolean algebra are often used interchangeably. Efficient implementation of Boolean functions is a problem in the design of combinational logic circuits. Logic sentences that can be expressed in classical propositional calculus have an equivalent expression in Boolean algebra, thus, Boolean logic is sometimes used to denote propositional calculus performed in this way. Boolean algebra is not sufficient to capture logic formulas using quantifiers, the closely related model of computation known as a Boolean circuit relates time complexity to circuit complexity. Whereas in elementary algebra expressions denote mainly numbers, in Boolean algebra they denote the truth values false and these values are represented with the bits, namely 0 and 1. Addition and multiplication then play the Boolean roles of XOR and AND respectively, Boolean algebra also deals with functions which have their values in the set. A sequence of bits is a commonly used such function, another common example is the subsets of a set E, to a subset F of E is associated the indicator function that takes the value 1 on F and 0 outside F. The most general example is the elements of a Boolean algebra, as with elementary algebra, the purely equational part of the theory may be developed without considering explicit values for the variables. The basic operations of Boolean calculus are as follows, AND, denoted x∧y, satisfies x∧y =1 if x = y =1 and x∧y =0 otherwise. OR, denoted x∨y, satisfies x∨y =0 if x = y =0, NOT, denoted ¬x, satisfies ¬x =0 if x =1 and ¬x =1 if x =0. Alternatively the values of x∧y, x∨y, and ¬x can be expressed by tabulating their values with truth tables as follows, the first operation, x → y, or Cxy, is called material implication. If x is then the value of x → y is taken to be that of y
Boolean algebra
–
Figure 2. Venn diagrams for conjunction, disjunction, and complement
101.
Universal Turing Machine
–
In computer science, a universal Turing machine is a Turing machine that can simulate an arbitrary Turing machine on arbitrary input. The universal machine essentially achieves this by reading both the description of the machine to be simulated as well as the input thereof from its own tape, Alan Turing introduced this machine in 1936–1937. It is also known as universal computing machine, universal machine, machine U, U, in terms of computational complexity, a multi-tape universal Turing machine need only be slower by logarithmic factor compared to the machines it simulates. Every Turing machine computes a certain fixed partial computable function from the strings over its alphabet. In that sense it behaves like a computer with a fixed program, however, we can encode the action table of any Turing machine in a string. Turing described such a construction in complete detail in his 1936 paper, is working on an incarnation of a Turing machine, and that John von Neumann on the work of Alan Turing. Davis makes a case that Turings Automatic Computing Engine computer anticipated the notions of microprogramming, Knuth cites Turings work on the ACE computer as designing hardware to facilitate subroutine linkage, Davis also references this work as Turings use of a hardware stack. As the Turing Machine was encouraging the construction of computers, the UTM was encouraging the development of the computer sciences. An early, if not the very first, assembler was proposed by a young hot-shot programmer for the EDVAC, Knuth observes that the subroutine return embedded in the program itself rather than in special registers is attributable to von Neumann and Goldstine. Knuth furthermore states that The first interpretive routine may be said to be the Universal Turing Machine, interpretive routines in the conventional sense were mentioned by John Mauchly in his lectures at the Moore School in 1946. Turing took part in development also, interpretive systems for the Pilot ACE computer were written under his direction. Davis briefly mentions operating systems and compilers as outcomes of the notion of program-as-data, some, however, might raise issues with this assessment. At the time a small cadre of researchers were intimately involved with the architecture of the new digital computers. These two aspects of theory and practice have been developed almost entirely independently of each other, the main reason is undoubtedly that logicians are interested in questions radically different from those with which the applied mathematicians and electrical engineers are primarily concerned. It cannot, however, fail to strike one as rather strange that often the same concepts are expressed by different terms in the two developments. Wang hoped that his paper would connect the two approaches, indeed, Minsky confirms this, that the first formulation of Turing-machine theory in computer-like models appears in Wang. Minsky goes on to demonstrate Turing equivalence of a counter machine, with respect to the reduction of computers to simple Turing equivalent models, Minskys designation of Wang as having made the first formulation is open to debate. The names of mathematicians Hermes and Kaphenst appear in the bibliographies of both Sheperdson-Sturgis and Elgot-Robinson, Two other names of importance are Canadian researchers Melzak and Lambek
Universal Turing Machine
–
Turing machines
102.
Normal distribution
–
In probability theory, the normal distribution is a very common continuous probability distribution. Normal distributions are important in statistics and are used in the natural and social sciences to represent real-valued random variables whose distributions are not known. The normal distribution is useful because of the limit theorem. Physical quantities that are expected to be the sum of independent processes often have distributions that are nearly normal. Moreover, many results and methods can be derived analytically in explicit form when the relevant variables are normally distributed, the normal distribution is sometimes informally called the bell curve. However, many other distributions are bell-shaped, the probability density of the normal distribution is, f =12 π σ2 e −22 σ2 Where, μ is mean or expectation of the distribution. σ is standard deviation σ2 is variance A random variable with a Gaussian distribution is said to be distributed and is called a normal deviate. The simplest case of a distribution is known as the standard normal distribution. The factor 1 /2 in the exponent ensures that the distribution has unit variance and this function is symmetric around x =0, where it attains its maximum value 1 /2 π and has inflection points at x = +1 and x = −1. Authors may differ also on which normal distribution should be called the standard one, the probability density must be scaled by 1 / σ so that the integral is still 1. If Z is a normal deviate, then X = Zσ + μ will have a normal distribution with expected value μ. Conversely, if X is a normal deviate, then Z = /σ will have a standard normal distribution. Every normal distribution is the exponential of a function, f = e a x 2 + b x + c where a is negative. In this form, the mean value μ is −b/, for the standard normal distribution, a is −1/2, b is zero, and c is − ln /2. The standard Gaussian distribution is denoted with the Greek letter ϕ. The alternative form of the Greek phi letter, φ, is used quite often. The normal distribution is often denoted by N. Thus when a random variable X is distributed normally with mean μ and variance σ2, some authors advocate using the precision τ as the parameter defining the width of the distribution, instead of the deviation σ or the variance σ2
Normal distribution
–
The
bean machine, a device invented by
Francis Galton, can be called the first generator of normal random variables. This machine consists of a vertical board with interleaved rows of pins. Small balls are dropped from the top and then bounce randomly left or right as they hit the pins. The balls are collected into bins at the bottom and settle down into a pattern resembling the Gaussian curve.
Normal distribution
–
Probability density function
Normal distribution
–
Carl Friedrich Gauss discovered the normal distribution in 1809 as a way to rationalize the
method of least squares.
Normal distribution
–
Marquis de Laplace proved the
central limit theorem in 1810, consolidating the importance of the normal distribution in statistics.
103.
Eight queens puzzle
–
The eight queens puzzle is the problem of placing eight chess queens on an 8×8 chessboard so that no two queens threaten each other. Thus, a solution requires that no two share the same row, column, or diagonal. Chess composer Max Bezzel published the eight queens puzzle in 1848, franz Nauck published the first solutions in 1850. Nauck also extended the puzzle to the n queens problem, with n queens on a chessboard of n × n squares, since then, many mathematicians, including Carl Friedrich Gauss, have worked on both the eight queens puzzle and its generalized n-queens version. In 1874, S. Gunther proposed a method using determinants to find solutions, in 1972, Edsger Dijkstra used this problem to illustrate the power of what he called structured programming. It is possible to use shortcuts that reduce computational requirements or rules of thumb that avoids brute-force computational techniques, generating permutations further reduces the possibilities to just 40,320, which are then checked for diagonal attacks. Martin Richards published a program to count solutions to the problem using bitwise operations. The eight queens puzzle has 92 distinct solutions, if solutions that differ only by the symmetry operations of rotation and reflection of the board are counted as one, the puzzle has 12 solutions. These are called solutions, representatives of each are shown below. A fundamental solution usually has eight variants obtained by rotating 90,180, or 270°, however, should a solution be equivalent to its own 90° rotation, that fundamental solution will have only two variants. Should a solution be equivalent to its own 180° rotation, it will have four variants, if n >1, it is not possible for a solution to be equivalent to its own reflection because that would require two queens to be facing each other. The different fundamental solutions are presented below, Solution 10 has the property that no three queens are in a straight line. These brute-force algorithms to count the number of solutions are computationally manageable for n =8, if the goal is to find a single solution then explicit solutions exist for all n ≥4, requiring no combinatorial search whatsoever. The explicit solutions exhibit stair-stepped patterns, as in the examples for n =8,9 and 10. Let be the square in column i and row j on the n × n chessboard, if n is even and n ≠ 6k +2, then place queens at and for i =1,2. If n is even and n ≠ 6k, then place queens at, if n is odd, then use one of the patterns above for and add a queen at. If the remainder is 2, swap 1 and 3 in odd list, if the remainder is 3, move 2 to the end of even list and 1,3 to the end of odd list. Append odd list to the even list and place queens in the rows given by these numbers, for N =8 this results in fundamental solution 1 above
Eight queens puzzle
104.
Travelling salesman problem
–
It is an NP-hard problem in combinatorial optimization, important in operations research and theoretical computer science. TSP is a case of the travelling purchaser problem and the vehicle routing problem. In the theory of computational complexity, the version of the TSP belongs to the class of NP-complete problems. Thus, it is possible that the running time for any algorithm for the TSP increases superpolynomially with the number of cities. The problem was first formulated in 1930 and is one of the most intensively studied problems in optimization and it is used as a benchmark for many optimization methods. The TSP has several applications even in its purest formulation, such as planning, logistics, slightly modified, it appears as a sub-problem in many areas, such as DNA sequencing. The TSP also appears in astronomy, as observing many sources will want to minimize the time spent moving the telescope between the sources. In many applications, additional constraints such as limited resources or time windows may be imposed, the origins of the travelling salesperson problem are unclear. A handbook for travelling salesmen from 1832 mentions the problem and includes example tours through Germany and Switzerland, the travelling salesperson problem was mathematically formulated in the 1800s by the Irish mathematician W. R. Hamilton and by the British mathematician Thomas Kirkman. Hamilton’s Icosian Game was a puzzle based on finding a Hamiltonian cycle. Hassler Whitney at Princeton University introduced the name travelling salesman problem soon after, Dantzig, Fulkerson and Johnson, however, speculated that given a near optimal solution we may be able to find optimality or prove optimality by adding a small amount of extra inequalities. They used this idea to solve their initial 49 city problem using a string model and they found they only needed 26 cuts to come to a solution for their 49 city problem. As well as cutting plane methods, Dantzig, Fulkerson and Johnson used branch, in the following decades, the problem was studied by many researchers from mathematics, computer science, chemistry, physics, and other sciences. Christofides made a big advance in this approach of giving an approach for which we know the worst-case scenario and his algorithm given in 1976, at worst is 1.5 times longer than the optimal solution. As the algorithm was so simple and quick, many hoped it would give way to a optimal solution method. However, until 2011 when it was beaten by less than a billionth of a percent, Richard M. Karp showed in 1972 that the Hamiltonian cycle problem was NP-complete, which implies the NP-hardness of TSP. This supplied a mathematical explanation for the apparent computational difficulty of finding optimal tours, great progress was made in the late 1970s and 1980, when Grötschel, Padberg, Rinaldi and others managed to exactly solve instances with up to 2392 cities, using cutting planes and branch-and-bound. In the 1990s, Applegate, Bixby, Chvátal, and Cook developed the program Concorde that has used in many recent record solutions
Travelling salesman problem
–
William Rowan Hamilton
Travelling salesman problem
–
Solution of a travelling salesman problem
105.
Integer factorization
–
In number theory, integer factorization is the decomposition of a composite number into a product of smaller integers. If these integers are further restricted to numbers, the process is called prime factorization. When the numbers are large, no efficient, non-quantum integer factorization algorithm is known. However, it has not been proven that no efficient algorithm exists, the presumed difficulty of this problem is at the heart of widely used algorithms in cryptography such as RSA. Many areas of mathematics and computer science have been brought to bear on the problem, including elliptic curves, algebraic number theory, not all numbers of a given length are equally hard to factor. The hardest instances of these problems are semiprimes, the product of two prime numbers, many cryptographic protocols are based on the difficulty of factoring large composite integers or a related problem—for example, the RSA problem. An algorithm that efficiently factors an arbitrary integer would render RSA-based public-key cryptography insecure, by the fundamental theorem of arithmetic, every positive integer has a unique prime factorization. If the integer is then it can be recognized as such in polynomial time. If composite however, the theorem gives no insight into how to obtain the factors, given a general algorithm for integer factorization, any integer can be factored down to its constituent prime factors simply by repeated application of this algorithm. The situation is complicated with special-purpose factorization algorithms, whose benefits may not be realized as well or even at all with the factors produced during decomposition. For example, if N =10 × p × q where p < q are very large primes, trial division will quickly produce the factors 2 and 5 but will take p divisions to find the next factor. Among the b-bit numbers, the most difficult to factor in practice using existing algorithms are those that are products of two primes of similar size, for this reason, these are the integers used in cryptographic applications. The largest such semiprime yet factored was RSA-768, a 768-bit number with 232 decimal digits and this factorization was a collaboration of several research institutions, spanning two years and taking the equivalent of almost 2000 years of computing on a single-core 2.2 GHz AMD Opteron. Like all recent factorization records, this factorization was completed with an optimized implementation of the general number field sieve run on hundreds of machines. No algorithm has been published that can factor all integers in polynomial time, neither the existence nor non-existence of such algorithms has been proved, but it is generally suspected that they do not exist and hence that the problem is not in class P. The problem is clearly in class NP but has not been proved to be in, or not in and it is generally suspected not to be in NP-complete. There are published algorithms that are faster than O for all positive ε, i. e. sub-exponential, the best published asymptotic running time is for the general number field sieve algorithm, which, for a b-bit number n, is, O. For current computers, GNFS is the best published algorithm for large n, for a quantum computer, however, Peter Shor discovered an algorithm in 1994 that solves it in polynomial time
Integer factorization
–
This image demonstrates the prime decomposition of 864. A shorthand way of writing the resulting prime factors is 2 5 × 3 3
106.
Central processing unit
–
The computer industry has used the term central processing unit at least since the early 1960s. The form, design and implementation of CPUs have changed over the course of their history, most modern CPUs are microprocessors, meaning they are contained on a single integrated circuit chip. An IC that contains a CPU may also contain memory, peripheral interfaces, some computers employ a multi-core processor, which is a single chip containing two or more CPUs called cores, in that context, one can speak of such single chips as sockets. Array processors or vector processors have multiple processors that operate in parallel, there also exists the concept of virtual CPUs which are an abstraction of dynamical aggregated computational resources. Early computers such as the ENIAC had to be rewired to perform different tasks. Since the term CPU is generally defined as a device for software execution, the idea of a stored-program computer was already present in the design of J. Presper Eckert and John William Mauchlys ENIAC, but was initially omitted so that it could be finished sooner. On June 30,1945, before ENIAC was made, mathematician John von Neumann distributed the paper entitled First Draft of a Report on the EDVAC and it was the outline of a stored-program computer that would eventually be completed in August 1949. EDVAC was designed to perform a number of instructions of various types. Significantly, the programs written for EDVAC were to be stored in high-speed computer memory rather than specified by the wiring of the computer. This overcame a severe limitation of ENIAC, which was the considerable time, with von Neumanns design, the program that EDVAC ran could be changed simply by changing the contents of the memory. Early CPUs were custom designs used as part of a larger, however, this method of designing custom CPUs for a particular application has largely given way to the development of multi-purpose processors produced in large quantities. This standardization began in the era of discrete transistor mainframes and minicomputers and has accelerated with the popularization of the integrated circuit. The IC has allowed increasingly complex CPUs to be designed and manufactured to tolerances on the order of nanometers, both the miniaturization and standardization of CPUs have increased the presence of digital devices in modern life far beyond the limited application of dedicated computing machines. Modern microprocessors appear in electronic devices ranging from automobiles to cellphones, the so-called Harvard architecture of the Harvard Mark I, which was completed before EDVAC, also utilized a stored-program design using punched paper tape rather than electronic memory. Relays and vacuum tubes were used as switching elements, a useful computer requires thousands or tens of thousands of switching devices. The overall speed of a system is dependent on the speed of the switches, tube computers like EDVAC tended to average eight hours between failures, whereas relay computers like the Harvard Mark I failed very rarely. In the end, tube-based CPUs became dominant because the significant speed advantages afforded generally outweighed the reliability problems, most of these early synchronous CPUs ran at low clock rates compared to modern microelectronic designs. Clock signal frequencies ranging from 100 kHz to 4 MHz were very common at this time, the design complexity of CPUs increased as various technologies facilitated building smaller and more reliable electronic devices
Central processing unit
–
An
Intel 80486DX2 CPU, as seen from above
Central processing unit
–
Bottom side of an Intel 80486DX2
Central processing unit
–
EDVAC, one of the first stored-program computers
Central processing unit
–
CPU,
core memory, and
external bus interface of a DEC
PDP-8 /I. Made of medium-scale integrated circuits.