1.
Neutral network (evolution)
–
A neutral network is a set of genes all related by point mutations that have equivalent function or fitness. Each node represents a sequence and each line represents the mutation connecting two sequences. Neutral networks can be thought of as high, flat plateaus in a fitness landscape, during neutral evolution, genes can randomly move through neutral networks and traverse regions of sequence space which may have consequences for robustness and evolvability. Neutral networks exist in fitness landscapes since proteins are robust to mutations and this leads to extended networks of genes of equivalent function, linked by neutral mutations. Proteins are resistant to mutations because many sequences can fold into highly similar structural folds, a protein adopts a limited ensemble of native conformations because those conformers have lower energy than unfolded and mis-folded states. This is achieved by a distributed, internal network of cooperative interactions, protein structural robustness results from few single mutations being sufficiently disruptive to compromise function. Proteins have also evolved to avoid aggregation as partially folded proteins can combine to form large, repeating, insoluble protein fibrils, there is evidence that proteins show negative design features to reduce the exposure of aggregation-prone beta-sheet motifs in their structures. Additionally, there is evidence that the genetic code itself may be optimised such that most point mutations lead to similar amino acids. Together these factors create a distribution of fitness effects of mutations that contains a proportion of neutral and nearly-neutral mutations. Neutral networks are a subset of the sequences in sequence space that have equivalent function, neutral evolution can therefore be visualised as a population diffusing from one set of sequence nodes, through the neutral network, to another cluster of sequence nodes. Since the majority of evolution is thought to be neutral, a proportion of gene change is the movement though expansive neutral networks. The more neutral neighbours a sequence has, the more robust to mutations it is since mutations are likely to simply neutrally convert it into an equally functional sequence. Indeed, if there are differences between the number of neutral neighbours of different sequences within a neutral network, the population is predicted to evolve towards these robust sequences. This is sometimes called circum-neutrality and represents the movement of populations away from cliffs in the fitness landscape, in addition to in silico models, these processes are beginning to be confirmed by experimental evolution of cytochrome P450s and B-lactamase. This would only be the case when the distance between activities is smaller than the distance that a neutrally evolving population can cover, the degree of interpenetration of the two networks will determine how common cryptic variation for the promiscuous activity is in sequence space
Neutral network (evolution)
2.
Machine learning
–
Machine learning is the subfield of computer science that, according to Arthur Samuel in 1959, gives computers the ability to learn without being explicitly programmed. Machine learning is related to computational statistics, which also focuses on prediction-making through the use of computers. It has strong ties to optimization, which delivers methods, theory. Machine learning is sometimes conflated with data mining, where the latter subfield focuses more on data analysis and is known as unsupervised learning. Machine learning can also be unsupervised and be used to learn and establish baseline behavioral profiles for various entities, tom M. be replaced with the question Can machines do what we can do. In the proposal he explores the characteristics that could be possessed by a thinking machine. Machine learning tasks are typically classified into three categories, depending on the nature of the learning signal or feedback available to a learning system. These are Supervised learning, The computer is presented with example inputs and their outputs, given by a teacher. Unsupervised learning, No labels are given to the learning algorithm, unsupervised learning can be a goal in itself or a means towards an end. Reinforcement learning, A computer program interacts with an environment in which it must perform a certain goal. The program is provided feedback in terms of rewards and punishments as it navigates its problem space, between supervised and unsupervised learning is semi-supervised learning, where the teacher gives an incomplete training signal, a training set with some of the target outputs missing. Transduction is a case of this principle where the entire set of problem instances is known at learning time. Among other categories of machine learning problems, learning to learn learns its own inductive bias based on previous experience and this is typically tackled in a supervised way. Spam filtering is an example of classification, where the inputs are email messages, in regression, also a supervised problem, the outputs are continuous rather than discrete. In clustering, a set of inputs is to be divided into groups, unlike in classification, the groups are not known beforehand, making this typically an unsupervised task. Density estimation finds the distribution of inputs in some space, dimensionality reduction simplifies inputs by mapping them into a lower-dimensional space. Topic modeling is a problem, where a program is given a list of human language documents and is tasked to find out which documents cover similar topics. As a scientific endeavour, machine learning grew out of the quest for artificial intelligence, already in the early days of AI as an academic discipline, some researchers were interested in having machines learn from data
Machine learning
–
Machine learning and data mining
3.
Statistical classification
–
An example would be assigning a given email into spam or non-spam classes or assigning a diagnosis to a given patient as described by observed characteristics of the patient. Classification is an example of pattern recognition, in the terminology of machine learning, classification is considered an instance of supervised learning, i. e. learning where a training set of correctly identified observations is available. The corresponding unsupervised procedure is known as clustering, and involves grouping data into categories based on some measure of inherent similarity or distance, often, the individual observations are analyzed into a set of quantifiable properties, known variously as explanatory variables or features. These properties may variously be categorical, ordinal, integer-valued or real-valued, other classifiers work by comparing observations to previous observations by means of a similarity or distance function. An algorithm that implements classification, especially in an implementation, is known as a classifier. The term classifier sometimes also refers to the function, implemented by a classification algorithm. Terminology across fields is quite varied, in machine learning, the observations are often known as instances, the explanatory variables are termed features, and the possible categories to be predicted are classes. Classification and clustering are examples of the general problem of pattern recognition. A common subclass of classification is probabilistic classification, algorithms of this nature use statistical inference to find the best class for a given instance. Unlike other algorithms, which output a best class, probabilistic algorithms output a probability of the instance being a member of each of the possible classes. The best class is then selected as the one with the highest probability. However, such an algorithm has numerous advantages over non-probabilistic classifiers, correspondingly, it can abstain when its confidence of choosing any particular output is too low. This early work assumed that data-values within each of the two groups had a normal distribution. The extension of this context to more than two-groups has also been considered with a restriction imposed that the classification rule should be linear. Bayesian procedures tend to be expensive and, in the days before Markov chain Monte Carlo computations were developed. Classification can be thought of as two separate problems – binary classification and multiclass classification, in binary classification, a better understood task, only two classes are involved, whereas multiclass classification involves assigning an object to one of several classes. Since many classification methods have developed specifically for binary classification. Most algorithms describe an individual instance whose category is to be predicted using a vector of individual
Statistical classification
–
Machine learning and data mining
4.
Cluster analysis
–
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups. Cluster analysis itself is not one specific algorithm, but the task to be solved. It can be achieved by various algorithms that differ significantly in their notion of what constitutes a cluster, popular notions of clusters include groups with small distances among the cluster members, dense areas of the data space, intervals or particular statistical distributions. Clustering can therefore be formulated as an optimization problem. The appropriate clustering algorithm and parameter settings depend on the data set. Cluster analysis as such is not a task, but an iterative process of knowledge discovery or interactive multi-objective optimization that involves trial. It is often necessary to modify data preprocessing and model parameters until the result achieves the desired properties, besides the term clustering, there are a number of terms with similar meanings, including automatic classification, numerical taxonomy, botryology and typological analysis. The notion of a cluster cannot be defined, which is one of the reasons why there are so many clustering algorithms. There is a common denominator, a group of data objects, however, different researchers employ different cluster models, and for each of these cluster models again different algorithms can be given. The notion of a cluster, as found by different algorithms, understanding these cluster models is key to understanding the differences between the various algorithms. Typical cluster models include, Connectivity models, for example, hierarchical clustering builds models based on distance connectivity, centroid models, for example, the k-means algorithm represents each cluster by a single mean vector. Distribution models, clusters are modeled using statistical distributions, such as multivariate normal distributions used by the Expectation-maximization algorithm, density models, for example, DBSCAN and OPTICS defines clusters as connected dense regions in the data space. Subspace models, in Biclustering, clusters are modeled with both members and relevant attributes. Group models, some algorithms do not provide a model for their results. Graph-based models, a clique, that is, a subset of nodes in a such that every two nodes in the subset are connected by an edge can be considered as a prototypical form of cluster. Relaxations of the connectivity requirement are known as quasi-cliques, as in the HCS clustering algorithm. A clustering is essentially a set of clusters, usually containing all objects in the data set. Additionally, it may specify the relationship of the clusters to each other, for example, the following overview will only list the most prominent examples of clustering algorithms, as there are possibly over 100 published clustering algorithms
Cluster analysis
–
The result of a cluster analysis shown as the coloring of the squares into three clusters.
5.
Structured prediction
–
Structured prediction or structured learning is an umbrella term for supervised machine learning techniques that involves predicting structured objects, rather than scalar discrete or real values. Probabilistic graphical models form a class of structured prediction models. Other algorithms and models for structured prediction include inductive logic programming, case-based reasoning, structured SVMs, Markov logic networks, sequence tagging is a class of problems prevalent in natural language processing, where input data are often sequences. The sequence tagging problem appears in several guises, e. g. part-of-speech tagging, in POS tagging, for example, each word in a sequence must receive a tag that expresses its type of word, This DT is VBZ a DT tagged JJ sentence NN. The main challenge of this problem is to resolve ambiguity, the sentence can also be a verb in English. One of the easiest ways to understand algorithms for structured prediction is the structured perceptron of Collins. This algorithm combines the perceptron algorithm for learning linear classifiers with an inference algorithm, first define a joint feature function Φ that maps a training sample x and a candidate prediction y to a vector of length n. Let GEN be a function that generates candidate predictions, the idea of learning is similar to multiclass perceptron. Conditional random field Structured support vector machine Recurrent neural network, in particular Elman networks Noah Smith, michael Collins, Discriminative Training Methods for Hidden Markov Models,2002. Implementation of Collins structured perceptron Implementation of structured perceptron with hashtags prediction system
Structured prediction
–
Machine learning and data mining
6.
Semi-supervised learning
–
Semi-supervised learning falls between unsupervised learning and supervised learning. Many machine-learning researchers have found that data, when used in conjunction with a small amount of labeled data. The acquisition of labeled data for a problem often requires a skilled human agent or a physical experiment. The cost associated with the process thus may render a fully labeled training set infeasible. In such situations, semi-supervised learning can be of practical value. Semi-supervised learning is also of theoretical interest in learning and as a model for human learning. As in the supervised learning framework, we are given a set of l independently identically distributed examples x 1, …, x l ∈ X with corresponding labels y 1, …, y l ∈ Y. Additionally, we are given u unlabeled examples x l +1, …, x l + u ∈ X. Semi-supervised learning may refer to either learning or inductive learning. The goal of learning is to infer the correct labels for the given unlabeled data x l +1, …, x l + u only. The goal of learning is to infer the correct mapping from X to Y. Intuitively, we can think of the problem as an exam. The teacher also provides a set of unsolved problems, in the transductive setting, these unsolved problems are a take-home exam and you want to do well on them in particular. In the inductive setting, these are problems of the sort you will encounter on the in-class exam. In order to any use of unlabeled data, we must assume some structure to the underlying distribution of data. Semi-supervised learning algorithms make use of at least one of the following assumptions, points which are close to each other are more likely to share a label. This is also assumed in supervised learning and yields a preference for geometrically simple decision boundaries. The data tend to form clusters, and points in the same cluster are more likely to share a label. This is a case of the smoothness assumption and gives rise to feature learning with clustering algorithms
Semi-supervised learning
–
Machine learning and data mining
7.
Learning to rank
–
Training data consists of lists of items with some partial order specified between items in each list. This order is typically induced by giving a numerical or ordinal score or a judgment for each item. The ranking models purpose is to rank, i. e. produce a permutation of items in new, ranking is a central part of many information retrieval problems, such as document retrieval, collaborative filtering, sentiment analysis, and online advertising. A possible architecture of a search engine is shown in the figure to the right. Training data consists of queries and documents matching them together with relevance degree of each match and it may be prepared manually by human assessors, who check results for some queries and determine relevance of each result. It is not feasible to check relevance of all documents, and so typically a technique called pooling is used — only the top few documents, alternatively, training data may be derived automatically by analyzing clickthrough logs, query chains, or such search engines features as Googles SearchWiki. Training data is used by an algorithm to produce a ranking model which computes relevance of documents for actual queries. Typically, users expect a search query to complete in a time, which makes it impossible to evaluate a complex ranking model on each document in the corpus. This phase is called top- k document retrieval and many heuristics were proposed in the literature to accelerate it, such as using a documents static quality score, in the second phase, a more accurate but computationally expensive machine-learned model is used to re-rank these documents. In Recommender systems for identifying a ranked list of related articles to recommend to a user after he or she has read a current news article. For convenience of MLR algorithms, query-document pairs are represented by numerical vectors. Such an approach is sometimes called bag of features and is analogous to the bag of words model, components of such vectors are called features, factors or ranking signals. They may be divided into three groups, Query-independent or static features — those features, which only on the document. For example, PageRank or documents length, such features can be precomputed in off-line mode during indexing. They may be used to compute documents static quality score, which is used to speed up search query evaluation. Query-dependent or dynamic features — those features, which depend both on the contents of the document and the query, such as TF-IDF score or other non-machine-learned ranking functions, query level features or query features, which depend only on the query. For example, the number of words in a query, selecting and designing good features is an important area in machine learning, which is called feature engineering. There are several measures which are used to judge how well an algorithm is doing on training data
Learning to rank
–
Machine learning and data mining
8.
Supervised learning
–
Supervised learning is the machine learning task of inferring a function from labeled training data. The training data consist of a set of training examples, in supervised learning, each example is a pair consisting of an input object and a desired output value. A supervised learning algorithm analyzes the data and produces an inferred function. An optimal scenario will allow for the algorithm to determine the class labels for unseen instances. This requires the algorithm to generalize from the training data to unseen situations in a reasonable way. The parallel task in human and animal psychology is often referred to as concept learning, in order to solve a given problem of supervised learning, one has to perform the following steps, Determine the type of training examples. Before doing anything else, the user should decide what kind of data is to be used as a training set, in the case of handwriting analysis, for example, this might be a single handwritten character, an entire handwritten word, or an entire line of handwriting. The training set needs to be representative of the use of the function. Thus, a set of objects is gathered and corresponding outputs are also gathered. Determine the input feature representation of the learned function, the accuracy of the learned function depends strongly on how the input object is represented. Typically, the object is transformed into a feature vector. The number of features should not be too large, because of the curse of dimensionality, Determine the structure of the learned function and corresponding learning algorithm. For example, the engineer may choose to use support vector machines or decision trees, run the learning algorithm on the gathered training set. Some supervised learning algorithms require the user to determine certain control parameters and these parameters may be adjusted by optimizing performance on a subset of the training set, or via cross-validation. Evaluate the accuracy of the learned function, after parameter adjustment and learning, the performance of the resulting function should be measured on a test set that is separate from the training set. A wide range of supervised learning algorithms are available, each with its strengths, there is no single learning algorithm that works best on all supervised learning problems. There are four major issues to consider in supervised learning, A first issue is the tradeoff between bias and variance, imagine that we have available several different, but equally good, training data sets. A learning algorithm is biased for an input x if
Supervised learning
–
Machine learning and data mining
9.
Ensemble learning
–
Even if the hypothesis space contains hypotheses that are very well-suited for a particular problem, it may be very difficult to find a good one. Ensembles combine multiple hypotheses to form a better hypothesis, the term ensemble is usually reserved for methods that generate multiple hypotheses using the same base learner. The broader term of multiple classifier systems also covers hybridization of hypotheses that are not induced by the same base learner, fast algorithms such as decision trees are commonly used in ensemble methods, although slower algorithms can benefit from ensemble techniques as well. By analogy, ensemble techniques have been used also in unsupervised learning scenarios, an ensemble is itself a supervised learning algorithm, because it can be trained and then used to make predictions. The trained ensemble, therefore, represents a single hypothesis and this hypothesis, however, is not necessarily contained within the hypothesis space of the models from which it is built. Thus, ensembles can be shown to have flexibility in the functions they can represent. Empirically, ensembles tend to better results when there is a significant diversity among the models. Many ensemble methods, therefore, seek to promote diversity among the models they combine, although perhaps non-intuitive, more random algorithms can be used to produce a stronger ensemble than very deliberate algorithms. Using a variety of learning algorithms, however, has been shown to be more effective than using techniques that attempt to dumb-down the models in order to promote diversity. While the number of component classifiers of an ensemble has an impact on the accuracy of prediction. A priori determining of ensemble size and the volume and velocity of big data streams make this even more crucial for online ensemble classifiers, mostly statistical tests was used for determining the proper number of components. It is called the law of diminishing returns in ensemble construction and their theoretical framework shows that using the same number of independent component classifiers as class labels gives the highest accuracy. The Bayes Optimal Classifier is a classification technique and it is an ensemble of all the hypotheses in the hypothesis space. On average, no other ensemble can outperform it, each hypothesis is given a vote proportional to the likelihood that the training dataset would be sampled from a system if that hypothesis were true. To facilitate training data of finite size, the vote of each hypothesis is also multiplied by the probability of that hypothesis. As an ensemble, the Bayes Optimal Classifier represents a hypothesis that is not necessarily in H, the hypothesis represented by the Bayes Optimal Classifier, however, is the optimal hypothesis in ensemble space. Unfortunately, the Bayes Optimal Classifier cannot be implemented for any. There are several reasons why the Bayes Optimal Classifier cannot be implemented, Most interesting hypothesis spaces are too large to iterate over
Ensemble learning
–
Machine learning and data mining
10.
Bootstrap aggregating
–
It also reduces variance and helps to avoid overfitting. Although it is applied to decision tree methods, it can be used with any type of method. Bagging is a case of the model averaging approach. Bagging was proposed by Leo Breiman in 1994 to improve the classification by combining classifications of randomly generated training sets. Given a standard training set D of size n, bagging generates m new training sets D i, each of size n′, by sampling from D uniformly, by sampling with replacement, some observations may be repeated in each D i. If n′=n, then for large n the set D i is expected to have the fraction of the examples of D. This kind of sample is known as a bootstrap sample, the m models are fitted using the above m bootstrap samples and combined by averaging the output or voting. Bagging leads to improvements for unstable procedures, which include, for example, artificial neural networks, classification and regression trees, an interesting application of bagging showing improvement in preimage learning is provided here. On the other hand, it can degrade the performance of stable methods such as K-nearest neighbors. To illustrate the principles of bagging, below is an analysis on the relationship between ozone and temperature. The relationship between temperature and ozone in this set is apparently non-linear, based on the scatter plot. To mathematically describe this relationship, LOESS smoothers are used, instead of building a single smoother from the complete data set,100 bootstrap samples of the data were drawn. Each sample is different from the data set, yet resembles it in distribution. For each bootstrap sample, a LOESS smoother was fit, predictions from these 100 smoothers were then made across the range of the data. The first 10 predicted smooth fits appear as lines in the figure below. The lines are clearly very wiggly and they overfit the data - a result of the span being too low, by taking the average of 100 smoothers, each fitted to a subset of the original data set, we arrive at one bagged predictor. Clearly, the mean is more stable and there is less overfit, boosting Bootstrapping Cross-validation Random forest Random subspace method Breiman, Leo. Alfaro, E. Gámez, M. and García, N. adabag, An R package for classification with AdaBoost. M1, AdaBoost-SAMME and Bagging
Bootstrap aggregating
–
Machine learning and data mining
11.
Naive Bayes classifier
–
In machine learning, naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes theorem with strong independence assumptions between the features. Naive Bayes has been studied extensively since the 1950s, with appropriate pre-processing, it is competitive in this domain with more advanced methods including support vector machines. It also finds application in medical diagnosis. Naive Bayes classifiers are highly scalable, requiring a number of parameters linear in the number of variables in a learning problem. Maximum-likelihood training can be done by evaluating an expression, which takes linear time. In the statistics and computer science literature, Naive Bayes models are known under a variety of names, including simple Bayes, all these names reference the use of Bayes theorem in the classifiers decision rule, but naive Bayes is not a Bayesian method. For example, a fruit may be considered to be an apple if it is red, round, for some types of probability models, naive Bayes classifiers can be trained very efficiently in a supervised learning setting. Despite their naive design and apparently oversimplified assumptions, naive Bayes classifiers have worked well in many complex real-world situations. In 2004, an analysis of the Bayesian classification problem showed that there are theoretical reasons for the apparently implausible efficacy of naive Bayes classifiers. Still, a comparison with other classification algorithms in 2006 showed that Bayes classification is outperformed by other approaches. An advantage of naive Bayes is that it requires a small number of training data to estimate the parameters necessary for classification. The problem with the formulation is that if the number of features n is large or if a feature can take on a large number of values. We therefore reformulate the model to make it more tractable and this means that p = p. Thus, the joint model can be expressed as p ∝ p ∝ p p p p ⋯ ∝ p ∏ i =1 n p. The discussion so far has derived the independent feature model, that is, the naive Bayes classifier combines this model with a decision rule. One common rule is to pick the hypothesis that is most probable, this is known as the maximum a posteriori or MAP decision rule. The corresponding classifier, a Bayes classifier, is the function that assigns a class label y ^ = C k for k as follows. A classs prior may be calculated by assuming equiprobable classes, or by calculating an estimate for the class probability from the training set, to estimate the parameters for a features distribution, one must assume a distribution or generate nonparametric models for the features from the training set. The assumptions on distributions of features are called the event model of the Naive Bayes classifier, for discrete features like the ones encountered in document classification, multinomial and Bernoulli distributions are popular
Naive Bayes classifier
–
Machine learning and data mining
12.
Logistic regression
–
In statistics, logistic regression, or logit regression, or logit model is a regression model where the dependent variable is categorical. This article covers the case of a binary dependent variable—that is, cases where the dependent variable has more than two outcome categories may be analysed in multinomial logistic regression, or, if the multiple categories are ordered, in ordinal logistic regression. In the terminology of economics, logistic regression is an example of a qualitative response/discrete choice model, Logistic regression was developed by statistician David Cox in 1958. The binary logistic model is used to estimate the probability of a response based on one or more predictor variables. It allows one to say that the presence of a risk factor increases the probability of an outcome by a specific percentage. Logistic regression is used in fields, including machine learning, most medical fields. For example, the Trauma and Injury Severity Score, which is used to predict mortality in injured patients, was originally developed by Boyd et al. using logistic regression. Many other medical scales used to assess severity of a patient have been developed using logistic regression, Logistic regression may be used to predict whether a patient has a given disease, based on observed characteristics of the patient. Another example might be to predict whether an American voter will vote Democratic or Republican, based on age, income, sex, race, state of residence, votes in previous elections, etc. The technique can also be used in engineering, especially for predicting the probability of failure of a given process and it is also used in marketing applications such as prediction of a customers propensity to purchase a product or halt a subscription, etc. Conditional random fields, an extension of logistic regression to sequential data, are used in language processing. Suppose we wish to answer the question, A group of 20 students spend between 0 and 6 hours studying for an exam. How does the number of hours spent studying affect the probability that the student will pass the exam, the reason for using logistic regression for this problem is that the dependent variable pass/fail represented by 1 and 0 are not cardinal numbers. If the problem were changed so that pass/fail was replaced with the grade 0–100, the table shows the number of hours each student spent studying, and whether they passed or failed. The graph shows the probability of passing the exam versus the number of hours studying, the logistic regression analysis gives the following output. The output indicates that hours studying is significantly associated with the probability of passing the exam, the output from the logistic regression analysis gives a p-value of p =0.0167, which is based on the Wald z-score. Rather than the Wald method, the method to calculate the p-value for logistic regression is the Likelihood Ratio Test. Logistic regression can be binomial, ordinal or multinomial, binomial or binary logistic regression deals with situations in which the observed outcome for a dependent variable can have only two possible types,0 and 1
Logistic regression
–
Graph of a logistic regression curve showing probability of passing an exam versus hours studying
13.
Perceptron
–
In machine learning, the perceptron is an algorithm for supervised learning of binary classifiers. It is a type of linear classifier, i. e. a classification algorithm that makes its predictions based on a linear predictor function combining a set of weights with the feature vector. The algorithm allows for learning, in that it processes elements in the training set one at a time. The perceptron algorithm dates back to the late 1950s, its first implementation, the perceptron algorithm was invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt, funded by the United States Office of Naval Research. This machine was designed for image recognition, it had an array of 400 photocells, weights were encoded in potentiometers, and weight updates during learning were performed by electric motors. Although the perceptron initially seemed promising, it was proved that perceptrons could not be trained to recognise many classes of patterns. It is often believed that they conjectured that a similar result would hold for a multi-layer perceptron network. However, this is not true, as both Minsky and Papert already knew that multi-layer perceptrons were capable of producing an XOR function, three years later Stephen Grossberg published a series of papers introducing networks capable of modelling differential, contrast-enhancing and XOR functions. Nevertheless, the often-miscited Minsky/Papert text caused a significant decline in interest and it took ten more years until neural network research experienced a resurgence in the 1980s. This text was reprinted in 1987 as Perceptrons - Expanded Edition where some errors in the text are shown. The kernel perceptron algorithm was introduced in 1964 by Aizerman et al. The bias shifts the decision boundary away from the origin and does not depend on any input value, the value of f is used to classify x as either a positive or a negative instance, in the case of a binary classification problem. If b is negative, then the combination of inputs must produce a positive value greater than | b | in order to push the classifier neuron over the 0 threshold. Spatially, the bias alters the position of the decision boundary, the perceptron learning algorithm does not terminate if the learning set is not linearly separable. If the vectors are not linearly separable learning will never reach a point where all vectors are classified properly, the most famous example of the perceptrons inability to solve problems with linearly nonseparable vectors is the Boolean exclusive-or problem. The solution spaces of decision boundaries for all functions and learning behaviors are studied in the reference. In the context of networks, a perceptron is an artificial neuron using the Heaviside step function as the activation function. The perceptron algorithm is termed the single-layer perceptron, to distinguish it from a multilayer perceptron
Perceptron
–
The Mark I Perceptron machine was the first implementation of the perceptron algorithm. The machine was connected to a camera that used 20×20 cadmium sulfide photocells to produce a 400-pixel image. The main visible feature is a patchboard that allowed experimentation with different combinations of input features. To the right of that are arrays of potentiometers that implemented the adaptive weights.
14.
Support vector machine
–
In machine learning, support vector machines are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. An SVM model is a representation of the examples as points in space, New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall. Classifying data is a task in machine learning. Suppose some given data points each belong to one of two classes, and the goal is to decide which class a new point will be in. In the case of support vector machines, a point is viewed as a p -dimensional vector. This is called a linear classifier, there are many hyperplanes that might classify the data. One reasonable choice as the best hyperplane is the one that represents the largest separation, or margin, so we choose the hyperplane so that the distance from it to the nearest data point on each side is maximized. More formally, a support vector machine constructs a hyperplane or set of hyperplanes in a high- or infinite-dimensional space, whereas the original problem may be stated in a finite dimensional space, it often happens that the sets to discriminate are not linearly separable in that space. For this reason, it was proposed that the original finite-dimensional space be mapped into a much higher-dimensional space, the hyperplanes in the higher-dimensional space are defined as the set of points whose dot product with a vector in that space is constant. The vectors defining the hyperplanes can be chosen to be linear combinations with parameters α i of images of feature vectors x i that occur in the data base. With this choice of a hyperplane, the x in the feature space that are mapped into the hyperplane are defined by the relation. Note that if k becomes small as y grows further away from x, in this way, the sum of kernels above can be used to measure the relative nearness of each test point to the data points originating in one or the other of the sets to be discriminated. Classification of images can also be performed using SVMs, experimental results show that SVMs achieve significantly higher search accuracy than traditional query refinement schemes after just three to four rounds of relevance feedback. This is also true of image segmentation systems, including using a modified version SVM that uses the privileged approach as suggested by Vapnik. Hand-written characters can be recognized using SVM, the SVM algorithm has been widely applied in the biological and other sciences. They have been used to classify proteins with up to 90% of the compounds classified correctly, permutation tests based on SVM weights have been suggested as a mechanism for interpretation of SVM models. Support vector machine weights have also used to interpret SVM models in the past. The original SVM algorithm was invented by Vladimir N. Vapnik, in 1992, Bernhard E. Boser, Isabelle M. Guyon and Vladimir N. Vapnik suggested a way to create nonlinear classifiers by applying the kernel trick to maximum-margin hyperplanes
Support vector machine
–
An example for a result of soft-margin SVM
Support vector machine
–
Machine learning and data mining
15.
Hierarchical clustering
–
In data mining and statistics, hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of clusters. Divisive, This is a top down approach, all start in one cluster. In general, the merges and splits are determined in a greedy manner, the results of hierarchical clustering are usually presented in a dendrogram. In the general case, the complexity of agglomerative clustering is O, divisive clustering with an exhaustive search is O, which is even worse. However, for special cases, optimal efficient agglomerative methods ) are known, SLINK for single-linkage. In order to decide which clusters should be combined, or where a cluster should be split, the choice of an appropriate metric will influence the shape of the clusters, as some elements may be close to one another according to one distance and farther away according to another. Some commonly used metrics for hierarchical clustering are, For text or other non-numeric data, the linkage criterion determines the distance between sets of observations as a function of the pairwise distances between observations. Some commonly used linkage criteria between two sets of observations A and B are, where d is the chosen metric, other linkage criteria include, The sum of all intra-cluster variance. The decrease in variance for the cluster being merged, the probability that candidate clusters spawn from the same distribution function. The product of in-degree and out-degree on a k-nearest-neighbour graph, the increment of some cluster descriptor after merging two clusters. Hierarchical clustering has the advantage that any valid measure of distance can be used. In fact, the observations themselves are not required, all that is used is a matrix of distances, for example, suppose this data is to be clustered, and the Euclidean distance is the distance metric. Cutting the tree at a height will give a partitioning clustering at a selected precision. In this example, cutting after the row of the dendrogram will yield clusters. Cutting after the row will yield clusters, which is a coarser clustering. The hierarchical clustering dendrogram would be as such, This method builds the hierarchy from the elements by progressively merging clusters. In our example, we have six elements and, the first step is to determine which elements to merge in a cluster. Usually, we want to take the two closest elements, according to the chosen distance, optionally, one can also construct a distance matrix at this stage, where the number in the i-th row j-th column is the distance between the i-th and j-th elements
Hierarchical clustering
–
Machine learning and data mining
16.
K-means clustering
–
K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean and this results in a partitioning of the data space into Voronoi cells. The problem is difficult, however, there are efficient heuristic algorithms that are commonly employed. These are usually similar to the algorithm for mixtures of Gaussian distributions via an iterative refinement approach employed by both algorithms. The algorithm has a relationship to the k-nearest neighbor classifier. One can apply the 1-nearest neighbor classifier on the centers obtained by k-means to classify new data into the existing clusters. This is known as nearest centroid classifier or Rocchio algorithm, the term k-means was first used by James MacQueen in 1967, though the idea goes back to Hugo Steinhaus in 1957. The standard algorithm was first proposed by Stuart Lloyd in 1957 as a technique for pulse-code modulation, in 1965, E. W. Forgy published essentially the same method, which is why it is sometimes referred to as Lloyd-Forgy. The most common uses a iterative refinement technique. Due to its ubiquity it is called the k-means algorithm, it is also referred to as Lloyds algorithm. Since the sum of squares is the squared Euclidean distance, this is intuitively the nearest mean, S i =, where each x p is assigned to exactly one S, even if it could be assigned to two or more of them. Update step, Calculate the new means to be the centroids of the observations in the new clusters. M i =1 | S i | ∑ x j ∈ S i x j Since the arithmetic mean is a least-squares estimator, the algorithm has converged when the assignments no longer change. Since both steps optimize the WCSS objective, and there exists a finite number of such partitionings. There is no guarantee that the optimum is found using this algorithm. The algorithm is often presented as assigning objects to the nearest cluster by distance, the standard algorithm aims at minimizing the WCSS objective, and thus assigns by least sum of squares, which is exactly equivalent to assigning by the smallest Euclidean distance. Using a different distance function other than Euclidean distance may stop the algorithm from converging, various modifications of k-means such as spherical k-means and k-medoids have been proposed to allow using other distance measures. Commonly used initialization methods are Forgy and Random Partition, the Forgy method randomly chooses k observations from the data set and uses these as the initial means
K-means clustering
17.
Expectation-maximization algorithm
–
These parameter-estimates are then used to determine the distribution of the latent variables in the next E step. The EM algorithm was explained and given its name in a classic 1977 paper by Arthur Dempster, Nan Laird and they pointed out that the method had been proposed many times in special circumstances by earlier authors. The Dempster-Laird-Rubin paper in 1977 generalized the method and sketched a convergence analysis for a class of problems. The Dempster-Laird-Rubin paper established the EM method as an important tool of statistical analysis, the convergence analysis of the Dempster-Laird-Rubin paper was flawed and a correct convergence analysis was published by C. F. Wus proof established the EM methods convergence outside of the exponential family, the EM algorithm is used to find maximum likelihood parameters of a statistical model in cases where the equations cannot be solved directly. Typically these models involve latent variables in addition to unknown parameters and that is, either missing values exist among the data, or the model can be formulated more simply by assuming the existence of further unobserved data points. In statistical models with latent variables, this is usually impossible, the EM algorithm proceeds from the observation that the following is a way to solve these two sets of equations numerically. In general, multiple maxima may occur, with no guarantee that the maximum will be found. Some likelihoods also have singularities in them, i. e. nonsensical maxima, associated with each data point may be a vector of observations. The missing values Z are discrete, drawn from a number of values. The parameters are continuous, and are of two kinds, Parameters that are associated with all points, and those associated with a specific value of a latent variable. However, it is possible to apply EM to other sorts of models and this suggests an iterative algorithm, in the case where both θ and Z are unknown, First, initialize the parameters θ to some random values. Compute the best value for Z given these parameter values, then, use the just-computed values of Z to compute a better estimate for the parameters θ. Parameters associated with a value of Z will use only those data points which associated latent variable has that value. Iterate steps 2 and 3 until convergence, the algorithm as just described monotonically approaches a local minimum of the cost function, and is commonly called hard EM. The k-means algorithm is an example of class of algorithms. The resulting algorithm is called soft EM, and is the type of algorithm normally associated with EM. The counts used to compute these weighted averages are called soft counts, the probabilities computed for Z are posterior probabilities and are what is computed in the E step
Expectation-maximization algorithm
–
Machine learning and data mining
18.
DBSCAN
–
Density-based spatial clustering of applications with noise is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996. DBSCAN is one of the most common clustering algorithms and also most cited in scientific literature, in 2014, the algorithm was awarded the test of time award at the leading data mining conference, KDD. Consider a set of points in space to be clustered. Those points are said to be reachable from p. By definition, no points are directly reachable from a non-core point, a point q is reachable from p if there is a path p1. Pn with p1 = p and pn = q, where each pi+1 is directly reachable from pi, All points not reachable from any other point are outliers. Now if p is a point, then it forms a cluster together with all points that are reachable from it. Each cluster contains at least one point, non-core points can be part of a cluster. Reachability is not a symmetric relation since, by definition, no point may be reachable from a non-core point, therefore a further notion of connectedness is needed to formally define the extent of the clusters found by DBSCAN. Two points p and q are density-connected if there is a point o such that p and q are density-reachable from o. A cluster then satisfies two properties, All points within the cluster are mutually density-connected, if a point is density-reachable from any point of the cluster, it is part of the cluster as well. DBSCAN requires two parameters, ε and the number of points required to form a dense region. It starts with a starting point that has not been visited. This points ε-neighborhood is retrieved, and if it contains sufficiently many points, otherwise, the point is labeled as noise. Note that this point might later be found in a sufficiently sized ε-environment of a different point, if a point is found to be a dense part of a cluster, its ε-neighborhood is also part of that cluster. Hence, all points that are found within the ε-neighborhood are added and this process continues until the density-connected cluster is completely found. Then, a new unvisited point is retrieved and processed, leading to the discovery of a cluster or noise. These simplifications have been omitted from the above pseudocode in order to reflect the originally published version, additionally, the regionQuery function need not return P in the list of points to be visited, as long as it is otherwise still counted in the local density estimate
DBSCAN
–
Machine learning and data mining
19.
Factor analysis
–
Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. For example, it is possible that variations in six observed variables mainly reflect the variations in two unobserved variables, factor analysis searches for such joint variations in response to unobserved latent variables. The observed variables are modelled as linear combinations of the potential factors, factor analysis aims to find independent latent variables. Followers of factor analytic methods believe that the information gained about the interdependencies between observed variables can be used later to reduce the set of variables in a dataset. Users of factor analysis believe that it helps to deal with data sets where there are numbers of observed variables that are thought to reflect a smaller number of underlying/latent variables. Factor analysis is related to principal component analysis, but the two are not identical, there has been significant controversy in the field over differences between the two techniques. PCA is a basic version of exploratory factor analysis that was developed in the early days prior to the advent of high-speed computers. From the point of view of exploratory analysis, the eigenvalues of PCA are inflated component loadings, suppose we have a set of p observable random variables, x 1, …, x p with means μ1, …, μ p. Here, the ε i are unobserved stochastic error terms with zero mean and finite variance, in matrix terms, we have x − μ = L F + ε. If we have n observations, then we will have the dimensions x p × n, L p × k, each column of x and F denote values for one particular observation, and matrix L does not vary across observations. Also we will impose the following assumptions on F, F and ε are independent, any solution of the above set of equations following the constraints for F is defined as the factors, and L as the loading matrix. Then note that from the conditions just imposed on F, we have C o v = C o v, or Σ = L C o v L T + C o v, or Σ = L L T + Ψ. Note that for any orthogonal matrix Q, if we set L = L Q and F = Q T F, hence a set of factors and factor loadings is unique only up to orthogonal transformation. Suppose a psychologist has the hypothesis there are two kinds of intelligence, verbal intelligence and mathematical intelligence, neither of which is directly observed. Evidence for the hypothesis is sought in the scores from each of 10 different academic fields of 1000 students. If each student is chosen randomly from a population, then each students 10 scores are random variables. e. It is a combination of two factors. For example, the hypothesis may hold that the average students aptitude in the field of astronomy is +, the numbers 10 and 6 are the factor loadings associated with astronomy
Factor analysis
20.
Independent component analysis
–
In signal processing, independent component analysis is a computational method for separating a multivariate signal into additive subcomponents. This is done by assuming that the subcomponents are non-Gaussian signals, ICA is a special case of blind source separation. A common example application is the cocktail party problem of listening in on one persons speech in a noisy room, Independent component analysis attempts to decompose a multivariate signal into independent non-Gaussian signals. As an example, sound is usually a signal that is composed of the addition, at each time t. The question then is whether it is possible to separate these contributing sources from the total signal. When the statistical independence assumption is correct, blind ICA separation of a mixed signal gives very good results and it is also used for signals that are not supposed to be generated by a mixing for analysis purposes. A simple application of ICA is the cocktail party problem, where the speech signals are separated from a sample data consisting of people talking simultaneously in a room. Usually the problem is simplified by assuming no delays or echoes. An important note to consider is that if N sources are present, other cases of underdetermined and overdetermined have been investigated. That the ICA separation of mixed signals gives very good results is based on two assumptions and three effects of mixing source signals, two assumptions, The source signals are independent of each other. The values in each source signal have non-Gaussian distributions, three effects of mixing source signals, Independence, As per assumption 1, the source signals are independent, however, their signal mixtures are not. This is because the signal mixtures share the source signals. Normality, According to the Central Limit Theorem, the distribution of a sum of independent random variables with finite variance tends towards a Gaussian distribution. Loosely speaking, a sum of two independent random variables usually has a distribution that is closer to Gaussian than any of the two original variables, here we consider the value of each signal as the random variable. Complexity, The temporal complexity of any signal mixture is greater than that of its simplest constituent source signal and those principles contribute to the basic establishment of ICA. ICA finds the independent components by maximizing the statistical independence of the estimated components and we may choose one of many ways to define a proxy for independence, and this choice governs the form of the ICA algorithm. The non-Gaussianity family of ICA algorithms, motivated by the limit theorem, uses kurtosis. Whitening and dimension reduction can be achieved with principal component analysis or singular value decomposition, whitening ensures that all dimensions are treated equally a priori before the algorithm is run
Independent component analysis
–
Independent component analysis in EEGLAB
21.
Non-negative matrix factorization
–
This non-negativity makes the resulting matrices easier to inspect. Also, in such as processing of audio spectrograms or muscular activity, non-negativity is inherent to the data being considered. Since the problem is not exactly solvable in general, it is commonly approximated numerically, NMF finds applications in such fields as computer vision, document clustering, chemometrics, audio signal processing and recommender systems. In chemometrics non-negative matrix factorization has a history under the name self modeling curve resolution. In this framework the vectors in the matrix are continuous curves rather than discrete vectors. Also early work on non-negative matrix factorizations was performed by a Finnish group of researchers in the middle of the 1990s under the name positive matrix factorization. That is, each column of V can be computed as follows, v i = W h i, when multiplying matrices, the dimensions of the factor matrices may be significantly lower than those of the product matrix and it is this property that forms the basis of NMF. NMF generates factors with significantly reduced compared to the original matrix. For example, if V is an m × n matrix, W is an m × p matrix, heres an example based on a text-mining application, Let the input matrix be V with 10000 rows and 500 columns where words are in rows and documents are in columns. That is, we have 500 documents indexed by 10000 words and it follows that a column vector v in V represents a document. Assume we ask the algorithm to find 10 features in order to generate a features matrix W with 10000 rows and 10 columns and a coefficients matrix H with 10 rows and 500 columns. The product of W and H is a matrix with 10000 rows and 500 columns and this last point is the basis of NMF because we can consider each original document in our example as being built from a small set of hidden features. A column in the coefficients matrix H represents a document with a cell value defining the documents rank for a feature. This follows because each row in H represents a feature and it is this property that drives most applications of NMF. More specifically, the approximation of V by V ≃ W H is achieved by minimizing the error function min W, H | | V − W H | | F, subject to W ≥0, H ≥0. If we add additional orthogonality constraint on H, i. e. H H T = I, then the above minimization is mathematically equivalent to the minimization of K-means clustering ). Furthermore, the computed H gives the cluster indicator, i. e. if H k j >0, and the computed W gives the cluster centroids, i. e. the k t h column gives the cluster centroid of k t h cluster. This centroids representation can be enhanced by convex NMF
Non-negative matrix factorization
–
NMF as a probabilistic graphical model: visible units (V) are connected to hidden units (H) through weights W, so that V is generated from a probability distribution with mean.
22.
Hidden Markov model
–
A hidden Markov model is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved states. An HMM can be presented as the simplest dynamic Bayesian network, the mathematics behind the HMM were developed by L. E. Baum and coworkers. It is closely related to a work on the optimal nonlinear filtering problem by Ruslan L. Stratonovich. In simpler Markov models, the state is visible to the observer. In a hidden Markov model, the state is not directly visible, each state has a probability distribution over the possible output tokens. Therefore, the sequence of tokens generated by an HMM gives some information about the sequence of states, in its discrete form, a hidden Markov process can be visualized as a generalization of the Urn problem with replacement. Consider this example, in a room that is not visible to an observer there is a genie, the room contains urns X1, X2, X3, … each of which contains a known mix of balls, each ball labeled y1, y2, y3, …. The genie chooses an urn in that room and randomly draws a ball from that urn and it then puts the ball onto a conveyor belt, where the observer can observe the sequence of the balls but not the sequence of urns from which they were drawn. The genie has some procedure to choose urns, the choice of the urn for the n-th ball depends only upon a random number, the choice of urn does not directly depend on the urns chosen before this single previous urn, therefore, this is called a Markov process. It can be described by the part of Figure 1. The Markov process itself cannot be observed, only the sequence of labeled balls and this is illustrated by the lower part of the diagram shown in Figure 1, where one can see that balls y1, y2, y3, y4 can be drawn at each state. However, the observer can work out other information, such as the likelihood that the ball came from each of the urns. The diagram below shows the architecture of an instantiated HMM. Each oval shape represents a variable that can adopt any of a number of values. The random variable x is the state at time t. The random variable y is the observation at time t, the arrows in the diagram denote conditional dependencies. This is called the Markov property, similarly, the value of the observed variable y only depends on the value of the hidden variable x. In the standard type of hidden Markov model considered here, the space of the hidden variables is discrete
Hidden Markov model
–
Machine learning and data mining
23.
K-nearest neighbors classification
–
In pattern recognition, the k-nearest neighbors algorithm is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space, the output depends on whether k-NN is used for classification or regression, In k-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbors, if k =1, then the object is simply assigned to the class of that single nearest neighbor. In k-NN regression, the output is the property value for the object and this value is the average of the values of its k nearest neighbors. K-NN is a type of instance-based learning, or lazy learning, the k-NN algorithm is among the simplest of all machine learning algorithms. For example, a weighting scheme consists in giving each neighbor a weight of 1/d. The neighbors are taken from a set of objects for which the class or the property value is known. This can be thought of as the set for the algorithm. A shortcoming of the algorithm is that it is sensitive to the local structure of the data. The algorithm is not to be confused with k-means, another popular machine learning technique. Suppose we have pairs, …, taking values in R d ×, where Y is the class label of X, so that X | Y = r ∼ P r for r =1,2. Given some norm ∥ ⋅ ∥ on R d and a point x ∈ R d, let, …, the training examples are vectors in a multidimensional feature space, each with a class label. The training phase of the algorithm consists only of storing the feature vectors, a commonly used distance metric for continuous variables is Euclidean distance. For discrete variables, such as for text classification, another metric can be used, in the context of gene expression microarray data, for example, k-NN has also been employed with correlation coefficients such as Pearson and Spearman. A drawback of the majority voting classification occurs when the class distribution is skewed. That is, examples of a more frequent class tend to dominate the prediction of the new example, one way to overcome this problem is to weight the classification, taking into account the distance from the test point to each of its k nearest neighbors. The class of each of the k nearest points is multiplied by a proportional to the inverse of the distance from that point to the test point. Another way to overcome skew is by abstraction in data representation, for example, in a self-organizing map, each node is a representative of a cluster of similar points, regardless of their density in the original training data
K-nearest neighbors classification
–
Machine learning and data mining
24.
Local outlier factor
–
LOF shares some concepts with DBSCAN and OPTICS such as the concepts of core distance and reachability distance, which are used for local density estimation. The local outlier factor is based on a concept of a local density and these are considered to be outliers. The local density is estimated by the distance at which a point can be reached from its neighbors. The definition of reachability distance used in LOF is a measure to produce more stable results within clusters. Let k-distance be the distance of the object A to the k-th nearest neighbor, note that the set of the k nearest neighbors includes all objects at this distance, which can in the case of a tie be more than k objects. We denote the set of k nearest neighbors as N k, objects that belong to the k nearest neighbors of B are considered to be equally distant. The reason for this distance is to get more stable results, note that this is not a distance in the mathematical definition, since it is not symmetric. The local reachability density of an object A is defined by lrd, note that it is not the average reachability of the neighbors from A, but the distance at which A can be reached from its neighbors. With duplicate points, this value can become infinite, a value of approximately 1 indicates that the object is comparable to its neighbors. A value below 1 indicates a region, while values significantly larger than 1 indicate outliers. Due to the approach, LOF is able to identify outliers in a data set that would not be outliers in another area of the data set. For example, a point at a distance to a very dense cluster is an outlier. While the geometric intuition of LOF is only applicable to low-dimensional vector spaces and it has experimentally been shown to work very well in numerous setups, often outperforming the competitors, for example in network intrusion detection and on processed classification benchmark data. The LOF family of methods can be generalized and then applied to various other problems, such as detecting outliers in geographic data. The resulting values are quotient-values and hard to interpret, a value of 1 or even less indicates a clear inlier, but there is no clear rule for when a point is an outlier. In one data set, a value of 1.1 may already be an outlier, in another dataset and these differences can also occur within a dataset due to the locality of the method. This is the first ensemble learning approach to outlier detection, for other variants see ref, local Outlier Probability is a method derived from LOF but using inexpensive local statistics to become less sensitive to the choice of the parameter k. In addition, the values are scaled to a value range of
Local outlier factor
–
Basic idea of LOF: comparing the local density of a point with the densities of its neighbors. A has a much lower density than its neighbors.
25.
Restricted Boltzmann machine
–
A restricted Boltzmann machine is a generative stochastic artificial neural network that can learn a probability distribution over its set of inputs. RBMs have found applications in dimensionality reduction, classification, collaborative filtering, feature learning and they can be trained in either supervised or unsupervised ways, depending on the task. By contrast, unrestricted Boltzmann machines may have connections between hidden units and this restriction allows for more efficient training algorithms than are available for the general class of Boltzmann machines, in particular the gradient-based contrastive divergence algorithm. Restricted Boltzmann machines can also be used in deep learning networks, in particular, deep belief networks can be formed by stacking RBMs and optionally fine-tuning the resulting deep network with gradient descent and backpropagation. The individual activation probabilities are given by P = σ and P = σ where σ denotes the logistic sigmoid, the visible units of RBM can be multinomial, although the hidden units are Bernoulli. In this case, the function for visible units is replaced by the softmax function P = exp Σ k ′ =1 K exp where K is the number of discrete values that the visible values have. They are applied in topic modeling, and recommender systems, Restricted Boltzmann machines are a special case of Boltzmann machines and Markov random fields. Their graphical model corresponds to that of factor analysis, the algorithm performs Gibbs sampling and is used inside a gradient descent procedure to compute weight update. Compute the outer product of v and h and call this the positive gradient, from h, sample a reconstruction v of the visible units, then resample the hidden activations h from this. Compute the outer product of v and h and call this the negative gradient, let the update to the weight matrix W be the positive gradient minus the negative gradient, times some learning rate, Δ W = ϵ. Update the biases a and b analogously, Δ a = ϵ, Δ b = ϵ, a Practical Guide to Training RBMs written by Hinton can be found on his homepage. Autoencoder Deep learning Helmholtz machine Hopfield network Introduction to Restricted Boltzmann Machines, edwin Chens blog, July 18,2011. A Beginners Guide to Restricted Boltzmann Machines, python implementation of Bernoulli RBM and tutorial
Restricted Boltzmann machine
–
Diagram of a restricted Boltzmann machine with three visible units and four hidden units (no bias units).
26.
Convolutional neural network
–
Individual cortical neurons respond to stimuli in a restricted region of space known as the receptive field. The receptive fields of different neurons partially overlap such that they tile the visual field, the response of an individual neuron to stimuli within its receptive field can be approximated mathematically by a convolution operation. Convolutional networks were inspired by biological processes and are variations of multilayer perceptrons designed to use minimal amounts of preprocessing and they have wide applications in image and video recognition, recommender systems and natural language processing. Convolutional neural networks model animal visual perception, and can be applied to visual recognition tasks, Convolutional neural networks consist of multiple layers of receptive fields. These are small neuron collections which process portions of the input image, the outputs of these collections are then tiled so that their input regions overlap, to obtain a higher-resolution representation of the original image, this is repeated for every such layer. Tiling allows CNNs to tolerate translation of the input image, Convolutional networks may include local or global pooling layers, which combine the outputs of neuron clusters. They also consist of combinations of convolutional and fully connected layers. A convolution operation on small regions of input is introduced to reduce the number of free parameters, one major advantage of networks is the use of shared weight in convolutional layers, which means that the same filt, this both reduces memory footprint and improves performance. Compared to other image classification algorithms, convolutional neural networks use relatively little pre-processing and this means that the network is responsible for learning the filters that in traditional algorithms were hand-engineered. The lack of dependence on prior knowledge and human effort in designing features is an advantage for CNNs. The design of neural networks follows visual mechanisms in living organisms. Work by Hubel and Wiesel in the 1950s and 1960s showed that cat, provided the eyes are not moving, the region of visual space within which visual stimuli affect the firing of a single neuron is known as its receptive fields. Neighboring cells have similar and overlapping receptive field, receptive field size and location varies systematically across the cortex to form a complete map of visual space, the cortex in each hemisphere representing the contralateral visual field. The neocognitron was introduced in 1980, the neocognitron does not require units located at several network positions to have the same trainable weights. This idea appears in 1986 in the version of the original backpropagation paper. They were developed in 1988 for temporal signals and their design was improved in 1998, generalized in 2003, and simplified in the same year. The ability to higher resolution images requires larger and more convolutional layers. Similarly, a shift invariant neural network was proposed for image recognition in 1988
Convolutional neural network
–
Machine learning and data mining
27.
Probably approximately correct learning
–
In computational learning theory, probably approximately correct learning is a framework for mathematical analysis of machine learning. It was proposed in 1984 by Leslie Valiant, in this framework, the learner receives samples and must select a generalization function from a certain class of possible functions. The goal is that, with probability, the selected function will have low generalization error. The learner must be able to learn the concept given any arbitrary approximation ratio, probability of success, the model was later extended to treat noise. An important innovation of the PAC framework is the introduction of computational complexity theory concepts to machine learning, in particular, the learner is expected to find efficient functions, and the learner itself must implement an efficient procedure. In order to give the definition for something that is PAC-learnable, for the following definitions, two examples will be used. The first is the problem of recognition given an array of n bits encoding a binary-valued image. The other example is the problem of finding an interval that will correctly classify points within the interval as positive, let X be a set called the instance space or the encoding of all the samples, and each instance have length assigned. In the character recognition problem, the space is X = n. In the interval problem the instance space is X = R, a concept is a subset c ⊂ X. One concept is the set of all patterns of bits in X = n that encode a picture of the letter P, an example concept from the second example is the set of all of the numbers between π /2 and 10. A concept class C is a set of concepts over X and this could be the set of all subsets of the array of bits that are skeletonized 4-connected. Let E X be a procedure that draws an example, x, using a probability distribution D and gives the correct label c, that is 1 if x ∈ c and 0 otherwise. Further if the statement for algorithm A is true for every concept c ∈ C and for every distribution D over X. We can also say that A is a PAC learning algorithm for C, under some regularity conditions these three conditions are equivalent, The concept class C is PAC learnable. The VC dimension of C is finite, C is a uniform Glivenko-Cantelli class. Machine learning Data mining Error Tolerance M. Kearns, U, an Introduction to Computational Learning Theory. Overview of the Probably Approximately Correct Learning Framework, in which Valiant argues that PAC learning describes how organisms evolve and learn
Probably approximately correct learning
–
Machine learning and data mining
28.
Statistical learning theory
–
Statistical learning theory is a framework for machine learning drawing from the fields of statistics and functional analysis. Statistical learning theory deals with the problem of finding a function based on data. Statistical learning theory has led to applications in fields such as computer vision, speech recognition, bioinformatics. The goals of learning are understanding and prediction, learning falls into many categories, including supervised learning, unsupervised learning, online learning, and reinforcement learning. From the perspective of statistical learning theory, supervised learning is best understood, supervised learning involves learning from a training set of data. Every point in the training is a pair, where the input maps to an output. The learning problem consists of inferring the function that maps between the input and the output, such that the function can be used to predict output from future input. Depending on the type of output, supervised learning problems are either problems of regression or problems of classification, if the output takes a continuous range of values, it is a regression problem. Using Ohms Law as an example, a regression could be performed with voltage as input, classification is very common for machine learning applications. In facial recognition, for instance, a picture of a persons face would be the input, the input would be represented by a large multidimensional vector whose elements represent pixels in the picture. After learning a function based on the training set data, that function is validated on a test set of data, data that did not appear in the training set. Take X to be the space of all possible inputs. Statistical learning theory takes the perspective that there is some probability distribution over the product space Z = X × Y, i. e. there exists some unknown p = p. The training set is made up of n samples from this probability distribution, and is notated S = = Every x → i is a vector from the training data. In this formalism, the problem consists of finding a function f, X ↦ Y such that f ∼ y. Let H be a space of functions f, X → Y called the hypothesis space, the hypothesis space is the space of functions the algorithm will search through. Let V be the functional, a metric for the difference between the predicted value f and the actual value y. This measure is based on the set, a sample from this unknown probability distribution
Statistical learning theory
–
This image represents an example of overfitting in machine learning. The red dots represent training set data. The green line represents the true functional relationship, while the blue line shows the learned function, which has fallen victim to overfitting.
Statistical learning theory
–
Machine learning and data mining
29.
Axon
–
An axon, is a long, slender projection of a nerve cell, or neuron, that typically conducts electrical impulses away from the neurons cell body. Axons are also known as nerve fibers, the function of the axon is to transmit information to different neurons, muscles and glands. Axon dysfunction has caused many inherited and acquired neurological disorders which can affect both the peripheral and central neurons, nerve fibers are classed into three types – A delta fibers, B fibers, and C fibres. A and B are myelinated and C are unmyelinated, an axon is one of two types of protoplasmic protrusions that extrude from the cell body of a neuron, the other type being dendrites. Axons are distinguished from dendrites by several features, including shape, length, all of these rules have exceptions, however. Axons are covered by a known as axolemma, the cytoplasm of axon is called axoplasm. Some types of neurons have no axon and transmit signals from their dendrites, no neuron ever has more than one axon, however in invertebrates such as insects or leeches the axon sometimes consists of several regions that function more or less independently of each other. Most axons branch, in cases very profusely. Axons make contact with other cells—usually other neurons but sometimes muscle or gland cells—at junctions called synapses, at a synapse, the membrane of the axon closely adjoins the membrane of the target cell, and special molecular structures serve to transmit electrical or electrochemical signals across the gap. Some synaptic junctions appear partway along an axon as it extends—these are called en passant synapses, other synapses appear as terminals at the ends of axonal branches. A single axon, with all its branches together, can innervate multiple parts of the brain. Axons are the transmission lines of the nervous system. Some axons can extend up to one meter or more while others extend as little as one millimeter, the longest axons in the human body are those of the sciatic nerve, which run from the base of the spinal cord to the big toe of each foot. The diameter of axons is also variable, most individual axons are microscopic in diameter. The largest mammalian axons can reach a diameter of up to 20 µm, the squid giant axon, which is specialized to conduct signals very rapidly, is close to 1 millimetre in diameter, the size of a small pencil lead. Axonal arborization also differs from one nerve fiber to the next, axons in the central nervous system typically show complex trees with many branch points. In comparison, the granule cell axon is characterized by a single T-shaped branch node from which two parallel fibers extend. Elaborate arborization allows for the transmission of messages to a large number of target neurons within a single region of the brain
Axon
–
A dissected human brain, showing grey matter and white matter
Axon
–
Dendrite
Axon
–
A)pyramidal cell,interneuron,and short durationwaveform (Axon). (B) overlay of the three average waveforms; (C) average and standard error of peak-trough time for pyramidal cells interneurons, and putative axons.(D)Scatter plot of signal to noise ratios for individual units againstpeak-trough time for axons, pyramidal cells (PYR) and interneurons (INT).
Axon
–
Axon of 9 day old mouse with growth cone visible
30.
Real numbers
–
In mathematics, a real number is a value that represents a quantity along a line. The adjective real in this context was introduced in the 17th century by René Descartes, the real numbers include all the rational numbers, such as the integer −5 and the fraction 4/3, and all the irrational numbers, such as √2. Included within the irrationals are the numbers, such as π. Real numbers can be thought of as points on a long line called the number line or real line. Any real number can be determined by a possibly infinite decimal representation, such as that of 8.632, the real line can be thought of as a part of the complex plane, and complex numbers include real numbers. These descriptions of the numbers are not sufficiently rigorous by the modern standards of pure mathematics. All these definitions satisfy the definition and are thus equivalent. The statement that there is no subset of the reals with cardinality greater than ℵ0. Simple fractions were used by the Egyptians around 1000 BC, the Vedic Sulba Sutras in, c.600 BC, around 500 BC, the Greek mathematicians led by Pythagoras realized the need for irrational numbers, in particular the irrationality of the square root of 2. Arabic mathematicians merged the concepts of number and magnitude into a general idea of real numbers. In the 16th century, Simon Stevin created the basis for modern decimal notation, in the 17th century, Descartes introduced the term real to describe roots of a polynomial, distinguishing them from imaginary ones. In the 18th and 19th centuries, there was work on irrational and transcendental numbers. Johann Heinrich Lambert gave the first flawed proof that π cannot be rational, Adrien-Marie Legendre completed the proof, Évariste Galois developed techniques for determining whether a given equation could be solved by radicals, which gave rise to the field of Galois theory. Charles Hermite first proved that e is transcendental, and Ferdinand von Lindemann, lindemanns proof was much simplified by Weierstrass, still further by David Hilbert, and has finally been made elementary by Adolf Hurwitz and Paul Gordan. The development of calculus in the 18th century used the set of real numbers without having defined them cleanly. The first rigorous definition was given by Georg Cantor in 1871, in 1874, he showed that the set of all real numbers is uncountably infinite but the set of all algebraic numbers is countably infinite. Contrary to widely held beliefs, his first method was not his famous diagonal argument, the real number system can be defined axiomatically up to an isomorphism, which is described hereafter. Another possibility is to start from some rigorous axiomatization of Euclidean geometry, from the structuralist point of view all these constructions are on equal footing
Real numbers
–
A symbol of the set of real numbers (ℝ)
31.
Computer vision
–
Computer vision is an interdisciplinary field that deals with how computers can be made for gaining high-level understanding from digital images or videos. From the perspective of engineering, it seeks to automate tasks that the visual system can do. g. in the forms of decisions. Understanding in this means the transformation of visual images into descriptions of the world that can interface with other thought processes. This image understanding can be seen as the disentangling of symbolic information from image data using models constructed with the aid of geometry, physics, statistics, as a scientific discipline, computer vision is concerned with the theory behind artificial systems that extract information from images. The image data can take many forms, such as sequences, views from multiple cameras. As a technological discipline, computer vision seeks to apply its theories, sub-domains of computer vision include scene reconstruction, event detection, video tracking, object recognition, object pose estimation, learning, indexing, motion estimation, and image restoration. Computer vision is a field that deals with how computers can be made for gaining high-level understanding from digital images or videos. From the perspective of engineering, it seeks to automate tasks that the visual system can do. Computer vision is concerned with the extraction, analysis and understanding of useful information from a single image or a sequence of images. It involves the development of a theoretical and algorithmic basis to achieve automatic visual understanding, as a scientific discipline, computer vision is concerned with the theory behind artificial systems that extract information from images. The image data can take many forms, such as sequences, views from multiple cameras. As a technological discipline, computer vision seeks to apply its theories, in the late 1960s, computer vision began at universities that were pioneering artificial intelligence. It was meant to mimic the visual system, as a stepping stone to endowing robots with intelligent behavior. In 1966, it was believed that this could be achieved through a project, by attaching a camera to a computer. The next decade saw studies based on rigorous mathematical analysis. These include the concept of scale-space, the inference of shape from various cues such as shading, texture and focus, researchers also realized that many of these mathematical concepts could be treated within the same optimization framework as regularization and Markov random fields. By the 1990s, some of the research topics became more active than the others. Research in projective 3-D reconstructions led to better understanding of camera calibration, with the advent of optimization methods for camera calibration, it was realized that a lot of the ideas were already explored in bundle adjustment theory from the field of photogrammetry
Computer vision
–
Artist's Concept of Rover on Mars, an example of an unmanned land-based vehicle. Notice the stereo cameras mounted on top of the Rover.
Computer vision
–
Relation between computer vision and various other fields [original research?]
32.
Social network
–
A social network is a social structure made up of a set of social actors, sets of dyadic ties, and other social interactions between actors. The social network perspective provides a set of methods for analyzing the structure of whole social entities as well as a variety of theories explaining the patterns observed in these structures. The study of these structures uses social network analysis to local and global patterns, locate influential entities. Social networks and the analysis of them is an inherently interdisciplinary academic field which emerged from social psychology, sociology, statistics, georg Simmel authored early structural theories in sociology emphasizing the dynamics of triads and web of group affiliations. Jacob Moreno is credited with developing the first sociograms in the 1930s to study interpersonal relationships and these approaches were mathematically formalized in the 1950s and theories and methods of social networks became pervasive in the social and behavioral sciences by the 1980s. Social network analysis is now one of the major paradigms in contemporary sociology, together with other complex networks, it forms part of the nascent field of network science. The social network is a theoretical construct useful in the sciences to study relationships between individuals, groups, organizations, or even entire societies. The term is used to describe a structure determined by such interactions. The ties through which any given social unit connects represent the convergence of the social contacts of that unit. This theoretical approach is, necessarily, relational, thus, one common criticism of social network theory is that individual agency is often ignored although this may not be the case in practice. Precisely because many different types of relations, singular or in combination, form these network configurations, in the late 1890s, both Émile Durkheim and Ferdinand Tönnies foreshadowed the idea of social networks in their theories and research of social groups. Tönnies argued that groups can exist as personal and direct social ties that either link individuals who share values and belief or impersonal, formal. Major developments in the field can be seen in the 1930s by several groups in psychology, anthropology, in psychology, in the 1930s, Jacob L. Moreno began systematic recording and analysis of social interaction in small groups, especially classrooms and work groups. In anthropology, the foundation for social theory is the theoretical and ethnographic work of Bronislaw Malinowski, Alfred Radcliffe-Brown. In sociology, the work of Talcott Parsons set the stage for taking a relational approach to understanding social structure. Later, drawing upon Parsons theory, the work of sociologist Peter Blau provides a strong impetus for analyzing the relational ties of social units with his work on social exchange theory, by the 1970s, a growing number of scholars worked to combine the different tracks and traditions. In general, social networks are self-organizing, emergent, and complex and these patterns become more apparent as network size increases. However, a network analysis of, for example, all interpersonal relationships in the world is not feasible and is likely to contain so much information as to be uninformative
Social network
Social network
–
Social network diagram, meso-level
Social network
–
Diagram: section of a large-scale social network
33.
Mathematics
–
Mathematics is the study of topics such as quantity, structure, space, and change. There is a range of views among mathematicians and philosophers as to the exact scope, Mathematicians seek out patterns and use them to formulate new conjectures. Mathematicians resolve the truth or falsity of conjectures by mathematical proof, when mathematical structures are good models of real phenomena, then mathematical reasoning can provide insight or predictions about nature. Through the use of abstraction and logic, mathematics developed from counting, calculation, measurement, practical mathematics has been a human activity from as far back as written records exist. The research required to solve mathematical problems can take years or even centuries of sustained inquiry, rigorous arguments first appeared in Greek mathematics, most notably in Euclids Elements. Galileo Galilei said, The universe cannot be read until we have learned the language and it is written in mathematical language, and the letters are triangles, circles and other geometrical figures, without which means it is humanly impossible to comprehend a single word. Without these, one is wandering about in a dark labyrinth, carl Friedrich Gauss referred to mathematics as the Queen of the Sciences. Benjamin Peirce called mathematics the science that draws necessary conclusions, David Hilbert said of mathematics, We are not speaking here of arbitrariness in any sense. Mathematics is not like a game whose tasks are determined by arbitrarily stipulated rules, rather, it is a conceptual system possessing internal necessity that can only be so and by no means otherwise. Albert Einstein stated that as far as the laws of mathematics refer to reality, they are not certain, Mathematics is essential in many fields, including natural science, engineering, medicine, finance and the social sciences. Applied mathematics has led to entirely new mathematical disciplines, such as statistics, Mathematicians also engage in pure mathematics, or mathematics for its own sake, without having any application in mind. There is no clear line separating pure and applied mathematics, the history of mathematics can be seen as an ever-increasing series of abstractions. The earliest uses of mathematics were in trading, land measurement, painting and weaving patterns, in Babylonian mathematics elementary arithmetic first appears in the archaeological record. Numeracy pre-dated writing and numeral systems have many and diverse. Between 600 and 300 BC the Ancient Greeks began a study of mathematics in its own right with Greek mathematics. Mathematics has since been extended, and there has been a fruitful interaction between mathematics and science, to the benefit of both. Mathematical discoveries continue to be made today, the overwhelming majority of works in this ocean contain new mathematical theorems and their proofs. The word máthēma is derived from μανθάνω, while the modern Greek equivalent is μαθαίνω, in Greece, the word for mathematics came to have the narrower and more technical meaning mathematical study even in Classical times
Mathematics
–
Euclid (holding calipers), Greek mathematician, 3rd century BC, as imagined by Raphael in this detail from The School of Athens.
Mathematics
–
Greek mathematician Pythagoras (c. 570 – c. 495 BC), commonly credited with discovering the Pythagorean theorem
Mathematics
–
Leonardo Fibonacci, the Italian mathematician who established the Hindu–Arabic numeral system to the Western World
Mathematics
–
Carl Friedrich Gauss, known as the prince of mathematicians
34.
Long term potentiation
–
In neuroscience, long-term potentiation is a persistent strengthening of synapses based on recent patterns of activity. These are patterns of activity that produce a long-lasting increase in signal transmission between two neurons. The opposite of LTP is long-term depression, which produces a decrease in synaptic strength. It is one of several phenomena underlying synaptic plasticity, the ability of synapses to change their strength. As memories are thought to be encoded by modification of synaptic strength, LTP is widely considered one of the cellular mechanisms that underlies learning. LTP was discovered in the hippocampus by Terje Lømo in 1966 and has remained a popular subject of research since. Many modern LTP studies seek to understand its basic biology. Still others try to develop methods, pharmacologic or otherwise, of enhancing LTP to improve learning, LTP is also a subject of clinical research, for example, in the areas of Alzheimers disease and addiction medicine. With this realization came the need to explain how memories could form in the absence of new neurons, the Spanish neuroanatomist Santiago Ramón y Cajal was among the first to suggest a mechanism of learning that did not require the formation of new neurons. In his 1894 Croonian Lecture, he proposed that memories might instead be formed by strengthening the connections between existing neurons to improve the effectiveness of their communication. These skills would not come until the half of the 20th century. LTP was first observed by Terje Lømo in 1966 in the Oslo, Norway, there, Lømo conducted a series of neurophysiological experiments on anesthetized rabbits to explore the role of the hippocampus in short-term memory. Lømos experiments focused on connections, or synapses, from the perforant pathway to the dentate gyrus and these experiments were carried out by stimulating presynaptic fibers of the perforant pathway and recording responses from a collection of postsynaptic cells of the dentate gyrus. As expected, a pulse of electrical stimulation to fibers of the perforant pathway caused excitatory postsynaptic potentials in cells of the dentate gyrus. When such a train of stimuli was applied, subsequent single-pulse stimuli elicited stronger, Timothy Bliss, who joined the Andersen laboratory in 1968, collaborated with Lømo and in 1973 the two published the first characterization of long-lasting potentiation in the rabbit hippocampus. Bliss and Tony Gardner-Medwin published a report of long-lasting potentiation in the awake animal which appeared in the same issue as the Bliss. In 1975, Douglas and Goddard proposed long-term potentiation as a new name for the phenomenon of long-lasting potentiation, Andersen suggested that the authors chose long-term potentiation perhaps because of its easily pronounced acronym, LTP. The physical and biological mechanism of LTP is still not understood, still others have proposed re-arranging or synchronizing the relationship between receptor regulation, LTP, and synaptic strength
Long term potentiation
Long term potentiation
–
The 19th century neuroanatomist Santiago Ramón y Cajal proposed that memories might be stored across synapses, the junctions between neurons that allow for their communication.
Long term potentiation
–
The Morris water maze task has been used to demonstrate the necessity of NMDA receptors in establishing spatial memories.
35.
Nobel laureate
–
They were established by the 1895 will of Alfred Nobel, which dictates that the awards should be administered by the Nobel Foundation. The Nobel Memorial Prize in Economic Sciences was established in 1968 by the Sveriges Riksbank, each recipient, or laureate, receives a gold medal, a diploma, and a sum of money, which is decided by the Nobel Foundation, yearly. Each recipient receives a medal, a diploma and an award that has varied throughout the years. In 1901, the recipients of the first Nobel Prizes were given 150,782 SEK, in 2008, the laureates were awarded a prize amount of 10,000,000 SEK. The awards are presented in Stockholm in a ceremony on December 10. In years in which the Nobel Prize is not awarded due to events or a lack of nominations. The Nobel Prize was not awarded between 1940 and 1942 due to the outbreak of World War II, between 1901 and 2015, the Nobel Prizes and the Nobel Memorial Prize in Economic Sciences were awarded 573 times to 900 people and organizations. With some receiving the Nobel Prize more than once, this makes a total of 870 individuals and 23 organizations, four Nobel laureates were not permitted by their governments to accept the Nobel Prize. Six laureates have received more than one prize, of the six, UNHCR has been awarded the Nobel Peace Prize twice. Also the Nobel Prize in Physics was awarded to John Bardeen twice, two laureates have been awarded twice but not in the same field, Marie Curie and Linus Pauling. Among the 870 Nobel laureates,48 have been women, the first woman to receive a Nobel Prize was Marie Curie and she was also the first person to be awarded two Nobel Prizes, the second award being the Nobel Prize in Chemistry, given in 1911. A In 1938 and 1939, the government of Germany did not allow three German Nobel nominees to accept their Nobel Prizes. The three were Richard Kuhn, Nobel laureate in Chemistry in 1938, Adolf Butenandt, Nobel laureate in Chemistry in 1939 and they were later awarded the Nobel Prize diploma and medal, but not the money. B In 1948, the Nobel Prize in Peace was not awarded, the Nobel Foundations website suggests that it would have been awarded to Mohandas Karamchand Gandhi, however, due to his assassination earlier that year, it was left unassigned in his honor. C In 1958, Russian-born Boris Pasternak, under pressure from the government of the Soviet Union, was forced to decline the Nobel Prize in Literature. D In 1964, Jean-Paul Sartre refused to accept the Nobel Prize in Literature, E In 1973, Lê Ðức Thọ declined the Nobel Peace Prize. His reason was that he felt he did not deserve it because although he helped negotiate the Paris Peace Accords, F In 2010, Liu Xiaobo was unable to receive the Nobel Peace Prize as he was sentenced to 11 years of imprisonment by the Chinese authorities
Nobel laureate
–
Nobel laureates of 2012 Alvin E. Roth, Brian Kobilka, Robert J. Lefkowitz, David J. Wineland, and Serge Haroche during the ceremony
Nobel laureate
–
Nobel laureates receive a gold medal together with a diploma and (as of 2012) 8 million SEK (roughly US$1.2 million, €0.93 million).
36.
Primary visual cortex
–
The visual cortex of the brain is a part of the cerebral cortex that plays an important role in processing visual information. It is located in the lobe in the back of the skull. Visual information coming from the eye, goes through the lateral nucleus, that is located in the thalamus. The part of the cortex that receives the sensory inputs from the thalamus is the primary visual cortex, also known as Visual area one. The extrastriate areas consist of visual areas two, three, four, and five, the primary visual cortex is located in and around the calcarine fissure in the occipital lobe. Each hemispheres V1 receives information directly from its ipsilateral lateral geniculate nucleus that receives signals from the contralateral visual hemifield, neurons in the visual cortex fire action potentials when visual stimuli appear within their receptive field. By definition, the field is the region within the entire visual field that elicits an action potential. But, for any given neuron, it may respond best to a subset of stimuli within its receptive field and this property is called neuronal tuning. In the earlier areas, neurons have simpler tuning. For example, a neuron in V1 may fire to any vertical stimulus in its receptive field, in the higher visual areas, neurons have complex tuning. For example, in the temporal cortex, a neuron may fire only when a certain face appears in its receptive field. The visual cortex receives its blood supply primarily from the branch of the posterior cerebral artery. One recent discovery concerning the human V1 is that measured by fMRI show very large attentional modulation. This result is consistent with another recent electrophysiology study, other current work on V1 seeks to fully characterize its tuning properties, and to use it as a model area for the canonical cortical circuit. Lesions to primary visual cortex lead to a scotoma, or hole in the visual field. Note that patients with scotomas are often able to use of visual information presented to their scotomas. Each V1 transmits information to two pathways, called the ventral stream and the dorsal stream. The ventral stream begins with V1, goes through visual area V2, then through visual area V4, the ventral stream, sometimes called the What Pathway, is associated with form recognition and object representation
Primary visual cortex
–
Micrograph showing the visual cortex (pink). The pia mater and arachnoid mater including blood vessels are seen at the top of the image. Subcortical white matter (blue) is seen at the bottom of the image. HE-LFB stain.
Primary visual cortex
–
View of the brain from behind. Red = Brodmann area 17 (primary visual cortex); orange = area 18; yellow = area 19
37.
Seymour Papert
–
Seymour Aubrey Papert was a South African-born American mathematician, computer scientist, and educator, who spent most of his career teaching and researching at MIT. He was one of the pioneers of artificial intelligence, and of the constructionist movement in education and he was co-inventor, with Wally Feurzeig and Cynthia Solomon, of the Logo programming language. Papert attended the University of the Witwatersrand, receiving a Bachelor of Arts degree in philosophy in 1949 followed by a PhD in mathematics in 1952 and he then went on to receive a second doctorate, also in mathematics, at the University of Cambridge, supervised by Frank Smithies. At MIT, Papert went on to create the Epistemology and Learning Research Group at the MIT Architecture Machine Group which later became the MIT Media Lab. Here, he was the developer of a theory on learning called constructionism, Papert had worked with Piaget at the University of Geneva from 1958 to 1963 and was one of Piagets protégés, Piaget himself once said that no one understands my ideas as well as Papert. Papert has rethought how schools should work, based on theories of learning. Papert used Piagets work in his development of the Logo programming language while at MIT and he created Logo as a tool to improve the way children think and solve problems. A small mobile robot called the Logo Turtle was developed, a main purpose of the Logo Foundation research group is to strengthen the ability to learn knowledge. Papert insisted a simple language or program that children can learn—like Logo—can also have advanced functionality for expert users. Counter-free automata,1971, ISBN 0-262-13076-9 Perceptrons, MIT Press,1969, ISBN 0-262-63111-3 Mindstorms, Children, Computers, Constructionism, research reports and essays 1985 -1990 by the Epistemology and Learning Research Group, the Media Lab, Massachusetts Institute of Technology, Ablex Pub. He was one of the principals for the One Laptop Per Child initiative to manufacture, Papert also collaborated with the construction toy manufacturer Lego on their Logo-programmable Lego Mindstorms robotics kits, which were named after his groundbreaking 1980 book. He was a figure in the revolutionary socialist circle around Socialist Review while living in London in the 1950s. Papert was also a prominent activist against South African apartheid policies during his university education, Papert was married to Dona Strauss, and later to Androula Christofides Henriques. Paperts third wife was MIT professor Sherry Turkle, and together wrote the influential paper Epistemological Pluralism. In his final 24 years, Papert was married to Suzanne Massie and he was moved to a hospital closer to his home in January 2007, but then contracted septicemia which damaged a heart valve, which was later replaced. By 2008 he had returned home, could think and communicate clearly and walk almost unaided and his rehabilitation team used some of the very principles of experiential, hands-on learning that he had pioneered. Papert died at his home in Blue Hill, Maine, on July 31,2016, paperts work has been used by other researchers in the fields of education and computer science. In 1981, Papert along with others in the Logo group at MIT
Seymour Papert
–
Seymour Papert (May 2006)
38.
Vanishing gradient problem
–
In machine learning, the vanishing gradient problem is a difficulty found in training artificial neural networks with gradient-based learning methods and backpropagation. In such methods, each of the neural networks weights receives an update proportional to the gradient of the function with respect to the current weight in each iteration of training. Traditional activation functions such as the tangent function have gradients in the range. With the advent of the algorithm in the 1970s, many researchers tried to train supervised deep artificial neural networks from scratch. The latter are trained by unfolding them into very deep feedforward networks, when activation functions are used whose derivatives can take on larger values, one risks encountering the related exploding gradient problem. To overcome this problem, several methods were proposed, one is Jürgen Schmidhubers multi-level hierarchy of networks pre-trained one level at a time through unsupervised learning, fine-tuned through backpropagation. Here each level learns a representation of the observations that is fed to the next level. Similar ideas have been used in neural network for unsupervised pre-training to structure a neural network. Then the network is trained further by supervised back-propagation to classify labeled data, the Deep belief network model by Hinton et al. involves learning the distribution of a high level representation using successive layers of binary or real-valued latent variables. It uses a restricted Boltzmann machine to model each new layer of higher level features, each new layer guarantees an increase on the lower-bound of the log likelihood of the data, thus improving the model, if trained properly. Once sufficiently many layers have been learned the deep architecture may be used as a model by reproducing the data when sampling down the model from the top level feature activations. Hinton reports that his models are effective feature extractors over high-dimensional, another method particularly used for Recurrent neural network is the long short-term memory network of 1997 by Hochreiter & Schmidhuber. Schmidhuber notes that this is basically what is winning many of the image recognition competitions now, one of the newest and most effective ways to resolve the vanishing gradient problem is with residual neural networks. It was noted prior to ResNets that a network would actually have higher training error than the shallow network. No extra parameters or changes to the algorithm were needed. ResNets yielded lower training error than their counterparts simply by reintroducing outputs from shallower layers in the network to compensate for the vanishing data. Sven Behnke relied only on the sign of the gradient when training his Neural Abstraction Pyramid to solve problems like image reconstruction and face localization. Neural networks can also be optimized by using a search algorithm on the space of neural networks weights
Vanishing gradient problem
–
Machine learning and data mining
39.
YouTube
–
YouTube is an American video-sharing website headquartered in San Bruno, California. The service was created by three former PayPal employees—Chad Hurley, Steve Chen, and Jawed Karim—in February 2005, Google bought the site in November 2006 for US$1.65 billion, YouTube now operates as one of Googles subsidiaries. Unregistered users can watch videos on the site, while registered users are permitted to upload an unlimited number of videos. Videos deemed potentially offensive are available only to registered users affirming themselves to be at least 18 years old, YouTube earns advertising revenue from Google AdSense, a program which targets ads according to site content and audience. As of February 2017, there are more than 400 hours of content uploaded to YouTube each minute, as of April 2017, the website is ranked as the second most popular site in the world by Alexa Internet, a web traffic analysis company. YouTube was founded by Chad Hurley, Steve Chen, and Jawed Karim, Hurley had studied design at Indiana University of Pennsylvania, and Chen and Karim studied computer science together at the University of Illinois at Urbana-Champaign. Karim could not easily find video clips of either event online, Hurley and Chen said that the original idea for YouTube was a video version of an online dating service, and had been influenced by the website Hot or Not. YouTube began as a venture capital-funded technology startup, primarily from an $11.5 million investment by Sequoia Capital between November 2005 and April 2006, YouTubes early headquarters were situated above a pizzeria and Japanese restaurant in San Mateo, California. The domain name www. youtube. com was activated on February 14,2005, the first YouTube video, titled Me at the zoo, shows co-founder Jawed Karim at the San Diego Zoo. The video was uploaded on April 23,2005, and can still be viewed on the site, YouTube offered the public a beta test of the site in May 2005. The first video to reach one million views was a Nike advertisement featuring Ronaldinho in November 2005. Following a $3.5 million investment from Sequoia Capital in November, the site grew rapidly, and in July 2006 the company announced that more than 65,000 new videos were being uploaded every day, and that the site was receiving 100 million video views per day. The site has 800 million unique users a month and it is estimated that in 2007 YouTube consumed as much bandwidth as the entire Internet in 2000. The choice of the name www. youtube. com led to problems for a similarly named website, the sites owner, Universal Tube & Rollform Equipment, filed a lawsuit against YouTube in November 2006 after being regularly overloaded by people looking for YouTube. Universal Tube has since changed the name of its website to www. utubeonline. com, in October 2006, Google Inc. announced that it had acquired YouTube for $1.65 billion in Google stock, and the deal was finalized on November 13,2006. In March 2010, YouTube began free streaming of certain content, according to YouTube, this was the first worldwide free online broadcast of a major sporting event. On March 31,2010, the YouTube website launched a new design, with the aim of simplifying the interface, Google product manager Shiva Rajaraman commented, We really felt like we needed to step back and remove the clutter. In May 2010, YouTube videos were watched more than two times per day
YouTube
–
From left to right: Chad Hurley, Steve Chen, and Jawed Karim
YouTube
–
Screenshot of YouTube's homepage
YouTube
–
YouTube's headquarters as of 2010 in San Bruno, California.
40.
CMOS
–
Complementary metal–oxide–semiconductor, abbreviated as CMOS /ˈsiːmɒs/, is a technology for constructing integrated circuits. CMOS technology is used in microprocessors, microcontrollers, static RAM, CMOS technology is also used for several analog circuits such as image sensors, data converters, and highly integrated transceivers for many types of communication. In 1963, while working for Fairchild Semiconductor, Frank Wanlass patented CMOS, CMOS is also sometimes referred to as complementary-symmetry metal–oxide–semiconductor. Two important characteristics of CMOS devices are high noise immunity and low power consumption. Since one transistor of the pair is always off, the series combination draws significant power only momentarily during switching between on and off states, CMOS also allows a high density of logic functions on a chip. It was primarily for this reason that CMOS became the most used technology to be implemented in VLSI chips, aluminium was once used but now the material is polysilicon. Other metal gates have made a comeback with the advent of high-k dielectric materials in the CMOS process, as announced by IBM and Intel for the 45 nanometer node and beyond. CMOS refers to both a style of digital circuitry design and the family of processes used to implement that circuitry on integrated circuits. CMOS circuitry dissipates less power than logic families with resistive loads, since this advantage has increased and grown more important, CMOS processes and variants have come to dominate, thus the vast majority of modern integrated circuit manufacturing is on CMOS processes. As of 2010, CPUs with the best performance per watt each year have been CMOS static logic since 1976, CMOS circuits use a combination of p-type and n-type metal–oxide–semiconductor field-effect transistor to implement logic gates and other digital circuits. CMOS always uses all enhancement-mode MOSFETs, CMOS circuits are constructed in such a way that all PMOS transistors must have either an input from the voltage source or from another PMOS transistor. Similarly, all NMOS transistors must have either an input from ground or from another NMOS transistor, the composition of a PMOS transistor creates low resistance between its source and drain contacts when a low gate voltage is applied and high resistance when a high gate voltage is applied. On the other hand, the composition of an NMOS transistor creates high resistance between source and drain when a low voltage is applied and low resistance when a high gate voltage is applied. CMOS accomplishes current reduction by complementing every nMOSFET with a pMOSFET, a high voltage on the gates will cause the nMOSFET to conduct and the pMOSFET to not conduct, while a low voltage on the gates causes the reverse. This arrangement greatly reduces power consumption and heat generation, however, during the switching time, both MOSFETs conduct briefly as the gate voltage goes from one state to another. This induces a brief spike in consumption and becomes a serious issue at high frequencies. The image on the right shows what happens when an input is connected to both a PMOS transistor and an NMOS transistor, when the voltage of input A is low, the NMOS transistors channel is in a high resistance state. This limits the current that can flow from Q to ground, the PMOS transistors channel is in a low resistance state and much more current can flow from the supply to the output
CMOS
–
CMOS inverter (NOT logic gate)
41.
Convolution
–
It has applications that include probability, statistics, computer vision, natural language processing, image and signal processing, engineering, and differential equations. The convolution can be defined for functions on other than Euclidean space. For example, periodic functions, such as the discrete-time Fourier transform, can be defined on a circle, a discrete convolution can be defined for functions on the set of integers. Computing the inverse of the operation is known as deconvolution. The convolution of f and g is written f∗g, using an asterisk or star and it is defined as the integral of the product of the two functions after one is reversed and shifted. As such, it is a kind of integral transform. While the symbol t is used above, it need not represent the time domain, but in that context, the convolution formula can be described as a weighted average of the function f at the moment t where the weighting is given by g simply shifted by amount t. As t changes, the weighting function emphasizes different parts of the input function, for the multi-dimensional formulation of convolution, see Domain of definition. A primarily engineering convention that one sees is, f ∗ g = d e f ∫ − ∞ ∞ f g d τ ⏟. For instance, ƒ*g is equivalent to, but ƒ*g is in fact equivalent to, Convolution describes the output of an important class of operations known as linear time-invariant. See LTI system theory for a derivation of convolution as the result of LTI constraints, in terms of the Fourier transforms of the input and output of an LTI operation, no new frequency components are created. The existing ones are only modified, in other words, the output transform is the pointwise product of the input transform with a third transform. See Convolution theorem for a derivation of that property of convolution, conversely, convolution can be derived as the inverse Fourier transform of the pointwise product of two Fourier transforms. One of the earliest uses of the convolution integral appeared in DAlemberts derivation of Taylors theorem in Recherches sur différents points importants du système du monde, soon thereafter, convolution operations appear in the works of Pierre Simon Laplace, Jean-Baptiste Joseph Fourier, Siméon Denis Poisson, and others. The term itself did not come into use until the 1950s or 60s. Prior to that it was known as faltung, composition product, superposition integral. Yet it appears as early as 1903, though the definition is rather unfamiliar in older uses, the operation, ∫0 t φ ψ d s,0 ≤ t < ∞, is a particular case of composition products considered by the Italian mathematician Vito Volterra in 1913. The summation is called a summation of the function f
Convolution
–
Gaussian blur can be used in order to obtain a smooth grayscale digital image of a halftone print
Convolution
–
Visual comparison of convolution, cross-correlation and autocorrelation.
42.
Back-propagation
–
The backward propagation of errors or backpropagation, is a common method of training artificial neural networks and used in conjunction with an optimization method such as gradient descent. The algorithm repeats a two cycle, propagation and weight update. When an input vector is presented to the network, it is propagated forward through the network, layer by layer, until it reaches the output layer. The output of the network is compared to the desired output, using a loss function. The error values are then propagated backwards, starting from the output, backpropagation uses these error values to calculate the gradient of the loss function with respect to the weights in the network. In the second phase, this gradient is fed to the optimization method and it is a generalization of the delta rule to multi-layered feedforward networks, made possible by using the chain rule to iteratively compute gradients for each layer. Backpropagation requires that the function used by the artificial neurons be differentiable. The goal of any supervised learning algorithm is to find a function that best maps a set of inputs to its correct output. An example would be a task, where the input is an image of an animal. The goal of backpropagation is to compute the derivative, or gradient. For backpropagation, the loss function calculates the difference between the input training example and its output, after the example has been propagated through the network. For backpropagation to work, two assumptions are made about the form of the error function, the first is that it can be written as an average E =1 n ∑ x E x over error functions E x, for individual training examples, x. In practice, training examples are placed in batches, and the error is averaged at the end of the batch, the second assumption is that it can be written as a function of the outputs from the neural network. Let y, y ′ be vectors in R n, select an error function E measuring the difference between two outputs. The standard choice is E =12 ∥ y − y ′ ∥2, the factor of 12 conveniently cancels the exponent when the error function is subsequently differentiated. The error function over n training examples can be written as an average, And the partial derivative with respect to the outputs, Let N be a network with e connections, m inputs. Below, x, x 1, x 2, … will denote vectors in R m, y, y ′, y 1, y 2, … vectors in R n and these are called inputs, outputs and weights respectively. The neural network corresponds to a function y = f N which, given a weight w, maps an input x to an output y
Back-propagation
–
Machine learning and data mining
43.
Multi-dimensional
–
In physics and mathematics, the dimension of a mathematical space is informally defined as the minimum number of coordinates needed to specify any point within it. Thus a line has a dimension of one only one coordinate is needed to specify a point on it – for example. The inside of a cube, a cylinder or a sphere is three-dimensional because three coordinates are needed to locate a point within these spaces, in classical mechanics, space and time are different categories and refer to absolute space and time. That conception of the world is a space but not the one that was found necessary to describe electromagnetism. The four dimensions of spacetime consist of events that are not absolutely defined spatially and temporally, Minkowski space first approximates the universe without gravity, the pseudo-Riemannian manifolds of general relativity describe spacetime with matter and gravity. Ten dimensions are used to string theory, and the state-space of quantum mechanics is an infinite-dimensional function space. The concept of dimension is not restricted to physical objects, high-dimensional spaces frequently occur in mathematics and the sciences. They may be parameter spaces or configuration spaces such as in Lagrangian or Hamiltonian mechanics, in mathematics, the dimension of an object is an intrinsic property independent of the space in which the object is embedded. This intrinsic notion of dimension is one of the ways the mathematical notion of dimension differs from its common usages. The dimension of Euclidean n-space En is n, when trying to generalize to other types of spaces, one is faced with the question what makes En n-dimensional. One answer is that to cover a ball in En by small balls of radius ε. This observation leads to the definition of the Minkowski dimension and its more sophisticated variant, the Hausdorff dimension, for example, the boundary of a ball in En looks locally like En-1 and this leads to the notion of the inductive dimension. While these notions agree on En, they turn out to be different when one looks at more general spaces, a tesseract is an example of a four-dimensional object. The rest of this section some of the more important mathematical definitions of the dimensions. A complex number has a real part x and an imaginary part y, a single complex coordinate system may be applied to an object having two real dimensions. For example, an ordinary two-dimensional spherical surface, when given a complex metric, complex dimensions appear in the study of complex manifolds and algebraic varieties. The dimension of a space is the number of vectors in any basis for the space. This notion of dimension is referred to as the Hamel dimension or algebraic dimension to distinguish it from other notions of dimension
Multi-dimensional
–
From left to right: the square, the cube and the tesseract. The two-dimensional (2d) square is bounded by one-dimensional (1d) lines; the three-dimensional (3d) cube by two-dimensional areas; and the four-dimensional (4d) tesseract by three-dimensional volumes. For display on a two-dimensional surface such as a screen, the 3d cube and 4d tesseract require projection.
44.
Long short-term memory
–
Long short-term memory is a recurrent neural network architecture proposed in 1997 by Sepp Hochreiter and Jürgen Schmidhuber. Relative insensitivity to gap length gives an advantage to LSTM over alternative RNNs and hidden Markov models, among other successes, LSTM achieved the best known results in natural language text compression, unsegmented connected handwriting recognition, and in 2009 won the ICDAR handwriting competition. As of 2016, major companies including Google, Apple, Microsoft. A LSTM network is a neural network that contains LSTM units instead of, or in addition to. A LSTM unit is a recurrent network unit that excels at remembering values for either long or short durations of time, the key to this ability is that it uses no activation function within its recurrent components. Thus, the value is not iteratively squashed over time. LSTM units are often implemented in blocks containing several LSTM units and this design is typical with deep multi-layered neural networks, and facilitates implementations with parallel hardware. In the equations below, each variable in lowercase italics represents a vector with a equal to the number of LSTM units in the block. LSTM blocks contain three or four gates that use to control the flow of information into or out of their memory. These gates are implemented using the function to compute a value between 0 and 1. Multiplication is applied with this value to partially allow or deny information to flow into or out of the memory, for example, an input gate controls the extent to which a new value flows into the memory. A forget gate controls the extent to which a value remains in memory, and, an output gate controls the extent to which the value in memory is used to compute the output activation of the block. The only weights in a LSTM block are used to direct the operation of the gates and these weights occur between the values that feed into the block and each of the gates. Thus, the LSTM block determines how to maintain its memory as a function of those values, LSTM blocks are usually trained with Backpropagation through time. Activation functions σ g, The original is a sigmoid function, σ c, The original is a hyperbolic tangent. σ h, The original is a tangent, but the peephole LSTM paper suggests σ h = x. H t −1 is not used, c t −1 is used instead in most places. F t = σ g i t = σ g o t = σ g c t = f t ∘ c t −1 + i t ∘ σ c h t = o t ∘ σ h Convolutional LSTM
Long short-term memory
–
A simple LSTM gate with only input, output, and forget gates. LSTM gates may have more gates.
45.
Graphical models
–
A graphical model or probabilistic graphical model is a probabilistic model for which a graph expresses the conditional dependence structure between random variables. They are commonly used in probability theory, statistics—particularly Bayesian statistics—and machine learning, two branches of graphical representations of distributions are commonly used, namely, Bayesian networks and Markov random fields. Both families encompass the properties of factorization and independences, but they differ in the set of independences they can encode, if the network structure of the model is a directed acyclic graph, the model represents a factorization of the joint probability of all random variables. More precisely, if the events are X1, …, X n then the joint probability satisfies P = ∏ i =1 n P where p a i is the set of parents of node X i. In other words, the joint distribution factors into a product of conditional distributions, in general, any two sets of nodes are conditionally independent given a third set if a criterion called d-separation holds in the graph. Local independences and global independences are equivalent in Bayesian networks and this type of graphical model is known as a directed graphical model, Bayesian network, or belief network. Classic machine learning models like hidden Markov models, neural networks, a Markov random field, also known as a Markov network, is a model over an undirected graph. A graphical model with many repeated subunits can be represented with plate notation, a factor graph is an undirected bipartite graph connecting variables and factors. Each factor represents a function over the variables it is connected to and this is a helpful representation for understanding and implementing belief propagation. A clique tree or junction tree is a tree of cliques, a chain graph is a graph which may have both directed and undirected edges, but without any directed cycles. Both directed acyclic graphs and undirected graphs are special cases of chain graphs, an ancestral graph is a further extension, having directed, bidirected and undirected edges. A conditional random field is a model specified over an undirected graph. A restricted Boltzmann machine is a generative model specified over an undirected graph. Belief propagation Structural equation model Graphical models and Conditional Random Fields Probabilistic Graphical Models taught by Eric Xing at CMU Bishop, cowell, Robert G. Dawid, A. Philip, Lauritzen, Steffen L. Spiegelhalter, David J. Probabilistic networks and expert systems. A more advanced and statistically oriented book Jensen, Finn, koller, D. Friedman, N. Probabilistic Graphical Models. A computational reasoning approach, where the relationships between graphs and probabilities were formally introduced, getting Started in Probabilistic Graphical Models. Heckermans Bayes Net Learning Tutorial A Brief Introduction to Graphical Models and Bayesian Networks Sargur Sriharis lecture slides on probabilistic graphical models
Graphical models
–
An example of a graphical model. Each arrow indicates a dependency. In this example: D depends on A, D depends on B, D depends on C, C depends on B, and C depends on D.
46.
Hyperbolic function
–
In mathematics, hyperbolic functions are analogs of the ordinary trigonometric, or circular functions. The inverse hyperbolic functions are the hyperbolic sine arsinh and so on. Just as the form a circle with a unit radius. The hyperbolic functions take a real argument called a hyperbolic angle, the size of a hyperbolic angle is twice the area of its hyperbolic sector. The hyperbolic functions may be defined in terms of the legs of a triangle covering this sector. Laplaces equations are important in areas of physics, including electromagnetic theory, heat transfer, fluid dynamics. In complex analysis, the hyperbolic functions arise as the parts of sine and cosine. When considered defined by a variable, the hyperbolic functions are rational functions of exponentials. Hyperbolic functions were introduced in the 1760s independently by Vincenzo Riccati, Riccati used Sc. and Cc. to refer to circular functions and Sh. and Ch. to refer to hyperbolic functions. Lambert adopted the names but altered the abbreviations to what they are today, the abbreviations sh and ch are still used in some other languages, like French and Russian. The hyperbolic functions are, Hyperbolic sine, sinh x = e x − e − x 2 = e 2 x −12 e x =1 − e −2 x 2 e − x. Hyperbolic cosine, cosh x = e x + e − x 2 = e 2 x +12 e x =1 + e −2 x 2 e − x, the complex forms in the definitions above derive from Eulers formula. One also has sech 2 x =1 − tanh 2 x csch 2 x = coth 2 x −1 for the other functions, sinh = sinh 2 = sgn cosh −12 where sgn is the sign function. All functions with this property are linear combinations of sinh and cosh, in particular the exponential functions e x and e − x, and it is possible to express the above functions as Taylor series, sinh x = x + x 33. + ⋯ = ∑ n =0 ∞ x 2 n +1, the function sinh x has a Taylor series expression with only odd exponents for x. Thus it is an odd function, that is, −sinh x = sinh, the function cosh x has a Taylor series expression with only even exponents for x. Thus it is a function, that is, symmetric with respect to the y-axis. The sum of the sinh and cosh series is the series expression of the exponential function
Hyperbolic function
–
Hyperbolic functions in the complex plane
Hyperbolic function
–
A ray through the unit hyperbola in the point, where is twice the area between the ray, the hyperbola, and the -axis. For points on the hyperbola below the -axis, the area is considered negative (see animated version with comparison with the trigonometric (circular) functions).
Hyperbolic function
Hyperbolic function
47.
Mathematical optimization
–
In mathematics, computer science and operations research, mathematical optimization, also spelled mathematical optimisation, is the selection of a best element from some set of available alternatives. The generalization of optimization theory and techniques to other formulations comprises an area of applied mathematics. Such a formulation is called a problem or a mathematical programming problem. Many real-world and theoretical problems may be modeled in this general framework, typically, A is some subset of the Euclidean space Rn, often specified by a set of constraints, equalities or inequalities that the members of A have to satisfy. The domain A of f is called the space or the choice set. The function f is called, variously, a function, a loss function or cost function, a utility function or fitness function, or, in certain fields. A feasible solution that minimizes the objective function is called an optimal solution, in mathematics, conventional optimization problems are usually stated in terms of minimization. Generally, unless both the function and the feasible region are convex in a minimization problem, there may be several local minima. While a local minimum is at least as good as any nearby points, a global minimum is at least as good as every feasible point. In a convex problem, if there is a minimum that is interior, it is also the global minimum. Optimization problems are often expressed with special notation, consider the following notation, min x ∈ R This denotes the minimum value of the objective function x 2 +1, when choosing x from the set of real numbers R. The minimum value in case is 1, occurring at x =0. Similarly, the notation max x ∈ R2 x asks for the value of the objective function 2x. In this case, there is no such maximum as the function is unbounded. This represents the value of the argument x in the interval, John Wiley & Sons, Ltd. pp. xxviii+489. (2008 Second ed. in French, Programmation mathématique, Théorie et algorithmes, Editions Tec & Doc, Paris,2008. Nemhauser, G. L. Rinnooy Kan, A. H. G. Todd, handbooks in Operations Research and Management Science. Amsterdam, North-Holland Publishing Co. pp. xiv+709, J. E. Dennis, Jr. and Robert B
Mathematical optimization
–
Graph of a paraboloid given by f(x, y) = −(x ² + y ²) + 4. The global maximum at (0, 0, 4) is indicated by a red dot.
48.
Random variable
–
In probability and statistics, a random variable, random quantity, aleatory variable, or stochastic variable is a variable quantity whose value depends on possible outcomes. It is common that these outcomes depend on physical variables that are not well understood. For example, when you toss a coin, the outcome of heads or tails depends on the uncertain physics. Which outcome will be observed is not certain, of course the coin could get caught in a crack in the floor, but such a possibility is excluded from consideration. The domain of a variable is the set of possible outcomes. In the case of the coin, there are two possible outcomes, namely heads or tails. Since one of these outcomes must occur, thus either the event that the coin lands heads or the event that the coin lands tails must have non-zero probability, a random variable is defined as a function that maps outcomes to numerical quantities, typically real numbers. In this sense, it is a procedure for assigning a numerical quantity to each outcome, and, contrary to its name. What is random is the physics that describes how the coin lands. A random variables possible values might represent the possible outcomes of a yet-to-be-performed experiment and they may also conceptually represent either the results of an objectively random process or the subjective randomness that results from incomplete knowledge of a quantity. The mathematics works the same regardless of the interpretation in use. A random variable has a probability distribution, which specifies the probability that its value falls in any given interval, two random variables with the same probability distribution can still differ in terms of their associations with, or independence from, other random variables. The realizations of a variable, that is, the results of randomly choosing values according to the variables probability distribution function, are called random variates. The formal mathematical treatment of random variables is a topic in probability theory, in that context, a random variable is understood as a function defined on a sample space whose outputs are numerical values. A random variable X, Ω → E is a function from a set of possible outcomes Ω to a measurable space E. The technical axiomatic definition requires Ω to be a probability space, a random variable does not return a probability. The probability of a set of outcomes is given by the probability measure P with which Ω is equipped. Rather, X returns a numerical quantity of outcomes in Ω — e. g. the number of heads in a collection of coin flips
Random variable
–
If the sample space is the set of possible numbers rolled on two dice, and the random variable of interest is the sum S of the numbers on the two dice, then S is a discrete random variable whose distribution is described by the probability mass function plotted as the height of picture columns here.
49.
Feedforward neural network
–
A feedforward neural network is an artificial neural network wherein connections between the units do not form a cycle. As such, it is different from recurrent neural networks, the feedforward neural network was the first and simplest type of artificial neural network devised. In this network, the moves in only one direction, forward, from the input nodes, through the hidden nodes. There are no cycles or loops in the network, the simplest kind of neural network is a single-layer perceptron network, which consists of a single layer of output nodes, the inputs are fed directly to the outputs via a series of weights. In this way it can be considered the simplest kind of feed-forward network, neurons with this kind of activation function are also called artificial neurons or linear threshold units. In the literature the term often refers to networks consisting of just one of these units. A similar neuron was described by Warren McCulloch and Walter Pitts in the 1940s, a perceptron can be created using any values for the activated and deactivated states as long as the threshold value lies between the two. Perceptrons can be trained by a learning algorithm that is usually called the delta rule. It calculates the errors between calculated output and sample data, and uses this to create an adjustment to the weights. This result can be found in Peter Auer, Harald Burgsteiner, a multi-layer neural network can compute a continuous output instead of a step function. A common choice is the logistic function, f =11 + e − x With this choice. The logistic function is known as the sigmoid function. It has a derivative, which allows it to be used in backpropagation. This function is also preferred because its derivative is easily calculated and this class of networks consists of multiple layers of computational units, usually interconnected in a feed-forward way. Each neuron in one layer has directed connections to the neurons of the subsequent layer, in many applications the units of these networks apply a sigmoid function as an activation function. This result holds for a range of activation functions, e. g. for the sigmoidal functions. Multi-layer networks use a variety of learning techniques, the most popular being back-propagation, here, the output values are compared with the correct answer to compute the value of some predefined error-function. By various techniques, the error is fed back through the network
Feedforward neural network
Feedforward neural network
–
In a feed forward network information always moves one direction; it never goes backwards.
50.
Directed acyclic graph
–
In mathematics and computer science, a directed acyclic graph, is a finite directed graph with no directed cycles. Equivalently, a DAG is a graph that has a topological ordering. DAGs can model different kinds of information. Similarly, topological orderings of DAGs can be used to order the compilation operations in a makefile, the program evaluation and review technique uses DAGs to model the milestones and activities of large human projects, and schedule these projects to use as little total time as possible. Combinational logic blocks in electronic design, and the operations in dataflow programming languages. More abstractly, the reachability relation in a DAG forms a partial order, the corresponding concept for undirected graphs is a forest, an undirected graph without cycles. Choosing an orientation for a forest produces a kind of directed acyclic graph called a polytree. However there are other kinds of directed acyclic graph that are not formed by orienting the edges of an undirected acyclic graph. Moreover, every undirected graph has an orientation, an assignment of a direction for its edges that makes it into a directed acyclic graph. To emphasize that DAGs are not the thing as directed versions of undirected acyclic graphs. A graph is formed by a collection of vertices and edges, in the case of a directed graph, each edge has an orientation, from one vertex to another vertex. A directed acyclic graph is a graph that has no cycles. A vertex v of a graph is said to be reachable from another vertex u when there exists a path that starts at u. As a special case, every vertex is considered to be reachable from itself, a graph that has a topological ordering cannot have any cycles, because the edge into the earliest vertex of a cycle would have to be oriented the wrong way. Therefore, every graph with an ordering is acyclic. Conversely, every directed acyclic graph has a topological ordering, therefore, this property can be used as an alternative definition of the directed acyclic graphs, they are exactly the graphs that have topological orderings. The reachability relationship in any directed graph can be formalized as a partial order ≤ on the vertices of the DAG. For example, the DAG with two edges a → b and b → c has the same reachability relation as the graph with three edges a → b, b → c, and a → c
Directed acyclic graph
–
An example of a directed acyclic graph
51.
Automatic differentiation
–
AD exploits the fact that every computer program, no matter how complicated, executes a sequence of elementary arithmetic operations and elementary functions. Automatic differentiation is not, Symbolic differentiation, nor Numerical differentiation, both classical methods have problems with calculating higher derivatives, where the complexity and errors increase. Finally, both methods are slow at computing the partial derivatives of a function with respect to many inputs. Automatic differentiation solves all of problems, at the expense of introducing more software dependencies. Fundamental to AD is the decomposition of differentials provided by the chain rule. For the simple composition y = g = g the chain rule gives d y d x = d y d w d w d x Usually, two distinct modes of AD are presented, forward accumulation and reverse accumulation. Generally, both forward and reverse accumulation are specific manifestations of applying the operator of program composition, with the one of the two mappings being fixed. In forward accumulation AD, one first fixes the independent variable to which differentiation is performed, compared to reverse accumulation, forward accumulation is very natural and easy to implement as the flow of derivative information coincides with the order of evaluation. One simply augments each variable w with its derivative ẇ, w ˙ = ∂ w ∂ x as denoted by the dot, the derivatives are then computed in sync with the evaluation steps and combined with other derivatives via the chain rule. The choice of the independent variable to which differentiation is performed affects the seed values ẇ1, suppose one is interested in the derivative of this function with respect to x1. Figure 2 shows a depiction of this process as a computational graph. The computational complexity of one sweep of forward accumulation is proportional to the complexity of the original code. Forward accumulation is more efficient than reverse accumulation for functions f, ℝn → ℝm with m ≫ n as only n sweeps are necessary, in reverse accumulation AD, one first fixes the dependent variable to be differentiated and computes the derivative with respect to each sub-expression recursively. The example function is real-valued, and thus there is only one seed for the derivative computation and this is done by adding an adjoint node for each primal node, connected by adjoint edges which parallel the primal edges but flow in the opposite direction. The nodes in the adjoint graph represent multiplication by the derivatives of the functions calculated by the nodes in the primal. For instance, addition in the primal causes fanout in the adjoint, fanout in the primal causes addition in the adjoint, a unary function y = f in the primal causes x̄ = ȳ f′ in the adjoint, etc. Reverse accumulation is more efficient than forward accumulation for functions f, ℝn → ℝm with m ≪ n as only m sweeps are necessary, reverse mode AD was first published in 1970 by Seppo Linnainmaa in his master thesis. Backpropagation of errors in multilayer perceptrons, a used in machine learning, is a special case of reverse mode AD
Automatic differentiation
–
Figure 2: Example of forward accumulation with computational graph
52.
Minimum bounding box
–
In geometry, the minimum or smallest bounding or enclosing box for a point set in N dimensions is the box with the smallest measure within which all the points lie. When other kinds of measure are used, the box is usually called accordingly. The minimum bounding box of a point set is the same as the bounding box of its convex hull. The term box/hyperrectangle comes from its usage in the Cartesian coordinate system, in the two-dimensional case it is called the minimum bounding rectangle. The axis-aligned minimum bounding box for a point set is its minimum bounding box subject to the constraint that the edges of the box are parallel to the coordinate axes. For example, in geometry and its applications when it is required to find intersections in the set of objects. Since it is usually a less expensive operation than the check of the actual intersection. The arbitrarily oriented minimum bounding box is the minimum bounding box, a three-dimensional rotating calipers algorithm can find the minimum-volume arbitrarily-oriented bounding box of a three-dimensional point set in cubic time. Bounding sphere Bounding volume Minimum bounding rectangle
Minimum bounding box
–
A series of geometric shapes enclosed by its minimum bounding box (in 2 dimensions)
53.
Gradient descent
–
Gradient descent is a first-order iterative optimization algorithm. To find a minimum of a function using gradient descent. If instead one takes steps proportional to the positive of the gradient, one approaches a maximum of that function. Gradient descent is known as steepest descent, or the method of steepest descent. Gradient descent should not be confused with the method of steepest descent for approximating integrals and it follows that, if a n +1 = a n − γ ∇ F for γ small enough, then F ≥ F. In other words, the term γ ∇ F is subtracted from a because we want to move against the gradient and we have F ≥ F ≥ F ≥ ⋯, so hopefully the sequence converges to the desired local minimum. Note that the value of the step size γ is allowed to change at every iteration. With certain assumptions on the function F and particular choices of γ, γ n = T | | ∇ F − ∇ F | |2 convergence to a minimum can be guaranteed. When the function F is convex, all local minima are also global minima and this process is illustrated in the adjacent picture. Here F is assumed to be defined on the plane, the blue curves are the contour lines, that is, the regions on which the value of F is constant. A red arrow originating at a point shows the direction of the gradient at that point. Note that the gradient at a point is orthogonal to the line going through that point. We see that gradient descent leads us to the bottom of the bowl, Gradient descent has problems with pathological functions such as the Rosenbrock function shown here. The Rosenbrock function has a curved valley which contains the minimum. The bottom of the valley is very flat, because of the curved flat valley the optimization is zig-zagging slowly with small stepsizes towards the minimum. The Zig-Zagging nature of the method is also evident below, where the gradient descent method is applied to F = sin cos . For some of the examples, gradient descent is relatively slow close to the minimum, technically. For poorly conditioned convex problems, gradient descent increasingly zigzags as the point nearly orthogonally to the shortest direction to a minimum point
Gradient descent
–
Illustration of gradient descent.
54.
A priori and a posteriori
–
The Latin phrases a priori and a posteriori are philosophical terms of art popularized by Immanuel Kants Critique of Pure Reason, one of the most influential works in the history of philosophy. These terms are used with respect to reasoning to distinguish necessary conclusions from first premises from conclusions based on sense observation, a posteriori knowledge or justification is dependent on experience or empirical evidence, as with most aspects of science and personal knowledge. There are many points of view on two types of knowledge, and their relationship gives rise to one of the oldest problems in modern philosophy. The terms a priori and a posteriori are primarily used as adjectives to modify the noun knowledge, however, a priori is sometimes used to modify other nouns, such as truth. Philosophers also may use apriority and aprioricity as nouns to refer to the quality of being a priori, although definitions and use of the terms have varied in the history of philosophy, they have consistently labeled two separate epistemological notions. See also the related distinctions, deductive/inductive, analytic/synthetic, necessary/contingent, the intuitive distinction between a priori and a posteriori knowledge is best seen in examples. A priori Consider the proposition, If George V reigned at least four days and this is something that one knows a priori, because it expresses a statement that one can derive by reason alone. A posteriori Compare this with the proposition expressed by the sentence and this is something that one must come to know a posteriori, because it expresses an empirical fact unknowable by reason alone. Several philosophers reacting to Kant sought to explain a priori knowledge without appealing to, as Paul Boghossian explains and that has never been described in satisfactory terms. One theory, popular among the positivists of the early 20th century, is what Boghossian calls the analytic explanation of the a priori. The distinction between analytic and synthetic propositions was first introduced by Kant, in short, proponents of this explanation claimed to have reduced a dubious metaphysical faculty of pure reason to a legitimate linguistic notion of analyticity. However, the explanation of a priori knowledge has undergone several criticisms. Most notably, Quine argued that the distinction is illegitimate. Quine states, But for all its a priori reasonableness, a boundary between analytic and synthetic statements simply has not been drawn and that there is such a distinction to be drawn at all is an unempirical dogma of empiricists, a metaphysical article of faith. While the soundness of Quines critique is highly disputed, it had an effect on the project of explaining the a priori in terms of the analytic. The metaphysical distinction between necessary and contingent truths has also related to a priori and a posteriori knowledge. A proposition that is true is one whose negation is self-contradictory. Consider the proposition that all bachelors are unmarried and its negation, the proposition that some bachelors are married, is incoherent, because the concept of being unmarried is part of the concept of being a bachelor
A priori and a posteriori
–
Time Portal
55.
Data clustering
–
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups. Cluster analysis itself is not one specific algorithm, but the task to be solved. It can be achieved by various algorithms that differ significantly in their notion of what constitutes a cluster, popular notions of clusters include groups with small distances among the cluster members, dense areas of the data space, intervals or particular statistical distributions. Clustering can therefore be formulated as an optimization problem. The appropriate clustering algorithm and parameter settings depend on the data set. Cluster analysis as such is not a task, but an iterative process of knowledge discovery or interactive multi-objective optimization that involves trial. It is often necessary to modify data preprocessing and model parameters until the result achieves the desired properties, besides the term clustering, there are a number of terms with similar meanings, including automatic classification, numerical taxonomy, botryology and typological analysis. The notion of a cluster cannot be defined, which is one of the reasons why there are so many clustering algorithms. There is a common denominator, a group of data objects, however, different researchers employ different cluster models, and for each of these cluster models again different algorithms can be given. The notion of a cluster, as found by different algorithms, understanding these cluster models is key to understanding the differences between the various algorithms. Typical cluster models include, Connectivity models, for example, hierarchical clustering builds models based on distance connectivity, centroid models, for example, the k-means algorithm represents each cluster by a single mean vector. Distribution models, clusters are modeled using statistical distributions, such as multivariate normal distributions used by the Expectation-maximization algorithm, density models, for example, DBSCAN and OPTICS defines clusters as connected dense regions in the data space. Subspace models, in Biclustering, clusters are modeled with both members and relevant attributes. Group models, some algorithms do not provide a model for their results. Graph-based models, a clique, that is, a subset of nodes in a such that every two nodes in the subset are connected by an edge can be considered as a prototypical form of cluster. Relaxations of the connectivity requirement are known as quasi-cliques, as in the HCS clustering algorithm. A clustering is essentially a set of clusters, usually containing all objects in the data set. Additionally, it may specify the relationship of the clusters to each other, for example, the following overview will only list the most prominent examples of clustering algorithms, as there are possibly over 100 published clustering algorithms
Data clustering
–
Machine learning and data mining
56.
Statistical distributions
–
For instance, if the random variable X is used to denote the outcome of a coin toss, then the probability distribution of X would take the value 0.5 for X = heads, and 0.5 for X = tails. In more technical terms, the probability distribution is a description of a phenomenon in terms of the probabilities of events. Examples of random phenomena can include the results of an experiment or survey, a probability distribution is defined in terms of an underlying sample space, which is the set of all possible outcomes of the random phenomenon being observed. The sample space may be the set of numbers or a higher-dimensional vector space, or it may be a list of non-numerical values, for example. Probability distributions are divided into two classes. A discrete probability distribution can be encoded by a discrete list of the probabilities of the outcomes, on the other hand, a continuous probability distribution is typically described by probability density functions. The normal distribution represents a commonly encountered continuous probability distribution, more complex experiments, such as those involving stochastic processes defined in continuous time, may demand the use of more general probability measures. A probability distribution whose sample space is the set of numbers is called univariate. Important and commonly encountered univariate probability distributions include the distribution, the hypergeometric distribution. The multivariate normal distribution is a commonly encountered multivariate distribution, to define probability distributions for the simplest cases, one needs to distinguish between discrete and continuous random variables. For example, the probability that an object weighs exactly 500 g is zero. Continuous probability distributions can be described in several ways, the cumulative distribution function is the antiderivative of the probability density function provided that the latter function exists. As probability theory is used in diverse applications, terminology is not uniform. The following terms are used for probability distribution functions, Distribution. Probability distribution, is a table that displays the probabilities of outcomes in a sample. Could be called a frequency distribution table, where all occurrences of outcomes sum to 1. Distribution function, is a form of frequency distribution table. Probability distribution function, is a form of probability distribution table
Statistical distributions
–
The probability mass function (pmf) p (S) specifies the probability distribution for the sum S of counts from two dice. For example, the figure shows that p (11) = 1/18. The pmf allows the computation of probabilities of events such as P (S > 9) = 1/12 + 1/18 + 1/36 = 1/6, and all other probabilities in the distribution.
57.
Bayesian spam filtering
–
Naive Bayes classifiers are a popular statistical technique of e-mail filtering. They typically use bag of features to identify spam e-mail. Naive Bayes classifiers work by correlating the use of tokens, with spam and non-spam e-mails and it is one of the oldest ways of doing spam filtering, with roots in the 1990s. The first known mail-filtering program to use a naive Bayes classifier was Jason Rennies ifile program, the program was used to sort mail into folders. The first scholarly publication on Bayesian spam filtering was by Sahami et al. in 1998 and that work was soon thereafter deployed in commercial spam filters. However, in 2002 Paul Graham greatly decreased the false positive rate, variants of the basic technique have been implemented in a number of research works and commercial software products. Many modern mail clients implement Bayesian spam filtering, users can also install separate email filtering programs. CRM114, oft cited as a Bayesian filter, is not intended to use a Bayes filter in production, particular words have particular probabilities of occurring in spam email and in legitimate email. For instance, most email users will frequently encounter the word Viagra in spam email, the filter doesnt know these probabilities in advance, and must first be trained so it can build them up. To train the filter, the user must manually indicate whether a new email is spam or not, for all words in each training email, the filter will adjust the probabilities that each word will appear in spam or legitimate email in its database. After training, the probabilities are used to compute the probability that an email with a particular set of words in it belongs to either category. Each word in the email contributes to the emails spam probability and this contribution is called the posterior probability and is computed using Bayes theorem. Then, the emails spam probability is computed over all words in the email, and if the total exceeds a certain threshold, as in any other spam filtering technique, email marked as spam can then be automatically moved to a Junk email folder, or even deleted outright. Some software implement quarantine mechanisms that define a frame during which the user is allowed to review the softwares decision. The initial training can usually be refined when wrong judgements from the software are identified and that allows the software to dynamically adapt to the ever evolving nature of spam. Some spam filters combine the results of both Bayesian spam filtering and other heuristics, resulting in even higher filtering accuracy, sometimes at the cost of adaptiveness, Bayesian email filters utilize Bayes theorem. Lets suppose the suspected message contains the word replica, most people who are used to receiving e-mail know that this message is likely to be spam, more precisely a proposal to sell counterfeit copies of well-known brands of watches. The spam detection software, however, does not know such facts and this assumption permits simplifying the general formula to, Pr = Pr Pr + Pr This is functionally equivalent to asking, what percentage of occurrences of the word replica appear in spam messages
Bayesian spam filtering
–
Machine learning and data mining
58.
Dimitri Bertsekas
–
Bertsekas was born in Greece and lived his childhood there. He is known for his work, and for his sixteen textbooks and monographs in theoretical and algorithmic optimization and control. He is featured among the top 100 most cited computer science authors in the CiteSeer search engine academic database and digital library, in 1995, he co-founded a publishing company, Athena Scientific that among others, publishes most of his books. In the late 90s Bertsekas developed a strong interest in digital photography and his photographs have been exhibited on several occasions at M. I. T. and can also be accessed from his www site http, //web. mit. edu/dimitrib/www/home. html. See also an article describing his career and views on mathematical research, ragazzini Education Award for outstanding contributions to education. And the 2015 Dantzig prize from SIAM and the Mathematical Optimization Society, some of these books have been published in multiple editions, and have been translated in various foreign languages. He has also written several widely referenced research monographs, which contain most of his research. These include, Stochastic Optimal Control, The Discrete-Time Case, a complex work, establishing the measure-theoretic foundations of dynamic programming. Parallel and Distributed Computation, Numerical Methods, which among others established the theoretical structures for the analysis of distributed asynchronous algorithms. Neuro-Dynamic Programming, which laid the foundations for suboptimal approximations of highly complex sequential decision-making problems
Dimitri Bertsekas
–
Dimitri P. Bertsekas
59.
Natural resource management
–
Natural resource management deals with managing the way in which people and natural landscapes interact. It brings together land use planning, water management, biodiversity conservation, Natural resource management specifically focuses on a scientific and technical understanding of resources and ecology and the life-supporting capacity of those resources. Environmental management is similar to natural resource management. In academic contexts, the sociology of natural resources is closely related to and this type of analysis coalesced in the 20th century with recognition that preservationist conservation strategies had not been effective in halting the decline of natural resources. A more integrated approach was implemented recognising the social, cultural, economic. A more holistic, national and even global form evolved, from the Brundtland Commission, in 2005 the government of New South Wales, established a Standard for Quality Natural Resource Management, to improve the consistency of practice, based on an adaptive management approach. In the United States, the most active areas of natural resource management are wildlife management often associated with ecotourism, in Australia, water sharing, such as the Murray Darling Basin Plan and catchment management are also significant. Individuals or groups may be able to use of the resources. National forest, National parks and military reservations are some US examples, private property, Any property owned by a defined individual or corporate entity. Both the benefit and duties to the fall to the owner. Private land is the most common example, Common property, It is a private property of a group. The group may vary in size, nature and internal structure e. g. indigenous neighbours of village, some examples of common property are community forests. Non-property, There is no owner of these properties. Each potential user has equal ability to use it as they wish and these areas are the most exploited. It is said that Everybodys property is nobodys property, an example is a lake fishery. Common land may exist without ownership, in case in the UK it is vested in a local authority. Stakeholder analysis originated from business management practices and has incorporated into natural resource management in ever growing popularity. Stakeholder analysis in the context of natural resource management identifies distinctive interest groups affected in the utilisation and conservation of natural resources, There is no definitive definition of a stakeholder as illustrated in the table below
Natural resource management
–
The Tongass National Forest in Alaska is managed by the United States Forest Service
Natural resource management
–
Air
60.
Conjugate gradient method
–
In mathematics, the conjugate gradient method is an algorithm for the numerical solution of particular systems of linear equations, namely those whose matrix is symmetric and positive-definite. Large sparse systems often arise when numerically solving partial differential equations or optimization problems, the conjugate gradient method can also be used to solve unconstrained optimization problems such as energy minimization. It was mainly developed by Magnus Hestenes and Eduard Stiefel, the biconjugate gradient method provides a generalization to non-symmetric matrices. Various nonlinear conjugate gradient methods seek minima of nonlinear equations. Suppose we want to solve the system of linear equations Ax = b for the vector x where the known n × n matrix A is symmetric, positive definite, and real. We denote the solution of this system by x∗. We say that two vectors u and v are conjugate if u T A v =0. Being conjugate is a relation, if u is conjugate to v. Suppose that P = is a set of n mutually conjugate vectors and this gives the following method for solving the equation Ax = b, find a sequence of n conjugate directions, and then compute the coefficients αk. If we choose the conjugate vectors pk carefully, then we may not need all of them to obtain an approximation to the solution x∗. So, we want to regard the conjugate gradient method as an iterative method and this also allows us to approximately solve systems where n is so large that the direct method would take too much time. We denote the initial guess for x∗ by x0 and we can assume without loss of generality that x0 =0. Starting with x0 we search for the solution and in each iteration we need a metric to tell us whether we are closer to the solution x∗. This metric comes from the fact that the solution x∗ is also the unique minimizer of the quadratic function. This suggests taking the first basis vector p0 to be the negative of the gradient of f at x = x0, the gradient of f equals Ax − b. Starting with a guessed solution x0, this means we take p0 = b − Ax0, the other vectors in the basis will be conjugate to the gradient, hence the name conjugate gradient method. Let rk be the residual at the kth step, r k = b − A x k, note that rk is the negative gradient of f at x = xk, so the gradient descent method would be to move in the direction rk. Here, we insist that the directions pk be conjugate to each other and we also require that the next search direction be built out of the current residue and all previous search directions, which is reasonable enough in practice
Conjugate gradient method
–
A comparison of the convergence of gradient descent with optimal step size (in green) and conjugate vector (in red) for minimizing a quadratic function associated with a given linear system. Conjugate gradient, assuming exact arithmetic, converges in at most n steps where n is the size of the matrix of the system (here n =2).
61.
Gene expression programming
–
In computer programming, gene expression programming is an evolutionary algorithm that creates computer programs or models. These computer programs are complex structures that learn and adapt by changing their sizes, shapes. And like living organisms, the programs of GEP are also encoded in simple linear chromosomes of fixed length. Thus, GEP is a system, benefiting from a simple genome to keep and transmit the genetic information. Evolutionary algorithms use populations of individuals, select individuals according to fitness and their use in artificial computational systems dates back to the 1950s where they were used to solve optimization problems. But it was with the introduction of evolution strategies by Rechenberg in 1965 that evolutionary algorithms gained popularity, a good overview text on evolutionary algorithms is the book “An Introduction to Genetic Algorithms” by Mitchell. Gene expression programming belongs to the family of algorithms and is closely related to genetic algorithms. From genetic algorithms it inherited the linear chromosomes of fixed length, in gene expression programming the linear chromosomes work as the genotype and the parse trees as the phenotype, creating a genotype/phenotype system. This genotype/phenotype system is multigenic, thus encoding multiple parse trees in each chromosome and this means that the computer programs created by GEP are composed of multiple parse trees. Because these parse trees are the result of expression, in GEP they are called expression trees. The genome of gene expression programming consists of a linear, symbolic string or chromosome of fixed length composed of one or more genes of equal size and these genes, despite their fixed length, code for expression trees of different sizes and shapes. As shown above, the genes of gene expression programming have all the same size, however, these fixed length strings code for expression trees of different sizes. This means that the size of the coding regions varies from gene to gene, for example, the mathematical expression, can also be represented as an expression tree, where “Q” represents the square root function. This kind of expression tree consists of the expression of GEP genes. For this particular example, the string corresponds to,01234567 Q*-+abcd which is the straightforward reading of the expression tree from top to bottom and from left to right. These linear strings are called k-expressions, going from k-expressions to expression trees is also very simple. For example, the following k-expression,01234567890 Q*b**+baQba is composed of two different terminals, two different functions of two arguments, and a function of one argument and its expression gives, The k-expressions of gene expression programming correspond to the region of genes that gets expressed. This means that there might be sequences in the genes that are not expressed, the reason for these noncoding regions is to provide a buffer of terminals so that all k-expressions encoded in GEP genes correspond always to valid programs or expressions
Gene expression programming
–
A major contributor to this article appears to have a close connection with its subject. It may require cleanup to comply with Wikipedia's content policies, particularly neutral point of view. Please discuss further on the talk page. (November 2012)
62.
Expectation-maximization
–
These parameter-estimates are then used to determine the distribution of the latent variables in the next E step. The EM algorithm was explained and given its name in a classic 1977 paper by Arthur Dempster, Nan Laird and they pointed out that the method had been proposed many times in special circumstances by earlier authors. The Dempster-Laird-Rubin paper in 1977 generalized the method and sketched a convergence analysis for a class of problems. The Dempster-Laird-Rubin paper established the EM method as an important tool of statistical analysis, the convergence analysis of the Dempster-Laird-Rubin paper was flawed and a correct convergence analysis was published by C. F. Wus proof established the EM methods convergence outside of the exponential family, the EM algorithm is used to find maximum likelihood parameters of a statistical model in cases where the equations cannot be solved directly. Typically these models involve latent variables in addition to unknown parameters and that is, either missing values exist among the data, or the model can be formulated more simply by assuming the existence of further unobserved data points. In statistical models with latent variables, this is usually impossible, the EM algorithm proceeds from the observation that the following is a way to solve these two sets of equations numerically. In general, multiple maxima may occur, with no guarantee that the maximum will be found. Some likelihoods also have singularities in them, i. e. nonsensical maxima, associated with each data point may be a vector of observations. The missing values Z are discrete, drawn from a number of values. The parameters are continuous, and are of two kinds, Parameters that are associated with all points, and those associated with a specific value of a latent variable. However, it is possible to apply EM to other sorts of models and this suggests an iterative algorithm, in the case where both θ and Z are unknown, First, initialize the parameters θ to some random values. Compute the best value for Z given these parameter values, then, use the just-computed values of Z to compute a better estimate for the parameters θ. Parameters associated with a value of Z will use only those data points which associated latent variable has that value. Iterate steps 2 and 3 until convergence, the algorithm as just described monotonically approaches a local minimum of the cost function, and is commonly called hard EM. The k-means algorithm is an example of class of algorithms. The resulting algorithm is called soft EM, and is the type of algorithm normally associated with EM. The counts used to compute these weighted averages are called soft counts, the probabilities computed for Z are posterior probabilities and are what is computed in the E step
Expectation-maximization
–
Machine learning and data mining
63.
Group method of data handling
–
GMDH is used in such fields as data mining, knowledge discovery, prediction, complex systems modeling, optimization and pattern recognition. In order to find the best solution GMDH algorithms consider various component subsets of the function called partial models. Coefficients of these models are estimated by the least squares method, GMDH algorithms gradually increase the number of partial model components and find a model structure with optimal complexity indicated by the minimum value of an external criterion. This process is called self-organization of models, jürgen Schmidhuber cites GDMH as one of the earliest deep learning methods, remarking that it was used to train eight-layer neural nets as early as 1971. The method was originated in 1968 by Prof. Alexey G. Ivakhnenko in the Institute of Cybernetics in Kiev, thanks to the authors policy of open code sharing the method was quickly settled in the large number of scientific laboratories worldwide. At that time code sharing was quite a physical action since the Internet is at least 5 years younger than GMDH, despite this fact the first investigation of GMDH outside the Soviet Union had been made soon by R. Shankar in 1972. Later on different GMDH variants were published by Japanese and Polish scientists, period 1968-1971 is characterized by application of only regularity criterion for solving of the problems of identification, pattern recognition and short-term forecasting. As reference functions polynomials, logical nets, fuzzy Zadeh sets, authors were stimulated by very high accuracy of forecasting with the new approach. The problem of modeling of noised data and incomplete information basis was solved, multicriteria selection and utilization of additional priory information for noiseimmunity increasing were proposed. Best experiments showed that with extended definition of the model by additional criterion noise level can be ten times more than signal. Then it was improved using Shannons Theorem of General Communication theory, the convergence of multilayered GMDH algorithms was investigated. It was shown that some multilayered algorithms have multilayerness error - analogous to static error of control systems, in 1977 a solution of objective systems analysis problems by multilayered GMDH algorithms was proposed. It turned out that sorting-out by criteria ensemble finds the only system of equations and therefore to show complex object elements. Many important theoretical results were received and it became clear that full physical models cannot be used for long-term forecasting. It was proved, that models of GMDH are more accurate for approximation. Two-level algorithms which use two different time scales for modeling were developed, since 1989 the new algorithms for non-parametric modeling of fuzzy objects and SLP for expert systems were developed and investigated. Present stage of GMDH development can be described as out of twice-multilayered neuronets. External criterion is one of the key features of GMDH. Criterion describes requirements to the model and it is always calculated with a separate part of data sample that have not been used for estimation of coefficients
Group method of data handling
–
GMDH author - Soviet scientist Prof. Alexey G. Ivakhnenko.
64.
Recursion
–
Recursion occurs when a thing is defined in terms of itself or of its type. Recursion is used in a variety of disciplines ranging from linguistics to logic, the most common application of recursion is in mathematics and computer science, where a function being defined is applied within its own definition. While this apparently defines a number of instances, it is often done in such a way that no loop or infinite chain of references can occur. The ancestors of ones ancestors are also ones ancestors, the Fibonacci sequence is a classic example of recursion, Fib =0 as base case 1, Fib =1 as base case 2, For all integers n >1, Fib, = Fib + Fib. Many mathematical axioms are based upon recursive rules, for example, the formal definition of the natural numbers by the Peano axioms can be described as,0 is a natural number, and each natural number has a successor, which is also a natural number. By this base case and recursive rule, one can generate the set of all natural numbers, recursively defined mathematical objects include functions, sets, and especially fractals. There are various more tongue-in-cheek definitions of recursion, see recursive humor, Recursion is the process a procedure goes through when one of the steps of the procedure involves invoking the procedure itself. A procedure that goes through recursion is said to be recursive, to understand recursion, one must recognize the distinction between a procedure and the running of a procedure. A procedure is a set of steps based on a set of rules, the running of a procedure involves actually following the rules and performing the steps. An analogy, a procedure is like a recipe, running a procedure is like actually preparing the meal. Recursion is related to, but not the same as, a reference within the specification of a procedure to the execution of some other procedure. For instance, a recipe might refer to cooking vegetables, which is another procedure that in turn requires heating water, for this reason recursive definitions are very rare in everyday situations. An example could be the procedure to find a way through a maze. Proceed forward until reaching either an exit or a branching point, If the point reached is an exit, terminate. Otherwise try each branch in turn, using the procedure recursively, if every trial fails by reaching only dead ends, return on the path led to this branching point. Whether this actually defines a terminating procedure depends on the nature of the maze, in any case, executing the procedure requires carefully recording all currently explored branching points, and which of their branches have already been exhaustively tried. This can be understood in terms of a definition of a syntactic category. A sentence can have a structure in which what follows the verb is another sentence, Dorothy thinks witches are dangerous, so a sentence can be defined recursively as something with a structure that includes a noun phrase, a verb, and optionally another sentence
Recursion
–
A visual form of recursion known as the Droste effect. The woman in this image holds an object that contains a smaller image of her holding an identical object, which in turn contains a smaller image of herself holding an identical object, and so forth. Advertisement for Droste cocoa, c. 1900
Recursion
–
Ouroboros, an ancient symbol depicting a serpent or dragon eating its own tail.
Recursion
–
Recently refreshed sourdough, bubbling through fermentation: the recipe calls for some sourdough left over from the last time the same recipe was made.
Recursion
–
Recursive dolls: the original set of Matryoshka dolls by Zvyozdochkin and Malyutin, 1892
65.
Topological sort
–
A topological ordering is possible if and only if the graph has no directed cycles, that is, if it is a directed acyclic graph. Any DAG has at least one topological ordering, and algorithms are known for constructing a topological ordering of any DAG in linear time, the canonical application of topological sorting is in scheduling a sequence of jobs or tasks based on their dependencies. The jobs are represented by vertices, and there is an edge from x to y if job x must be completed before job y can be started, then, a topological sort gives an order in which to perform the jobs. It is also used to decide in order to load tables with foreign keys in databases. The usual algorithms for topological sorting have running time linear in the number of plus the number of edges, asymptotically. One of these algorithms, first described by Kahn, works by choosing vertices in the order as the eventual topological sort. First, find a list of nodes which have no incoming edges and insert them into a set S. Otherwise, the graph must have at least one cycle and therefore a topological sorting is impossible, reflecting the non-uniqueness of the resulting sort, the structure S can be simply a set or a queue or a stack. Depending on the order that nodes n are removed from set S, a variation of Kahns algorithm that breaks ties lexicographically forms a key component of the Coffman–Graham algorithm for parallel scheduling and layered graph drawing. An alternative algorithm for sorting is based on depth-first search. Since each edge and node is visited once, the runs in linear time. This depth-first-search-based algorithm is the one described by Cormen et al. it seems to have been first described in print by Tarjan. On a parallel random-access machine, an ordering can be constructed in O time using a polynomial number of processors. One method for doing this is to square the adjacency matrix of the given graph, logarithmically many times. The resulting matrix describes the longest path distances in the graph, sorting the vertices by the lengths of their longest incoming paths produces a topological ordering. The topological ordering can also be used to compute shortest paths through a weighted directed acyclic graph. Let V be the list of vertices in such a graph, in topological order. Then the following algorithm computes the shortest path from some source vertex s to all vertices, On a graph of n vertices and m edges
Topological sort
–
5, 7, 3, 11, 8, 2, 9, 10 (visual left-to-right, top-to-bottom)
66.
Natural language processing
–
The history of NLP generally starts in the 1950s, although work can be found from earlier periods. In 1950, Alan Turing published an article titled Computing Machinery, the Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that three or five years, machine translation would be a solved problem. Little further research in translation was conducted until the late 1980s. Using almost no information about human thought or emotion, ELIZA sometimes provided a startlingly human-like interaction, when the patient exceeded the very small knowledge base, ELIZA might provide a generic response, for example, responding to My head hurts with Why do you say your head hurts. During the 1970s many programmers began to write conceptual ontologies, which structured real-world information into computer-understandable data, examples are MARGIE, SAM, PAM, TaleSpin, QUALM, Politics, and Plot Units. During this time, many chatterbots were written including PARRY, Racter, up to the 1980s, most NLP systems were based on complex sets of hand-written rules. Starting in the late 1980s, however, there was a revolution in NLP with the introduction of machine learning algorithms for language processing, some of the earliest-used machine learning algorithms, such as decision trees, produced systems of hard if-then rules similar to existing hand-written rules. The cache language models upon which many speech recognition systems now rely are examples of statistical models. Many of the early successes occurred in the field of machine translation, due especially to work at IBM Research. However, most other systems depended on corpora specifically developed for the tasks implemented by these systems, as a result, a great deal of research has gone into methods of more effectively learning from limited amounts of data. Recent research has focused on unsupervised and semi-supervised learning algorithms. Such algorithms are able to learn from data that has not been hand-annotated with the desired answers, generally, this task is much more difficult than supervised learning, and typically produces less accurate results for a given amount of input data. However, there is an amount of non-annotated data available. Since the so-called statistical revolution in the late 1980s and mid 1990s, formerly, many language-processing tasks typically involved the direct hand coding of rules, which is not in general robust to natural language variation. The machine-learning paradigm calls instead for using statistical inference to automatically learn such rules through the analysis of large corpora of typical real-world examples, Many different classes of machine learning algorithms have been applied to NLP tasks. These algorithms take as input a set of features that are generated from the input data. Some of the algorithms, such as decision trees, produced systems of hard if-then rules similar to the systems of hand-written rules that were then common
Natural language processing
–
An automated online assistant providing customer service on a web page, an example of an application where natural language processing is a major component.
67.
Baidu
–
Baidu, Inc. incorporated on January 18,2000, is a Chinese-American web services company headquartered at the Baidu Campus in Beijings Haidian District. It is one of the largest Internet companies in the world, Baidu offers many services, including a Chinese search engine for websites, audio files and images. Baidu offers 57 search and community services including Baidu Baike and a searchable, Baidu was established in 2000 by Robin Li and Eric Xu. Both of the co-founders are Chinese nationals who studied and worked overseas before returning to China, in December 2016, Baidu ranked 4th overall in the Alexa Internet rankings. During Q4 of 2010, it is estimated there were 4.02 billion search queries in China of which Baidu had a market share of 56. 6%. Chinas Internet-search revenue share in second quarter 2011 by Baidu is 76%, in December 2007, Baidu became the first Chinese company to be included in the NASDAQ-100 index. As of 2006, Baidu provided an index of over 740 million web pages,80 million images, Baidu offers multimedia content including MP3 music, and movies, and is the first in China to offer Wireless Application Protocol and personal digital assistant -based mobile search. Baidu Baike is similar to Wikipedia as an encyclopedia, however, unlike Wikipedia. The company also hosts a service, called Baidu Music. On December 4,2015, Baidu announced plans to merge with Taihe Entertainment Group to help the service compete with Apple Inc. s Apple Music, the name Baidu literally means countless times, or alternatively, a hundred times. In 1994, Robin Li joined IDD Information Services, a New Jersey division of Dow Jones and Company and he also worked on developing better algorithms for search engines and remained at IDD Information Services from May 1994 to June 1997. In 1996, while at IDD, Li developed the RankDex site-scoring algorithm for search results page ranking. He later used this technology for the Baidu search engine, in 2000, the company Baidu launched in Beijing, China. The first office was located in a room, which was near Peking University from where Robin graduated. In 2003, Baidu launched a search engine and picture search engine. On January 12,2010, Baidu. coms DNS records in the United States were altered such that browsers to baidu, Internet users were met with a page saying This site has been attacked by Iranian Cyber Army. Chinese hackers later responded by attacking Iranian websites and leaving messages, on August 6,2012, the BBC reported that three employees of Baidu were arrested on suspicion that they accepted bribes. The bribes were paid for deleting posts from the forum service
Baidu
–
Baidu headquarters, Haidian District, Beijing
Baidu
–
Baidu, Inc.
68.
Natural Language Processing
–
The history of NLP generally starts in the 1950s, although work can be found from earlier periods. In 1950, Alan Turing published an article titled Computing Machinery, the Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that three or five years, machine translation would be a solved problem. Little further research in translation was conducted until the late 1980s. Using almost no information about human thought or emotion, ELIZA sometimes provided a startlingly human-like interaction, when the patient exceeded the very small knowledge base, ELIZA might provide a generic response, for example, responding to My head hurts with Why do you say your head hurts. During the 1970s many programmers began to write conceptual ontologies, which structured real-world information into computer-understandable data, examples are MARGIE, SAM, PAM, TaleSpin, QUALM, Politics, and Plot Units. During this time, many chatterbots were written including PARRY, Racter, up to the 1980s, most NLP systems were based on complex sets of hand-written rules. Starting in the late 1980s, however, there was a revolution in NLP with the introduction of machine learning algorithms for language processing, some of the earliest-used machine learning algorithms, such as decision trees, produced systems of hard if-then rules similar to existing hand-written rules. The cache language models upon which many speech recognition systems now rely are examples of statistical models. Many of the early successes occurred in the field of machine translation, due especially to work at IBM Research. However, most other systems depended on corpora specifically developed for the tasks implemented by these systems, as a result, a great deal of research has gone into methods of more effectively learning from limited amounts of data. Recent research has focused on unsupervised and semi-supervised learning algorithms. Such algorithms are able to learn from data that has not been hand-annotated with the desired answers, generally, this task is much more difficult than supervised learning, and typically produces less accurate results for a given amount of input data. However, there is an amount of non-annotated data available. Since the so-called statistical revolution in the late 1980s and mid 1990s, formerly, many language-processing tasks typically involved the direct hand coding of rules, which is not in general robust to natural language variation. The machine-learning paradigm calls instead for using statistical inference to automatically learn such rules through the analysis of large corpora of typical real-world examples, Many different classes of machine learning algorithms have been applied to NLP tasks. These algorithms take as input a set of features that are generated from the input data. Some of the algorithms, such as decision trees, produced systems of hard if-then rules similar to the systems of hand-written rules that were then common
Natural Language Processing
–
An automated online assistant providing customer service on a web page, an example of an application where natural language processing is a major component.
69.
Deep belief network
–
When trained on a set of examples in an unsupervised way, a DBN can learn to probabilistically reconstruct its inputs. The layers then act as feature detectors on inputs, after this learning step, a DBN can be further trained in a supervised way to perform classification. This also leads to a fast, layer-by-layer unsupervised training procedure, the observation, due to Yee-Whye Teh, that DBNs can be trained greedily, one layer at a time, led to one of the first effective deep learning algorithms. The training algorithm for DBNs proceeds as follows, let X be a matrix of inputs, regarded as a set of feature vectors. Train a restricted Boltzmann machine on X to obtain its weight matrix, transform X by the RBM to produce new data X, either by sampling or by computing the mean activation of the hidden units. Repeat this procedure with X ← X for the pair of layers. Fine-tune all the parameters of this architecture with respect to a proxy for the DBN log- likelihood. Bayesian network Deep learning Deep Belief Networks
Deep belief network
–
Machine learning and data mining
70.
Contrastive divergence
–
A restricted Boltzmann machine is a generative stochastic artificial neural network that can learn a probability distribution over its set of inputs. RBMs have found applications in dimensionality reduction, classification, collaborative filtering, feature learning and they can be trained in either supervised or unsupervised ways, depending on the task. By contrast, unrestricted Boltzmann machines may have connections between hidden units and this restriction allows for more efficient training algorithms than are available for the general class of Boltzmann machines, in particular the gradient-based contrastive divergence algorithm. Restricted Boltzmann machines can also be used in deep learning networks, in particular, deep belief networks can be formed by stacking RBMs and optionally fine-tuning the resulting deep network with gradient descent and backpropagation. The individual activation probabilities are given by P = σ and P = σ where σ denotes the logistic sigmoid, the visible units of RBM can be multinomial, although the hidden units are Bernoulli. In this case, the function for visible units is replaced by the softmax function P = exp Σ k ′ =1 K exp where K is the number of discrete values that the visible values have. They are applied in topic modeling, and recommender systems, Restricted Boltzmann machines are a special case of Boltzmann machines and Markov random fields. Their graphical model corresponds to that of factor analysis, the algorithm performs Gibbs sampling and is used inside a gradient descent procedure to compute weight update. Compute the outer product of v and h and call this the positive gradient, from h, sample a reconstruction v of the visible units, then resample the hidden activations h from this. Compute the outer product of v and h and call this the negative gradient, let the update to the weight matrix W be the positive gradient minus the negative gradient, times some learning rate, Δ W = ϵ. Update the biases a and b analogously, Δ a = ϵ, Δ b = ϵ, a Practical Guide to Training RBMs written by Hinton can be found on his homepage. Autoencoder Deep learning Helmholtz machine Hopfield network Introduction to Restricted Boltzmann Machines, edwin Chens blog, July 18,2011. A Beginners Guide to Restricted Boltzmann Machines, python implementation of Bernoulli RBM and tutorial
Contrastive divergence
–
Diagram of a restricted Boltzmann machine with three visible units and four hidden units (no bias units).
71.
Maximum likelihood
–
The method of maximum likelihood corresponds to many well-known estimation methods in statistics. For example, one may be interested in the heights of adult female penguins, MLE would accomplish this by taking the mean and variance as parameters and finding particular parametric values that make the observed results the most probable given the model. In general, for a set of data and underlying statistical model. Maximum likelihood estimation gives a unified approach to estimation, which is well-defined in the case of the normal distribution, maximum-likelihood estimation was recommended, analyzed and widely popularized by Ronald Fisher between 1912 and 1922. Maximum-likelihood estimation finally transcended heuristic justification in a proof published by Samuel S. Wilks in 1938, ironically, the only difficult part of the proof depends on the expected value of the Fisher information matrix, which is provided by a theorem by Fisher. Wilks continued to improve on the generality of the theorem throughout his life, some of the theory behind maximum likelihood estimation was developed for Bayesian statistics. Reviews of the development of maximum likelihood estimation have been provided by a number of authors, suppose there is a sample x1, x2, …, xn of n independent and identically distributed observations, coming from a distribution with an unknown probability density function f0. It is however surmised that the function f0 belongs to a family of distributions, called the parametric model. The value θ0 is unknown and is referred to as the value of the parameter vector. It is desirable to find an estimator θ ^ which would be as close to the true value θ0 as possible, either or both the observed variables xi and the parameter θ can be vectors. To use the method of maximum likelihood, one first specifies the joint density function for all observations, for an independent and identically distributed sample, this joint density function is f = f × f × ⋯ × f. Note that, denotes a separation between the two categories of arguments, the parameters θ and the observations x 1, …, x n. The hat over ℓ indicates that it is akin to some estimator, indeed, ℓ ^ estimates the expected log-likelihood of a single observation in the model. The method of maximum likelihood estimates θ0 by finding a value of θ that maximizes ℓ ^ and this method of estimation defines a maximum likelihood estimator of θ0, ⊆, if a maximum exists. An MLE estimate is the same regardless of whether we maximize the likelihood or the log-likelihood function, for many models, a maximum likelihood estimator can be found as an explicit function of the observed data x1. For many other models, however, no solution to the maximization problem is known or available. For some problems, there may be multiple estimates that maximize the likelihood, in the exposition above, it is assumed that the data are independent and identically distributed. In a simpler extension, an allowance can be made for data heterogeneity, put another way, we are now assuming that each observation xi comes from a random variable that has its own distribution function fi
Maximum likelihood
–
Ronald Fisher in 1913
72.
ReLU
–
In the context of artificial neural networks, the rectifier is an activation function defined as f = max, where x is the input to a neuron. This is also known as a function and is analogous to half-wave rectification in electrical engineering. This activation function was first introduced to a network by Hahnloser et al. in a 2000 paper in Nature with strong biological motivations. It has been used in convolutional networks more effectively than the widely used logistic sigmoid and its practical counterpart. The rectifier is, as of 2015, the most popular activation function for deep neural networks, a unit employing the rectifier is also called a rectified linear unit. A smooth approximation to the rectifier is the function f = ln . The derivative of softplus is f ′ = e x / =1 /, i. e. the logistic function, rectified linear units find applications in computer vision and speech recognition using deep neural nets. Leaky ReLUs allow a small, non-zero gradient when the unit is not active. F = { x if x >00.01 x otherwise Parametric ReLUs take this further by making the coefficient of leakage into a parameter that is learned along with the other neural network parameters. F = { x if x >0 a x otherwise Note that for a ≤1, exponential linear units try to make the mean activations closer to zero which speeds up learning. It has been shown that ELUs obtain higher accuracy than ReLUs. F = { x if x >=0 a otherwise a is a hyper-parameter to be tuned, biological plausibility, One-sided, compared to the antisymmetry of tanh. Sparse activation, For example, in a randomly initialized network, efficient gradient propagation, No vanishing or exploding gradient problems. Efficient computation, Only comparison, addition and multiplication, for the first time in 2011, the use of the rectifier as a non-linearity has been shown to enable training deep supervised neural networks without requiring unsupervised pre-training. Rectified linear units, compared to sigmoid function or similar functions, allow for faster and effective training of deep neural architectures on large. Non-differentiable at zero, however it is anywhere else, including points arbitrarily close to zero. Non-zero centered Unbounded, Could potentially blow up, dying Relu problem, Relu neurons can sometimes be pushed into states in which they become inactive for essentially all inputs. In this state, no gradients flow backward through the neuron, in some cases, large numbers of neurons in a network can become stuck in dead states, effectively decreasing the model capacity
ReLU
–
Plot of the rectifier (blue) and softplus (green) functions near x = 0.
73.
Binary variable
–
Binary data is data whose unit can take on only two possible states, traditionally termed 0 and +1 in accordance with the binary numeral system and Boolean algebra. Forms and interpretations of binary data come in different technical and scientific fields, such two-valued unit can be termed, bit in computer science, truth value in mathematical logic and related domains, binary variable in statistics. A discrete variable that can take only one state contains zero information and that is why the bit, a variable with only two possible values, is a standard primary unit of information. A collection of n bits may have 2n states, see binary number for details, number of states of a collection of discrete variables depends exponentially on the number of variables, and only as a power law on number of states of each variable. Ten bits have more states than three decimal digits, so, the use of any other small number than 2 does not provide an advantage. Moreover, Boolean algebra provides a convenient mathematical structure for collection of bits, Boolean algebra operations are known as bitwise operations in computer science. Boolean functions are also well-studied theoretically and easily implementable, either with computer programs or by so-named logic gates in digital electronics and this contributes to the use of bits to represent different data, even those originally not binary. In statistics, binary data is a data type described by binary variables. Binary data represents the outcomes of Bernoulli trials—statistical experiments with two possible outcomes. It is a type of data, which more generally represents experiments with a fixed number of possible outcomes. In this respect, also, binary data is similar to categorical data, often, binary data is used to represent one of two conceptually opposed values, e. g. For example, binary data is used to represent the party choices of voters in elections in the United States. In this case, there is no inherent reason why only two parties should exist, and indeed, other parties do exist in the U. S. Modeling continuous data as a variable for analysis purposes is called dichotomization. Like all discretization, it involves discretization error, but the goal is to something valuable despite the error. Binary variables that are random variables are distributed according to a Bernoulli distribution, regression analysis on predicted outcomes that are binary variables is accomplished through logistic regression, probit regression or a related type of discrete choice model. In modern computers, binary data refers to any data represented in binary form rather than interpreted on a level or converted into some other form. At the lowest level, bits are stored in a device such as a flip-flop
Binary variable
–
Binary tree, a conceptual metaphor (and a data structure) for sequences of bits
74.
Markov chain Monte Carlo
–
The state of the chain after a number of steps is then used as a sample of the desired distribution. The quality of the sample improves as a function of the number of steps, random walk Monte Carlo methods make up a large subclass of MCMC methods. They are also used for generating samples that gradually populate the rare failure region in rare event sampling, when an MCMC method is used for approximating a multi-dimensional integral, an ensemble of walkers move around randomly. At each point where a walker steps, the value at that point is counted towards the integral. The walker then may make a number of steps around the area. Random walk Monte Carlo methods are a kind of random simulation or Monte Carlo method, however, whereas the random samples of the integrand used in a conventional Monte Carlo integration are statistically independent, those used in MCMC methods are correlated. A Markov chain is constructed in such a way as to have the integrand as its equilibrium distribution, Gibbs sampling, This method requires all the conditional distributions of the target distribution to be sampled exactly. When drawing from the distributions is not straightforward other samplers-within-Gibbs are used. Gibbs sampling is popular partly because it not require any tuning. Slice sampling, This method depends on the principle one can sample from a distribution by sampling uniformly from the region under the plot of its density function. It alternates uniform sampling in the direction with uniform sampling from the horizontal slice defined by the current vertical position. Multiple-try Metropolis, This method is a variation of the Metropolis–Hastings algorithm that allows multiple trials at each point, by making it possible to take larger steps at each iteration, it helps address the curse of dimensionality. Reversible-jump, This method is a variant of the Metropolis–Hastings algorithm that allows proposals that change the dimensionality of the space, MCMC methods that change dimensionality have long been used in statistical physics applications, where for some problems a distribution that is a grand canonical ensemble is used. Is automatically inferred from the data, unlike most of the current MCMC methods that ignore the previous trials, using a new algorithm the MCMC algorithm is able to use the previous steps and generate the next candidate. This training-based algorithm is able to speed-up the MCMC algorithm by an order of magnitude, interacting MCMC methodologies are a class of mean field particle methods for obtaining random samples from a sequence of probability distributions with an increasing level of sampling complexity. These probabilistic models include path space state models with increasing time horizon, sequence of partial observations, increasing constraint level sets for conditional distributions, decreasing temperature schedules associated with some Boltzmann-Gibbs distributions, and many others. In principle, any MCMC sampler can be turned into an interacting MCMC sampler and these interacting MCMC samplers can be interpreted as a way to run in parallel a sequence of MCMC samplers. For instance, interacting simulated annealing algorithms are based on independent Metropolis-Hastings moves interacting sequentially with a selection-resampling type mechanism, in contrast to traditional MCMC methods, the precision parameter of this class of interacting MCMC samplers is only related to the number of interacting MCMC samplers
Markov chain Monte Carlo
–
Convergence of the Metropolis-Hastings algorithm. MCMC attempts to approximate the blue distribution with the orange distribution
75.
Affine transformation
–
In geometry, an affine transformation, affine map or an affinity is a function between affine spaces which preserves points, straight lines and planes. Also, sets of parallel lines remain parallel after an affine transformation, an affine transformation does not necessarily preserve angles between lines or distances between points, though it does preserve ratios of distances between points lying on a straight line. Examples of affine transformations include translation, scaling, homothety, similarity transformation, reflection, rotation, shear mapping, and compositions of them in any combination and sequence. If X and Y are affine spaces, then every affine transformation f, X → Y is of the form x ↦ M x + b, unlike a purely linear transformation, an affine map need not preserve the zero point in a linear space. Thus, every linear transformation is affine, but not every affine transformation is linear, all Euclidean spaces are affine, but there are affine spaces that are non-Euclidean. In affine coordinates, which include Cartesian coordinates in Euclidean spaces, another way to deal with affine transformations systematically is to select a point as the origin, then, any affine transformation is equivalent to a linear transformation followed by a translation. An affine map f, A → B between two spaces is a map on the points that acts linearly on the vectors. In symbols, f determines a linear transformation φ such that and we can interpret this definition in a few other ways, as follows. If an origin O ∈ A is chosen, and B denotes its image f ∈ B, the conclusion is that, intuitively, f consists of a translation and a linear map. In other words, f preserves barycenters, as shown above, an affine map is the composition of two functions, a translation and a linear map. Ordinary vector algebra uses matrix multiplication to represent linear maps, using an augmented matrix and an augmented vector, it is possible to represent both the translation and the linear map using a single matrix multiplication. If A is a matrix, = is equivalent to the following y → = A x → + b →, the above-mentioned augmented matrix is called an affine transformation matrix, or projective transformation matrix. This representation exhibits the set of all affine transformations as the semidirect product of K n and G L. This is a group under the operation of composition of functions, ordinary matrix-vector multiplication always maps the origin to the origin, and could therefore never represent a translation, in which the origin must necessarily be mapped to some other point. By appending the additional coordinate 1 to every vector, one considers the space to be mapped as a subset of a space with an additional dimension. In that space, the original space occupies the subset in which the coordinate is 1. Thus the origin of the space can be found at. A translation within the space by means of a linear transformation of the higher-dimensional space is then possible
Affine transformation
–
An image of a fern-like fractal that exhibits affine self-similarity. Each of the leaves of the fern is related to each other leaf by an affine transformation. For instance, the red leaf can be transformed into both the small dark blue leaf and the large light blue leaf by a combination of reflection, rotation, scaling, and translation.
76.
Convex optimization problem
–
Convex minimization, a subfield of optimization, studies the problem of minimizing convex functions over convex sets. The convexity property can make optimization in some sense easier than the general case - for example, the convexity of f makes the powerful tools of convex analysis applicable. With recent improvements in computing and in theory, convex minimization is nearly as straightforward as linear programming. Many optimization problems can be reformulated as convex minimization problems, for example, the problem of maximizing a concave function f can be re-formulated equivalently as a problem of minimizing the function -f, which is convex. The general form of a problem is to find some x ∗ ∈ X such that f = min, for some feasible set X ⊂ R n. The optimization problem is called an optimization problem if X is a convex set. The following statements are true about the convex minimization problem, if a local minimum exists, the set of all minima is convex. for each strictly convex function, if the function has a minimum, then the minimum is unique. Standard form is the usual and most intuitive form of describing a convex minimization problem, in practice, the terms linear and affine are often used interchangeably. Such constraints can be expressed in the h i = a i T x + b i. A convex minimization problem is thus written as x f s u b j e c t t o g i ≤0, i =1, …, m h i =0, i =1, …, p. Note that every equality constraint h =0 can be replaced by a pair of inequality constraints h ≤0 and − h ≤0. Therefore, for theoretical purposes, equality constraints are redundant, however, following from this fact, it is easy to understand why h i =0 has to be affine as opposed to merely being convex. If h i is convex, h i ≤0 is convex, therefore, the only way for h i =0 to be convex is for h i to be affine. Then the domain X is, X =, the Lagrangian function for the problem is L = λ0 f + λ1 g 1 + ⋯ + λ m g m. If there exists a strictly feasible point, that is, a point z satisfying g 1, …, g m <0, dual subgradient methods are subgradient methods applied to a dual problem. The drift-plus-penalty method is similar to the dual subgradient method, problems with convex level sets can be efficiently minimized, in theory. Yurii Nesterov proved that quasi-convex minimization problems could be solved efficiently, however, such theoretically efficient methods use divergent-series stepsize rules, which were first developed for classical subgradient methods. Solving even close-to-convex but non-convex problems can be computationally intractable, minimizing a unimodal function is intractable, regardless of the smoothness of the function, according to results of Ivanov
Convex optimization problem
–
Unconstrained nonlinear: Methods calling …
77.
Convex optimization
–
Convex minimization, a subfield of optimization, studies the problem of minimizing convex functions over convex sets. The convexity property can make optimization in some sense easier than the general case - for example, the convexity of f makes the powerful tools of convex analysis applicable. With recent improvements in computing and in theory, convex minimization is nearly as straightforward as linear programming. Many optimization problems can be reformulated as convex minimization problems, for example, the problem of maximizing a concave function f can be re-formulated equivalently as a problem of minimizing the function -f, which is convex. The general form of a problem is to find some x ∗ ∈ X such that f = min, for some feasible set X ⊂ R n. The optimization problem is called an optimization problem if X is a convex set. The following statements are true about the convex minimization problem, if a local minimum exists, the set of all minima is convex. for each strictly convex function, if the function has a minimum, then the minimum is unique. Standard form is the usual and most intuitive form of describing a convex minimization problem, in practice, the terms linear and affine are often used interchangeably. Such constraints can be expressed in the h i = a i T x + b i. A convex minimization problem is thus written as x f s u b j e c t t o g i ≤0, i =1, …, m h i =0, i =1, …, p. Note that every equality constraint h =0 can be replaced by a pair of inequality constraints h ≤0 and − h ≤0. Therefore, for theoretical purposes, equality constraints are redundant, however, following from this fact, it is easy to understand why h i =0 has to be affine as opposed to merely being convex. If h i is convex, h i ≤0 is convex, therefore, the only way for h i =0 to be convex is for h i to be affine. Then the domain X is, X =, the Lagrangian function for the problem is L = λ0 f + λ1 g 1 + ⋯ + λ m g m. If there exists a strictly feasible point, that is, a point z satisfying g 1, …, g m <0, dual subgradient methods are subgradient methods applied to a dual problem. The drift-plus-penalty method is similar to the dual subgradient method, problems with convex level sets can be efficiently minimized, in theory. Yurii Nesterov proved that quasi-convex minimization problems could be solved efficiently, however, such theoretically efficient methods use divergent-series stepsize rules, which were first developed for classical subgradient methods. Solving even close-to-convex but non-convex problems can be computationally intractable, minimizing a unimodal function is intractable, regardless of the smoothness of the function, according to results of Ivanov
Convex optimization
–
Unconstrained nonlinear: Methods calling …
78.
Bilinear map
–
In mathematics, a bilinear map is a function combining elements of two vector spaces to yield an element of a third vector space, and is linear in each of its arguments. Let V, W and X be three vector spaces over the base field F. In other words, when we hold the first entry of the bilinear map fixed while letting the second entry vary, the result is a linear operator, and similarly for when we hold the second entry fixed. If V = W and we have B = B for all v, w in V, the case where X is the base field F, and we have a bilinear form, is particularly useful. The definition works without any changes if instead of vector spaces over a field F and it generalizes to n-ary functions, where the proper term is multilinear. This satisfies B = r ⋅ B B = B ⋅ s for all m in M, n in N, r in R and s in S, a first immediate consequence of the definition is that B = 0X whenever v = 0V or w = 0W. This may be seen by writing the zero vector 0X as 0 ⋅ 0X and moving the scalar 0 outside, in front of B, the set L of all bilinear maps is a linear subspace of the space of all maps from V × W into X. If V, W, X are finite-dimensional, then so is L, for X = F, i. e. bilinear forms, the dimension of this space is dim V × dim W. To see this, choose a basis for V and W, then each bilinear map can be represented by the matrix B. Now, if X is a space of dimension, we obviously have dim L = dim V × dim W × dim X. Matrix multiplication is a bilinear map M × M → M. If a vector space V over the real numbers R carries an inner product, in general, for a vector space V over a field F, a bilinear form on V is the same as a bilinear map V × V → F. If V is a space with dual space V∗, then the application operator. Let V and W be vector spaces over the base field F. If f is a member of V∗ and g a member of W∗, the cross product in R3 is a bilinear map R3 × R3 → R3. Let B, V × W → X be a bilinear map, and L, U → W be a linear map, then ↦ B is a bilinear map on V × U
Bilinear map
–
A matrix M determines a bilinear map into the real by means of a real bilinear form (v, w) ↦ v ′ Mw, then associates of this are taken to the other three possibilities using duality and the musical isomorphism
79.
Real number
–
In mathematics, a real number is a value that represents a quantity along a line. The adjective real in this context was introduced in the 17th century by René Descartes, the real numbers include all the rational numbers, such as the integer −5 and the fraction 4/3, and all the irrational numbers, such as √2. Included within the irrationals are the numbers, such as π. Real numbers can be thought of as points on a long line called the number line or real line. Any real number can be determined by a possibly infinite decimal representation, such as that of 8.632, the real line can be thought of as a part of the complex plane, and complex numbers include real numbers. These descriptions of the numbers are not sufficiently rigorous by the modern standards of pure mathematics. All these definitions satisfy the definition and are thus equivalent. The statement that there is no subset of the reals with cardinality greater than ℵ0. Simple fractions were used by the Egyptians around 1000 BC, the Vedic Sulba Sutras in, c.600 BC, around 500 BC, the Greek mathematicians led by Pythagoras realized the need for irrational numbers, in particular the irrationality of the square root of 2. Arabic mathematicians merged the concepts of number and magnitude into a general idea of real numbers. In the 16th century, Simon Stevin created the basis for modern decimal notation, in the 17th century, Descartes introduced the term real to describe roots of a polynomial, distinguishing them from imaginary ones. In the 18th and 19th centuries, there was work on irrational and transcendental numbers. Johann Heinrich Lambert gave the first flawed proof that π cannot be rational, Adrien-Marie Legendre completed the proof, Évariste Galois developed techniques for determining whether a given equation could be solved by radicals, which gave rise to the field of Galois theory. Charles Hermite first proved that e is transcendental, and Ferdinand von Lindemann, lindemanns proof was much simplified by Weierstrass, still further by David Hilbert, and has finally been made elementary by Adolf Hurwitz and Paul Gordan. The development of calculus in the 18th century used the set of real numbers without having defined them cleanly. The first rigorous definition was given by Georg Cantor in 1871, in 1874, he showed that the set of all real numbers is uncountably infinite but the set of all algebraic numbers is countably infinite. Contrary to widely held beliefs, his first method was not his famous diagonal argument, the real number system can be defined axiomatically up to an isomorphism, which is described hereafter. Another possibility is to start from some rigorous axiomatization of Euclidean geometry, from the structuralist point of view all these constructions are on equal footing
Real number
–
A symbol of the set of real numbers (ℝ)
80.
Probability density
–
In a more precise sense, the PDF is used to specify the probability of the random variable falling within a particular range of values, as opposed to taking on any one value. The probability density function is everywhere, and its integral over the entire space is equal to one. The terms probability distribution function and probability function have also sometimes used to denote the probability density function. However, this use is not standard among probabilists and statisticians, further confusion of terminology exists because density function has also been used for what is here called the probability mass function. In general though, the PMF is used in the context of random variables. Suppose a species of bacteria typically lives 4 to 6 hours, what is the probability that a bacterium lives exactly 5 hours. A lot of bacteria live for approximately 5 hours, but there is no chance that any given bacterium dies at exactly 5.0000000000, instead we might ask, What is the probability that the bacterium dies between 5 hours and 5.01 hours. Lets say the answer is 0.02, next, What is the probability that the bacterium dies between 5 hours and 5.001 hours. The answer is probably around 0.002, since this is 1/10th of the previous interval, the probability that the bacterium dies between 5 hours and 5.0001 hours is probably about 0.0002, and so on. In these three examples, the ratio / is approximately constant, and equal to 2 per hour, for example, there is 0.02 probability of dying in the 0. 01-hour interval between 5 and 5.01 hours, and =2 hour−1. This quantity 2 hour−1 is called the probability density for dying at around 5 hours, therefore, in response to the question What is the probability that the bacterium dies at 5 hours. A literally correct but unhelpful answer is 0, but an answer can be written as dt. This is the probability that the bacterium dies within a window of time around 5 hours. For example, the probability that it lives longer than 5 hours, there is a probability density function f with f =2 hour−1. The integral of f over any window of time is the probability that the dies in that window. A probability density function is most commonly associated with absolutely continuous univariate distributions, a random variable X has density fX, where fX is a non-negative Lebesgue-integrable function, if, Pr = ∫ a b f X d x. That is, f is any function with the property that. In the continuous univariate case above, the measure is the Lebesgue measure
Probability density
–
Boxplot and probability density function of a normal distribution N (0, σ 2).
81.
Greedy algorithm
–
A greedy algorithm is an algorithmic paradigm that follows the problem solving heuristic of making the locally optimal choice at each stage with the hope of finding a global optimum. For example, a strategy for the traveling salesman problem is the following heuristic. This heuristic need not find a best solution, but terminates in a number of steps. In mathematical optimization, greedy algorithms solve combinatorial problems having the properties of matroids, most problems for which they work will have two properties, Greedy choice property We can make whatever choice seems best at the moment and then solve the subproblems that arise later. The choice made by an algorithm may depend on choices made so far. It iteratively makes one greedy choice after another, reducing each given problem into a smaller one, in other words, a greedy algorithm never reconsiders its choices. This is the difference from dynamic programming, which is exhaustive and is guaranteed to find the solution. After every stage, dynamic programming makes decisions based on all the decisions made in the previous stage, optimal substructure A problem exhibits optimal substructure if an optimal solution to the problem contains optimal solutions to the sub-problems. For many other problems, greedy algorithms fail to produce the optimal solution, Greedy algorithms can be characterized as being short sighted, and also as non-recoverable. They are ideal only for problems which have optimal substructure, despite this, for many simple problems, the best suited algorithms are greedy algorithms. It is important, however, to note that the algorithm can be used as a selection algorithm to prioritize options within a search. They can make commitments to certain choices too early which prevent them from finding the best overall solution later, for example, all known greedy coloring algorithms for the graph coloring problem and all other NP-complete problems do not consistently find optimum solutions. Nevertheless, they are useful because they are quick to think up, examples of such greedy algorithms are Kruskals algorithm and Prims algorithm for finding minimum spanning trees, and the algorithm for finding optimum Huffman trees. The theory of matroids, and the general theory of greedoids. Greedy algorithms appear in network routing as well, using greedy routing, a message is forwarded to the neighboring node which is closest to the destination. The notion of a location may be determined by its physical location. Location may also be an artificial construct as in small world routing. The activity selection problem is characteristic to this class of problems, in the Macintosh computer game Crystal Quest the objective is to collect crystals, in a fashion similar to the travelling salesman problem
Greedy algorithm
82.
Atari 2600
–
The Atari 2600 is a home video game console by Atari, Inc. This format contrasts with the model of having non-microprocessor dedicated hardware. The console was sold as the Atari VCS, an abbreviation for Video Computer System. Following the release of the Atari 5200 in 1982, the VCS was renamed to the Atari 2600, after the units Atari part number, CX2600. The 2600 was typically bundled with two controllers, a conjoined pair of paddle controllers, and a game cartridge, initially Combat. Ted Dabney and Nolan Bushnell developed the Atari gaming system in the 1970s, originally operating under the name Syzygy, Bushnell and Dabney changed the name of their company to Atari in 1972. In 1973, Atari Inc. had purchased an engineering think tank called Cyan Engineering to research next-generation video game systems, and had been working on a prototype known as Stella for some time. Unlike prior generations of machines that use custom logic to play a number of games, its core is a complete CPU. It was combined with a RAM-and-I/O chip, the MOS Technology 6532, the first two versions of the machine contain a fourth chip, a standard CMOS logic buffer IC, making Stella cost-effective. Some later versions of the console eliminated the buffer chip, programs for small computers of the time were generally stored on cassette tapes, floppy disks, or paper tape. In 1976, Fairchild Semiconductor released their own CPU-based system, the Video Entertainment System. Stella was still not ready for production, but it was clear that it needed to be there were a number of me too products filling up the market. Atari Inc. didnt have the flow to complete the system quickly. Nolan Bushnell eventually turned to Warner Communications, and sold the company to them in 1976 for US$28 million on the promise that Stella would be produced as soon as possible. Key to the success of the machine was the hiring of Jay Miner. Once that was completed and debugged, the system was ready for shipping, the unit was originally priced at US$199, and shipped with two joysticks and a Combat cartridge. In a move to compete directly with the Channel F, Atari Inc. named the machine the Video Computer System, as the Channel F was at that point known as the VES, for Video Entertainment System. The VCS was also rebadged as the Sears Video Arcade and sold through Sears, Roebuck, another breakthrough for gaming systems was Ataris invention of a computer-controlled opponent, rather than the usual two-player or asymmetric challenges of the past
Atari 2600
–
Atari 2600 four-switch "wood veneer" version, dating from 1980–82
Atari 2600
–
The second 2600 model is the "Light Sixer" which has lighter plastic molding and shielding than the 1977 launch model.
Atari 2600
–
Later 2600 models only used four front switches.
Atari 2600
–
The all black "Darth Vader" 4-switch model from 1982-
83.
Sparse distributed memory
–
Sparse distributed memory is a mathematical model of human long-term memory introduced by Pentti Kanerva in 1988 while he was at NASA Ames Research Center. It is a generalized random-access memory for long binary words and these words serve as both addresses to and data for the memory. SDM implements transformation from logical space to space using distributed data representation and storage. A value corresponding to an address is stored into many physical addresses. This way of storing is robust and not deterministic, a memory cell is not addressed directly. If input data are partially damaged at all, we can still get correct output data, the theory of the memory is mathematically complete and has been verified by computer simulation. It arose from the observation that the distances between points of a high-dimensional space resemble the proximity relations between concepts in human memory, the theory is also practical in that memories based on it can be implemented with conventional RAM-memory elements. Human memory has a tendency to congregate memories based on similarities between them, such as firetrucks are red and apples are red. Sparse distributed memory is a representation of human memory. An important property of high dimensional spaces is that two randomly chosen vectors are relatively far away from each other, meaning that they are uncorrelated. SDM can be considered a realization of Locality-sensitive hashing, the underlying idea behind a SDM is the mapping of a huge binary memory onto a smaller set of physical locations, so-called hard locations. As a general guideline, those hard locations should be distributed in the virtual space. Every datum is stored distributed by a set of hard locations, therefore, recall may not be perfect, accuracy depending on the saturation of the memory. Kanervas proposal is based on four ideas,1. The boolean space n, or 2 n points in 100 < n <105 dimensions and this means that it makes sense to store data as points of the mentioned space where each memory item is stored as an n-bit vector. Neurons with n inputs can be used as address decoders of a random-access memory 3, unifying principle, data stored into the memory can be used as addresses to the same memory. Distance between two points is a measure of similarity between two memory items, the closer the points, the more similar the stored vectors. Time can be traced in the memory as a function of where the data are stored, depending on the context, the vectors are called points, patterns, addresses, words, memory items, data, or events
Sparse distributed memory
–
The exponential decay function
84.
Hierarchical temporal memory
–
Hierarchical temporal memory is a biologically constrained theory of machine intelligence originally described in the 2004 book On Intelligence by Jeff Hawkins with Sandra Blakeslee. HTM is based on neuroscience and the physiology and interaction of neurons in the neocortex of the human brain. The technology has been tested and implemented in software through example applications from Numenta, at the core of HTM are learning algorithms that can store, learn, infer and recall high-order sequences. Unlike most other machine learning methods, HTM learns time-based patterns in unlabeled data on a continuous basis, HTM is robust to noise and high capacity, meaning that it can learn multiple patterns simultaneously. When applied to computers, HTM is well suited for prediction, anomaly detection, classification, a typical HTM network is a tree-shaped hierarchy of levels that are composed of smaller elements called nodes or columns. A single level in the hierarchy is called a region. Higher hierarchy levels often have fewer nodes and therefore less spatial resolvability, higher hierarchy levels can reuse patterns learned at the lower levels by combining them to memorize more complex patterns. Each HTM node has the basic functionality. In learning and inference modes, sensory data comes into the bottom level nodes, in generation mode, the bottom level nodes output the generated pattern of a given category. When in inference mode, a node in each level interprets information coming in from its nodes in the lower level as probabilities of the categories it has in memory. Each HTM region learns by identifying and memorizing spatial patterns - combinations of bits that often occur at the same time. It then identifies temporal sequences of patterns that are likely to occur one after another. During training, a node receives a temporal sequence of patterns as its input. The learning process consists of two stages, Spatial pooling identifies frequently observed patterns and memorizes them as coincidences, patterns that are significantly similar to each other are treated as the same coincidence. A large number of input patterns are reduced to a manageable number of known coincidences. Temporal pooling partitions coincidences that are likely to follow each other in the sequence into temporal groups. Each group of patterns represents a cause of the input pattern, during inference, the node calculates the set of probabilities that a pattern belongs to each known coincidence. Then it calculates the probabilities that the input represents each temporal group, the set of probabilities assigned to the groups is called a nodes belief about the input pattern
Hierarchical temporal memory
–
An example of HTM hierarchy used for image recognition
85.
Pointer (computer programming)
–
In computer science, a pointer is a programming language object, whose value refers to another value stored elsewhere in the computer memory using its memory address. A pointer references a location in memory, and obtaining the value stored at that location is known as dereferencing the pointer. As an analogy, a number in a books index could be considered a pointer to the corresponding page. Pointers to data significantly improve performance for operations such as traversing strings, lookup tables, control tables. In particular, it is much cheaper in time and space to copy and dereference pointers than it is to copy. Pointers are also used to hold the addresses of entry points for called subroutines in procedural programming, in object-oriented programming, pointers to functions are used for binding methods, often using what are called virtual method tables. A pointer is a simple, more concrete implementation of the more abstract data type. Several languages support some type of pointer, although some have more restrictions on their use than others, because pointers allow both protected and unprotected access to memory addresses, there are risks associated with using them particularly in the latter case. Other measures may also be taken, harold Lawson is credited with the 1964 invention of the pointer. According to the Oxford English Dictionary, the word pointer first appeared in print as a pointer in a technical memorandum by the System Development Corporation. In computer science, a pointer is a kind of reference, a data primitive is any datum that can be read from or written to computer memory using one memory access. A data aggregate is a group of primitives that are contiguous in memory. In the context of these definitions, a byte is the smallest primitive, the memory address of the initial byte of a datum is considered the memory address of the entire datum. A memory pointer is a primitive, the value of which is intended to be used as a memory address and it is also said that a pointer points to a datum when the pointers value is the datums memory address. More generally, a pointer is a kind of reference, and it is said that a pointer references a datum stored somewhere in memory, to obtain that datum is to dereference the pointer. The feature that separates pointers from other kinds of reference is that a value is meant to be interpreted as a memory address. References serve as a level of indirection, A pointers value determines which memory address is to be used in a calculation, when setting up data structures like lists, queues and trees, it is necessary to have pointers to help manage how the structure is implemented and controlled. Typical examples of pointers are start pointers, end pointers, and these pointers can either be absolute or relative
Pointer (computer programming)
–
Pointer a pointing to the memory address associated with variable b. Note that in this particular diagram, the computing architecture uses the same address space and data primitive for both pointers and non-pointers; this need not be the case.
86.
Probability distribution
–
For instance, if the random variable X is used to denote the outcome of a coin toss, then the probability distribution of X would take the value 0.5 for X = heads, and 0.5 for X = tails. In more technical terms, the probability distribution is a description of a phenomenon in terms of the probabilities of events. Examples of random phenomena can include the results of an experiment or survey, a probability distribution is defined in terms of an underlying sample space, which is the set of all possible outcomes of the random phenomenon being observed. The sample space may be the set of numbers or a higher-dimensional vector space, or it may be a list of non-numerical values, for example. Probability distributions are divided into two classes. A discrete probability distribution can be encoded by a discrete list of the probabilities of the outcomes, on the other hand, a continuous probability distribution is typically described by probability density functions. The normal distribution represents a commonly encountered continuous probability distribution, more complex experiments, such as those involving stochastic processes defined in continuous time, may demand the use of more general probability measures. A probability distribution whose sample space is the set of numbers is called univariate. Important and commonly encountered univariate probability distributions include the distribution, the hypergeometric distribution. The multivariate normal distribution is a commonly encountered multivariate distribution, to define probability distributions for the simplest cases, one needs to distinguish between discrete and continuous random variables. For example, the probability that an object weighs exactly 500 g is zero. Continuous probability distributions can be described in several ways, the cumulative distribution function is the antiderivative of the probability density function provided that the latter function exists. As probability theory is used in diverse applications, terminology is not uniform. The following terms are used for probability distribution functions, Distribution. Probability distribution, is a table that displays the probabilities of outcomes in a sample. Could be called a frequency distribution table, where all occurrences of outcomes sum to 1. Distribution function, is a form of frequency distribution table. Probability distribution function, is a form of probability distribution table
Probability distribution
–
The probability mass function (pmf) p (S) specifies the probability distribution for the sum S of counts from two dice. For example, the figure shows that p (11) = 1/18. The pmf allows the computation of probabilities of events such as P (S > 9) = 1/12 + 1/18 + 1/36 = 1/6, and all other probabilities in the distribution.
87.
Kernel principal component analysis
–
In the field of multivariate statistics, kernel principal component analysis is an extension of principal component analysis using techniques of kernel methods. Using a kernel, the linear operations of PCA are performed in a reproducing kernel Hilbert space. Recall that conventional PCA operates on zero-centered data, that is,1 N ∑ i =1 N x i =0. That is, given N points, x i, if we map them to an N-dimensional space with Φ where Φ, R d → R N, it is easy to construct a hyperplane that divides the points into arbitrary clusters. Of course, this Φ creates linearly independent vectors, so there is no covariance on which to perform eigendecomposition explicitly as we would in linear PCA. The N-elements in each column of K represent the dot product of one point of the data with respect to all the transformed points. Some well-known kernels are shown in the example below and we use K ′ to perform the kernel PCA algorithm described above. One caveat of kernel PCA should be illustrated here, in linear PCA, we can use the eigenvalues to rank the eigenvectors based on how much of the variation of the data is captured by each principal component. This is useful for data dimensionality reduction and it could also be applied to KPCA, however, in practice there are cases that all variations of the data are same. This is typically caused by a choice of kernel scale. In practice, a data set leads to a large K. One way to deal with this is to perform clustering on the dataset, since even this method may yield a relatively large K, it is common to compute only the top P eigenvalues and eigenvectors of K. Consider three concentric clouds of points, we wish to use kernel PCA to identify these groups, the color of the points is not part of the algorithm, but only there to show how the data groups together before and after the transformation. First, consider the kernel k =2 Applying this to kernel PCA yields the next image, Kernel PCA has been demonstrated to be useful for novelty detection, and image de-noising. Cluster analysis Kernel trick Multilinear PCA Multilinear subspace learning Nonlinear dimensionality reduction Spectral clustering
Kernel principal component analysis
–
Input points before kernel PCA
88.
Validation set
–
In many areas of information science, finding predictive relationships from data is a very important task. Initial discovery of relationships is usually done with a set while a test set. More formally, a set is a set of data used to discover potentially predictive relationships. A test set is a set of used to assess the strength. Test and training sets are used in intelligent systems, machine learning, genetic programming, regression analysis was one of the earliest such approaches to be developed. The data used to construct or discover a relationship are called the training data set. A test set is a set of data that is independent of the training data, if a model fit to the training set also fits the test set well, minimal overfitting has taken place. A better fitting of the set as opposed to the test set usually points to overfitting. In order to avoid overfitting, when any classification parameter needs to be adjusted, it is necessary to have a set in addition to the training. The validation set functions as a hybrid, it is training data used by testing, most simply, part of the training set can be set aside and used as a validation set, this is known as the holdout method, and common proportions are 70%/30% training/validation. Alternatively, this process can be repeated, repeatedly partitioning the original training set into a set and a validation set. These can be defined as, Training set, A set of used for learning. Validation set, A set of used to tune the hyperparameters of a classifier. Test set, A set of examples used only to assess the performance of a fully-specified classifier, various networks are trained by minimization of an appropriate error function defined with respect to a training data set. This approach is called the hold out method, sometimes the training set and validation set are referred to collectively as design set, the first part of the design set is the training set, the second part is the validation step. Another example of parameter adjustment is hierarchical classification, which splits a complete multi-class problem into a set of smaller classiﬁcation problems and it serves for learning more accurate concepts due to simpler classiﬁcation boundaries in subtasks and individual feature selection procedures for subtasks. When doing classiﬁcation decomposition, the choice is the order of combination of smaller classiﬁcation steps. Depending on the application, it can be derived from the matrix and, uncovering the reasons for typical errors
Validation set
89.
Time series
–
A time series is a series of data points indexed in time order. Most commonly, a series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data, examples of time series are heights of ocean tides, counts of sunspots, and the daily closing value of the Dow Jones Industrial Average. Time series are very frequently plotted via line charts, Time series analysis comprises methods for analyzing time series data in order to extract meaningful statistics and other characteristics of the data. Time series forecasting is the use of a model to predict future values based on previously observed values, Time series data have a natural temporal ordering. This makes time series analysis distinct from cross-sectional studies, in there is no natural ordering of the observations. Time series analysis is also distinct from data analysis where the observations typically relate to geographical locations. A stochastic model for a series will generally reflect the fact that observations close together in time will be more closely related than observations further apart. Methods for time series analysis may be divided into two classes, frequency-domain methods and time-domain methods, the former include spectral analysis and wavelet analysis, the latter include auto-correlation and cross-correlation analysis. In the time domain, correlation and analysis can be made in a filter-like manner using scaled correlation, additionally, time series analysis techniques may be divided into parametric and non-parametric methods. The parametric approaches assume that the stationary stochastic process has a certain structure which can be described using a small number of parameters. In these approaches, the task is to estimate the parameters of the model describes the stochastic process. By contrast, non-parametric approaches explicitly estimate the covariance or the spectrum of the process without assuming that the process has any particular structure, Methods of time series analysis may also be divided into linear and non-linear, and univariate and multivariate. A time series is one type of Panel data, Panel data is the general class, a multidimensional data set, whereas a time series data set is a one-dimensional panel. A data set may exhibit characteristics of both data and time series data. One way to tell is to ask what makes one data record unique from the other records, if the answer is the time data field, then this is a time series data set candidate. If determining a unique record requires a data field and an additional identifier which is unrelated to time. If the differentiation lies on the identifier, then the data set is a cross-sectional data set candidate
Time series
–
Time series: random data plus trend, with best-fit line and different applied filters
90.
Blind source separation
–
This problem is in general highly underdetermined, but useful solutions can be derived under a surprising variety of conditions. Much of the literature in this field focuses on the separation of temporal signals such as audio. However, blind signal separation is now performed on multidimensional data, such as images and tensors. The set of source signals, s = T, is mixed using a matrix, A = ∈ R m × n, to produce a set of mixed signals, x = T. Usually, n is equal to m, if m > n, then the system of equations is overdetermined and thus can be unmixed using a conventional linear method. If n > m, the system is underdetermined and a non-linear method must be employed to recover the unmixed signals, the signals themselves can be multidimensional. X = A ⋅ s The above equation is effectively inverted as follows. Blind source separation separates the set of mixed signals, x, through the determination of a matrix, B = ∈ R n × m, to recover an approximation of the original signals. Y = B ⋅ x At a cocktail party, there is a group of talking at the same time. You have multiple microphones picking up mixed signals, but you want to isolate the speech of a single person, BSS can be used to separate the individual sources by using mixed signals. Figure 2 shows the concept of BSS. The individual source signals are shown as well as the signals which are received signals. BSS is used to separate the signals with only knowing mixed signals. The separated signals are only approximations of the source signals, the separated images, were separated using Python and the Shogun toolbox using Joint Approximation Diagonalization of Eigen-matrices algorithm which is based off Independent component analysis, ICA. This toolbox method can be used with multi-dimensions but for a visual aspect images were used. Brain imaging is another application for BSS. In electroencephalogram and Magnetoencephalography, the interference from muscle activity masks the desired signal from brain activity, BSS, however, can be used to separate the two so an accurate representation of brain activity may be achieved. A second approach, exemplified by nonnegative matrix factorization, is to impose constraints on the source signals
Blind source separation
–
v
91.
Control engineering
–
Control engineering or control systems engineering is the engineering discipline that applies control theory to design systems with desired behaviors. When a device is designed to perform without the need of human inputs for correction it is called automatic control, multi-disciplinary in nature, control systems engineering activities focus on implementation of control systems mainly derived by mathematical modeling of systems of a diverse range. Modern day control engineering is a new field of study that gained significant attention during the 20th century with the advancement of technology. It can be defined or classified as practical application of control theory. Control engineering has a role in a wide range of control systems. Automatic control systems were first developed over two years ago. The first feedback control device on record is thought to be the ancient Ktesibioss water clock in Alexandria and it kept time by regulating the water level in a vessel and, therefore, the water flow from that vessel. This certainly was a device as water clocks of similar design were still being made in Baghdad when the Mongols captured the city in 1258 A. D. A variety of devices have been used over the centuries to accomplish useful tasks or simply to just entertain. In his 1868 paper On Governors, James Clerk Maxwell was able to explain instabilities exhibited by the governor using differential equations to describe the control system. This demonstrated the importance and usefulness of models and methods in understanding complex phenomena. Elements of control theory had appeared earlier but not as dramatically and convincingly as in Maxwells analysis, Control theory made significant strides over the next century. In the very first control relationships, a current output was represented by a control input. However, not having adequate technology to implement electrical control systems, designers were left with the option of less efficient, a very effective mechanical controller that is still widely used in some hydro plants is the governor. There are two divisions in control theory, namely, classical and modern, which have direct implications for the control engineering applications. The scope of control theory is limited to single-input and single-output system design. Many systems may be assumed to have an order and single variable system response in the time domain. A controller designed using classical theory often requires on-site tuning due to incorrect design approximations, the most common controllers designed using classical control theory are PID controllers
Control engineering
–
Control systems play a critical role in space flight
92.
Computer numerical control
–
Computer numerical control is the automation of machine tools by means of computers executing pre-programmed sequences of machine control commands. This is in contrast to machines that are controlled by hand wheels or levers. In modern CNC systems, the design of a mechanical part, the parts mechanical dimensions are defined using computer-aided design software, and then translated into manufacturing directives by computer-aided manufacturing software. The resulting directives are transformed into the specific commands necessary for a machine to produce the component. Since any particular component might require the use of a number of different tools – drills, saws, in other installations, a number of different machines are used with an external controller and human or robotic operators that move the component from machine to machine. In either case, the series of steps needed to produce any part is highly automated and produces a part that closely matches the original CAD design. The first NC machines were built in the 1940s and 1950s and these early servomechanisms were rapidly augmented with analog and digital computers, creating the modern CNC machine tools that have revolutionized the machining processes. Motion is controlled along multiple axes, normally at least two, and a spindle that moves in the Z. The position of the tool is driven by direct-drive stepper motor or servo motors in order to provide highly accurate movements, or in older designs, open-loop control works as long as the forces are kept small enough and speeds are not too great. On commercial metalworking machines, closed loop controls are standard and required in order to provide the accuracy, speed, as the controller hardware evolved, the mills themselves also evolved. Most new CNC systems built today are 100% electronically controlled, cNC-like systems are now used for any process that can be described as a series of movements and operations. CNC mills use computer controls to cut different materials and they are able to translate programs consisting of specific numbers and letters to move the spindle to various locations and depths. Many use G-code, which is a programming language that many CNC machines understand. These proprietary languages, while often simpler than G-code, are not transferable to other machines, CNC mills have many functions including face milling, shoulder milling, tapping, drilling and some even offer turning. Standard linear CNC mills are limited to 3 axis, but others may also have one or more rotational axes, today, CNC mills can have 4 to 6 axes. Lathes are machines that cut workpieces while they are rotated, CNC lathes are able to make fast, precision cuts, generally using indexable tools and drills. They are particularly effective for complicated programs designed to make parts that would be infeasible to make on manual lathes, CNC lathes have similar control specifications to CNC mills and can often read G-code as well as the manufacturers proprietary programming language. CNC lathes generally have two axes, but newer models have more axes, allowing for more advanced jobs to be machined, plasma cutting involves cutting a material using a plasma torch
Computer numerical control
–
A CNC turning center
Computer numerical control
–
A Tsugami multifunction turn mill machine used for short runs of complex parts.
Computer numerical control
–
Main articles
93.
Process control
–
Process control is an engineering discipline that deals with architectures, mechanisms and algorithms for maintaining the output of a specific process within a desired range. For instance, the temperature of a reactor may be controlled to maintain a consistent product output. Process control enables automation, by which a staff of operating personnel can operate a complex process from a central control room. Process control may either use feedback or it may be open loop, Control may also be continuous or cause a sequence of discrete events, such as a timer on a lawn sprinkler or controls on an elevator. A thermostat on a heater is an example of control that is on or off, a temperature sensor turns the heat source on if the temperature falls below the set point and turns the heat source off when the set point is reached. There is no measurement of the difference between the set point and the temperature and no adjustment to the rate at which heat is added other than all or none. A familiar example of control is cruise control on an automobile. Here speed is the measured variable, the operator adjusts the desired speed set point and the controller monitors the speed sensor and compares the measured speed to the set point. The controller makes adjustments having information only about the error although settings known as tuning are used to achieve stable control, the operation of such controllers is the subject of control theory. A PLC output would then calculate an incremental amount of change in the valve position, larger more complex systems can be controlled by process control systems like Distributed Control System or SCADA. The accompanying diagram is a model which shows functional manufacturing levels in a large process using computerised control. Level 2 contains the supervisory computers, which information from processor nodes on the system. Level 3 is the control level, which does not directly control the process. Processes can be characterized as one or more of the forms, Discrete – Found in many manufacturing, motion. Robotic assembly, such as found in automotive production, can be characterized as discrete process control. Most discrete manufacturing involves the production of discrete pieces of product, batch – Some applications require that specific quantities of raw materials be combined in specific ways for particular durations to produce an intermediate or end result. One example is the production of adhesives and glues, which require the mixing of raw materials in a heated vessel for a period of time to form a quantity of end product. Other important examples are the production of food, beverages and medicine, batch processes are generally used to produce a relatively low to intermediate quantity of product per year
Process control
–
Control panel of a nuclear reactor.
Process control
–
Example of control system of a continuous stirred-tank reactor.
94.
Poker
–
Poker is a family of card games that combine gambling, strategy, and skill. Poker games vary in the number of cards dealt, the number of shared or community cards, the number of cards that remain hidden, and the betting procedures. In most modern games, the first round of betting begins with one or more of the players making some form of a forced bet. In standard poker, each player bets according to the rank they believe their hand is worth as compared to the other players. The action then proceeds clockwise as each player in turn must either match, or call, a player who matches a bet may also raise, or increase the bet. The betting round ends when all players have called the last bet or folded. If all but one player folds on any round, the player collects the pot without being required to reveal their hand. If more than one remains in contention after the final betting round, a showdown takes place where the hands are revealed. By the 1990s some gaming historians including David Parlett started to challenge the notion that poker is a derivative of As-Nas. There is evidence that a game called poque, a French game similar to poker, was played around the region where poker is said to have originated. The name of the game likely descended from the Irish Poca or even the French poque, yet it is not clear whether the origins of poker itself lie with the games bearing those names. It is commonly regarded as sharing ancestry with the Renaissance game of primero, the English game brag clearly descended from brelan and incorporated bluffing. It is quite possible that all of these earlier games influenced the development of poker as it exists now, the unique features of poker have to do with the betting, and do not appear in any known older game. In this view poker originated much later, in the early or mid-18th century and it was played in a variety of forms, with 52 cards, and included both straight poker and stud. 20 card poker was a variant for two players, the development of poker is linked to the historical movement that also saw the invention of commercial gambling. English actor Joseph Crowell reported that the game was played in New Orleans in 1829, with a deck of 20 cards, and four players betting on which players hand was the most valuable. As it spread north along the Mississippi River and to the West during the gold rush, soon after this spread, the full 52-card French deck was used and the flush was introduced. The draw was added prior to 1850, during the American Civil War, many additions were made including stud poker, and the straight
Poker
–
A game of Texas hold 'em in progress. "Hold 'em" is a popular form of poker.
Poker
Poker
–
Officers of the 114th Pennsylvania Infantry playing cards in front of tents. Petersburg, Virginia, August 1864
Poker
–
2006 WSOP Main Event table
95.
Medical diagnosis
–
Medical diagnosis is the process of determining which disease or condition explains a persons symptoms and signs. It is most often referred to as diagnosis with the context being implicit. The information required for diagnosis is typically collected from a history, often, one or more diagnostic procedures, such as diagnostic tests, are also done during the process. Sometimes Posthumous diagnosis is considered a kind of medical diagnosis, Diagnosis is often challenging, because many signs and symptoms are nonspecific. For example, redness of the skin, by itself, is a sign of many disorders, thus differential diagnosis, in which several possible explanations are compared and contrasted, must be performed. This involves the correlation of various pieces of information followed by the recognition and differentiation of patterns, occasionally the process is made easy by a sign or symptom that is pathognomonic. Diagnosis is a component of the procedure of a doctors visit. From the point of view of statistics, the procedure involves classification tests. The first recorded examples of medical diagnosis are found in the writings of Imhotep in ancient Egypt, a Babylonian medical textbook, the Diagnostic Handbook written by Esagil-kin-apli, introduced the use of empiricism, logic and rationality in the diagnosis of an illness or disease. Traditional Chinese Medicine, as described in the Yellow Emperors Inner Canon or Huangdi Neijing, specified four diagnostic methods, inspection, auscultation-olfaction, interrogation, hippocrates was known to make diagnoses by tasting his patients urine and smelling their sweat. This article uses diagnostician as any of these person categories, a diagnostic procedure does not necessarily involve elucidation of the etiology of the diseases or conditions of interest, that is, what caused the disease or condition. Such elucidation can be useful to optimize treatment, further specify the prognosis or prevent recurrence of the disease or condition in the future, the initial task is to detect a medical indication to perform a diagnostic procedure. Indications include, Detection of any deviation from what is known to be normal, such as can be described in terms of, for example, anatomy, physiology, pathology, psychology, a complaint expressed by a patient. The fact that a patient has sought a diagnostician can itself be an indication to perform a diagnostic procedure, even during an already ongoing diagnostic procedure, there can be an indication to perform another, separate, diagnostic procedure for another, potentially concomitant, disease or condition. A diagnostic test is any kind of medical test performed to aid in the diagnosis or detection of disease, Diagnostic tests can also be used to provide prognostic information on people with established disease. Processing of the answers, findings or other results, consultations with other providers and specialists in the field may be sought. There are a number of methods or techniques that can be used in a diagnostic procedure, in reality, a diagnostic procedure may involve components of multiple methods. The final result may also remain a list of possible conditions, the resultant diagnostic opinion by this method can be regarded more or less as a diagnosis of exclusion
Medical diagnosis
–
Radiography is an important tool in diagnosis of certain disorders.
96.
E-mail spam
–
Email spam, also known as junk email, is a type of electronic spam where unsolicited messages are sent by email. Spam email may also include malware as scripts or other executable file attachments, Spam is named after Spam luncheon meat by way of a Monty Python sketch in which Spam in the sketch is ubiquitous, unavoidable and repetitive. Email spam has steadily grown since the early 1990s, botnets, networks of virus-infected computers, are used to send about 80% of spam. Since the expense of the spam is borne mostly by the recipient, the legal status of spam varies from one jurisdiction to another. In the United States, spam was declared to be legal by the CAN-SPAM Act of 2003 provided the message adheres to rules set by the Act and by the FTC. ISPs have attempted to recover the cost of spam through lawsuits against spammers, spammers collect email addresses from chatrooms, websites, customer lists, newsgroups, and viruses that harvest users address books. These collected email addresses are also sold to other spammers. The proportion of spam email was around 80% of email messages sent, from the beginning of the Internet, sending of junk email has been prohibited. Gary Thuerk sent the first email spam message in 1978 to 600 people and he was reprimanded and told not to do it again. The ban on spam is enforced by the Terms of Service/Acceptable Use Policy of internet service providers and it was estimated in 2009 that spam cost businesses around US$130 billion. As the scale of the problem has grown, ISPs and the public have turned to government for relief from spam. Spam has several definitions varying by source, Unsolicited bulk email —unsolicited email, sent in large quantities. Unsolicited commercial email —this more restrictive definition is used by regulators whose mandate is to regulate commerce, many spam emails contain URLs to a website or websites. According to a Cyberoam report in 2014, there are an average of 54 billion spam messages sent every day, pharmaceutical products jumped up 45% from last quarter’s analysis, leading this quarter’s spam pack. Emails purporting to offer jobs with fast, easy cash come in at two, accounting for approximately 15% of all spam email. And, rounding off at number three are spam emails about diet products, accounting for approximately 1%, according to information compiled by Commtouch Software Ltd. email spam for the first quarter of 2010 can be broken down as follows. Advance fee fraud spam, such as the Nigerian 419 scam, organized spam gangs operate from sites set up by the Russian mafia, with turf battles and revenge killings sometimes resulting. Spam is also a medium for fraudsters to scam users into entering personal information on fake Web sites using emails forged to look like they are from banks or other organizations, such as PayPal
E-mail spam
–
An email box folder filled with spam messages.
97.
Information theory
–
Information theory studies the quantification, storage, and communication of information. A key measure in information theory is entropy, entropy quantifies the amount of uncertainty involved in the value of a random variable or the outcome of a random process. For example, identifying the outcome of a coin flip provides less information than specifying the outcome from a roll of a die. Some other important measures in information theory are mutual information, channel capacity, error exponents, applications of fundamental topics of information theory include lossless data compression, lossy data compression, and channel coding. The field is at the intersection of mathematics, statistics, computer science, physics, neurobiology, Information theory studies the transmission, processing, utilization, and extraction of information. Abstractly, information can be thought of as the resolution of uncertainty, Information theory is a broad and deep mathematical theory, with equally broad and deep applications, amongst which is the vital field of coding theory. These codes can be subdivided into data compression and error-correction techniques. In the latter case, it took years to find the methods Shannons work proved were possible. A third class of information theory codes are cryptographic algorithms, concepts, methods and results from coding theory and information theory are widely used in cryptography and cryptanalysis. See the article ban for a historical application, Information theory is also used in information retrieval, intelligence gathering, gambling, statistics, and even in musical composition. Prior to this paper, limited information-theoretic ideas had been developed at Bell Labs, the unit of information was therefore the decimal digit, much later renamed the hartley in his honour as a unit or scale or measure of information. Alan Turing in 1940 used similar ideas as part of the analysis of the breaking of the German second world war Enigma ciphers. Much of the mathematics behind information theory with events of different probabilities were developed for the field of thermodynamics by Ludwig Boltzmann, Information theory is based on probability theory and statistics. Information theory often concerns itself with measures of information of the associated with random variables. Important quantities of information are entropy, a measure of information in a random variable, and mutual information. The choice of base in the following formulae determines the unit of information entropy that is used. A common unit of information is the bit, based on the binary logarithm, other units include the nat, which is based on the natural logarithm, and the hartley, which is based on the common logarithm. In what follows, an expression of the form p log p is considered by convention to be equal to zero whenever p =0 and this is justified because lim p →0 + p log p =0 for any logarithmic base
Information theory
–
A picture showing scratches on the readable surface of a CD-R. Music and data CDs are coded using error correcting codes and thus can still be read even if they have minor scratches using error detection and correction.
Information theory
–
Entropy of a Bernoulli trial as a function of success probability, often called the binary entropy function,. The entropy is maximized at 1 bit per trial when the two possible outcomes are equally probable, as in an unbiased coin toss.
98.
Boolean algebra
–
In mathematics and mathematical logic, Boolean algebra is the branch of algebra in which the values of the variables are the truth values true and false, usually denoted 1 and 0 respectively. It is thus a formalism for describing logical relations in the way that ordinary algebra describes numeric relations. Boolean algebra was introduced by George Boole in his first book The Mathematical Analysis of Logic, according to Huntington, the term Boolean algebra was first suggested by Sheffer in 1913. Boolean algebra has been fundamental in the development of digital electronics and it is also used in set theory and statistics. Booles algebra predated the modern developments in algebra and mathematical logic. In an abstract setting, Boolean algebra was perfected in the late 19th century by Jevons, Schröder, Huntington, in fact, M. H. Stone proved in 1936 that every Boolean algebra is isomorphic to a field of sets. Shannon already had at his disposal the abstract mathematical apparatus, thus he cast his switching algebra as the two-element Boolean algebra, in circuit engineering settings today, there is little need to consider other Boolean algebras, thus switching algebra and Boolean algebra are often used interchangeably. Efficient implementation of Boolean functions is a problem in the design of combinational logic circuits. Logic sentences that can be expressed in classical propositional calculus have an equivalent expression in Boolean algebra, thus, Boolean logic is sometimes used to denote propositional calculus performed in this way. Boolean algebra is not sufficient to capture logic formulas using quantifiers, the closely related model of computation known as a Boolean circuit relates time complexity to circuit complexity. Whereas in elementary algebra expressions denote mainly numbers, in Boolean algebra they denote the truth values false and these values are represented with the bits, namely 0 and 1. Addition and multiplication then play the Boolean roles of XOR and AND respectively, Boolean algebra also deals with functions which have their values in the set. A sequence of bits is a commonly used such function, another common example is the subsets of a set E, to a subset F of E is associated the indicator function that takes the value 1 on F and 0 outside F. The most general example is the elements of a Boolean algebra, as with elementary algebra, the purely equational part of the theory may be developed without considering explicit values for the variables. The basic operations of Boolean calculus are as follows, AND, denoted x∧y, satisfies x∧y =1 if x = y =1 and x∧y =0 otherwise. OR, denoted x∨y, satisfies x∨y =0 if x = y =0, NOT, denoted ¬x, satisfies ¬x =0 if x =1 and ¬x =1 if x =0. Alternatively the values of x∧y, x∨y, and ¬x can be expressed by tabulating their values with truth tables as follows, the first operation, x → y, or Cxy, is called material implication. If x is then the value of x → y is taken to be that of y
Boolean algebra
–
Figure 2. Venn diagrams for conjunction, disjunction, and complement
99.
Universal Turing Machine
–
In computer science, a universal Turing machine is a Turing machine that can simulate an arbitrary Turing machine on arbitrary input. The universal machine essentially achieves this by reading both the description of the machine to be simulated as well as the input thereof from its own tape, Alan Turing introduced this machine in 1936–1937. It is also known as universal computing machine, universal machine, machine U, U, in terms of computational complexity, a multi-tape universal Turing machine need only be slower by logarithmic factor compared to the machines it simulates. Every Turing machine computes a certain fixed partial computable function from the strings over its alphabet. In that sense it behaves like a computer with a fixed program, however, we can encode the action table of any Turing machine in a string. Turing described such a construction in complete detail in his 1936 paper, is working on an incarnation of a Turing machine, and that John von Neumann on the work of Alan Turing. Davis makes a case that Turings Automatic Computing Engine computer anticipated the notions of microprogramming, Knuth cites Turings work on the ACE computer as designing hardware to facilitate subroutine linkage, Davis also references this work as Turings use of a hardware stack. As the Turing Machine was encouraging the construction of computers, the UTM was encouraging the development of the computer sciences. An early, if not the very first, assembler was proposed by a young hot-shot programmer for the EDVAC, Knuth observes that the subroutine return embedded in the program itself rather than in special registers is attributable to von Neumann and Goldstine. Knuth furthermore states that The first interpretive routine may be said to be the Universal Turing Machine, interpretive routines in the conventional sense were mentioned by John Mauchly in his lectures at the Moore School in 1946. Turing took part in development also, interpretive systems for the Pilot ACE computer were written under his direction. Davis briefly mentions operating systems and compilers as outcomes of the notion of program-as-data, some, however, might raise issues with this assessment. At the time a small cadre of researchers were intimately involved with the architecture of the new digital computers. These two aspects of theory and practice have been developed almost entirely independently of each other, the main reason is undoubtedly that logicians are interested in questions radically different from those with which the applied mathematicians and electrical engineers are primarily concerned. It cannot, however, fail to strike one as rather strange that often the same concepts are expressed by different terms in the two developments. Wang hoped that his paper would connect the two approaches, indeed, Minsky confirms this, that the first formulation of Turing-machine theory in computer-like models appears in Wang. Minsky goes on to demonstrate Turing equivalence of a counter machine, with respect to the reduction of computers to simple Turing equivalent models, Minskys designation of Wang as having made the first formulation is open to debate. The names of mathematicians Hermes and Kaphenst appear in the bibliographies of both Sheperdson-Sturgis and Elgot-Robinson, Two other names of importance are Canadian researchers Melzak and Lambek
Universal Turing Machine
–
Turing machines
100.
Eight queens puzzle
–
The eight queens puzzle is the problem of placing eight chess queens on an 8×8 chessboard so that no two queens threaten each other. Thus, a solution requires that no two share the same row, column, or diagonal. Chess composer Max Bezzel published the eight queens puzzle in 1848, franz Nauck published the first solutions in 1850. Nauck also extended the puzzle to the n queens problem, with n queens on a chessboard of n × n squares, since then, many mathematicians, including Carl Friedrich Gauss, have worked on both the eight queens puzzle and its generalized n-queens version. In 1874, S. Gunther proposed a method using determinants to find solutions, in 1972, Edsger Dijkstra used this problem to illustrate the power of what he called structured programming. It is possible to use shortcuts that reduce computational requirements or rules of thumb that avoids brute-force computational techniques, generating permutations further reduces the possibilities to just 40,320, which are then checked for diagonal attacks. Martin Richards published a program to count solutions to the problem using bitwise operations. The eight queens puzzle has 92 distinct solutions, if solutions that differ only by the symmetry operations of rotation and reflection of the board are counted as one, the puzzle has 12 solutions. These are called solutions, representatives of each are shown below. A fundamental solution usually has eight variants obtained by rotating 90,180, or 270°, however, should a solution be equivalent to its own 90° rotation, that fundamental solution will have only two variants. Should a solution be equivalent to its own 180° rotation, it will have four variants, if n >1, it is not possible for a solution to be equivalent to its own reflection because that would require two queens to be facing each other. The different fundamental solutions are presented below, Solution 10 has the property that no three queens are in a straight line. These brute-force algorithms to count the number of solutions are computationally manageable for n =8, if the goal is to find a single solution then explicit solutions exist for all n ≥4, requiring no combinatorial search whatsoever. The explicit solutions exhibit stair-stepped patterns, as in the examples for n =8,9 and 10. Let be the square in column i and row j on the n × n chessboard, if n is even and n ≠ 6k +2, then place queens at and for i =1,2. If n is even and n ≠ 6k, then place queens at, if n is odd, then use one of the patterns above for and add a queen at. If the remainder is 2, swap 1 and 3 in odd list, if the remainder is 3, move 2 to the end of even list and 1,3 to the end of odd list. Append odd list to the even list and place queens in the rows given by these numbers, for N =8 this results in fundamental solution 1 above
Eight queens puzzle
101.
Travelling salesman problem
–
It is an NP-hard problem in combinatorial optimization, important in operations research and theoretical computer science. TSP is a case of the travelling purchaser problem and the vehicle routing problem. In the theory of computational complexity, the version of the TSP belongs to the class of NP-complete problems. Thus, it is possible that the running time for any algorithm for the TSP increases superpolynomially with the number of cities. The problem was first formulated in 1930 and is one of the most intensively studied problems in optimization and it is used as a benchmark for many optimization methods. The TSP has several applications even in its purest formulation, such as planning, logistics, slightly modified, it appears as a sub-problem in many areas, such as DNA sequencing. The TSP also appears in astronomy, as observing many sources will want to minimize the time spent moving the telescope between the sources. In many applications, additional constraints such as limited resources or time windows may be imposed, the origins of the travelling salesperson problem are unclear. A handbook for travelling salesmen from 1832 mentions the problem and includes example tours through Germany and Switzerland, the travelling salesperson problem was mathematically formulated in the 1800s by the Irish mathematician W. R. Hamilton and by the British mathematician Thomas Kirkman. Hamilton’s Icosian Game was a puzzle based on finding a Hamiltonian cycle. Hassler Whitney at Princeton University introduced the name travelling salesman problem soon after, Dantzig, Fulkerson and Johnson, however, speculated that given a near optimal solution we may be able to find optimality or prove optimality by adding a small amount of extra inequalities. They used this idea to solve their initial 49 city problem using a string model and they found they only needed 26 cuts to come to a solution for their 49 city problem. As well as cutting plane methods, Dantzig, Fulkerson and Johnson used branch, in the following decades, the problem was studied by many researchers from mathematics, computer science, chemistry, physics, and other sciences. Christofides made a big advance in this approach of giving an approach for which we know the worst-case scenario and his algorithm given in 1976, at worst is 1.5 times longer than the optimal solution. As the algorithm was so simple and quick, many hoped it would give way to a optimal solution method. However, until 2011 when it was beaten by less than a billionth of a percent, Richard M. Karp showed in 1972 that the Hamiltonian cycle problem was NP-complete, which implies the NP-hardness of TSP. This supplied a mathematical explanation for the apparent computational difficulty of finding optimal tours, great progress was made in the late 1970s and 1980, when Grötschel, Padberg, Rinaldi and others managed to exactly solve instances with up to 2392 cities, using cutting planes and branch-and-bound. In the 1990s, Applegate, Bixby, Chvátal, and Cook developed the program Concorde that has used in many recent record solutions
Travelling salesman problem
–
William Rowan Hamilton
Travelling salesman problem
–
Solution of a travelling salesman problem
102.
Integer factorization
–
In number theory, integer factorization is the decomposition of a composite number into a product of smaller integers. If these integers are further restricted to numbers, the process is called prime factorization. When the numbers are large, no efficient, non-quantum integer factorization algorithm is known. However, it has not been proven that no efficient algorithm exists, the presumed difficulty of this problem is at the heart of widely used algorithms in cryptography such as RSA. Many areas of mathematics and computer science have been brought to bear on the problem, including elliptic curves, algebraic number theory, not all numbers of a given length are equally hard to factor. The hardest instances of these problems are semiprimes, the product of two prime numbers, many cryptographic protocols are based on the difficulty of factoring large composite integers or a related problem—for example, the RSA problem. An algorithm that efficiently factors an arbitrary integer would render RSA-based public-key cryptography insecure, by the fundamental theorem of arithmetic, every positive integer has a unique prime factorization. If the integer is then it can be recognized as such in polynomial time. If composite however, the theorem gives no insight into how to obtain the factors, given a general algorithm for integer factorization, any integer can be factored down to its constituent prime factors simply by repeated application of this algorithm. The situation is complicated with special-purpose factorization algorithms, whose benefits may not be realized as well or even at all with the factors produced during decomposition. For example, if N =10 × p × q where p < q are very large primes, trial division will quickly produce the factors 2 and 5 but will take p divisions to find the next factor. Among the b-bit numbers, the most difficult to factor in practice using existing algorithms are those that are products of two primes of similar size, for this reason, these are the integers used in cryptographic applications. The largest such semiprime yet factored was RSA-768, a 768-bit number with 232 decimal digits and this factorization was a collaboration of several research institutions, spanning two years and taking the equivalent of almost 2000 years of computing on a single-core 2.2 GHz AMD Opteron. Like all recent factorization records, this factorization was completed with an optimized implementation of the general number field sieve run on hundreds of machines. No algorithm has been published that can factor all integers in polynomial time, neither the existence nor non-existence of such algorithms has been proved, but it is generally suspected that they do not exist and hence that the problem is not in class P. The problem is clearly in class NP but has not been proved to be in, or not in and it is generally suspected not to be in NP-complete. There are published algorithms that are faster than O for all positive ε, i. e. sub-exponential, the best published asymptotic running time is for the general number field sieve algorithm, which, for a b-bit number n, is, O. For current computers, GNFS is the best published algorithm for large n, for a quantum computer, however, Peter Shor discovered an algorithm in 1994 that solves it in polynomial time
Integer factorization
–
This image demonstrates the prime decomposition of 864. A shorthand way of writing the resulting prime factors is 2 5 × 3 3
103.
Von Neumann architecture
–
The meaning has evolved to be any stored-program computer in which an instruction fetch and a data operation cannot occur at the same time because they share a common bus. This is referred to as the von Neumann bottleneck and often limits the performance of the system, a stored-program digital computer is one that keeps its program instructions, as well as its data, in read-write, random-access memory. The earliest computing machines had fixed programs, some very simple computers still use this design, either for simplicity or training purposes. For example, a calculator is a fixed program computer. It can do mathematics, but it cannot be used as a word processor or a gaming console. Changing the program of a machine requires rewiring, restructuring. The earliest computers were not so much programmed as they were designed and it could take three weeks to set up a program on ENIAC and get it working. With the proposal of the computer, this changed. A stored-program computer includes, by design, an instruction set, a stored-program design also allows for self-modifying code. One early motivation for such a facility was the need for a program to increment or otherwise modify the address portion of instructions and this became less important when index registers and indirect addressing became usual features of machine architecture. Another use was to embed frequently used data in the stream using immediate addressing. Self-modifying code has largely fallen out of favor, since it is hard to understand and debug. On a large scale, the ability to treat instructions as data is what makes assemblers, compilers, linkers, loaders, one can write programs which write programs. This has allowed a sophisticated self-hosting computing ecosystem to flourish around von Neumann architecture machines, on a smaller scale, some repetitive operations such as BITBLT or pixel & vertex shaders could be accelerated on general purpose processors with just-in-time compilation techniques. This is one use of self-modifying code that has remained popular, in it he described a hypothetical machine which he called a universal computing machine, and which is now known as the Universal Turing machine. The hypothetical machine had a store that contained both instructions and data. Whether he knew of Turings paper of 1936 at that time is not clear, in 1936, Konrad Zuse also anticipated in two patent applications that machine instructions could be stored in the same storage used for data. In planning a new machine, EDVAC, Eckert wrote in January 1944 that they would store data and programs in a new addressable memory device and this was the first time the construction of a practical stored-program machine was proposed
Von Neumann architecture
–
Von Neumann architecture scheme.
104.
Database
–
A database is an organized collection of data. It is the collection of schemas, tables, queries, reports, views, a database management system is a computer software application that interacts with the user, other applications, and the database itself to capture and analyze data. A general-purpose DBMS is designed to allow the definition, creation, querying, update, well-known DBMSs include MySQL, PostgreSQL, MongoDB, MariaDB, Microsoft SQL Server, Oracle, Sybase, SAP HANA, MemSQL and IBM DB2. Sometimes a DBMS is loosely referred to as a database, formally, a database refers to a set of related data and the way it is organized. The DBMS provides various functions that allow entry, storage and retrieval of large quantities of information, because of the close relationship between them, the term database is often used casually to refer to both a database and the DBMS used to manipulate it. Outside the world of information technology, the term database is often used to refer to any collection of related data. This article is concerned only with databases where the size and usage requirements necessitate use of a management system. Update – Insertion, modification, and deletion of the actual data, retrieval – Providing information in a form directly usable or for further processing by other applications. The retrieved data may be available in a form basically the same as it is stored in the database or in a new form obtained by altering or combining existing data from the database. Both a database and its DBMS conform to the principles of a database model. Database system refers collectively to the model, database management system. Physically, database servers are dedicated computers that hold the actual databases and run only the DBMS, Database servers are usually multiprocessor computers, with generous memory and RAID disk arrays used for stable storage. RAID is used for recovery of data if any of the disks fail, hardware database accelerators, connected to one or more servers via a high-speed channel, are also used in large volume transaction processing environments. DBMSs are found at the heart of most database applications, DBMSs may be built around a custom multitasking kernel with built-in networking support, but modern DBMSs typically rely on a standard operating system to provide these functions. Since DBMSs comprise a significant market, computer and storage vendors often take into account DBMS requirements in their own development plans, databases are used to support internal operations of organizations and to underpin online interactions with customers and suppliers. Databases are used to hold information and more specialized data. A DBMS has evolved into a software system and its development typically requires thousands of human years of development effort. Some general-purpose DBMSs such as Adabas, Oracle and DB2 have been undergoing upgrades since the 1970s, general-purpose DBMSs aim to meet the needs of as many applications as possible, which adds to the complexity
Database
–
Collage of five types of database models
Database
–
Basic structure of navigational CODASYL database model
105.
Central processing unit
–
The computer industry has used the term central processing unit at least since the early 1960s. The form, design and implementation of CPUs have changed over the course of their history, most modern CPUs are microprocessors, meaning they are contained on a single integrated circuit chip. An IC that contains a CPU may also contain memory, peripheral interfaces, some computers employ a multi-core processor, which is a single chip containing two or more CPUs called cores, in that context, one can speak of such single chips as sockets. Array processors or vector processors have multiple processors that operate in parallel, there also exists the concept of virtual CPUs which are an abstraction of dynamical aggregated computational resources. Early computers such as the ENIAC had to be rewired to perform different tasks. Since the term CPU is generally defined as a device for software execution, the idea of a stored-program computer was already present in the design of J. Presper Eckert and John William Mauchlys ENIAC, but was initially omitted so that it could be finished sooner. On June 30,1945, before ENIAC was made, mathematician John von Neumann distributed the paper entitled First Draft of a Report on the EDVAC and it was the outline of a stored-program computer that would eventually be completed in August 1949. EDVAC was designed to perform a number of instructions of various types. Significantly, the programs written for EDVAC were to be stored in high-speed computer memory rather than specified by the wiring of the computer. This overcame a severe limitation of ENIAC, which was the considerable time, with von Neumanns design, the program that EDVAC ran could be changed simply by changing the contents of the memory. Early CPUs were custom designs used as part of a larger, however, this method of designing custom CPUs for a particular application has largely given way to the development of multi-purpose processors produced in large quantities. This standardization began in the era of discrete transistor mainframes and minicomputers and has accelerated with the popularization of the integrated circuit. The IC has allowed increasingly complex CPUs to be designed and manufactured to tolerances on the order of nanometers, both the miniaturization and standardization of CPUs have increased the presence of digital devices in modern life far beyond the limited application of dedicated computing machines. Modern microprocessors appear in electronic devices ranging from automobiles to cellphones, the so-called Harvard architecture of the Harvard Mark I, which was completed before EDVAC, also utilized a stored-program design using punched paper tape rather than electronic memory. Relays and vacuum tubes were used as switching elements, a useful computer requires thousands or tens of thousands of switching devices. The overall speed of a system is dependent on the speed of the switches, tube computers like EDVAC tended to average eight hours between failures, whereas relay computers like the Harvard Mark I failed very rarely. In the end, tube-based CPUs became dominant because the significant speed advantages afforded generally outweighed the reliability problems, most of these early synchronous CPUs ran at low clock rates compared to modern microelectronic designs. Clock signal frequencies ranging from 100 kHz to 4 MHz were very common at this time, the design complexity of CPUs increased as various technologies facilitated building smaller and more reliable electronic devices
Central processing unit
–
An Intel 80486DX2 CPU, as seen from above
Central processing unit
–
Bottom side of an Intel 80486DX2
Central processing unit
–
EDVAC, one of the first stored-program computers
Central processing unit
–
CPU, core memory, and external bus interface of a DEC PDP-8 /I. Made of medium-scale integrated circuits.
106.
Graphics processing unit
–
GPUs are used in embedded systems, mobile phones, personal computers, workstations, and game consoles. In a personal computer, a GPU can be present on a video card, the term GPU was popularized by Nvidia in 1999, who marketed the GeForce 256 as the worlds first GPU, or Graphics Processing Unit. It was presented as a processor with integrated transform, lighting, triangle setup/clipping. Rival ATI Technologies coined the visual processing unit or VPU with the release of the Radeon 9700 in 2002. Arcade system boards have been using specialized graphics chips since the 1970s, in early video game hardware, the RAM for frame buffers was expensive, so video chips composited data together as the display was being scanned out on the monitor. Fujitsus MB14241 video shifter was used to accelerate the drawing of sprite graphics for various 1970s arcade games from Taito and Midway, such as Gun Fight, Sea Wolf, the Namco Galaxian arcade system in 1979 used specialized graphics hardware supporting RGB color, multi-colored sprites and tilemap backgrounds. The Galaxian hardware was used during the golden age of arcade video games, by game companies such as Namco, Centuri, Gremlin, Irem, Konami, Midway, Nichibutsu, Sega. In the home market, the Atari 2600 in 1977 used a video shifter called the Television Interface Adaptor,6502 machine code subroutines could be triggered on scan lines by setting a bit on a display list instruction. ANTIC also supported smooth vertical and horizontal scrolling independent of the CPU and it became one of the best known of what were known as graphics processing units in the 1980s. The Williams Electronics arcade games Robotron,2084, Joust, Sinistar, in 1985, the Commodore Amiga featured a custom graphics chip, with a blitter unit accelerating bitmap manipulation, line draw, and area fill functions. Also included is a coprocessor with its own instruction set, capable of manipulating graphics hardware registers in sync with the video beam. In 1986, Texas Instruments released the TMS34010, the first microprocessor with on-chip graphics capabilities and it could run general-purpose code, but it had a very graphics-oriented instruction set. In 1990-1992, this chip would become the basis of the Texas Instruments Graphics Architecture Windows accelerator cards, in 1987, the IBM8514 graphics system was released as one of the first video cards for IBM PC compatibles to implement fixed-function 2D primitives in electronic hardware. Fujitsu later competed with the FM Towns computer, released in 1989 with support for a full 16,777,216 color palette, in 1988, the first dedicated polygonal 3D graphics boards were introduced in arcades with the Namco System 21 and Taito Air System. In 1991, S3 Graphics introduced the S3 86C911, which its designers named after the Porsche 911 as an implication of the performance increase it promised. The 86C911 spawned a host of imitators, by 1995, all major PC graphics chip makers had added 2D acceleration support to their chips. By this time, fixed-function Windows accelerators had surpassed expensive general-purpose graphics coprocessors in Windows performance, throughout the 1990s, 2D GUI acceleration continued to evolve. As manufacturing capabilities improved, so did the level of integration of graphics chips, arcade systems such as the Sega Model 2 and Namco Magic Edge Hornet Simulator in 1993 were capable of hardware T&L years before appearing in consumer graphics cards
Graphics processing unit
–
GeForce 6600GT (NV43) GPU
Graphics processing unit
–
Tseng Labs ET4000/W32p
Graphics processing unit
–
S3 Graphics ViRGE
Graphics processing unit
–
Voodoo3 2000 AGP card
107.
Go (game)
–
Go is an abstract strategy board game for two players, in which the aim is to surround more territory than the opponent. The game was invented in ancient China more than 2,500 years ago and it was considered one of the four essential arts of the cultured aristocratic Chinese scholar caste in antiquity. The earliest written reference to the game is recognized as the historical annal Zuo Zhuan. The modern game of Go as we know it was formalized in Japan in the 15th century CE, despite its relatively simple rules, Go is very complex, even more so than chess, and possesses more possibilities than the total number of atoms in the visible universe. Compared to chess, Go has both a board with more scope for play and longer games, and, on average. The playing pieces are called stones, one player uses the white stones and the other, black. The players take turns placing the stones on the vacant intersections of a board with a 19×19 grid of lines, beginners often play on smaller 9×9 and 13×13 boards, and archaeological evidence shows that the game was played in earlier centuries on a board with a 17×17 grid. However, boards with a 19×19 grid had become standard by the time the game had reached Korea in the 5th century CE, the objective of Go—as the translation of its name implies—is to fully surround a larger total area of the board than the opponent. Once placed on the board, stones may not be moved, capture happens when a stone or group of stones is surrounded by opposing stones on all orthogonally-adjacent points. The game proceeds until neither player wishes to make another move, when a game concludes, the territory is counted along with captured stones and komi to determine the winner. Games may also be terminated by resignation, as of mid-2008, there were well over 40 million Go players worldwide, the overwhelming majority of them living in East Asia. As of December 2015, the International Go Federation has a total of 75 member countries, Go is an adversarial game with the objective of surrounding a larger total area of the board with ones stones than the opponent. As the game progresses, the players position stones on the board to map out formations, contests between opposing formations are often extremely complex and may result in the expansion, reduction, or wholesale capture and loss of formation stones. A basic principle of Go is that a group of stones must have at least one liberty to remain on the board, a liberty is an open point bordering the group. An enclosed liberty is called an eye, and a group of stones with two or more eyes is said to be unconditionally alive, such groups cannot be captured, even if surrounded. A group with one eye or no eyes is dead and cannot resist eventual capture, the general strategy is to expand ones territory, attack the opponents weak groups, and always stay mindful of the life status of ones own groups. The liberties of groups are countable, situations where mutually opposing groups must capture each other or die are called capturing races, or semeai. In a capturing race, the group with more liberties will ultimately be able to capture the opponents stones, capturing races and the elements of life or death are the primary challenges of Go
Go (game)
–
Go is played on a grid of black lines (usually 19×19). Game pieces, called stones, are played on the line intersections.
Go (game)
–
Woman Playing Go (Tang Dynasty c. 744), discovered at the Astana Graves
Go (game)
–
Korean couple, in traditional dress, play in a photograph dated between 1910 and 1920.
108.
Adaptive resonance theory
–
Adaptive resonance theory is a theory developed by Stephen Grossberg and Gail Carpenter on aspects of how the brain processes information. It describes a number of network models which use supervised and unsupervised learning methods. The model postulates that top-down expectations take the form of a template or prototype that is then compared with the actual features of an object as detected by the senses. This comparison gives rise to a measure of category belongingness, as long as this difference between sensation and expectation does not exceed a set threshold called the vigilance parameter, the sensed object will be considered a member of the expected class. The system thus offers a solution to the plasticity/stability problem, i. e. the problem of acquiring new knowledge without disrupting existing knowledge, the basic ART system is an unsupervised learning model. It typically consists of a field and a recognition field composed of neurons, a vigilance parameter. The comparison field takes a vector and transfers it to its best match in the recognition field. Its best match is the neuron whose set of weights most closely matches the input vector. Each recognition field neuron outputs a signal to each of the other recognition field neurons. In this way the field exhibits lateral inhibition, allowing each neuron in it to represent a category to which input vectors are classified. After the input vector is classified, the reset module compares the strength of the match to the vigilance parameter. If the vigilance parameter is overcome, training commences, the weights of the winning recognition neuron are adjusted towards the features of the input vector, otherwise, if the match level is below the vigilance parameter the winning recognition neuron is inhibited and a search procedure is carried out. In this search procedure, recognition neurons are disabled one by one by the function until the vigilance parameter is overcome by a recognition match. In particular, at each cycle of the procedure the most active recognition neuron is selected. If no committed recognition neuron’s match overcomes the vigilance parameter, then an uncommitted neuron is committed, the vigilance parameter has considerable influence on the system, higher vigilance produces highly detailed memories, while lower vigilance results in more general memories. There are two methods of training ART-based neural networks, slow and fast. With fast learning, algebraic equations are used to calculate degree of weight adjustments to be made, while fast learning is effective and efficient for a variety of tasks, the slow learning method is more biologically plausible and can be used with continuous-time networks. ART1 is the simplest variety of ART networks, accepting only binary inputs, ART2 extends network capabilities to support continuous inputs
Adaptive resonance theory
–
Basic ART Structure
109.
Artificial life
–
The discipline was named by Christopher Langton, an American theoretical biologist, in 1986. There are three kinds of alife, named for their approaches, soft, from software, hard, from hardware. Artificial life researchers study traditional biology by trying to recreate aspects of biological phenomena, artificial life studies the fundamental processes of living systems in artificial environments in order to gain a deeper understanding of the complex information processing that define such systems. The modeling philosophy of life strongly differs from traditional modeling by studying not only life-as-we-know-it. A traditional model of a system will focus on capturing its most important parameters. In contrast, a modeling approach will generally seek to decipher the most simple and general principles underlying life. The simulation then offers the possibility to analyse new and different lifelike systems, vladimir Georgievich Redko proposed to generalize this distinction to the modeling of any process, leading to the more general distinction of processes-as-we-kn