1.
Factorial experiment
–
A full factorial design may also be called a fully crossed design. Such an experiment allows the investigator to study the effect of each factor on the response variable, for the vast majority of factorial experiments, each factor has only two levels. For example, with two factors each taking two levels, an experiment would have four treatment combinations in total, and is usually called a 2×2 factorial design. If the number of combinations in a factorial design is too high to be logistically feasible. Factorial designs were used in the 19th century by John Bennet Lawes, ronald Fisher argued in 1926 that complex designs were more efficient than studying one factor at a time. Fisher wrote, No aphorism is more frequently repeated in connection with field trials, than that we must ask Nature few questions, or, ideally, one question, the writer is convinced that this view is wholly mistaken. Nature, he suggests, will best respond to a logical, frank Yates made significant contributions, particularly in the analysis of designs, by the Yates analysis. The term factorial may not have used in print before 1935. The simplest factorial experiment contains two levels for each of two factors, suppose an engineer wishes to study the total power used by each of two different motors, A and B, running at each of two different speeds,2000 or 3000 RPM. The factorial experiment would consist of four units, motor A at 2000 RPM, motor B at 2000 RPM, motor A at 3000 RPM. Each combination of a single selected from every factor is present once. This experiment is an example of a 22 factorial experiment, so named because it considers two levels for each of two factors, or #levels#factors, producing 22=4 factorial points, Designs can involve many independent variables. As a further example, the effects of three variables can be evaluated in eight experimental conditions shown as the corners of a cube. This can be conducted with or without replication, depending on its intended purpose and it will provide the effects of the three independent variables on the dependent variable and possible interactions. The notation used to denote factorial experiments conveys a lot of information, when a design is denoted a 23 factorial, this identifies the number of factors, how many levels each factor has, and how many experimental conditions there are in the design. Similarly, a 25 design has five factors, each two levels, and 25=32 experimental conditions. Factorial experiments can involve factors with different numbers of levels, a 243 design has five factors, four with two levels and one with three levels, and has 16 X 3=48 experimental conditions. To save space, the points in a factorial experiment are often abbreviated with strings of plus and minus signs
2.
Machine learning
–
Machine learning is the subfield of computer science that, according to Arthur Samuel in 1959, gives computers the ability to learn without being explicitly programmed. Machine learning is related to computational statistics, which also focuses on prediction-making through the use of computers. It has strong ties to optimization, which delivers methods, theory. Machine learning is sometimes conflated with data mining, where the latter subfield focuses more on data analysis and is known as unsupervised learning. Machine learning can also be unsupervised and be used to learn and establish baseline behavioral profiles for various entities, tom M. be replaced with the question Can machines do what we can do. In the proposal he explores the characteristics that could be possessed by a thinking machine. Machine learning tasks are typically classified into three categories, depending on the nature of the learning signal or feedback available to a learning system. These are Supervised learning, The computer is presented with example inputs and their outputs, given by a teacher. Unsupervised learning, No labels are given to the learning algorithm, unsupervised learning can be a goal in itself or a means towards an end. Reinforcement learning, A computer program interacts with an environment in which it must perform a certain goal. The program is provided feedback in terms of rewards and punishments as it navigates its problem space, between supervised and unsupervised learning is semi-supervised learning, where the teacher gives an incomplete training signal, a training set with some of the target outputs missing. Transduction is a case of this principle where the entire set of problem instances is known at learning time. Among other categories of machine learning problems, learning to learn learns its own inductive bias based on previous experience and this is typically tackled in a supervised way. Spam filtering is an example of classification, where the inputs are email messages, in regression, also a supervised problem, the outputs are continuous rather than discrete. In clustering, a set of inputs is to be divided into groups, unlike in classification, the groups are not known beforehand, making this typically an unsupervised task. Density estimation finds the distribution of inputs in some space, dimensionality reduction simplifies inputs by mapping them into a lower-dimensional space. Topic modeling is a problem, where a program is given a list of human language documents and is tasked to find out which documents cover similar topics. As a scientific endeavour, machine learning grew out of the quest for artificial intelligence, already in the early days of AI as an academic discipline, some researchers were interested in having machines learn from data
3.
Data mining
–
It is an interdisciplinary subfield of computer science. The overall goal of the mining process is to extract information from a data set. Data mining is the step of the knowledge discovery in databases process. The term is a misnomer, because the goal is the extraction of patterns and knowledge from large amounts of data, often the more general terms data analysis and analytics – or, when referring to actual methods, artificial intelligence and machine learning – are more appropriate. This usually involves using database techniques such as spatial indices and these patterns can then be seen as a kind of summary of the input data, and may be used in further analysis or, for example, in machine learning and predictive analytics. For example, the data mining step might identify multiple groups in the data, neither the data collection, data preparation, nor result interpretation and reporting is part of the data mining step, but do belong to the overall KDD process as additional steps. These methods can, however, be used in creating new hypotheses to test against the larger data populations, in the 1960s, statisticians used terms like Data Fishing or Data Dredging to refer to what they considered the bad practice of analyzing data without an a-priori hypothesis. The term Data Mining appeared around 1990 in the database community, however, the term data mining became more popular in the business and press communities. Currently, Data Mining and Knowledge Discovery are used interchangeably, in the Academic community, the major forums for research started in 1995 when the First International Conference on Data Mining and Knowledge Discovery was started in Montreal under AAAI sponsorship. It was co-chaired by Usama Fayyad and Ramasamy Uthurusamy, a year later, in 1996, Usama Fayyad launched the journal by Kluwer called Data Mining and Knowledge Discovery as its founding Editor-in-Chief. Later he started the SIGKDDD Newsletter SIGKDD Explorations, the KDD International conference became the primary highest quality conference in Data Mining with an acceptance rate of research paper submissions below 18%. The Journal Data Mining and Knowledge Discovery is the research journal of the field. The manual extraction of patterns from data has occurred for centuries, early methods of identifying patterns in data include Bayes theorem and regression analysis. The proliferation, ubiquity and increasing power of technology has dramatically increased data collection, storage. Data mining is the process of applying these methods with the intention of uncovering hidden patterns in data sets. The Knowledge Discovery in Databases process is defined with the stages. Polls conducted in 2002,2004,2007 and 2014 show that the CRISP-DM methodology is the methodology used by data miners. The only other data mining standard named in these polls was SEMMA, however, 3–4 times as many people reported using CRISP-DM
4.
Statistical classification
–
An example would be assigning a given email into spam or non-spam classes or assigning a diagnosis to a given patient as described by observed characteristics of the patient. Classification is an example of pattern recognition, in the terminology of machine learning, classification is considered an instance of supervised learning, i. e. learning where a training set of correctly identified observations is available. The corresponding unsupervised procedure is known as clustering, and involves grouping data into categories based on some measure of inherent similarity or distance, often, the individual observations are analyzed into a set of quantifiable properties, known variously as explanatory variables or features. These properties may variously be categorical, ordinal, integer-valued or real-valued, other classifiers work by comparing observations to previous observations by means of a similarity or distance function. An algorithm that implements classification, especially in an implementation, is known as a classifier. The term classifier sometimes also refers to the function, implemented by a classification algorithm. Terminology across fields is quite varied, in machine learning, the observations are often known as instances, the explanatory variables are termed features, and the possible categories to be predicted are classes. Classification and clustering are examples of the general problem of pattern recognition. A common subclass of classification is probabilistic classification, algorithms of this nature use statistical inference to find the best class for a given instance. Unlike other algorithms, which output a best class, probabilistic algorithms output a probability of the instance being a member of each of the possible classes. The best class is then selected as the one with the highest probability. However, such an algorithm has numerous advantages over non-probabilistic classifiers, correspondingly, it can abstain when its confidence of choosing any particular output is too low. This early work assumed that data-values within each of the two groups had a normal distribution. The extension of this context to more than two-groups has also been considered with a restriction imposed that the classification rule should be linear. Bayesian procedures tend to be expensive and, in the days before Markov chain Monte Carlo computations were developed. Classification can be thought of as two separate problems – binary classification and multiclass classification, in binary classification, a better understood task, only two classes are involved, whereas multiclass classification involves assigning an object to one of several classes. Since many classification methods have developed specifically for binary classification. Most algorithms describe an individual instance whose category is to be predicted using a vector of individual
5.
Cluster analysis
–
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups. Cluster analysis itself is not one specific algorithm, but the task to be solved. It can be achieved by various algorithms that differ significantly in their notion of what constitutes a cluster, popular notions of clusters include groups with small distances among the cluster members, dense areas of the data space, intervals or particular statistical distributions. Clustering can therefore be formulated as an optimization problem. The appropriate clustering algorithm and parameter settings depend on the data set. Cluster analysis as such is not a task, but an iterative process of knowledge discovery or interactive multi-objective optimization that involves trial. It is often necessary to modify data preprocessing and model parameters until the result achieves the desired properties, besides the term clustering, there are a number of terms with similar meanings, including automatic classification, numerical taxonomy, botryology and typological analysis. The notion of a cluster cannot be defined, which is one of the reasons why there are so many clustering algorithms. There is a common denominator, a group of data objects, however, different researchers employ different cluster models, and for each of these cluster models again different algorithms can be given. The notion of a cluster, as found by different algorithms, understanding these cluster models is key to understanding the differences between the various algorithms. Typical cluster models include, Connectivity models, for example, hierarchical clustering builds models based on distance connectivity, centroid models, for example, the k-means algorithm represents each cluster by a single mean vector. Distribution models, clusters are modeled using statistical distributions, such as multivariate normal distributions used by the Expectation-maximization algorithm, density models, for example, DBSCAN and OPTICS defines clusters as connected dense regions in the data space. Subspace models, in Biclustering, clusters are modeled with both members and relevant attributes. Group models, some algorithms do not provide a model for their results. Graph-based models, a clique, that is, a subset of nodes in a such that every two nodes in the subset are connected by an edge can be considered as a prototypical form of cluster. Relaxations of the connectivity requirement are known as quasi-cliques, as in the HCS clustering algorithm. A clustering is essentially a set of clusters, usually containing all objects in the data set. Additionally, it may specify the relationship of the clusters to each other, for example, the following overview will only list the most prominent examples of clustering algorithms, as there are possibly over 100 published clustering algorithms
6.
Association rule learning
–
Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness, for example, the rule ⇒ found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, they are likely to also buy hamburger meat. Such information can be used as the basis for decisions about marketing activities such as, in contrast with sequence mining, association rule learning typically does not consider the order of items either within a transaction or across transactions. Following the original definition by Agrawal, Imieliński, Swami the problem of association rule mining is defined as, Let D = be a set of transactions called the database. Each transaction in D has a unique transaction ID and contains a subset of the items in I, a rule is defined as an implication of the form, X ⇒ Y, where X, Y ⊆ I. In Agrawal, Imieliński, Swami a rule is defined only between a set and an item, X ⇒ i j for i j ∈ I. Every rule is composed by two different sets of items, also known as itemsets, X and Y, where X is called antecedent or left-hand-side, to illustrate the concepts, we use a small example from the supermarket domain. An example rule for the supermarket could be ⇒ meaning that if butter and bread are bought, note, this example is extremely small. In order to select interesting rules from the set of all rules, constraints on various measures of significance. The best-known constraints are minimum thresholds on support and confidence, Let X be an itemset, X ⇒ Y an association rule and T a set of transactions of a given database. Support is an indication of how frequently the itemset appears in the database, the argument of s u p p is a set of preconditions, and thus becomes more restrictive as it grows. Confidence is an indication of how often the rule has been found to be true, the confidence value of a rule, X ⇒ Y, with respect to a set of transactions T, is the proportion of the transactions that contains X which also contains Y. Confidence is defined as, c o n f = s u p p / s u p p. For example, the rule ⇒ has a confidence of 0.2 /0.2 =1.0 in the database, note that s u p p means the support of the union of the items in X and Y. This is somewhat confusing since we think in terms of probabilities of events. We can rewrite s u p p as the probability P, for example, the rule ⇒ has a lift of 0.20.4 ×0.4 =1.25. If the rule had a lift of 1, it would imply that the probability of occurrence of the antecedent, when two events are independent of each other, no rule can be drawn involving those two events. The value of lift is that it both the confidence of the rule and the overall data set
7.
Reinforcement learning
–
In the operations research and control literature, the field where reinforcement learning methods are studied is called approximate dynamic programming. In economics and game theory, reinforcement learning may be used to explain how equilibrium may arise under bounded rationality, in machine learning, the environment is typically formulated as a Markov decision process, as many reinforcement learning algorithms for this context utilize dynamic programming techniques. Reinforcement learning differs from standard supervised learning in that correct input/output pairs are never presented, further, there is a focus on on-line performance, which involves finding a balance between exploration and exploitation. The exploration vs. exploitation trade-off in reinforcement learning has been most thoroughly studied through the multi-armed bandit problem, the observation typically involves the scalar immediate reward associated with the last transition. In many works, the agent is assumed to observe the current environmental state, in which case we talk about full observability. Sometimes the set of available to the agent is restricted. A reinforcement learning agent interacts with its environment in discrete time steps, at each time t, the agent receives an observation o t, which typically includes the reward r t. It then chooses an action a t from the set of actions available, the environment moves to a new state s t +1 and the reward r t +1 associated with the transition is determined. The goal of a reinforcement learning agent is to collect as much reward as possible, the agent can choose any action as a function of the history and it can even randomize its action selection. When the agents performance is compared to that of an agent which acts optimally from the beginning, thus, reinforcement learning is particularly well-suited to problems which include a long-term versus short-term reward trade-off. It has been applied successfully to various problems, including robot control, elevator scheduling, telecommunications, backgammon, checkers, two components make reinforcement learning powerful, The use of samples to optimize performance and the use of function approximation to deal with large environments. The first two of these problems could be considered planning problems, while the last one could be considered as a learning problem. However, under a reinforcement learning methodology both planning problems would be converted to machine learning problems, the reinforcement learning problem as described requires clever exploration mechanisms. Randomly selecting actions, without reference to a probability distribution, is known to give rise to very poor performance. The case of finite Markov decision processes is relatively well understood by now, however, due to the lack of algorithms that would provably scale well with the number of states, in practice people resort to simple exploration methods. One such method is ϵ -greedy, when the agent chooses the action that it believes has the best long-term effect with probability 1 − ϵ, and it chooses an action uniformly at random, otherwise. Here,0 < ϵ <1 is a tuning parameter, even if the issue of exploration is disregarded and even if the state was observable, the problem remains to find out which actions are good based on past experience. For simplicity, assume for a moment that the problem studied is episodic, assume further that no matter what course of actions the agent takes, termination is inevitable
8.
Structured prediction
–
Structured prediction or structured learning is an umbrella term for supervised machine learning techniques that involves predicting structured objects, rather than scalar discrete or real values. Probabilistic graphical models form a class of structured prediction models. Other algorithms and models for structured prediction include inductive logic programming, case-based reasoning, structured SVMs, Markov logic networks, sequence tagging is a class of problems prevalent in natural language processing, where input data are often sequences. The sequence tagging problem appears in several guises, e. g. part-of-speech tagging, in POS tagging, for example, each word in a sequence must receive a tag that expresses its type of word, This DT is VBZ a DT tagged JJ sentence NN. The main challenge of this problem is to resolve ambiguity, the sentence can also be a verb in English. One of the easiest ways to understand algorithms for structured prediction is the structured perceptron of Collins. This algorithm combines the perceptron algorithm for learning linear classifiers with an inference algorithm, first define a joint feature function Φ that maps a training sample x and a candidate prediction y to a vector of length n. Let GEN be a function that generates candidate predictions, the idea of learning is similar to multiclass perceptron. Conditional random field Structured support vector machine Recurrent neural network, in particular Elman networks Noah Smith, michael Collins, Discriminative Training Methods for Hidden Markov Models,2002. Implementation of Collins structured perceptron Implementation of structured perceptron with hashtags prediction system
9.
Feature engineering
–
Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. Feature engineering is fundamental to the application of learning, and is both difficult and expensive. The need for manual feature engineering can be obviated by automated feature learning, feature engineering is an informal topic, but it is considered essential in applied machine learning. Coming up with features is difficult, time-consuming, requires expert knowledge, applied machine learning is basically feature engineering. When working on a learning problem, feature engineering is manually designing what the input xs should be. A feature is a piece of information that might be useful for prediction, any attribute could be a feature, as long as it is useful to the model. The purpose of a feature, other than being an attribute, a feature is a characteristic that might help when solving the problem. The features in your data are important to the models you use. The quality and quantity of the features will have influence on whether the model is good or not. You could say the better the features are, the better the result is and this isnt entirely true, because the results achieved also depend on the model and the data, not just the chosen features. That said, choosing the right features is very important. Better features can produce simpler and more models, and they often yield better results. The algorithms we used are standard for Kagglers. We spent most of our efforts in feature engineering and we were also very careful to discard features likely to expose us to the risk of over-fitting our model. …some machine learning projects succeed and some fail, easily the most important factor is the features used. Depending on a feature it could be relevant, relevant. It is important to create a lot of features, even if some of them are irrelevant, you cant afford missing the rest. Afterwards, feature selection can be used in order to prevent overfitting, feature explosion can be caused by feature combination or feature templates, both leading to a quick growth in the total number of features
10.
Feature learning
–
This obviates manual feature engineering, which is otherwise necessary, and allows a machine to both learn at a specific task and learn the features themselves. Feature learning is motivated by the fact that machine learning such as classification often require input that is mathematically and computationally convenient to process. However, real-world data such as images, video, and sensor measurement is usually complex, redundant, thus, it is necessary to discover useful features or representations from raw data. Traditional hand-crafted features often require expensive human labor and often rely on expert knowledge, also, they normally do not generalize well. This motivates the design of efficient feature learning techniques, to automate, Feature learning can be divided into two categories, supervised and unsupervised feature learning, analogous to these categories in machine learning generally. In supervised feature learning, features are learned with labeled input data, examples include supervised neural networks, multilayer perceptron, and dictionary learning. In unsupervised feature learning, features are learned with unlabeled input data, examples include dictionary learning, independent component analysis, autoencoders, matrix factorization, and various forms of clustering. Supervised feature learning is learning features from labeled data, several approaches are introduced in the following. Dictionary learning is to learn a set of elements from the input data such that each data point can be represented as a weighted sum of the representative elements. The dictionary elements and the weights may be found by minimizing the average representation error, supervised dictionary learning exploits both the structure underlying the input data and the labels for optimizing the dictionary elements. For example, a dictionary learning technique was proposed by Mairal et al. in 2009. Neural networks are used to illustrate a family of learning algorithms via a network consisting of layers of inter-connected nodes. It is inspired by the system, where the nodes are viewed as neurons. Each edge has a weight, and the network defines computational rules that passes input data from the input layer to the output layer. A network function associated with a neural network characterizes the relationship between input and output layers, which is parameterized by the weights, with appropriately defined network functions, various learning tasks can be performed by minimizing a cost function over the network function. Unsupervised feature learning is to features from unlabeled data. The goal of unsupervised learning is often to discover low-dimensional features that captures some structure underlying the high-dimensional input data. Several approaches are introduced in the following, k-means clustering is an approach for vector quantization
11.
Semi-supervised learning
–
Semi-supervised learning falls between unsupervised learning and supervised learning. Many machine-learning researchers have found that data, when used in conjunction with a small amount of labeled data. The acquisition of labeled data for a problem often requires a skilled human agent or a physical experiment. The cost associated with the process thus may render a fully labeled training set infeasible. In such situations, semi-supervised learning can be of practical value. Semi-supervised learning is also of theoretical interest in learning and as a model for human learning. As in the supervised learning framework, we are given a set of l independently identically distributed examples x 1, …, x l ∈ X with corresponding labels y 1, …, y l ∈ Y. Additionally, we are given u unlabeled examples x l +1, …, x l + u ∈ X. Semi-supervised learning may refer to either learning or inductive learning. The goal of learning is to infer the correct labels for the given unlabeled data x l +1, …, x l + u only. The goal of learning is to infer the correct mapping from X to Y. Intuitively, we can think of the problem as an exam. The teacher also provides a set of unsolved problems, in the transductive setting, these unsolved problems are a take-home exam and you want to do well on them in particular. In the inductive setting, these are problems of the sort you will encounter on the in-class exam. In order to any use of unlabeled data, we must assume some structure to the underlying distribution of data. Semi-supervised learning algorithms make use of at least one of the following assumptions, points which are close to each other are more likely to share a label. This is also assumed in supervised learning and yields a preference for geometrically simple decision boundaries. The data tend to form clusters, and points in the same cluster are more likely to share a label. This is a case of the smoothness assumption and gives rise to feature learning with clustering algorithms
12.
Unsupervised learning
–
Unsupervised machine learning is the machine learning task of inferring a function to describe hidden structure from unlabeled data. g. Principal component analysis, Independent component analysis, Non-negative matrix factorization, the classical example of unsupervised learning in the study of both natural and artificial neural networks is subsumed by Donald Hebbs principle, that is, neurons that fire together wire together. In Hebbian learning, the connection is reinforced irrespective of an error, a similar version that modifies synaptic weights takes into account the time between the action potentials. Hebbian Learning has been hypothesized to underlie a range of functions, such as pattern recognition. Among neural network models, the map and adaptive resonance theory are commonly used unsupervised learning algorithms. The SOM is a organization in which nearby locations in the map represent inputs with similar properties. ART networks are used for many pattern recognition tasks, such as automatic target recognition. The first version of ART was ART1, developed by Carpenter, one of the statistical approaches for unsupervised learning is the method of moments. In the method of moments, the parameters in the model are related to the moments of one or more random variables, and thus. The moments are usually estimated from samples empirically, the basic moments are first and second order moments. For a random vector, the first order moment is the vector. Higher order moments are usually represented using tensors which are the generalization of matrices to higher orders as multi-dimensional arrays, in particular, the method of moments is shown to be effective in learning the parameters of latent variable models. Latent variable models are statistical models where in addition to the observed variables, in the topic modeling, the words in the document are generated according to different statistical parameters when the topic of the document is changed. It is shown that method of moments consistently recover the parameters of a class of latent variable models under some assumptions. The Expectation–maximization algorithm is one of the most practical methods for learning latent variable models. However, it can get stuck in local optima, and it is not guaranteed that the algorithm will converge to the unknown parameters of the model. Alternatively, for the method of moments, the convergence is guaranteed under some conditions. Behavioral-based detection in network security has become a good area for a combination of supervised-
13.
Learning to rank
–
Training data consists of lists of items with some partial order specified between items in each list. This order is typically induced by giving a numerical or ordinal score or a judgment for each item. The ranking models purpose is to rank, i. e. produce a permutation of items in new, ranking is a central part of many information retrieval problems, such as document retrieval, collaborative filtering, sentiment analysis, and online advertising. A possible architecture of a search engine is shown in the figure to the right. Training data consists of queries and documents matching them together with relevance degree of each match and it may be prepared manually by human assessors, who check results for some queries and determine relevance of each result. It is not feasible to check relevance of all documents, and so typically a technique called pooling is used — only the top few documents, alternatively, training data may be derived automatically by analyzing clickthrough logs, query chains, or such search engines features as Googles SearchWiki. Training data is used by an algorithm to produce a ranking model which computes relevance of documents for actual queries. Typically, users expect a search query to complete in a time, which makes it impossible to evaluate a complex ranking model on each document in the corpus. This phase is called top- k document retrieval and many heuristics were proposed in the literature to accelerate it, such as using a documents static quality score, in the second phase, a more accurate but computationally expensive machine-learned model is used to re-rank these documents. In Recommender systems for identifying a ranked list of related articles to recommend to a user after he or she has read a current news article. For convenience of MLR algorithms, query-document pairs are represented by numerical vectors. Such an approach is sometimes called bag of features and is analogous to the bag of words model, components of such vectors are called features, factors or ranking signals. They may be divided into three groups, Query-independent or static features — those features, which only on the document. For example, PageRank or documents length, such features can be precomputed in off-line mode during indexing. They may be used to compute documents static quality score, which is used to speed up search query evaluation. Query-dependent or dynamic features — those features, which depend both on the contents of the document and the query, such as TF-IDF score or other non-machine-learned ranking functions, query level features or query features, which depend only on the query. For example, the number of words in a query, selecting and designing good features is an important area in machine learning, which is called feature engineering. There are several measures which are used to judge how well an algorithm is doing on training data
14.
Supervised learning
–
Supervised learning is the machine learning task of inferring a function from labeled training data. The training data consist of a set of training examples, in supervised learning, each example is a pair consisting of an input object and a desired output value. A supervised learning algorithm analyzes the data and produces an inferred function. An optimal scenario will allow for the algorithm to determine the class labels for unseen instances. This requires the algorithm to generalize from the training data to unseen situations in a reasonable way. The parallel task in human and animal psychology is often referred to as concept learning, in order to solve a given problem of supervised learning, one has to perform the following steps, Determine the type of training examples. Before doing anything else, the user should decide what kind of data is to be used as a training set, in the case of handwriting analysis, for example, this might be a single handwritten character, an entire handwritten word, or an entire line of handwriting. The training set needs to be representative of the use of the function. Thus, a set of objects is gathered and corresponding outputs are also gathered. Determine the input feature representation of the learned function, the accuracy of the learned function depends strongly on how the input object is represented. Typically, the object is transformed into a feature vector. The number of features should not be too large, because of the curse of dimensionality, Determine the structure of the learned function and corresponding learning algorithm. For example, the engineer may choose to use support vector machines or decision trees, run the learning algorithm on the gathered training set. Some supervised learning algorithms require the user to determine certain control parameters and these parameters may be adjusted by optimizing performance on a subset of the training set, or via cross-validation. Evaluate the accuracy of the learned function, after parameter adjustment and learning, the performance of the resulting function should be measured on a test set that is separate from the training set. A wide range of supervised learning algorithms are available, each with its strengths, there is no single learning algorithm that works best on all supervised learning problems. There are four major issues to consider in supervised learning, A first issue is the tradeoff between bias and variance, imagine that we have available several different, but equally good, training data sets. A learning algorithm is biased for an input x if
15.
Decision tree learning
–
Decision tree learning uses a decision tree as a predictive model which maps observations about an item to conclusions about the items target value. It is one of the predictive modelling approaches used in statistics, data mining, Decision trees where the target variable can take continuous values are called regression trees. In decision analysis, a tree can be used to visually and explicitly represent decisions. In data mining, a decision tree describes data and this page deals with decision trees in data mining. Decision tree learning is a commonly used in data mining. The goal is to create a model that predicts the value of a target based on several input variables. An example is shown in the diagram at right, each interior node corresponds to one of the input variables, there are edges to children for each of the possible values of that input variable. Each leaf represents a value of the target given the values of the input variables represented by the path from the root to the leaf. A decision tree is a representation for classifying examples. For this section, assume all of the input features have finite discrete domains. Each element of the domain of the classification is called a class, a decision tree or a classification tree is a tree in which each internal node is labeled with an input feature. Each leaf of the tree is labeled with a class or a probability distribution over the classes, a tree can be learned by splitting the source set into subsets based on an attribute value test. This process is repeated on each derived subset in a manner called recursive partitioning. See the examples illustrated in the figure for spaces that have and have not been partitioned using recursive partitioning, the recursion is completed when the subset at a node has all the same value of the target variable, or when splitting no longer adds value to the predictions. This process of induction of decision trees is an example of a greedy algorithm. In data mining, decision trees can be described also as the combination of mathematical and computational techniques to aid the description, categorization and generalization of a set of data. Data comes in records of the form, = The dependent variable, Y, is the variable that we are trying to understand. The vector x is composed of the variables, x1, x2
16.
Ensemble learning
–
Even if the hypothesis space contains hypotheses that are very well-suited for a particular problem, it may be very difficult to find a good one. Ensembles combine multiple hypotheses to form a better hypothesis, the term ensemble is usually reserved for methods that generate multiple hypotheses using the same base learner. The broader term of multiple classifier systems also covers hybridization of hypotheses that are not induced by the same base learner, fast algorithms such as decision trees are commonly used in ensemble methods, although slower algorithms can benefit from ensemble techniques as well. By analogy, ensemble techniques have been used also in unsupervised learning scenarios, an ensemble is itself a supervised learning algorithm, because it can be trained and then used to make predictions. The trained ensemble, therefore, represents a single hypothesis and this hypothesis, however, is not necessarily contained within the hypothesis space of the models from which it is built. Thus, ensembles can be shown to have flexibility in the functions they can represent. Empirically, ensembles tend to better results when there is a significant diversity among the models. Many ensemble methods, therefore, seek to promote diversity among the models they combine, although perhaps non-intuitive, more random algorithms can be used to produce a stronger ensemble than very deliberate algorithms. Using a variety of learning algorithms, however, has been shown to be more effective than using techniques that attempt to dumb-down the models in order to promote diversity. While the number of component classifiers of an ensemble has an impact on the accuracy of prediction. A priori determining of ensemble size and the volume and velocity of big data streams make this even more crucial for online ensemble classifiers, mostly statistical tests was used for determining the proper number of components. It is called the law of diminishing returns in ensemble construction and their theoretical framework shows that using the same number of independent component classifiers as class labels gives the highest accuracy. The Bayes Optimal Classifier is a classification technique and it is an ensemble of all the hypotheses in the hypothesis space. On average, no other ensemble can outperform it, each hypothesis is given a vote proportional to the likelihood that the training dataset would be sampled from a system if that hypothesis were true. To facilitate training data of finite size, the vote of each hypothesis is also multiplied by the probability of that hypothesis. As an ensemble, the Bayes Optimal Classifier represents a hypothesis that is not necessarily in H, the hypothesis represented by the Bayes Optimal Classifier, however, is the optimal hypothesis in ensemble space. Unfortunately, the Bayes Optimal Classifier cannot be implemented for any. There are several reasons why the Bayes Optimal Classifier cannot be implemented, Most interesting hypothesis spaces are too large to iterate over
17.
Bootstrap aggregating
–
It also reduces variance and helps to avoid overfitting. Although it is applied to decision tree methods, it can be used with any type of method. Bagging is a case of the model averaging approach. Bagging was proposed by Leo Breiman in 1994 to improve the classification by combining classifications of randomly generated training sets. Given a standard training set D of size n, bagging generates m new training sets D i, each of size n′, by sampling from D uniformly, by sampling with replacement, some observations may be repeated in each D i. If n′=n, then for large n the set D i is expected to have the fraction of the examples of D. This kind of sample is known as a bootstrap sample, the m models are fitted using the above m bootstrap samples and combined by averaging the output or voting. Bagging leads to improvements for unstable procedures, which include, for example, artificial neural networks, classification and regression trees, an interesting application of bagging showing improvement in preimage learning is provided here. On the other hand, it can degrade the performance of stable methods such as K-nearest neighbors. To illustrate the principles of bagging, below is an analysis on the relationship between ozone and temperature. The relationship between temperature and ozone in this set is apparently non-linear, based on the scatter plot. To mathematically describe this relationship, LOESS smoothers are used, instead of building a single smoother from the complete data set,100 bootstrap samples of the data were drawn. Each sample is different from the data set, yet resembles it in distribution. For each bootstrap sample, a LOESS smoother was fit, predictions from these 100 smoothers were then made across the range of the data. The first 10 predicted smooth fits appear as lines in the figure below. The lines are clearly very wiggly and they overfit the data - a result of the span being too low, by taking the average of 100 smoothers, each fitted to a subset of the original data set, we arrive at one bagged predictor. Clearly, the mean is more stable and there is less overfit, boosting Bootstrapping Cross-validation Random forest Random subspace method Breiman, Leo. Alfaro, E. Gámez, M. and García, N. adabag, An R package for classification with AdaBoost. M1, AdaBoost-SAMME and Bagging
18.
Random forest
–
Random decision forests correct for decision trees habit of overfitting to their training set. An extension of the algorithm was developed by Leo Breiman and Adele Cutler, a subsequent work along the same lines concluded that other splitting methods, as long as they are randomly forced to be insensitive to some feature dimensions, behave similarly. The explanation of the forest methods resistance to overtraining can be found in Kleinbergs theory of stochastic discrimination, the idea of random subspace selection from Ho was also influential in the design of random forests. In this method a forest of trees is grown, and variation among the trees is introduced by projecting the training data into a chosen subspace before fitting each tree or each node. Finally, the idea of randomized node optimization, where the decision at each node is selected by a randomized procedure, the introduction of random forests proper was first made in a paper by Leo Breiman. This paper describes a method of building a forest of uncorrelated trees using a CART like procedure, combined with randomized node optimization, decision trees are a popular method for various machine learning tasks. In particular, trees that are very deep tend to learn highly irregular patterns, they overfit their training sets, i. e. have low bias. Random forests are a way of averaging multiple deep decision trees, trained on different parts of the training set. This comes at the expense of an increase in the bias and some loss of interpretability. The training algorithm for random forests applies the technique of bootstrap aggregating, or bagging. Given a training set X = x1, xn with responses Y = y1. Yn, bagging repeatedly selects a random sample with replacement of the set and fits trees to these samples. B, Sample, with replacement, B training examples from X, Y, call these Xb, train a decision or regression tree fb on Xb, Yb. This bootstrapping procedure leads to better performance because it decreases the variance of the model. This means that while the predictions of a tree are highly sensitive to noise in its training set. Simply training many trees on a training set would give strongly correlated trees. The number of samples/trees, B, is a free parameter, typically, a few hundred to several thousand trees are used, depending on the size and nature of the training set. The training and test error tend to level off after some number of trees have been fit, the above procedure describes the original bagging algorithm for trees
19.
K-nearest neighbors algorithm
–
In pattern recognition, the k-nearest neighbors algorithm is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space, the output depends on whether k-NN is used for classification or regression, In k-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbors, if k =1, then the object is simply assigned to the class of that single nearest neighbor. In k-NN regression, the output is the property value for the object and this value is the average of the values of its k nearest neighbors. K-NN is a type of instance-based learning, or lazy learning, the k-NN algorithm is among the simplest of all machine learning algorithms. For example, a weighting scheme consists in giving each neighbor a weight of 1/d. The neighbors are taken from a set of objects for which the class or the property value is known. This can be thought of as the set for the algorithm. A shortcoming of the algorithm is that it is sensitive to the local structure of the data. The algorithm is not to be confused with k-means, another popular machine learning technique. Suppose we have pairs, …, taking values in R d ×, where Y is the class label of X, so that X | Y = r ∼ P r for r =1,2. Given some norm ∥ ⋅ ∥ on R d and a point x ∈ R d, let, …, the training examples are vectors in a multidimensional feature space, each with a class label. The training phase of the algorithm consists only of storing the feature vectors, a commonly used distance metric for continuous variables is Euclidean distance. For discrete variables, such as for text classification, another metric can be used, in the context of gene expression microarray data, for example, k-NN has also been employed with correlation coefficients such as Pearson and Spearman. A drawback of the majority voting classification occurs when the class distribution is skewed. That is, examples of a more frequent class tend to dominate the prediction of the new example, one way to overcome this problem is to weight the classification, taking into account the distance from the test point to each of its k nearest neighbors. The class of each of the k nearest points is multiplied by a proportional to the inverse of the distance from that point to the test point. Another way to overcome skew is by abstraction in data representation, for example, in a self-organizing map, each node is a representative of a cluster of similar points, regardless of their density in the original training data
20.
Naive Bayes classifier
–
In machine learning, naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes theorem with strong independence assumptions between the features. Naive Bayes has been studied extensively since the 1950s, with appropriate pre-processing, it is competitive in this domain with more advanced methods including support vector machines. It also finds application in medical diagnosis. Naive Bayes classifiers are highly scalable, requiring a number of parameters linear in the number of variables in a learning problem. Maximum-likelihood training can be done by evaluating an expression, which takes linear time. In the statistics and computer science literature, Naive Bayes models are known under a variety of names, including simple Bayes, all these names reference the use of Bayes theorem in the classifiers decision rule, but naive Bayes is not a Bayesian method. For example, a fruit may be considered to be an apple if it is red, round, for some types of probability models, naive Bayes classifiers can be trained very efficiently in a supervised learning setting. Despite their naive design and apparently oversimplified assumptions, naive Bayes classifiers have worked well in many complex real-world situations. In 2004, an analysis of the Bayesian classification problem showed that there are theoretical reasons for the apparently implausible efficacy of naive Bayes classifiers. Still, a comparison with other classification algorithms in 2006 showed that Bayes classification is outperformed by other approaches. An advantage of naive Bayes is that it requires a small number of training data to estimate the parameters necessary for classification. The problem with the formulation is that if the number of features n is large or if a feature can take on a large number of values. We therefore reformulate the model to make it more tractable and this means that p = p. Thus, the joint model can be expressed as p ∝ p ∝ p p p p ⋯ ∝ p ∏ i =1 n p. The discussion so far has derived the independent feature model, that is, the naive Bayes classifier combines this model with a decision rule. One common rule is to pick the hypothesis that is most probable, this is known as the maximum a posteriori or MAP decision rule. The corresponding classifier, a Bayes classifier, is the function that assigns a class label y ^ = C k for k as follows. A classs prior may be calculated by assuming equiprobable classes, or by calculating an estimate for the class probability from the training set, to estimate the parameters for a features distribution, one must assume a distribution or generate nonparametric models for the features from the training set. The assumptions on distributions of features are called the event model of the Naive Bayes classifier, for discrete features like the ones encountered in document classification, multinomial and Bernoulli distributions are popular
21.
Artificial neural network
–
Each neural unit is connected with many others, and links can enhance or inhibit the activation state of adjoining neural units. Each individual neural unit computes using summation function, There may be a threshold function or limiting function on each connection and on the unit itself, such that the signal must surpass the limit before propagating to other neurons. These systems are self-learning and trained, rather than explicitly programmed, Neural networks typically consist of multiple layers or a cube design, and the signal path traverses from the first, to the last layer of neural units. Back propagation is the use of stimulation to reset weights on the front neural units. More modern networks are a bit more free flowing in terms of stimulation and inhibition with connections interacting in a more chaotic. Dynamic neural networks are the most advanced, in that they dynamically can, based on rules, form new connections, the goal of the neural network is to solve problems in the same way that the human brain would, although several neural networks are more abstract. New brain research often stimulates new patterns in neural networks, one new approach is using connections which span much further and link processing layers rather than always being localized to adjacent neurons. Neural networks are based on numbers, with the value of the core. An interesting facet of these systems is that they are unpredictable in their success with self-learning, after training, some become great problem solvers and others dont perform as well. In order to them, several thousand cycles of interaction typically occur. Warren McCulloch and Walter Pitts created a model for neural networks based on mathematics. This model paved the way for neural network research to split into two distinct approaches, one approach focused on biological processes in the brain and the other focused on the application of neural networks to artificial intelligence. This work led to the paper by Kleene on nerve networks. “Representation of events in nerve nets and finite automata. ”In, Automata Studies, ed. by C. E. Shannon, annals of Mathematics Studies, no.34. Princeton University Press, Princeton, N. J.1956, in the late 1940s psychologist Donald Hebb created a hypothesis of learning based on the mechanism of neural plasticity that is now known as Hebbian learning. Hebbian learning is considered to be a typical unsupervised learning rule, researchers started applying these ideas to computational models in 1948 with Turings B-type machines. Farley and Wesley A. Clark first used computational machines, then called calculators, other neural network computational machines were created by Rochester, Holland, Habit, and Duda. Frank Rosenblatt created the perceptron, an algorithm for pattern recognition based on a computer learning network using simple addition and subtraction
22.
Logistic regression
–
In statistics, logistic regression, or logit regression, or logit model is a regression model where the dependent variable is categorical. This article covers the case of a binary dependent variable—that is, cases where the dependent variable has more than two outcome categories may be analysed in multinomial logistic regression, or, if the multiple categories are ordered, in ordinal logistic regression. In the terminology of economics, logistic regression is an example of a qualitative response/discrete choice model, Logistic regression was developed by statistician David Cox in 1958. The binary logistic model is used to estimate the probability of a response based on one or more predictor variables. It allows one to say that the presence of a risk factor increases the probability of an outcome by a specific percentage. Logistic regression is used in fields, including machine learning, most medical fields. For example, the Trauma and Injury Severity Score, which is used to predict mortality in injured patients, was originally developed by Boyd et al. using logistic regression. Many other medical scales used to assess severity of a patient have been developed using logistic regression, Logistic regression may be used to predict whether a patient has a given disease, based on observed characteristics of the patient. Another example might be to predict whether an American voter will vote Democratic or Republican, based on age, income, sex, race, state of residence, votes in previous elections, etc. The technique can also be used in engineering, especially for predicting the probability of failure of a given process and it is also used in marketing applications such as prediction of a customers propensity to purchase a product or halt a subscription, etc. Conditional random fields, an extension of logistic regression to sequential data, are used in language processing. Suppose we wish to answer the question, A group of 20 students spend between 0 and 6 hours studying for an exam. How does the number of hours spent studying affect the probability that the student will pass the exam, the reason for using logistic regression for this problem is that the dependent variable pass/fail represented by 1 and 0 are not cardinal numbers. If the problem were changed so that pass/fail was replaced with the grade 0–100, the table shows the number of hours each student spent studying, and whether they passed or failed. The graph shows the probability of passing the exam versus the number of hours studying, the logistic regression analysis gives the following output. The output indicates that hours studying is significantly associated with the probability of passing the exam, the output from the logistic regression analysis gives a p-value of p =0.0167, which is based on the Wald z-score. Rather than the Wald method, the method to calculate the p-value for logistic regression is the Likelihood Ratio Test. Logistic regression can be binomial, ordinal or multinomial, binomial or binary logistic regression deals with situations in which the observed outcome for a dependent variable can have only two possible types,0 and 1
23.
Perceptron
–
In machine learning, the perceptron is an algorithm for supervised learning of binary classifiers. It is a type of linear classifier, i. e. a classification algorithm that makes its predictions based on a linear predictor function combining a set of weights with the feature vector. The algorithm allows for learning, in that it processes elements in the training set one at a time. The perceptron algorithm dates back to the late 1950s, its first implementation, the perceptron algorithm was invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt, funded by the United States Office of Naval Research. This machine was designed for image recognition, it had an array of 400 photocells, weights were encoded in potentiometers, and weight updates during learning were performed by electric motors. Although the perceptron initially seemed promising, it was proved that perceptrons could not be trained to recognise many classes of patterns. It is often believed that they conjectured that a similar result would hold for a multi-layer perceptron network. However, this is not true, as both Minsky and Papert already knew that multi-layer perceptrons were capable of producing an XOR function, three years later Stephen Grossberg published a series of papers introducing networks capable of modelling differential, contrast-enhancing and XOR functions. Nevertheless, the often-miscited Minsky/Papert text caused a significant decline in interest and it took ten more years until neural network research experienced a resurgence in the 1980s. This text was reprinted in 1987 as Perceptrons - Expanded Edition where some errors in the text are shown. The kernel perceptron algorithm was introduced in 1964 by Aizerman et al. The bias shifts the decision boundary away from the origin and does not depend on any input value, the value of f is used to classify x as either a positive or a negative instance, in the case of a binary classification problem. If b is negative, then the combination of inputs must produce a positive value greater than | b | in order to push the classifier neuron over the 0 threshold. Spatially, the bias alters the position of the decision boundary, the perceptron learning algorithm does not terminate if the learning set is not linearly separable. If the vectors are not linearly separable learning will never reach a point where all vectors are classified properly, the most famous example of the perceptrons inability to solve problems with linearly nonseparable vectors is the Boolean exclusive-or problem. The solution spaces of decision boundaries for all functions and learning behaviors are studied in the reference. In the context of networks, a perceptron is an artificial neuron using the Heaviside step function as the activation function. The perceptron algorithm is termed the single-layer perceptron, to distinguish it from a multilayer perceptron
24.
Support vector machine
–
In machine learning, support vector machines are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. An SVM model is a representation of the examples as points in space, New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall. Classifying data is a task in machine learning. Suppose some given data points each belong to one of two classes, and the goal is to decide which class a new point will be in. In the case of support vector machines, a point is viewed as a p -dimensional vector. This is called a linear classifier, there are many hyperplanes that might classify the data. One reasonable choice as the best hyperplane is the one that represents the largest separation, or margin, so we choose the hyperplane so that the distance from it to the nearest data point on each side is maximized. More formally, a support vector machine constructs a hyperplane or set of hyperplanes in a high- or infinite-dimensional space, whereas the original problem may be stated in a finite dimensional space, it often happens that the sets to discriminate are not linearly separable in that space. For this reason, it was proposed that the original finite-dimensional space be mapped into a much higher-dimensional space, the hyperplanes in the higher-dimensional space are defined as the set of points whose dot product with a vector in that space is constant. The vectors defining the hyperplanes can be chosen to be linear combinations with parameters α i of images of feature vectors x i that occur in the data base. With this choice of a hyperplane, the x in the feature space that are mapped into the hyperplane are defined by the relation. Note that if k becomes small as y grows further away from x, in this way, the sum of kernels above can be used to measure the relative nearness of each test point to the data points originating in one or the other of the sets to be discriminated. Classification of images can also be performed using SVMs, experimental results show that SVMs achieve significantly higher search accuracy than traditional query refinement schemes after just three to four rounds of relevance feedback. This is also true of image segmentation systems, including using a modified version SVM that uses the privileged approach as suggested by Vapnik. Hand-written characters can be recognized using SVM, the SVM algorithm has been widely applied in the biological and other sciences. They have been used to classify proteins with up to 90% of the compounds classified correctly, permutation tests based on SVM weights have been suggested as a mechanism for interpretation of SVM models. Support vector machine weights have also used to interpret SVM models in the past. The original SVM algorithm was invented by Vladimir N. Vapnik, in 1992, Bernhard E. Boser, Isabelle M. Guyon and Vladimir N. Vapnik suggested a way to create nonlinear classifiers by applying the kernel trick to maximum-margin hyperplanes
25.
Hierarchical clustering
–
In data mining and statistics, hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of clusters. Divisive, This is a top down approach, all start in one cluster. In general, the merges and splits are determined in a greedy manner, the results of hierarchical clustering are usually presented in a dendrogram. In the general case, the complexity of agglomerative clustering is O, divisive clustering with an exhaustive search is O, which is even worse. However, for special cases, optimal efficient agglomerative methods ) are known, SLINK for single-linkage. In order to decide which clusters should be combined, or where a cluster should be split, the choice of an appropriate metric will influence the shape of the clusters, as some elements may be close to one another according to one distance and farther away according to another. Some commonly used metrics for hierarchical clustering are, For text or other non-numeric data, the linkage criterion determines the distance between sets of observations as a function of the pairwise distances between observations. Some commonly used linkage criteria between two sets of observations A and B are, where d is the chosen metric, other linkage criteria include, The sum of all intra-cluster variance. The decrease in variance for the cluster being merged, the probability that candidate clusters spawn from the same distribution function. The product of in-degree and out-degree on a k-nearest-neighbour graph, the increment of some cluster descriptor after merging two clusters. Hierarchical clustering has the advantage that any valid measure of distance can be used. In fact, the observations themselves are not required, all that is used is a matrix of distances, for example, suppose this data is to be clustered, and the Euclidean distance is the distance metric. Cutting the tree at a height will give a partitioning clustering at a selected precision. In this example, cutting after the row of the dendrogram will yield clusters. Cutting after the row will yield clusters, which is a coarser clustering. The hierarchical clustering dendrogram would be as such, This method builds the hierarchy from the elements by progressively merging clusters. In our example, we have six elements and, the first step is to determine which elements to merge in a cluster. Usually, we want to take the two closest elements, according to the chosen distance, optionally, one can also construct a distance matrix at this stage, where the number in the i-th row j-th column is the distance between the i-th and j-th elements
26.
K-means clustering
–
K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean and this results in a partitioning of the data space into Voronoi cells. The problem is difficult, however, there are efficient heuristic algorithms that are commonly employed. These are usually similar to the algorithm for mixtures of Gaussian distributions via an iterative refinement approach employed by both algorithms. The algorithm has a relationship to the k-nearest neighbor classifier. One can apply the 1-nearest neighbor classifier on the centers obtained by k-means to classify new data into the existing clusters. This is known as nearest centroid classifier or Rocchio algorithm. In other words, its objective is to find, a r g m i n S ∑ i =1 k ∑ x ∈ S i ∥ x − μ i ∥2 where μi is the mean of points in Si. The term k-means was first used by James MacQueen in 1967, the standard algorithm was first proposed by Stuart Lloyd in 1957 as a technique for pulse-code modulation, though it wasnt published outside of Bell Labs until 1982. In 1965, E. W. Forgy published essentially the same method, the most common algorithm uses an iterative refinement technique. Due to its ubiquity it is called the k-means algorithm, it is also referred to as Lloyds algorithm. Since the sum of squares is the squared Euclidean distance, this is intuitively the nearest mean, S i =, where each x p is assigned to exactly one S, even if it could be assigned to two or more of them. Update step, Calculate the new means to be the centroids of the observations in the new clusters. M i =1 | S i | ∑ x j ∈ S i x j Since the arithmetic mean is a least-squares estimator, the algorithm has converged when the assignments no longer change. Since both steps optimize the WCSS objective, and there exists a finite number of such partitionings. There is no guarantee that the optimum is found using this algorithm. The algorithm is often presented as assigning objects to the nearest cluster by distance, the standard algorithm aims at minimizing the WCSS objective, and thus assigns by least sum of squares, which is exactly equivalent to assigning by the smallest Euclidean distance. Using a different distance function other than Euclidean distance may stop the algorithm from converging, various modifications of k-means such as spherical k-means and k-medoids have been proposed to allow using other distance measures
27.
DBSCAN
–
Density-based spatial clustering of applications with noise is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996. DBSCAN is one of the most common clustering algorithms and also most cited in scientific literature, in 2014, the algorithm was awarded the test of time award at the leading data mining conference, KDD. Consider a set of points in space to be clustered. Those points are said to be reachable from p. By definition, no points are directly reachable from a non-core point, a point q is reachable from p if there is a path p1. Pn with p1 = p and pn = q, where each pi+1 is directly reachable from pi, All points not reachable from any other point are outliers. Now if p is a point, then it forms a cluster together with all points that are reachable from it. Each cluster contains at least one point, non-core points can be part of a cluster. Reachability is not a symmetric relation since, by definition, no point may be reachable from a non-core point, therefore a further notion of connectedness is needed to formally define the extent of the clusters found by DBSCAN. Two points p and q are density-connected if there is a point o such that p and q are density-reachable from o. A cluster then satisfies two properties, All points within the cluster are mutually density-connected, if a point is density-reachable from any point of the cluster, it is part of the cluster as well. DBSCAN requires two parameters, ε and the number of points required to form a dense region. It starts with a starting point that has not been visited. This points ε-neighborhood is retrieved, and if it contains sufficiently many points, otherwise, the point is labeled as noise. Note that this point might later be found in a sufficiently sized ε-environment of a different point, if a point is found to be a dense part of a cluster, its ε-neighborhood is also part of that cluster. Hence, all points that are found within the ε-neighborhood are added and this process continues until the density-connected cluster is completely found. Then, a new unvisited point is retrieved and processed, leading to the discovery of a cluster or noise. These simplifications have been omitted from the above pseudocode in order to reflect the originally published version, additionally, the regionQuery function need not return P in the list of points to be visited, as long as it is otherwise still counted in the local density estimate
28.
OPTICS algorithm
–
Ordering points to identify the clustering structure is an algorithm for finding density-based clusters in spatial data. It was presented by Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel and its basic idea is similar to DBSCAN, but it addresses one of DBSCANs major weaknesses, the problem of detecting meaningful clusters in data of varying density. In order to do so, the points of the database are ordered such that points which are spatially closest become neighbors in the ordering. Additionally, a distance is stored for each point that represents the density that needs to be accepted for a cluster in order to have both points belong to the same cluster. This is represented as a dendrogram, like DBSCAN, OPTICS requires two parameters, ε, which describes the maximum distance to consider, and M i n P t s, describing the number of points required to form a cluster. A point p is a point if at least M i n P t s points are found within its ε -neighborhood N ε. Both the core-distance and the reachability-distance are undefined if no sufficiently dense cluster is available, given a sufficiently large ε, this will never happen, but then every ε -neighborhood query will return the entire database, resulting in O runtime. Hence, the ε parameter is required to cut off the density of clusters that is no longer considered to be interesting, the parameter ε is, strictly speaking, not necessary. It can simply be set to the possible value. When a spatial index is available, however, it play a practical role with regards to complexity. OPTICS abstracts from DBSCAN by removing this parameter, at least to the extent of only having to give the maximum value. The basic approach of OPTICS is similar to DBSCAN, but instead of maintaining a set of known, but so far unprocessed cluster members, using a reachability-plot, the hierarchical structure of the clusters can be obtained easily. It is a 2D plot, with the ordering of the points as processed by OPTICS on the x-axis, since points belonging to a cluster have a low reachability distance to their nearest neighbor, the clusters show up as valleys in the reachability plot. The deeper the valley, the denser the cluster, the image above illustrates this concept. In its upper left area, an example data set is shown. The upper right part visualizes the spanning tree produced by OPTICS, colors in this plot are labels, and not computed by the algorithm, but it is well visible how the valleys in the plot correspond to the clusters in above data set. The yellow points in this image are considered noise, and no valley is found in their reachability plot and they will usually not be assigned to clusters except the omnipresent all data cluster in a hierarchical result. Clusterings obtained this way usually are hierarchical, and cannot be achieved by a single DBSCAN run, like DBSCAN, OPTICS processes each point once, and performs one ε -neighborhood query during this processing
29.
Mean-shift
–
Mean shift is a non-parametric feature-space analysis technique for locating the maxima of a density function, a so-called mode-seeking algorithm. Application domains include cluster analysis in computer vision and image processing, the mean shift procedure was originally presented in 1975 by Fukunaga and Hostetler. Mean shift is a procedure for locating the maxima of a density function given discrete data sampled from that function and it is useful for detecting the modes of this density. This is a method, and we start with an initial estimate x. Let a kernel function K be given and this function determines the weight of nearby points for re-estimation of the mean. Typically a Gaussian kernel on the distance to the current estimate is used, the difference m − x is called mean shift in Fukunaga and Hostetler. The mean-shift algorithm now sets x ← m, and repeats the estimation until m converges, although the mean shift algorithm has been widely used in many applications, a rigid proof for the convergence of the algorithm using a general kernel in a high dimensional space is still missing. Aliyari Ghassabeh showed the convergence of the mean shift algorithm in one-dimension with a differentiable, convex, however, the one-dimensional case has limited real world applications. Also, the convergence of the algorithm in higher dimensions with a number of the stationary points has been proved. However, sufficient conditions for a kernel function to have finite stationary points have not been provided. Let data be a finite set S embedded in the n-dimensional Euclidean space, Let K be a flat kernel that is the characteristic function of the λ -ball in X, In each iteration of the algorithm, s ← m is performed for all s ∈ S simultaneously. The first question, then, is how to estimate the density function given a set of samples. One of the simplest approaches is to just smooth the data, e. g. by convolving it with a kernel of width h. H is the parameter in the algorithm and is called the bandwidth. This approach is known as kernel density estimation or the Parzen window technique, once we have computed f from equation above, we can find its local maxima using gradient ascent or some other optimization technique. The problem with this force approach is that, for higher dimensions. Instead, mean shift uses a variant of what is known in the literature as multiple restart gradient descent. Kernel definition, Let X be the n-dimensional Euclidean space, R n, denote the ith component of x by x i
30.
Independent component analysis
–
In signal processing, independent component analysis is a computational method for separating a multivariate signal into additive subcomponents. This is done by assuming that the subcomponents are non-Gaussian signals, ICA is a special case of blind source separation. A common example application is the cocktail party problem of listening in on one persons speech in a noisy room, Independent component analysis attempts to decompose a multivariate signal into independent non-Gaussian signals. As an example, sound is usually a signal that is composed of the addition, at each time t. The question then is whether it is possible to separate these contributing sources from the total signal. When the statistical independence assumption is correct, blind ICA separation of a mixed signal gives very good results and it is also used for signals that are not supposed to be generated by a mixing for analysis purposes. A simple application of ICA is the cocktail party problem, where the speech signals are separated from a sample data consisting of people talking simultaneously in a room. Usually the problem is simplified by assuming no delays or echoes. An important note to consider is that if N sources are present, other cases of underdetermined and overdetermined have been investigated. That the ICA separation of mixed signals gives very good results is based on two assumptions and three effects of mixing source signals, two assumptions, The source signals are independent of each other. The values in each source signal have non-Gaussian distributions, three effects of mixing source signals, Independence, As per assumption 1, the source signals are independent, however, their signal mixtures are not. This is because the signal mixtures share the source signals. Normality, According to the Central Limit Theorem, the distribution of a sum of independent random variables with finite variance tends towards a Gaussian distribution. Loosely speaking, a sum of two independent random variables usually has a distribution that is closer to Gaussian than any of the two original variables, here we consider the value of each signal as the random variable. Complexity, The temporal complexity of any signal mixture is greater than that of its simplest constituent source signal and those principles contribute to the basic establishment of ICA. ICA finds the independent components by maximizing the statistical independence of the estimated components and we may choose one of many ways to define a proxy for independence, and this choice governs the form of the ICA algorithm. The non-Gaussianity family of ICA algorithms, motivated by the limit theorem, uses kurtosis. Whitening and dimension reduction can be achieved with principal component analysis or singular value decomposition, whitening ensures that all dimensions are treated equally a priori before the algorithm is run
31.
Linear discriminant analysis
–
The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification. LDA is closely related to analysis of variance and regression analysis, however, ANOVA uses categorical independent variables and a continuous dependent variable, whereas discriminant analysis has continuous independent variables and a categorical dependent variable. Logistic regression and probit regression are more similar to LDA than ANOVA is and these other methods are preferable in applications where it is not reasonable to assume that the independent variables are normally distributed, which is a fundamental assumption of the LDA method. LDA is also related to principal component analysis and factor analysis in that they both look for linear combinations of variables which best explain the data. LDA explicitly attempts to model the difference between the classes of data, PCA on the other hand does not take into account any difference in class, and factor analysis builds the feature combinations based on differences rather than similarities. Discriminant analysis is different from factor analysis in that it is not an interdependence technique. LDA works when the measurements made on independent variables for each observation are continuous quantities, when dealing with categorical independent variables, the equivalent technique is discriminant correspondence analysis. Consider a set of observations x → for each sample of an object or event with known class y and this set of samples is called the training set. The classification problem is then to find a good predictor for the class y of any sample of the distribution given only an observation x →. LDA approaches the problem by assuming that the probability density functions p and p are both normally distributed with mean and covariance parameters and, respectively. LDA instead makes the additional simplifying homoscedasticity assumption and that the covariances have full rank, in other words, the observation belongs to y if corresponding x → is located on a certain side of a hyperplane perpendicular to w →. The location of the plane is defined by the threshold c, canonical discriminant analysis finds axes that best separate the categories. These linear functions are uncorrelated and define, in effect, an optimal k −1 space through the cloud of data that best separates the k groups. See “Multiclass LDA” for details below, suppose two classes of observations have means μ →0, μ →1 and covariances Σ0, Σ1. Then the linear combination of features w → ⋅ x → will have means w → ⋅ μ → i and it can be shown that the maximum separation occurs when w → ∝ −1 When the assumptions of LDA are satisfied, the above equation is equivalent to LDA. Be sure to note that the vector w → is the normal to the discriminant hyperplane, as an example, in a two dimensional problem, the line that best divides the two groups is perpendicular to w →. Generally, the points to be discriminated are projected onto w →. There is no rule for the threshold
32.
Non-negative matrix factorization
–
This non-negativity makes the resulting matrices easier to inspect. Also, in such as processing of audio spectrograms or muscular activity, non-negativity is inherent to the data being considered. Since the problem is not exactly solvable in general, it is commonly approximated numerically, NMF finds applications in such fields as computer vision, document clustering, chemometrics, audio signal processing and recommender systems. In chemometrics non-negative matrix factorization has a history under the name self modeling curve resolution. In this framework the vectors in the matrix are continuous curves rather than discrete vectors. Also early work on non-negative matrix factorizations was performed by a Finnish group of researchers in the middle of the 1990s under the name positive matrix factorization. That is, each column of V can be computed as follows, v i = W h i, when multiplying matrices, the dimensions of the factor matrices may be significantly lower than those of the product matrix and it is this property that forms the basis of NMF. NMF generates factors with significantly reduced compared to the original matrix. For example, if V is an m × n matrix, W is an m × p matrix, heres an example based on a text-mining application, Let the input matrix be V with 10000 rows and 500 columns where words are in rows and documents are in columns. That is, we have 500 documents indexed by 10000 words and it follows that a column vector v in V represents a document. Assume we ask the algorithm to find 10 features in order to generate a features matrix W with 10000 rows and 10 columns and a coefficients matrix H with 10 rows and 500 columns. The product of W and H is a matrix with 10000 rows and 500 columns and this last point is the basis of NMF because we can consider each original document in our example as being built from a small set of hidden features. A column in the coefficients matrix H represents a document with a cell value defining the documents rank for a feature. This follows because each row in H represents a feature and it is this property that drives most applications of NMF. More specifically, the approximation of V by V ≃ W H is achieved by minimizing the error function min W, H | | V − W H | | F, subject to W ≥0, H ≥0. If we add additional orthogonality constraint on H, i. e. H H T = I, then the above minimization is mathematically equivalent to the minimization of K-means clustering ). Furthermore, the computed H gives the cluster indicator, i. e. if H k j >0, and the computed W gives the cluster centroids, i. e. the k t h column gives the cluster centroid of k t h cluster. This centroids representation can be enhanced by convex NMF
33.
Principal component analysis
–
The number of principal components is less than or equal to the smaller of the number of original variables or the number of observations. The resulting vectors are an orthogonal basis set. PCA is sensitive to the scaling of the original variables. PCA was invented in 1901 by Karl Pearson, as an analogue of the principal theorem in mechanics. PCA is mostly used as a tool in exploratory data analysis, PCA can be done by eigenvalue decomposition of a data covariance matrix or singular value decomposition of a data matrix, usually after mean centering the data matrix for each attribute. The results of a PCA are usually discussed in terms of component scores, sometimes called factor scores, PCA is the simplest of the true eigenvector-based multivariate analyses. Often, its operation can be thought of as revealing the structure of the data in a way that best explains the variance in the data. This is done by using only the first few principal components so that the dimensionality of the data is reduced. PCA is closely related to factor analysis, Factor analysis typically incorporates more domain specific assumptions about the underlying structure and solves eigenvectors of a slightly different matrix. PCA is also related to canonical correlation analysis, PCA can be thought of as fitting an n-dimensional ellipsoid to the data, where each axis of the ellipsoid represents a principal component. To find the axes of the ellipsoid, we must first subtract the mean of each variable from the dataset to center the data around the origin, then, we compute the covariance matrix of the data, and calculate the eigenvalues and corresponding eigenvectors of this covariance matrix. Then, we must orthogonalize the set of eigenvectors, and normalize each to become unit vectors, once this is done, each of the mutually orthogonal, unit eigenvectors can be interpreted as an axis of the ellipsoid fitted to the data. The proportion of the variance that each eigenvector represents can be calculated by dividing the corresponding to that eigenvector by the sum of all eigenvalues. This procedure is sensitive to the scaling of the data, a standard result for a symmetric matrix such as XTX is that the quotients maximum possible value is the largest eigenvalue of the matrix, which occurs when w is the corresponding eigenvector. With w found, the first component of a vector x can then be given as a score t1 = x ⋅ w in the transformed co-ordinates, or as the corresponding vector in the original variables. Thus the loading vectors are eigenvectors of XTX, however eigenvectors w and w corresponding to eigenvalues of a symmetric matrix are orthogonal, or can be orthogonalised. The product in the line is therefore zero, there is no sample covariance between different principal components over the dataset. Another way to characterise the principal components transformation is therefore as the transformation to coordinates which diagonalise the empirical sample covariance matrix, however, not all the principal components need to be kept
34.
T-distributed stochastic neighbor embedding
–
T-distributed stochastic neighbor embedding is a machine learning algorithm for dimensionality reduction developed by Geoffrey Hinton and Laurens van der Maaten. The t-SNE algorithm comprises two main stages, note that whilst the original algorithm uses the Euclidean distance between objects as the base of its similarity metric, this should be changed as appropriate. T-SNE has been used in a range of applications, including computer security research, music analysis, cancer research, bioinformatics. As a result, the bandwidth is adapted to the density of the data, t-SNE aims to learn a d -dimensional map y 1, …, y N that reflects the similarities p i j as well as possible. To this end, it measures similarities q i j between two points in the map y i and y j, using a similar approach. The result of this optimization is a map that reflects the similarities between the high-dimensional inputs well, visualizing Data Using t-SNE, Google Tech Talk about t-SNE ELKI contains tSNE, also with Barnes-Hut approximation. Https, //github. com/elki-project/elki/blob/master/elki/src/main/java/de/lmu/ifi/dbs/elki/algorithm/projection/TSNE. java t-Distributed Stochastic Neighbor Embedding http, //lvdmaaten. github. io/tsne/
35.
Graphical model
–
A graphical model or probabilistic graphical model is a probabilistic model for which a graph expresses the conditional dependence structure between random variables. They are commonly used in probability theory, statistics—particularly Bayesian statistics—and machine learning, two branches of graphical representations of distributions are commonly used, namely, Bayesian networks and Markov random fields. Both families encompass the properties of factorization and independences, but they differ in the set of independences they can encode, if the network structure of the model is a directed acyclic graph, the model represents a factorization of the joint probability of all random variables. More precisely, if the events are X1, …, X n then the joint probability satisfies P = ∏ i =1 n P where p a i is the set of parents of node X i. In other words, the joint distribution factors into a product of conditional distributions, in general, any two sets of nodes are conditionally independent given a third set if a criterion called d-separation holds in the graph. Local independences and global independences are equivalent in Bayesian networks and this type of graphical model is known as a directed graphical model, Bayesian network, or belief network. Classic machine learning models like hidden Markov models, neural networks, a Markov random field, also known as a Markov network, is a model over an undirected graph. A graphical model with many repeated subunits can be represented with plate notation, a factor graph is an undirected bipartite graph connecting variables and factors. Each factor represents a function over the variables it is connected to and this is a helpful representation for understanding and implementing belief propagation. A clique tree or junction tree is a tree of cliques, a chain graph is a graph which may have both directed and undirected edges, but without any directed cycles. Both directed acyclic graphs and undirected graphs are special cases of chain graphs, an ancestral graph is a further extension, having directed, bidirected and undirected edges. A conditional random field is a model specified over an undirected graph. A restricted Boltzmann machine is a generative model specified over an undirected graph. Belief propagation Structural equation model Graphical models and Conditional Random Fields Probabilistic Graphical Models taught by Eric Xing at CMU Bishop, cowell, Robert G. Dawid, A. Philip, Lauritzen, Steffen L. Spiegelhalter, David J. Probabilistic networks and expert systems. A more advanced and statistically oriented book Jensen, Finn, koller, D. Friedman, N. Probabilistic Graphical Models. A computational reasoning approach, where the relationships between graphs and probabilities were formally introduced, getting Started in Probabilistic Graphical Models. Heckermans Bayes Net Learning Tutorial A Brief Introduction to Graphical Models and Bayesian Networks Sargur Sriharis lecture slides on probabilistic graphical models
36.
Bayesian network
–
For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases, formally, Bayesian networks are DAGs whose nodes represent random variables in the Bayesian sense, they may be observable quantities, latent variables, unknown parameters or hypotheses. Edges represent conditional dependencies, nodes that are not connected represent variables that are independent of each other. Each node is associated with a probability function that takes, as input, a set of values for the nodes parent variables. Similar ideas may be applied to undirected, and possibly cyclic, graphs, efficient algorithms exist that perform inference and learning in Bayesian networks. Bayesian networks that model sequences of variables are called dynamic Bayesian networks, generalizations of Bayesian networks that can represent and solve decision problems under uncertainty are called influence diagrams. Suppose that there are two events which could cause grass to be wet, either the sprinkler is on or its raining, also, suppose that the rain has a direct effect on the use of the sprinkler. Then the situation can be modeled with a Bayesian network, all three variables have two possible values, T and F. The joint probability function is, Pr = Pr Pr Pr where the names of the variables have been abbreviated to G = Grass wet, S = Sprinkler turned on, and R = Raining. The model can answer questions like What is the probability that it is raining, for example, Pr = Pr Pr Pr =0.99 ×0.01 ×0.2 =0.00198. Then the numerical results are Pr =0.00198 T T T +0.1584 T F T0.00198 T T T +0.288 T T F +0.1584 T F T +0.0 T F F =8912491 ≈35.77 %. If, on the hand, we wish to answer an interventional question, What is the probability that it would rain. The answer would be governed by the joint distribution function Pr = Pr P obtained by removing the factor Pr from the pre-intervention distribution. As expected, the probability of rain is unaffected by the action, Pr = Pr. If, moreover, we wish to predict the impact of turning the sprinkler on, we have Pr = Pr Pr with the term Pr removed, showing that the action has an effect on the grass but not on the rain. These predictions may not be feasible when some of the variables are unobserved, the effect of the action do can still be predicted, however, whenever a criterion called back-door is satisfied. It states that, if a set Z of nodes can be observed that d-separates all back-door paths from X to Y then Pr = Pr / Pr, a back-door path is one that ends with an arrow into X. Sets that satisfy the back-door criterion are called sufficient or admissible and we then say that P is not identified
37.
Conditional random field
–
Conditional random fields are a class of statistical modeling method often applied in pattern recognition and machine learning and used for structured prediction. CRFs fall into the sequence modeling family, CRFs are a type of discriminative undirected probabilistic graphical model. It is used to encode known relationships between observations and construct consistent interpretations and it is often used for labeling or parsing of sequential data, such as natural language processing or biological sequences and in computer vision. In computer vision, CRFs are often used for object recognition and image segmentation. Lafferty, McCallum and Pereira define a CRF on observations X and random variables Y as follows, Let G = be a graph such that Y = v ∈ V, so that Y is indexed by the vertices of G. Then is a random field when the random variables Y v, conditioned on X, obey the Markov property with respect to the graph, p = p. For general graphs, the problem of exact inference in CRFs is intractable, the inference problem for a CRF is basically the same as for an MRF and the same arguments hold. However, there exist special cases for which exact inference is feasible, If the graph is a chain or a tree, the algorithms used in these cases are analogous to the forward-backward and Viterbi algorithm for the case of HMMs. If the CRF only contains pair-wise potentials and the energy is submodular, If exact inference is impossible, several algorithms can be used to obtain approximate solutions. These include, Loopy belief propagation Alpha expansion Mean field inference Linear programming relaxations Learning the parameters θ is usually done by maximum likelihood learning for p, If all nodes have exponential family distributions and all nodes are observed during training, this optimization is convex. It can be solved for example using gradient descent algorithms, or Quasi-Newton methods such as the L-BFGS algorithm, on the other hand, if some variables are unobserved, the inference problem has to be solved for these variables. Exact inference is intractable in general graphs, so approximations have to be used, in sequence modeling, the graph of interest is usually a chain graph. An input sequence of observed variables X represents a sequence of observations, the model assigns each feature a numerical weight and combines them to determine the probability of a certain value for Y i. Linear-chain CRFs have many of the same applications as conceptually simpler hidden Markov models, an HMM can loosely be understood as a CRF with very specific feature functions that use constant probabilities to model state transitions and emissions. CRFs can be extended into higher order models by making each Y i dependent on a fixed number o of previous variables Y i − o, training and inference are only practical for small values of o, since their computational cost increases exponentially with o. Large-margin models for structured prediction, such as the structured Support Vector Machine can be seen as an alternative training procedure to CRFs, there exists another generalization of CRFs, the semi-Markov conditional random field, which models variable-length segmentations of the label sequence Y. This provides much of the power of higher-order CRFs to model long-range dependencies of the Y i, latent-dynamic conditional random fields or discriminative probabilistic latent variable models are a type of CRFs for sequence tagging tasks. They are latent variable models that are trained discriminatively and these models find applications in computer vision, specifically gesture recognition from video streams and shallow parsing