1.
Software developer
–
A software developer is a person concerned with facets of the software development process, including the research, design, programming, and testing of computer software. Other job titles which are used with similar meanings are programmer, software analyst. According to developer Eric Sink, the differences between system design, software development, and programming are more apparent, even more so that developers become systems architects, those who design the multi-leveled architecture or component interactions of a large software system. In a large company, there may be employees whose sole responsibility consists of one of the phases above. In smaller development environments, a few people or even an individual might handle the complete process. The word software was coined as a prank as early as 1953, before this time, computers were programmed either by customers, or the few commercial computer vendors of the time, such as UNIVAC and IBM. The first company founded to provide products and services was Computer Usage Company in 1955. The software industry expanded in the early 1960s, almost immediately after computers were first sold in mass-produced quantities, universities, government, and business customers created a demand for software. Many of these programs were written in-house by full-time staff programmers, some were distributed freely between users of a particular machine for no charge. Others were done on a basis, and other firms such as Computer Sciences Corporation started to grow. The computer/hardware makers started bundling operating systems, systems software and programming environments with their machines, new software was built for microcomputers, so other manufacturers including IBM, followed DECs example quickly, resulting in the IBM AS/400 amongst others. The industry expanded greatly with the rise of the computer in the mid-1970s. In the following years, it created a growing market for games, applications. DOS, Microsofts first operating system product, was the dominant operating system at the time, by 2014 the role of cloud developer had been defined, in this context, one definition of a developer in general was published, Developers make software for the world to use. The job of a developer is to crank out code -- fresh code for new products, code fixes for maintenance, code for business logic, bus factor Software Developer description from the US Department of Labor
2.
IBM
–
International Business Machines Corporation is an American multinational technology company headquartered in Armonk, New York, United States, with operations in over 170 countries. The company originated in 1911 as the Computing-Tabulating-Recording Company and was renamed International Business Machines in 1924, IBM manufactures and markets computer hardware, middleware and software, and offers hosting and consulting services in areas ranging from mainframe computers to nanotechnology. IBM is also a research organization, holding the record for most patents generated by a business for 24 consecutive years. IBM has continually shifted its business mix by exiting commoditizing markets and focusing on higher-value, also in 2014, IBM announced that it would go fabless, continuing to design semiconductors, but offloading manufacturing to GlobalFoundries. Nicknamed Big Blue, IBM is one of 30 companies included in the Dow Jones Industrial Average and one of the worlds largest employers, with nearly 380,000 employees. Known as IBMers, IBM employees have been awarded five Nobel Prizes, six Turing Awards, ten National Medals of Technology, in the 1880s, technologies emerged that would ultimately form the core of what would become International Business Machines. On June 16,1911, their four companies were amalgamated in New York State by Charles Ranlett Flint forming a fifth company, the Computing-Tabulating-Recording Company based in Endicott, New York. The five companies had 1,300 employees and offices and plants in Endicott and Binghamton, New York, Dayton, Ohio, Detroit, Michigan, Washington, D. C. and Toronto. They manufactured machinery for sale and lease, ranging from commercial scales and industrial time recorders, meat and cheese slicers, to tabulators and punched cards. Thomas J. Watson, Sr. fired from the National Cash Register Company by John Henry Patterson, called on Flint and, Watson joined CTR as General Manager then,11 months later, was made President when court cases relating to his time at NCR were resolved. Having learned Pattersons pioneering business practices, Watson proceeded to put the stamp of NCR onto CTRs companies and his favorite slogan, THINK, became a mantra for each companys employees. During Watsons first four years, revenues more than doubled to $9 million, Watson had never liked the clumsy hyphenated title of the CTR and in 1924 chose to replace it with the more expansive title International Business Machines. By 1933 most of the subsidiaries had been merged into one company, in 1937, IBMs tabulating equipment enabled organizations to process unprecedented amounts of data, its clients including the U. S. During the Second World War the company produced small arms for the American war effort, in 1949, Thomas Watson, Sr. created IBM World Trade Corporation, a subsidiary of IBM focused on foreign operations. In 1952, he stepped down after almost 40 years at the company helm, in 1957, the FORTRAN scientific programming language was developed. In 1961, IBM developed the SABRE reservation system for American Airlines, in 1963, IBM employees and computers helped NASA track the orbital flight of the Mercury astronauts. A year later it moved its headquarters from New York City to Armonk. The latter half of the 1960s saw IBM continue its support of space exploration, on April 7,1964, IBM announced the first computer system family, the IBM System/360
3.
Software release life cycle
–
Usage of the alpha/beta test terminology originated at IBM. As long ago as the 1950s, IBM used similar terminology for their hardware development, a test was the verification of a new product before public announcement. B test was the verification before releasing the product to be manufactured, C test was the final test before general availability of the product. Martin Belsky, a manager on some of IBMs earlier software projects claimed to have invented the terminology, IBM dropped the alpha/beta terminology during the 1960s, but by then it had received fairly wide notice. The usage of beta test to refer to testing done by customers was not done in IBM, rather, IBM used the term field test. Pre-alpha refers to all activities performed during the project before formal testing. These activities can include requirements analysis, software design, software development, in typical open source development, there are several types of pre-alpha versions. Milestone versions include specific sets of functions and are released as soon as the functionality is complete, the alpha phase of the release life cycle is the first phase to begin software testing. In this phase, developers generally test the software using white-box techniques, additional validation is then performed using black-box or gray-box techniques, by another testing team. Moving to black-box testing inside the organization is known as alpha release, alpha software can be unstable and could cause crashes or data loss. Alpha software may not contain all of the features that are planned for the final version, in general, external availability of alpha software is uncommon in proprietary software, while open source software often has publicly available alpha versions. The alpha phase usually ends with a freeze, indicating that no more features will be added to the software. At this time, the software is said to be feature complete, Beta, named after the second letter of the Greek alphabet, is the software development phase following alpha. Software in the stage is also known as betaware. Beta phase generally begins when the software is complete but likely to contain a number of known or unknown bugs. Software in the phase will generally have many more bugs in it than completed software, as well as speed/performance issues. The focus of beta testing is reducing impacts to users, often incorporating usability testing, the process of delivering a beta version to the users is called beta release and this is typically the first time that the software is available outside of the organization that developed it. Beta version software is useful for demonstrations and previews within an organization
4.
Operating system
–
An operating system is system software that manages computer hardware and software resources and provides common services for computer programs. All computer programs, excluding firmware, require a system to function. Operating systems are found on many devices that contain a computer – from cellular phones, the dominant desktop operating system is Microsoft Windows with a market share of around 83. 3%. MacOS by Apple Inc. is in place, and the varieties of Linux is in third position. Linux distributions are dominant in the server and supercomputing sectors, other specialized classes of operating systems, such as embedded and real-time systems, exist for many applications. A single-tasking system can run one program at a time. Multi-tasking may be characterized in preemptive and co-operative types, in preemptive multitasking, the operating system slices the CPU time and dedicates a slot to each of the programs. Unix-like operating systems, e. g. Solaris, Linux, cooperative multitasking is achieved by relying on each process to provide time to the other processes in a defined manner. 16-bit versions of Microsoft Windows used cooperative multi-tasking, 32-bit versions of both Windows NT and Win9x, used preemptive multi-tasking. Single-user operating systems have no facilities to distinguish users, but may allow multiple programs to run in tandem, a distributed operating system manages a group of distinct computers and makes them appear to be a single computer. The development of networked computers that could be linked and communicate with each other gave rise to distributed computing, distributed computations are carried out on more than one machine. When computers in a work in cooperation, they form a distributed system. The technique is used both in virtualization and cloud computing management, and is common in large server warehouses, embedded operating systems are designed to be used in embedded computer systems. They are designed to operate on small machines like PDAs with less autonomy and they are able to operate with a limited number of resources. They are very compact and extremely efficient by design, Windows CE and Minix 3 are some examples of embedded operating systems. A real-time operating system is a system that guarantees to process events or data by a specific moment in time. A real-time operating system may be single- or multi-tasking, but when multitasking, early computers were built to perform a series of single tasks, like a calculator. Basic operating system features were developed in the 1950s, such as resident monitor functions that could run different programs in succession to speed up processing
5.
Microsoft Windows
–
Microsoft Windows is a metafamily of graphical operating systems developed, marketed, and sold by Microsoft. It consists of families of operating systems, each of which cater to a certain sector of the computing industry with the OS typically associated with IBM PC compatible architecture. Active Windows families include Windows NT, Windows Embedded and Windows Phone, defunct Windows families include Windows 9x, Windows 10 Mobile is an active product, unrelated to the defunct family Windows Mobile. Microsoft introduced an operating environment named Windows on November 20,1985, Microsoft Windows came to dominate the worlds personal computer market with over 90% market share, overtaking Mac OS, which had been introduced in 1984. Apple came to see Windows as an encroachment on their innovation in GUI development as implemented on products such as the Lisa. On PCs, Windows is still the most popular operating system, however, in 2014, Microsoft admitted losing the majority of the overall operating system market to Android, because of the massive growth in sales of Android smartphones. In 2014, the number of Windows devices sold was less than 25% that of Android devices sold and this comparison however may not be fully relevant, as the two operating systems traditionally target different platforms. As of September 2016, the most recent version of Windows for PCs, tablets, smartphones, the most recent versions for server computers is Windows Server 2016. A specialized version of Windows runs on the Xbox One game console, Microsoft, the developer of Windows, has registered several trademarks each of which denote a family of Windows operating systems that target a specific sector of the computing industry. It now consists of three operating system subfamilies that are released almost at the time and share the same kernel. Windows, The operating system for personal computers, tablets. The latest version is Windows 10, the main competitor of this family is macOS by Apple Inc. for personal computers and Android for mobile devices. Windows Server, The operating system for server computers, the latest version is Windows Server 2016. Unlike its clients sibling, it has adopted a strong naming scheme, the main competitor of this family is Linux. Windows PE, A lightweight version of its Windows sibling meant to operate as an operating system, used for installing Windows on bare-metal computers. The latest version is Windows PE10.0.10586.0, Windows Embedded, Initially, Microsoft developed Windows CE as a general-purpose operating system for every device that was too resource-limited to be called a full-fledged computer. The following Windows families are no longer being developed, Windows 9x, Microsoft now caters to the consumers market with Windows NT. Windows Mobile, The predecessor to Windows Phone, it was a mobile operating system
6.
MacOS
–
Within the market of desktop, laptop and home computers, and by web usage, it is the second most widely used desktop OS after Microsoft Windows. Launched in 2001 as Mac OS X, the series is the latest in the family of Macintosh operating systems, Mac OS X succeeded classic Mac OS, which was introduced in 1984, and the final release of which was Mac OS9 in 1999. An initial, early version of the system, Mac OS X Server 1.0, was released in 1999, the first desktop version, Mac OS X10.0, followed in March 2001. In 2012, Apple rebranded Mac OS X to OS X. Releases were code named after big cats from the release up until OS X10.8 Mountain Lion. Beginning in 2013 with OS X10.9 Mavericks, releases have been named after landmarks in California, in 2016, Apple rebranded OS X to macOS, adopting the nomenclature that it uses for their other operating systems, iOS, watchOS, and tvOS. The latest version of macOS is macOS10.12 Sierra, macOS is based on technologies developed at NeXT between 1985 and 1997, when Apple acquired the company. The X in Mac OS X and OS X is pronounced ten, macOS shares its Unix-based core, named Darwin, and many of its frameworks with iOS, tvOS and watchOS. A heavily modified version of Mac OS X10.4 Tiger was used for the first-generation Apple TV, Apple also used to have a separate line of releases of Mac OS X designed for servers. Beginning with Mac OS X10.7 Lion, the functions were made available as a separate package on the Mac App Store. Releases of Mac OS X from 1999 to 2005 can run only on the PowerPC-based Macs from the time period, Mac OS X10.5 Leopard was released as a Universal binary, meaning the installer disc supported both Intel and PowerPC processors. In 2009, Apple released Mac OS X10.6 Snow Leopard, in 2011, Apple released Mac OS X10.7 Lion, which no longer supported 32-bit Intel processors and also did not include Rosetta. All versions of the system released since then run exclusively on 64-bit Intel CPUs, the heritage of what would become macOS had originated at NeXT, a company founded by Steve Jobs following his departure from Apple in 1985. There, the Unix-like NeXTSTEP operating system was developed, and then launched in 1989 and its graphical user interface was built on top of an object-oriented GUI toolkit using the Objective-C programming language. This led Apple to purchase NeXT in 1996, allowing NeXTSTEP, then called OPENSTEP, previous Macintosh operating systems were named using Arabic numerals, e. g. Mac OS8 and Mac OS9. The letter X in Mac OS Xs name refers to the number 10 and it is therefore correctly pronounced ten /ˈtɛn/ in this context. However, a common mispronunciation is X /ˈɛks/, consumer releases of Mac OS X included more backward compatibility. Mac OS applications could be rewritten to run natively via the Carbon API, the consumer version of Mac OS X was launched in 2001 with Mac OS X10.0. Reviews were variable, with praise for its sophisticated, glossy Aqua interface
7.
Linux
–
Linux is a Unix-like computer operating system assembled under the model of free and open-source software development and distribution. The defining component of Linux is the Linux kernel, an operating system kernel first released on September 17,1991 by Linus Torvalds, the Free Software Foundation uses the name GNU/Linux to describe the operating system, which has led to some controversy. Linux was originally developed for computers based on the Intel x86 architecture. Because of the dominance of Android on smartphones, Linux has the largest installed base of all operating systems. Linux is also the operating system on servers and other big iron systems such as mainframe computers. It is used by around 2. 3% of desktop computers, the Chromebook, which runs on Chrome OS, dominates the US K–12 education market and represents nearly 20% of the sub-$300 notebook sales in the US. Linux also runs on embedded systems – devices whose operating system is built into the firmware and is highly tailored to the system. This includes TiVo and similar DVR devices, network routers, facility automation controls, televisions, many smartphones and tablet computers run Android and other Linux derivatives. The development of Linux is one of the most prominent examples of free, the underlying source code may be used, modified and distributed—commercially or non-commercially—by anyone under the terms of its respective licenses, such as the GNU General Public License. Typically, Linux is packaged in a known as a Linux distribution for both desktop and server use. Distributions intended to run on servers may omit all graphical environments from the standard install, because Linux is freely redistributable, anyone may create a distribution for any intended use. The Unix operating system was conceived and implemented in 1969 at AT&Ts Bell Laboratories in the United States by Ken Thompson, Dennis Ritchie, Douglas McIlroy, first released in 1971, Unix was written entirely in assembly language, as was common practice at the time. Later, in a key pioneering approach in 1973, it was rewritten in the C programming language by Dennis Ritchie, the availability of a high-level language implementation of Unix made its porting to different computer platforms easier. Due to an earlier antitrust case forbidding it from entering the computer business, as a result, Unix grew quickly and became widely adopted by academic institutions and businesses. In 1984, AT&T divested itself of Bell Labs, freed of the legal obligation requiring free licensing, the GNU Project, started in 1983 by Richard Stallman, has the goal of creating a complete Unix-compatible software system composed entirely of free software. Later, in 1985, Stallman started the Free Software Foundation, by the early 1990s, many of the programs required in an operating system were completed, although low-level elements such as device drivers, daemons, and the kernel were stalled and incomplete. Linus Torvalds has stated that if the GNU kernel had been available at the time, although not released until 1992 due to legal complications, development of 386BSD, from which NetBSD, OpenBSD and FreeBSD descended, predated that of Linux. Torvalds has also stated that if 386BSD had been available at the time, although the complete source code of MINIX was freely available, the licensing terms prevented it from being free software until the licensing changed in April 2000
8.
Unix
–
Among these is Apples macOS, which is the Unix version with the largest installed base as of 2014. Many Unix-like operating systems have arisen over the years, of which Linux is the most popular, Unix was originally meant to be a convenient platform for programmers developing software to be run on it and on other systems, rather than for non-programmer users. The system grew larger as the system started spreading in academic circles, as users added their own tools to the system. Unix was designed to be portable, multi-tasking and multi-user in a time-sharing configuration and these concepts are collectively known as the Unix philosophy. By the early 1980s users began seeing Unix as a universal operating system. Under Unix, the system consists of many utilities along with the master control program. To mediate such access, the kernel has special rights, reflected in the division between user space and kernel space, the microkernel concept was introduced in an effort to reverse the trend towards larger kernels and return to a system in which most tasks were completed by smaller utilities. In an era when a standard computer consisted of a disk for storage and a data terminal for input and output. However, modern systems include networking and other new devices, as graphical user interfaces developed, the file model proved inadequate to the task of handling asynchronous events such as those generated by a mouse. In the 1980s, non-blocking I/O and the set of inter-process communication mechanisms were augmented with Unix domain sockets, shared memory, message queues, and semaphores. In microkernel implementations, functions such as network protocols could be moved out of the kernel, Multics introduced many innovations, but had many problems. Frustrated by the size and complexity of Multics but not by the aims and their last researchers to leave Multics, Ken Thompson, Dennis Ritchie, M. D. McIlroy, and J. F. Ossanna, decided to redo the work on a much smaller scale. The name Unics, a pun on Multics, was suggested for the project in 1970. Peter H. Salus credits Peter Neumann with the pun, while Brian Kernighan claims the coining for himself, in 1972, Unix was rewritten in the C programming language. Bell Labs produced several versions of Unix that are referred to as Research Unix. In 1975, the first source license for UNIX was sold to faculty at the University of Illinois Department of Computer Science, UIUC graduate student Greg Chesson was instrumental in negotiating the terms of this license. During the late 1970s and early 1980s, the influence of Unix in academic circles led to adoption of Unix by commercial startups, including Sequent, HP-UX, Solaris, AIX. In the late 1980s, AT&T Unix System Laboratories and Sun Microsystems developed System V Release 4, in the 1990s, Unix-like systems grew in popularity as Linux and BSD distributions were developed through collaboration by a worldwide network of programmers
9.
Computing platform
–
Computing platform means in general sense, where any piece of software is executed. It may be the hardware or the system, even a web browser or other application. The term computing platform can refer to different abstraction levels, including a hardware architecture, an operating system. In total it can be said to be the stage on which programs can run. For example, an OS may be a platform that abstracts the underlying differences in hardware, platforms may also include, Hardware alone, in the case of small embedded systems. Embedded systems can access hardware directly, without an OS, this is referred to as running on bare metal, a browser in the case of web-based software. The browser itself runs on a platform, but this is not relevant to software running within the browser. An application, such as a spreadsheet or word processor, which hosts software written in a scripting language. This can be extended to writing fully-fledged applications with the Microsoft Office suite as a platform, software frameworks that provide ready-made functionality. Cloud computing and Platform as a Service, the social networking sites Twitter and facebook are also considered development platforms. A virtual machine such as the Java virtual machine, applications are compiled into a format similar to machine code, known as bytecode, which is then executed by the VM. A virtualized version of a system, including virtualized hardware, OS, software. These allow, for instance, a typical Windows program to run on what is physically a Mac, some architectures have multiple layers, with each layer acting as a platform to the one above it. In general, a component only has to be adapted to the layer immediately beneath it, however, the JVM, the layer beneath the application, does have to be built separately for each OS
10.
Megabyte
–
The megabyte is a multiple of the unit byte for digital information. Its recommended unit symbol is MB, but sometimes MByte is used, the unit prefix mega is a multiplier of 1000000 in the International System of Units. Therefore, one megabyte is one million bytes of information and this definition has been incorporated into the International System of Quantities. However, in the computer and information fields, several other definitions are used that arose for historical reasons of convenience. A common usage has been to one megabyte as 1048576bytes. However, most standards bodies have deprecated this usage in favor of a set of binary prefixes, less common is a convention that used the megabyte to mean 1000×1024 bytes. The megabyte is commonly used to measure either 10002 bytes or 10242 bytes, the interpretation of using base 1024 originated as a compromise technical jargon for the byte multiples that needed to be expressed by the powers of 2 but lacked a convenient name. As 1024 approximates 1000, roughly corresponding to the SI prefix kilo-, in 1998 the International Electrotechnical Commission proposed standards for binary prefixes requiring the use of megabyte to strictly denote 10002 bytes and mebibyte to denote 10242 bytes. By the end of 2009, the IEC Standard had been adopted by the IEEE, EU, ISO, the Mac OS X10.6 file manager is a notable example of this usage in software. Since Snow Leopard, file sizes are reported in decimal units, base 21 MB =1048576 bytes is the definition used by Microsoft Windows in reference to computer memory, such as RAM. This definition is synonymous with the binary prefix mebibyte. Mixed 1 MB =1024000 bytes is the used to describe the formatted capacity of the 1.44 MB3. 5inch HD floppy disk. Semiconductor memory doubles in size for each address lane added to an integrated circuit package, the capacity of a disk drive is the product of the sector size, number of sectors per track, number of tracks per side, and the number of disk platters in the drive. Changes in any of these factors would not usually double the size, sector sizes were set as powers of two for convenience in processing. It was an extension to give the capacity of a disk drive in multiples of the sector size, giving a mix of decimal. Depending on compression methods and file format, a megabyte of data can roughly be, a 4 megapixel JPEG image with normal compression. Approximately 1 minute of 128 kbit/s MP3 compressed music,6 seconds of uncompressed CD audio. A typical English book volume in plain text format, the human genome consists of DNA representing 800 MB of data
11.
Statistics
–
Statistics is a branch of mathematics dealing with the collection, analysis, interpretation, presentation, and organization of data. In applying statistics to, e. g. a scientific, industrial, or social problem, populations can be diverse topics such as all people living in a country or every atom composing a crystal. Statistics deals with all aspects of data including the planning of data collection in terms of the design of surveys, statistician Sir Arthur Lyon Bowley defines statistics as Numerical statements of facts in any department of inquiry placed in relation to each other. When census data cannot be collected, statisticians collect data by developing specific experiment designs, representative sampling assures that inferences and conclusions can safely extend from the sample to the population as a whole. In contrast, an observational study does not involve experimental manipulation, inferences on mathematical statistics are made under the framework of probability theory, which deals with the analysis of random phenomena. A standard statistical procedure involves the test of the relationship between two data sets, or a data set and a synthetic data drawn from idealized model. A hypothesis is proposed for the relationship between the two data sets, and this is compared as an alternative to an idealized null hypothesis of no relationship between two data sets. Rejecting or disproving the hypothesis is done using statistical tests that quantify the sense in which the null can be proven false. Working from a hypothesis, two basic forms of error are recognized, Type I errors and Type II errors. Multiple problems have come to be associated with this framework, ranging from obtaining a sufficient sample size to specifying an adequate null hypothesis, measurement processes that generate statistical data are also subject to error. Many of these errors are classified as random or systematic, the presence of missing data or censoring may result in biased estimates and specific techniques have been developed to address these problems. Statistics continues to be an area of research, for example on the problem of how to analyze Big data. Statistics is a body of science that pertains to the collection, analysis, interpretation or explanation. Some consider statistics to be a mathematical science rather than a branch of mathematics. While many scientific investigations make use of data, statistics is concerned with the use of data in the context of uncertainty, mathematical techniques used for this include mathematical analysis, linear algebra, stochastic analysis, differential equations, and measure-theoretic probability theory. In applying statistics to a problem, it is practice to start with a population or process to be studied. Populations can be diverse topics such as all living in a country or every atom composing a crystal. Ideally, statisticians compile data about the entire population and this may be organized by governmental statistical institutes
12.
Data mining
–
It is an interdisciplinary subfield of computer science. The overall goal of the mining process is to extract information from a data set. Data mining is the step of the knowledge discovery in databases process. The term is a misnomer, because the goal is the extraction of patterns and knowledge from large amounts of data, often the more general terms data analysis and analytics – or, when referring to actual methods, artificial intelligence and machine learning – are more appropriate. This usually involves using database techniques such as spatial indices and these patterns can then be seen as a kind of summary of the input data, and may be used in further analysis or, for example, in machine learning and predictive analytics. For example, the data mining step might identify multiple groups in the data, neither the data collection, data preparation, nor result interpretation and reporting is part of the data mining step, but do belong to the overall KDD process as additional steps. These methods can, however, be used in creating new hypotheses to test against the larger data populations, in the 1960s, statisticians used terms like Data Fishing or Data Dredging to refer to what they considered the bad practice of analyzing data without an a-priori hypothesis. The term Data Mining appeared around 1990 in the database community, however, the term data mining became more popular in the business and press communities. Currently, Data Mining and Knowledge Discovery are used interchangeably, in the Academic community, the major forums for research started in 1995 when the First International Conference on Data Mining and Knowledge Discovery was started in Montreal under AAAI sponsorship. It was co-chaired by Usama Fayyad and Ramasamy Uthurusamy, a year later, in 1996, Usama Fayyad launched the journal by Kluwer called Data Mining and Knowledge Discovery as its founding Editor-in-Chief. Later he started the SIGKDDD Newsletter SIGKDD Explorations, the KDD International conference became the primary highest quality conference in Data Mining with an acceptance rate of research paper submissions below 18%. The Journal Data Mining and Knowledge Discovery is the research journal of the field. The manual extraction of patterns from data has occurred for centuries, early methods of identifying patterns in data include Bayes theorem and regression analysis. The proliferation, ubiquity and increasing power of technology has dramatically increased data collection, storage. Data mining is the process of applying these methods with the intention of uncovering hidden patterns in data sets. The Knowledge Discovery in Databases process is defined with the stages. Polls conducted in 2002,2004,2007 and 2014 show that the CRISP-DM methodology is the methodology used by data miners. The only other data mining standard named in these polls was SEMMA, however, 3–4 times as many people reported using CRISP-DM
13.
Text mining
–
Text mining, also referred to as text data mining, roughly equivalent to text analytics, is the process of deriving high-quality information from text. High-quality information is derived through the devising of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the text, deriving patterns within the structured data. High quality in text mining usually refers to some combination of relevance, novelty, the overarching goal is, essentially, to turn text into data for analysis, via application of natural language processing and analytical methods. The term is synonymous with text mining, indeed, Ronen Feldman modified a 2000 description of text mining in 2004 to describe text analytics. The term text analytics also describes that application of text analytics to respond to problems, whether independently or in conjunction with query and analysis of fielded. It is a truism that 80 percent of business-relevant information originates in unstructured form and these techniques and processes discover and present knowledge – facts, business rules, and relationships – that is otherwise locked in textual form, impenetrable to automated processing. Labor-intensive manual text mining approaches first surfaced in the mid-1980s, Text mining is an interdisciplinary field that draws on information retrieval, data mining, machine learning, statistics, and computational linguistics. As most information is stored as text, text mining is believed to have a high commercial potential value. Increasing interest is being paid to multilingual data mining, the ability to gain information across languages, the challenge of exploiting the large proportion of enterprise information that originates in unstructured form has been recognized for decades. It is recognized in the earliest definition of intelligence, in an October 1958 IBM Journal article by H. P. Both incoming and internally generated documents are automatically abstracted, characterized by a word pattern and this is not surprising, text in unstructured documents is hard to process. The emergence of text analytics in its current form stems from a refocusing of research in the late 1990s from algorithm development to application, as described by Prof. Marti A. In this paper, I have attempted to suggest a new emphasis, hearsts 1999 statement of need fairly well describes the state of text analytics technology and practice a decade later. Recognition of Pattern Identified Entities, Features such as telephone numbers, coreference, identification of noun phrases and other terms that refer to the same object. Text analytics techniques are helpful in analyzing, sentiment at the entity, concept, or topic level and in distinguishing opinion holder, the technology is now broadly applied for a wide variety of government, research, and business needs. Applications can be sorted into a number of categories by analysis type or by business function and it is also involved in the study of text encryption/decryption. A range of text mining applications in the literature has been described
14.
Data collection
–
While methods vary by discipline, the emphasis on ensuring accurate and honest collection remains the same. Regardless of the field of study or preference for defining data, both the selection of appropriate data collection instruments and clearly delineated instructions for their correct use reduce the likelihood of errors occurring. A formal data collection process is necessary as it ensures that the data gathered are both defined and accurate and that subsequent decisions based on arguments embodied in the findings are valid. The process provides both a baseline from which to measure and in certain cases a target on what to improve, generally there are three types of data collection and they are,1. Surveys, Standardized paper-and-pencil or phone questionnaires that ask predetermined questions, interviews, Structured or unstructured one-on-one directed conversations with key individuals or leaders in a community. Consequences from improperly collected data include, Inability to answer questions accurately, Inability to repeat. Distorted findings result in wasted resources and can mislead other researchers to pursue fruitless avenues of investigation and this compromises decisions for public policy. Bureau of Statistics, Guyana by Arun Sooknarine
15.
Software license
–
A software license is a legal instrument governing the use or redistribution of software. Under United States copyright law all software is copyright protected, in code as also object code form. The only exception is software in the public domain, most distributed software can be categorized according to its license type. Two common categories for software under copyright law, and therefore with licenses which grant the licensee specific rights, are proprietary software and free, unlicensed software outside the copyright protection is either public domain software or software which is non-distributed, non-licensed and handled as internal business trade secret. Contrary to popular belief, distributed unlicensed software is copyright protected. Examples for this are unauthorized software leaks or software projects which are placed on public software repositories like GitHub without specified license. As voluntarily handing software into the domain is problematic in some international law domains, there are also licenses granting PD-like rights. Therefore, the owner of a copy of software is legally entitled to use that copy of software. Hence, if the end-user of software is the owner of the respective copy, as many proprietary licenses only enumerate the rights that the user already has under 17 U. S. C. §117, and yet proclaim to take away from the user. Proprietary software licenses often proclaim to give software publishers more control over the way their software is used by keeping ownership of each copy of software with the software publisher. The form of the relationship if it is a lease or a purchase, for example UMG v. Augusto or Vernor v. Autodesk. The ownership of goods, like software applications and video games, is challenged by licensed. The Swiss based company UsedSoft innovated the resale of business software and this feature of proprietary software licenses means that certain rights regarding the software are reserved by the software publisher. Therefore, it is typical of EULAs to include terms which define the uses of the software, the most significant effect of this form of licensing is that, if ownership of the software remains with the software publisher, then the end-user must accept the software license. In other words, without acceptance of the license, the end-user may not use the software at all, one example of such a proprietary software license is the license for Microsoft Windows. The most common licensing models are per single user or per user in the appropriate volume discount level, Licensing per concurrent/floating user also occurs, where all users in a network have access to the program, but only a specific number at the same time. Another license model is licensing per dongle which allows the owner of the dongle to use the program on any computer, Licensing per server, CPU or points, regardless the number of users, is common practice as well as site or company licenses
16.
Computer program
–
A computer program is a collection of instructions that performs a specific task when executed by a computer. A computer requires programs to function, and typically executes the programs instructions in a processing unit. A computer program is written by a computer programmer in a programming language. From the program in its form of source code, a compiler can derive machine code—a form consisting of instructions that the computer can directly execute. Alternatively, a program may be executed with the aid of an interpreter. A part of a program that performs a well-defined task is known as an algorithm. A collection of programs, libraries and related data are referred to as software. Computer programs may be categorized along functional lines, such as software or system software. The earliest programmable machines preceded the invention of the digital computer, in 1801, Joseph-Marie Jacquard devised a loom that would weave a pattern by following a series of perforated cards. Patterns could be weaved and repeated by arranging the cards, in 1837, Charles Babbage was inspired by Jacquards loom to attempt to build the Analytical Engine. The names of the components of the device were borrowed from the textile industry. In the textile industry, yarn was brought from the store to be milled, the device would have had a store—memory to hold 1,000 numbers of 40 decimal digits each. Numbers from the store would then have then transferred to the mill. It was programmed using two sets of perforated cards—one to direct the operation and the other for the input variables, however, after more than 17,000 pounds of the British governments money, the thousands of cogged wheels and gears never fully worked together. During a nine-month period in 1842–43, Ada Lovelace translated the memoir of Italian mathematician Luigi Menabrea, the memoir covered the Analytical Engine. The translation contained Note G which completely detailed a method for calculating Bernoulli numbers using the Analytical Engine and this note is recognized by some historians as the worlds first written computer program. In 1936, Alan Turing introduced the Universal Turing machine—a theoretical device that can model every computation that can be performed on a Turing complete computing machine and it is a finite-state machine that has an infinitely long read/write tape. The machine can move the back and forth, changing its contents as it performs an algorithm
17.
SPSS Modeler
–
IBM SPSS Modeler is a data mining and text analytics software application from IBM. It is used to build models and conduct other analytic tasks. It has an interface which allows users to leverage statistical. One of its aims from the outset was to get rid of unnecessary complexity in data transformations. The first version incorporated decision trees, and neural networks, which could both be trained without underlying knowledge of how those techniques worked, IBM SPSS Modeler was originally named Clementine by its creators, Integral Solutions Limited. This name continued for a while after SPSSs acquisition of the product, SPSS later changed the name to SPSS Clementine, and then later to PASW Modeler. Following IBMs 2009 acquisition of SPSS, the product was renamed IBM SPSS Modeler, predicting movie box office receipts IBM sells the current version of SPSS Modeler in two separate bundles of features. The first version was released on Jun 9th 1994, after Beta testing at 6 customer sites, Clementine was originally developed by a UK company named Integral Solutions Limited, in Collaboration with Artificial Intelligence researchers at Sussex University. The original Clementine was implemented in Poplog, which ISL marketed for Sussex University, Clementine mainly used the Poplog languages, Pop11, with some parts written in C for speed, along with additional tools provided as part of Solaris, VMS and various versions of Unix. The tool quickly garnered the attention of the mining community. In order to reach a market, ISL then Ported Poplog to Microsoft Windows using the NutCracker package. In 1998 ISL was acquired by SPSS Inc. who saw the potential for extended development as a data mining tool. SPSS Clementine version 7.0, The client front-end runs under Windows, the server back-end Unix variants, Linux, and Windows. The graphical user interface is written in Java, IBM SPSS Modeler 14.0 was the first release of Modeler by IBM IBM SPSS Modeler 15, released in June 2012, introduced significant new functionality for Social Network Analysis and Entity Analytics. IBM SPSS Statistics List of statistical packages Cross Industry Standard Process for Data Mining Chapman, P. Clinton, J. Kerber, R. Khabaza, T. Reinartz, T. Shearer, C. Nisbet, R. Elder, J. Miner, G. Handbook of Statistical Analysis, users Guide – SPSS Modeler 15 IBM SPSS Modeler website IBM SPSS Modeler online from cloud
18.
Social science
–
Social science is a major category of academic disciplines, concerned with society and the relationships among individuals within a society. It in turn has many branches, each of which is considered a social science, the social sciences include economics, political science, human geography, demography, psychology, sociology, anthropology, archaeology, jurisprudence, history, and linguistics. The term is sometimes used to refer specifically to the field of sociology. A more detailed list of sub-disciplines within the sciences can be found at Outline of social science. Positivist social scientists use methods resembling those of the sciences as tools for understanding society. In modern academic practice, researchers are often eclectic, using multiple methodologies, the term social research has also acquired a degree of autonomy as practitioners from various disciplines share in its aims and methods. Social sciences came forth from the philosophy of the time and were influenced by the Age of Revolutions, such as the Industrial Revolution. The social sciences developed from the sciences, or the systematic knowledge-bases or prescriptive practices, the beginnings of the social sciences in the 18th century are reflected in the grand encyclopedia of Diderot, with articles from Jean-Jacques Rousseau and other pioneers. The growth of the sciences is also reflected in other specialized encyclopedias. The modern period saw social science first used as a distinct conceptual field, Social science was influenced by positivism, focusing on knowledge based on actual positive sense experience and avoiding the negative, metaphysical speculation was avoided. Auguste Comte used the term science sociale to describe the field, taken from the ideas of Charles Fourier, following this period, there were five paths of development that sprang forth in the social sciences, influenced by Comte on other fields. One route that was taken was the rise of social research, large statistical surveys were undertaken in various parts of the United States and Europe. Another route undertaken was initiated by Émile Durkheim, studying social facts, a third means developed, arising from the methodological dichotomy present, in which social phenomena were identified with and understood, this was championed by figures such as Max Weber. The fourth route taken, based in economics, was developed and furthered economic knowledge as a hard science, the last path was the correlation of knowledge and social values, the antipositivism and verstehen sociology of Max Weber firmly demanded this distinction. In this route, theory and prescription were non-overlapping formal discussions of a subject, around the start of the 20th century, Enlightenment philosophy was challenged in various quarters. The development of social science subfields became very quantitative in methodology, examples of boundary blurring include emerging disciplines like social research of medicine, sociobiology, neuropsychology, bioeconomics and the history and sociology of science. Increasingly, quantitative research and qualitative methods are being integrated in the study of action and its implications. In the first half of the 20th century, statistics became a discipline of applied mathematics
19.
Frequency (statistics)
–
In statistics the frequency of an event i is the number n i of times the event occurred in an experiment or study. These frequencies are often represented in histograms. The cumulative frequency is the total of the frequencies of all events at or below a certain point in an ordered list of events. The relative frequency of an event is the absolute frequency normalized by the number of events. The values of f i for all events i can be plotted to produce a frequency distribution, in the case when n i =0 for certain i, pseudocounts can be added. The height of a rectangle is equal to the frequency density of the interval. The total area of the histogram is equal to the number of data, a histogram may also be normalized displaying relative frequencies. It then shows the proportion of cases fall into each of several categories. The categories are usually specified as consecutive, non-overlapping intervals of a variable, the categories must be adjacent, and often are chosen to be of the same size. The rectangles of a histogram are drawn so that they touch each other to indicate that the variable is continuous. A bar chart or bar graph is a chart with rectangular bars with lengths proportional to the values that they represent, the bars can be plotted vertically or horizontally. A vertical bar chart is called a column bar chart. A frequency distribution table is an arrangement of the values one or more variables take in a sample. This interpretation is often contrasted with Bayesian probability, in fact, the term frequentist was first used by M. G. Kendall in 1949, to contrast with Bayesians, whom he called non-frequentists. He observed 3. we may distinguish two main attitudes. It might be thought that the differences between the frequentists and the non-frequentists are largely due to the differences of the domains which they purport to cover, I assert that this is not so. Aperiodic frequency Cumulative frequency analysis Law of large numbers Probability density function Statistical regularity Word frequency
20.
Mean
–
In mathematics, mean has several different definitions depending on the context. An analogous formula applies to the case of a probability distribution. Not every probability distribution has a mean, see the Cauchy distribution for an example. Moreover, for some distributions the mean is infinite, for example, the arithmetic mean of a set of numbers x1, x2. Xn is typically denoted by x ¯, pronounced x bar, if the data set were based on a series of observations obtained by sampling from a statistical population, the arithmetic mean is termed the sample mean to distinguish it from the population mean. For a finite population, the mean of a property is equal to the arithmetic mean of the given property while considering every member of the population. For example, the mean height is equal to the sum of the heights of every individual divided by the total number of individuals. The sample mean may differ from the mean, especially for small samples. The law of large numbers dictates that the larger the size of the sample, outside of probability and statistics, a wide range of other notions of mean are often used in geometry and analysis, examples are given below. The geometric mean is an average that is useful for sets of numbers that are interpreted according to their product. X ¯ =1 n For example, the mean of five values,4,36,45,50,75 is,1 /5 =243000005 =30. The harmonic mean is an average which is useful for sets of numbers which are defined in relation to some unit, for example speed. AM, GM, and HM satisfy these inequalities, A M ≥ G M ≥ H M Equality holds if, in descriptive statistics, the mean may be confused with the median, mode or mid-range, as any of these may be called an average. The mean of a set of observations is the average of the values, however, for skewed distributions. For example, mean income is typically skewed upwards by a number of people with very large incomes. By contrast, the income is the level at which half the population is below. The mode income is the most likely income, and favors the larger number of people with lower incomes, the mean of a probability distribution is the long-run arithmetic average value of a random variable having that distribution. In this context, it is known as the expected value
21.
Analysis of variance
–
In the ANOVA setting, the observed variance in a particular variable is partitioned into components attributable to different sources of variation. In its simplest form, ANOVA provides a statistical test of whether or not the means of groups are equal. Hy, ANOVAs are useful for comparing three or more means for statistical significance and it is conceptually similar to multiple two-sample t-tests, but is more conservative and is therefore suited to a wide range of practical problems. While the analysis of variance reached fruition in the 20th century and these include hypothesis testing, the partitioning of sums of squares, experimental techniques and the additive model. Laplace was performing hypothesis testing in the 1770s, the development of least-squares methods by Laplace and Gauss circa 1800 provided an improved method of combining observations. It also initiated much study of the contributions to sums of squares, Laplace soon knew how to estimate a variance from a residual sum of squares. By 1827 Laplace was using least squares methods to address ANOVA problems regarding measurements of atmospheric tides, before 1800 astronomers had isolated observational errors resulting from reaction times and had developed methods of reducing the errors. An eloquent non-mathematical explanation of the effects model was available in 1885. Ronald Fisher introduced the term variance and proposed its formal analysis in a 1918 article The Correlation Between Relatives on the Supposition of Mendelian Inheritance and his first application of the analysis of variance was published in 1921. Analysis of variance became widely known after being included in Fishers 1925 book Statistical Methods for Research Workers, Randomization models were developed by several researchers. The first was published in Polish by Neyman in 1923, one of the attributes of ANOVA which ensured its early popularity was computational elegance. The structure of the model allows solution for the additive coefficients by simple algebra rather than by matrix calculations. In the era of mechanical calculators this simplicity was critical, the determination of statistical significance also required access to tables of the F function which were supplied by early statistics texts. The analysis of variance can be used as an tool to explain observations. A dog show provides an example, a dog show is not a random sampling of the breed, it is typically limited to dogs that are adult, pure-bred, and exemplary. A histogram of dog weights from a show might plausibly be rather complex, suppose we wanted to predict the weight of a dog based on a certain set of characteristics of each dog. Before we could do that, we would need to explain the distribution of weights by dividing the dog population into groups based on those characteristics. A successful grouping will split dogs such that each group has a low variance of dog weights, in the illustrations to the right, each group is identified as X1, X2, etc
22.
Correlation and dependence
–
In statistics, dependence or association is any statistical relationship, whether causal or not, between two random variables or bivariate data. Familiar examples of dependent phenomena include the correlation between the physical statures of parents and their offspring, and the correlation between the demand for a product and its price, correlations are useful because they can indicate a predictive relationship that can be exploited in practice. For example, an electrical utility may produce less power on a day based on the correlation between electricity demand and weather. In this example there is a relationship, because extreme weather causes people to use more electricity for heating or cooling. However, in general, the presence of a correlation is not sufficient to infer the presence of a causal relationship, formally, random variables are dependent if they do not satisfy a mathematical property of probabilistic independence. In informal parlance, correlation is synonymous with dependence, however, when used in a technical sense, correlation refers to any of several specific types of relationship between mean values. There are several correlation coefficients, often denoted ρ or r, the most common of these is the Pearson correlation coefficient, which is sensitive only to a linear relationship between two variables. Other correlation coefficients have been developed to be more robust than the Pearson correlation – that is, mutual information can also be applied to measure dependence between two variables. It is obtained by dividing the covariance of the two variables by the product of their standard deviations, karl Pearson developed the coefficient from a similar but slightly different idea by Francis Galton. The Pearson correlation is defined only if both of the deviations are finite and nonzero. It is a corollary of the Cauchy–Schwarz inequality that the correlation cannot exceed 1 in absolute value, the correlation coefficient is symmetric, corr = corr. As it approaches zero there is less of a relationship, the closer the coefficient is to either −1 or 1, the stronger the correlation between the variables. If the variables are independent, Pearsons correlation coefficient is 0, for example, suppose the random variable X is symmetrically distributed about zero, and Y = X2. Then Y is completely determined by X, so that X and Y are perfectly dependent, however, in the special case when X and Y are jointly normal, uncorrelatedness is equivalent to independence. If we have a series of n measurements of X and Y written as xi, N, then the sample correlation coefficient can be used to estimate the population Pearson correlation r between X and Y. If x and y are results of measurements that contain measurement error, for the case of a linear model with a single independent variable, the coefficient of determination is the square of r, Pearsons product-moment coefficient. If, as the one variable increases, the other decreases, to illustrate the nature of rank correlation, and its difference from linear correlation, consider the following four pairs of numbers. As we go from each pair to the pair x increases
23.
Factor analysis
–
Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. For example, it is possible that variations in six observed variables mainly reflect the variations in two unobserved variables, factor analysis searches for such joint variations in response to unobserved latent variables. The observed variables are modelled as linear combinations of the potential factors, factor analysis aims to find independent latent variables. Followers of factor analytic methods believe that the information gained about the interdependencies between observed variables can be used later to reduce the set of variables in a dataset. Users of factor analysis believe that it helps to deal with data sets where there are numbers of observed variables that are thought to reflect a smaller number of underlying/latent variables. Factor analysis is related to principal component analysis, but the two are not identical, there has been significant controversy in the field over differences between the two techniques. PCA is a basic version of exploratory factor analysis that was developed in the early days prior to the advent of high-speed computers. From the point of view of exploratory analysis, the eigenvalues of PCA are inflated component loadings, suppose we have a set of p observable random variables, x 1, …, x p with means μ1, …, μ p. Here, the ε i are unobserved stochastic error terms with zero mean and finite variance, in matrix terms, we have x − μ = L F + ε. If we have n observations, then we will have the dimensions x p × n, L p × k, each column of x and F denote values for one particular observation, and matrix L does not vary across observations. Also we will impose the following assumptions on F, F and ε are independent, any solution of the above set of equations following the constraints for F is defined as the factors, and L as the loading matrix. Then note that from the conditions just imposed on F, we have C o v = C o v, or Σ = L C o v L T + C o v, or Σ = L L T + Ψ. Note that for any orthogonal matrix Q, if we set L = L Q and F = Q T F, hence a set of factors and factor loadings is unique only up to orthogonal transformation. Suppose a psychologist has the hypothesis there are two kinds of intelligence, verbal intelligence and mathematical intelligence, neither of which is directly observed. Evidence for the hypothesis is sought in the scores from each of 10 different academic fields of 1000 students. If each student is chosen randomly from a population, then each students 10 scores are random variables. e. It is a combination of two factors. For example, the hypothesis may hold that the average students aptitude in the field of astronomy is +, the numbers 10 and 6 are the factor loadings associated with astronomy
24.
Cluster analysis
–
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups. Cluster analysis itself is not one specific algorithm, but the task to be solved. It can be achieved by various algorithms that differ significantly in their notion of what constitutes a cluster, popular notions of clusters include groups with small distances among the cluster members, dense areas of the data space, intervals or particular statistical distributions. Clustering can therefore be formulated as an optimization problem. The appropriate clustering algorithm and parameter settings depend on the data set. Cluster analysis as such is not a task, but an iterative process of knowledge discovery or interactive multi-objective optimization that involves trial. It is often necessary to modify data preprocessing and model parameters until the result achieves the desired properties, besides the term clustering, there are a number of terms with similar meanings, including automatic classification, numerical taxonomy, botryology and typological analysis. The notion of a cluster cannot be defined, which is one of the reasons why there are so many clustering algorithms. There is a common denominator, a group of data objects, however, different researchers employ different cluster models, and for each of these cluster models again different algorithms can be given. The notion of a cluster, as found by different algorithms, understanding these cluster models is key to understanding the differences between the various algorithms. Typical cluster models include, Connectivity models, for example, hierarchical clustering builds models based on distance connectivity, centroid models, for example, the k-means algorithm represents each cluster by a single mean vector. Distribution models, clusters are modeled using statistical distributions, such as multivariate normal distributions used by the Expectation-maximization algorithm, density models, for example, DBSCAN and OPTICS defines clusters as connected dense regions in the data space. Subspace models, in Biclustering, clusters are modeled with both members and relevant attributes. Group models, some algorithms do not provide a model for their results. Graph-based models, a clique, that is, a subset of nodes in a such that every two nodes in the subset are connected by an edge can be considered as a prototypical form of cluster. Relaxations of the connectivity requirement are known as quasi-cliques, as in the HCS clustering algorithm. A clustering is essentially a set of clusters, usually containing all objects in the data set. Additionally, it may specify the relationship of the clusters to each other, for example, the following overview will only list the most prominent examples of clustering algorithms, as there are possibly over 100 published clustering algorithms
25.
K-means clustering
–
K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean and this results in a partitioning of the data space into Voronoi cells. The problem is difficult, however, there are efficient heuristic algorithms that are commonly employed. These are usually similar to the algorithm for mixtures of Gaussian distributions via an iterative refinement approach employed by both algorithms. The algorithm has a relationship to the k-nearest neighbor classifier. One can apply the 1-nearest neighbor classifier on the centers obtained by k-means to classify new data into the existing clusters. This is known as nearest centroid classifier or Rocchio algorithm. In other words, its objective is to find, a r g m i n S ∑ i =1 k ∑ x ∈ S i ∥ x − μ i ∥2 where μi is the mean of points in Si. The term k-means was first used by James MacQueen in 1967, the standard algorithm was first proposed by Stuart Lloyd in 1957 as a technique for pulse-code modulation, though it wasnt published outside of Bell Labs until 1982. In 1965, E. W. Forgy published essentially the same method, the most common algorithm uses an iterative refinement technique. Due to its ubiquity it is called the k-means algorithm, it is also referred to as Lloyds algorithm. Since the sum of squares is the squared Euclidean distance, this is intuitively the nearest mean, S i =, where each x p is assigned to exactly one S, even if it could be assigned to two or more of them. Update step, Calculate the new means to be the centroids of the observations in the new clusters. M i =1 | S i | ∑ x j ∈ S i x j Since the arithmetic mean is a least-squares estimator, the algorithm has converged when the assignments no longer change. Since both steps optimize the WCSS objective, and there exists a finite number of such partitionings. There is no guarantee that the optimum is found using this algorithm. The algorithm is often presented as assigning objects to the nearest cluster by distance, the standard algorithm aims at minimizing the WCSS objective, and thus assigns by least sum of squares, which is exactly equivalent to assigning by the smallest Euclidean distance. Using a different distance function other than Euclidean distance may stop the algorithm from converging, various modifications of k-means such as spherical k-means and k-medoids have been proposed to allow using other distance measures
26.
Hierarchical clustering
–
In data mining and statistics, hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of clusters. Divisive, This is a top down approach, all start in one cluster. In general, the merges and splits are determined in a greedy manner, the results of hierarchical clustering are usually presented in a dendrogram. In the general case, the complexity of agglomerative clustering is O, divisive clustering with an exhaustive search is O, which is even worse. However, for special cases, optimal efficient agglomerative methods ) are known, SLINK for single-linkage. In order to decide which clusters should be combined, or where a cluster should be split, the choice of an appropriate metric will influence the shape of the clusters, as some elements may be close to one another according to one distance and farther away according to another. Some commonly used metrics for hierarchical clustering are, For text or other non-numeric data, the linkage criterion determines the distance between sets of observations as a function of the pairwise distances between observations. Some commonly used linkage criteria between two sets of observations A and B are, where d is the chosen metric, other linkage criteria include, The sum of all intra-cluster variance. The decrease in variance for the cluster being merged, the probability that candidate clusters spawn from the same distribution function. The product of in-degree and out-degree on a k-nearest-neighbour graph, the increment of some cluster descriptor after merging two clusters. Hierarchical clustering has the advantage that any valid measure of distance can be used. In fact, the observations themselves are not required, all that is used is a matrix of distances, for example, suppose this data is to be clustered, and the Euclidean distance is the distance metric. Cutting the tree at a height will give a partitioning clustering at a selected precision. In this example, cutting after the row of the dendrogram will yield clusters. Cutting after the row will yield clusters, which is a coarser clustering. The hierarchical clustering dendrogram would be as such, This method builds the hierarchy from the elements by progressively merging clusters. In our example, we have six elements and, the first step is to determine which elements to merge in a cluster. Usually, we want to take the two closest elements, according to the chosen distance, optionally, one can also construct a distance matrix at this stage, where the number in the i-th row j-th column is the distance between the i-th and j-th elements
27.
Linear discriminant analysis
–
The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification. LDA is closely related to analysis of variance and regression analysis, however, ANOVA uses categorical independent variables and a continuous dependent variable, whereas discriminant analysis has continuous independent variables and a categorical dependent variable. Logistic regression and probit regression are more similar to LDA than ANOVA is and these other methods are preferable in applications where it is not reasonable to assume that the independent variables are normally distributed, which is a fundamental assumption of the LDA method. LDA is also related to principal component analysis and factor analysis in that they both look for linear combinations of variables which best explain the data. LDA explicitly attempts to model the difference between the classes of data, PCA on the other hand does not take into account any difference in class, and factor analysis builds the feature combinations based on differences rather than similarities. Discriminant analysis is different from factor analysis in that it is not an interdependence technique. LDA works when the measurements made on independent variables for each observation are continuous quantities, when dealing with categorical independent variables, the equivalent technique is discriminant correspondence analysis. Consider a set of observations x → for each sample of an object or event with known class y and this set of samples is called the training set. The classification problem is then to find a good predictor for the class y of any sample of the distribution given only an observation x →. LDA approaches the problem by assuming that the probability density functions p and p are both normally distributed with mean and covariance parameters and, respectively. LDA instead makes the additional simplifying homoscedasticity assumption and that the covariances have full rank, in other words, the observation belongs to y if corresponding x → is located on a certain side of a hyperplane perpendicular to w →. The location of the plane is defined by the threshold c, canonical discriminant analysis finds axes that best separate the categories. These linear functions are uncorrelated and define, in effect, an optimal k −1 space through the cloud of data that best separates the k groups. See “Multiclass LDA” for details below, suppose two classes of observations have means μ →0, μ →1 and covariances Σ0, Σ1. Then the linear combination of features w → ⋅ x → will have means w → ⋅ μ → i and it can be shown that the maximum separation occurs when w → ∝ −1 When the assumptions of LDA are satisfied, the above equation is equivalent to LDA. Be sure to note that the vector w → is the normal to the discriminant hyperplane, as an example, in a two dimensional problem, the line that best divides the two groups is perpendicular to w →. Generally, the points to be discriminated are projected onto w →. There is no rule for the threshold
28.
Menu (computing)
–
In computing and telecommunications, a menu or menu bar is graphical control element. It is a list of options or commands presented to an operator by a computer or communications system, entering the appropriate short-cut selects a menu item. A more sophisticated solution offers navigation using the keys or the mouse. The current selection is highlighted and can be activated by pressing the enter key, a computer using a graphical user interface presents menus with a combination of text and symbols to represent choices. By clicking on one of the symbols or text, the operator is selecting the instruction that the symbol represents, a context menu is a menu in which the choices presented to the operator are automatically modified according to the current context in which the operator is working. A common use of menus is to provide convenient access to various such as saving or opening a file, quitting a program. Most widget toolkits provide some form of pull-down or pop-up menu, according to traditional human interface guidelines, menu names were always supposed to be verbs, such as file, edit and so on. This has been ignored in subsequent user interface developments. A single-word verb however is unclear, and so as to allow for multiple word menu names. Menus are now seen in consumer electronics, starting with TV sets and VCRs that gained on-screen displays in the early 1990s. Menus allow the control of settings like tint, brightness, contrast, bass and treble, other more recent electronics in the 2000s also have menus, such as digital media players. Menus are sometimes hierarchically organized, allowing navigation through different levels of the menu structure, selecting a menu entry with an arrow will expand it, showing a second menu with options related to the selected entry. Usability of sub-menus has been criticized as difficult, because of the height that must be crossed by the pointer. The steering law predicts that this movement will be slow, federal Standard 1037C Drop-down menu Hamburger button Pie menu Radio button WIMP MenUA, A Design Space of Menu Techniques Site that discusses various menu design techniques
29.
Python (programming language)
–
Python is a widely used high-level programming language for general-purpose programming, created by Guido van Rossum and first released in 1991. The language provides constructs intended to enable writing clear programs on both a small and large scale and it has a large and comprehensive standard library. Python interpreters are available for operating systems, allowing Python code to run on a wide variety of systems. CPython, the implementation of Python, is open source software and has a community-based development model. CPython is managed by the non-profit Python Software Foundation, about the origin of Python, Van Rossum wrote in 1996, Over six years ago, in December 1989, I was looking for a hobby programming project that would keep me occupied during the week around Christmas. Would be closed, but I had a computer. I decided to write an interpreter for the new scripting language I had been thinking about lately, I chose Python as a working title for the project, being in a slightly irreverent mood. Python 2.0 was released on 16 October 2000 and had major new features, including a cycle-detecting garbage collector. With this release the development process was changed and became more transparent, Python 3.0, a major, backwards-incompatible release, was released on 3 December 2008 after a long period of testing. Many of its features have been backported to the backwards-compatible Python 2.6. x and 2.7. x version series. The End Of Life date for Python 2.7 was initially set at 2015, many other paradigms are supported via extensions, including design by contract and logic programming. Python uses dynamic typing and a mix of reference counting and a garbage collector for memory management. An important feature of Python is dynamic name resolution, which binds method, the design of Python offers some support for functional programming in the Lisp tradition. The language has map, reduce and filter functions, list comprehensions, dictionaries, and sets, the standard library has two modules that implement functional tools borrowed from Haskell and Standard ML. Python can also be embedded in existing applications that need a programmable interface, while offering choice in coding methodology, the Python philosophy rejects exuberant syntax, such as in Perl, in favor of a sparser, less-cluttered grammar. As Alex Martelli put it, To describe something as clever is not considered a compliment in the Python culture. Pythons philosophy rejects the Perl there is more one way to do it approach to language design in favor of there should be one—and preferably only one—obvious way to do it. Pythons developers strive to avoid premature optimization, and moreover, reject patches to non-critical parts of CPython that would offer an increase in speed at the cost of clarity
30.
Visual Basic
–
Microsoft intended Visual Basic to be relatively easy to learn and use. A programmer can create an application using the components provided by the Visual Basic program itself, over time the community of programmers developed third party components. Programs written in Visual Basic can also use the Windows API, the final release was version 6 in 1998. On April 8,2008 Microsoft stopped supporting Visual Basic 6.0 IDE, in 2014, some software developers still preferred Visual Basic 6.0 over its successor, Visual Basic. NET. In 2014 some developers lobbied for a new version of Visual Basic 6.0, in 2016, Visual Basic 6.0 won the technical impact award at The 19th Annual D. I. C. E. A dialect of Visual Basic, Visual Basic for Applications, is used as a macro or scripting language within several Microsoft applications, like the BASIC programming language, Visual Basic was designed to accommodate a steep learning curve. Programmers can create simple and complex GUI applications. Since VB defines default attributes and actions for the components, a programmer can develop a program without writing much code. Programs built with earlier versions suffered performance problems, but faster computers, though VB programs can be compiled into native code executables from version 5 on, they still require the presence of around 1 MB of runtime libraries. Core runtime libraries are included by default in Windows 2000 and later, earlier versions of Windows, require that the runtime libraries be distributed with the executable. Forms are created using drag-and-drop techniques, a tool is used to place controls on the form. Controls have attributes and event handlers associated with them, default values are provided when the control is created, but may be changed by the programmer. Many attribute values can be modified during run time based on actions or changes in the environment. For example, code can be inserted into the form resize event handler to reposition a control so that it remains centered on the form, expands to fill up the form, etc. Visual Basic can create executables, ActiveX controls, or DLL files, dialog boxes with less functionality can be used to provide pop-up capabilities. Controls provide the functionality of the application, while programmers can insert additional logic within the appropriate event handlers. For example, a drop-down combination box automatically displays a list, when the user selects an element, an event handler is called that executes code that the programmer created to perform the action for that list item. Alternatively, a Visual Basic component can have no user interface and this allows for server-side processing or an add-in module
31.
Free software
–
The right to study and modify software entails availability of the software source code to its users. This right is conditional on the person actually having a copy of the software. Richard Stallman used the existing term free software when he launched the GNU Project—a collaborative effort to create a freedom-respecting operating system—and the Free Software Foundation. The FSFs Free Software Definition states that users of software are free because they do not need to ask for permission to use the software. Free software thus differs from proprietary software, such as Microsoft Office, Google Docs, Sheets, and Slides or iWork from Apple, which users cannot study, change, freeware, which is a category of freedom-restricting proprietary software that does not require payment for use. For computer programs that are covered by law, software freedom is achieved with a software license. Software that is not covered by law, such as software in the public domain, is free if the source code is in the public domain. Proprietary software, including freeware, use restrictive software licences or EULAs, Users are thus prevented from changing the software, and this results in the user relying on the publisher to provide updates, help, and support. This situation is called vendor lock-in, Users often may not reverse engineer, modify, or redistribute proprietary software. Other legal and technical aspects, such as patents and digital rights management may restrict users in exercising their rights. Free software may be developed collaboratively by volunteer computer programmers or by corporations, as part of a commercial, from the 1950s up until the early 1970s, it was normal for computer users to have the software freedoms associated with free software, which was typically public domain software. Software was commonly shared by individuals who used computers and by manufacturers who welcomed the fact that people were making software that made their hardware useful. Organizations of users and suppliers, for example, SHARE, were formed to exchange of software. As software was written in an interpreted language such as BASIC. Software was also shared and distributed as printed source code in computer magazines and books, in United States vs. IBM, filed January 17,1969, the government charged that bundled software was anti-competitive. While some software might always be free, there would henceforth be an amount of software produced primarily for sale. In the 1970s and early 1980s, the industry began using technical measures to prevent computer users from being able to study or adapt the software as they saw fit. In 1980, copyright law was extended to computer programs, Software development for the GNU operating system began in January 1984, and the Free Software Foundation was founded in October 1985