Regular expression

A regular expression, regex or regexp is a sequence of characters that define a search pattern. This pattern is used by string searching algorithms for "find" or "find and replace" operations on strings, or for input validation, it is a technique developed in formal language theory. The concept arose in the 1950s when the American mathematician Stephen Cole Kleene formalized the description of a regular language; the concept came into common use with Unix text-processing utilities. Since the 1980s, different syntaxes for writing regular expressions exist, one being the POSIX standard and another used, being the Perl syntax. Regular expressions are used in search engines and replace dialogs of word processors and text editors, in text processing utilities such as sed and AWK and in lexical analysis. Many programming languages provide regex capabilities, built-in or via libraries; the phrase regular expressions, regexes, is used to mean the specific, standard textual syntax for representing patterns for matching text.

Each character in a regular expression is either a metacharacter, having a special meaning, or a regular character that has a literal meaning. For example, in the regex a. A is a literal character which matches just'a', while'.' is a meta character that matches every character except a newline. Therefore, this regex matches, for example,'a', or'ax', or'a0'. Together and literal characters can be used to identify text of a given pattern, or process a number of instances of it. Pattern matches may vary from a precise equality to a general similarity, as controlled by the metacharacters. For example. Is a general pattern, is less general and a is a precise pattern; the metacharacter syntax is designed to represent prescribed targets in a concise and flexible way to direct the automation of text processing of a variety of input data, in a form easy to type using a standard ASCII keyboard. A simple case of a regular expression in this syntax is to locate a word spelled two different ways in a text editor, the regular expression serialie matches both "serialise" and "serialize".

Wildcards achieve this, but are more limited in what they can pattern, as they have fewer metacharacters and a simple language-base. The usual context of wildcard characters is in globbing similar names in a list of files, whereas regexes are employed in applications that pattern-match text strings in general. For example, the regex ^+|+$ matches excess whitespace at the beginning or end of a line. An advanced regular expression that matches any numeral is??. A regex processor translates a regular expression in the above syntax into an internal representation which can be executed and matched against a string representing the text being searched in. One possible approach is the Thompson's construction algorithm to construct a nondeterministic finite automaton, made deterministic and the resulting deterministic finite automaton is run on the target text string to recognize substrings that match the regular expression; the picture shows the NFA scheme N obtained from the regular expression s*, where s denotes a simpler regular expression in turn, recursively translated to the NFA N.

Regular expressions originated in 1951, when mathematician Stephen Cole Kleene described regular languages using his mathematical notation called regular sets. These arose in theoretical computer science, in the subfields of automata theory and the description and classification of formal languages. Other early implementations of pattern matching include the SNOBOL language, which did not use regular expressions, but instead its own pattern matching constructs. Regular expressions entered popular use from 1968 in two uses: pattern matching in a text editor and lexical analysis in a compiler. Among the first appearances of regular expressions in program form was when Ken Thompson built Kleene's notation into the editor QED as a means to match patterns in text files. For speed, Thompson implemented regular expression matching by just-in-time compilation to IBM 7094 code on the Compatible Time-Sharing System, an important early example of JIT compilation, he added this capability to the Unix editor ed, which led to the popular search tool grep's use of regular expressions.

Around the same time when Thompson developed QED, a group of researchers including Douglas T. Ross implemented a tool based on regular expressions, used for lexical analysis in compiler design. Many variations of these original forms of regular expressions were used in Unix programs at Bell Labs in the 1970s, including vi, sed, AWK, expr, in other programs such as Emacs. Regexes were subsequently adopted by a wide range of programs, with these early forms standardized in the POSIX.2 standard in 1992. In the 1980s the more complicated regexes arose in Perl, which derived from a regex library written by Henry Spencer, who wrote an implementation of Advanced Regular Expressions for Tcl; the Tcl library is a hybrid NFA/DFA implementation with improved performance characteristics. Software projects that have adopted Spencer's Tcl regular expression implementation include PostgreSQL. Perl expanded on Spencer's original library

Computer science

Computer science is the study of processes that interact with data and that can be represented as data in the form of programs. It enables the use of algorithms to manipulate and communicate digital information. A computer scientist studies the theory of computation and the practice of designing software systems, its fields can be divided into practical disciplines. Computational complexity theory is abstract, while computer graphics emphasizes real-world applications. Programming language theory considers approaches to the description of computational processes, while computer programming itself involves the use of programming languages and complex systems. Human–computer interaction considers the challenges in making computers useful and accessible; the earliest foundations of what would become computer science predate the invention of the modern digital computer. Machines for calculating fixed numerical tasks such as the abacus have existed since antiquity, aiding in computations such as multiplication and division.

Algorithms for performing computations have existed since antiquity before the development of sophisticated computing equipment. Wilhelm Schickard designed and constructed the first working mechanical calculator in 1623. In 1673, Gottfried Leibniz demonstrated a digital mechanical calculator, called the Stepped Reckoner, he may be considered the first computer scientist and information theorist, among other reasons, documenting the binary number system. In 1820, Thomas de Colmar launched the mechanical calculator industry when he released his simplified arithmometer, the first calculating machine strong enough and reliable enough to be used daily in an office environment. Charles Babbage started the design of the first automatic mechanical calculator, his Difference Engine, in 1822, which gave him the idea of the first programmable mechanical calculator, his Analytical Engine, he started developing this machine in 1834, "in less than two years, he had sketched out many of the salient features of the modern computer".

"A crucial step was the adoption of a punched card system derived from the Jacquard loom" making it infinitely programmable. In 1843, during the translation of a French article on the Analytical Engine, Ada Lovelace wrote, in one of the many notes she included, an algorithm to compute the Bernoulli numbers, considered to be the first computer program. Around 1885, Herman Hollerith invented the tabulator, which used punched cards to process statistical information. In 1937, one hundred years after Babbage's impossible dream, Howard Aiken convinced IBM, making all kinds of punched card equipment and was in the calculator business to develop his giant programmable calculator, the ASCC/Harvard Mark I, based on Babbage's Analytical Engine, which itself used cards and a central computing unit; when the machine was finished, some hailed it as "Babbage's dream come true". During the 1940s, as new and more powerful computing machines were developed, the term computer came to refer to the machines rather than their human predecessors.

As it became clear that computers could be used for more than just mathematical calculations, the field of computer science broadened to study computation in general. In 1945, IBM founded the Watson Scientific Computing Laboratory at Columbia University in New York City; the renovated fraternity house on Manhattan's West Side was IBM's first laboratory devoted to pure science. The lab is the forerunner of IBM's Research Division, which today operates research facilities around the world; the close relationship between IBM and the university was instrumental in the emergence of a new scientific discipline, with Columbia offering one of the first academic-credit courses in computer science in 1946. Computer science began to be established as a distinct academic discipline in the 1950s and early 1960s; the world's first computer science degree program, the Cambridge Diploma in Computer Science, began at the University of Cambridge Computer Laboratory in 1953. The first computer science degree program in the United States was formed at Purdue University in 1962.

Since practical computers became available, many applications of computing have become distinct areas of study in their own rights. Although many believed it was impossible that computers themselves could be a scientific field of study, in the late fifties it became accepted among the greater academic population, it is the now well-known IBM brand that formed part of the computer science revolution during this time. IBM released the IBM 704 and the IBM 709 computers, which were used during the exploration period of such devices. "Still, working with the IBM was frustrating if you had misplaced as much as one letter in one instruction, the program would crash, you would have to start the whole process over again". During the late 1950s, the computer science discipline was much in its developmental stages, such issues were commonplace. Time has seen significant improvements in the effectiveness of computing technology. Modern society has seen a significant shift in the users of computer technology, from usage only by experts and professionals, to a near-ubiquitous user base.

Computers were quite costly, some degree of humanitarian aid was needed for efficient use—in part from professional computer operators. As computer adoption became more widespread and affordable, less human assistance was needed for common usage. Despite its short history as a formal academic discipline, computer science has made a number of fundamental contributions to science and society—in fact, along with electronics, it is

Approximate string matching

In computer science, approximate string matching is the technique of finding strings that match a pattern approximately. The problem of approximate string matching is divided into two sub-problems: finding approximate substring matches inside a given string and finding dictionary strings that match the pattern approximately; the closeness of a match is measured in terms of the number of primitive operations necessary to convert the string into an exact match. This number is called the edit distance between the pattern; the usual primitive operations are: insertion: cot → coat deletion: coat → cot substitution: coat → costThese three operations may be generalized as forms of substitution by adding a NULL character wherever a character has been deleted or inserted: insertion: co*t → coat deletion: coat → co*t substitution: coat → costSome approximate matchers treat transposition, in which the positions of two letters in the string are swapped, to be a primitive operation. Transposition: cost → cotsDifferent approximate matchers impose different constraints.

Some matchers use a single global unweighted cost, that is, the total number of primitive operations necessary to convert the match to the pattern. For example, if the pattern is coil, foil differs by one substitution, coils by one insertion, oil by one deletion, foal by two substitutions. If all operations count as a single unit of cost and the limit is set to one, foil and oil will count as matches while foal will not. Other matchers specify the number of operations of each type separately, while still others set a total cost but allow different weights to be assigned to different operations; some matchers permit separate assignments of weights to individual groups in the pattern. One possible definition of the approximate string matching problem is the following: Given a pattern string P = p 1 p 2... P m and a text string T = t 1 t 2 … t n, find a substring T j ′, j = t j ′ … t j in T, which, of all substrings of T, has the smallest edit distance to the pattern P. A brute-force approach would be to compute the edit distance to P for all substrings of T, choose the substring with the minimum distance.

However, this algorithm would have the running time O. A better solution, proposed by Sellers, relies on dynamic programming, it uses an alternative formulation of the problem: for each position j in the text T and each position i in the pattern P, compute the minimum edit distance between the i first characters of the pattern, P i, any substring T j ′, j of T that ends at position j. For each position j in the text T, each position i in the pattern P, go through all substrings of T ending at position j, determine which one of them has the minimal edit distance to the i first characters of the pattern P. Write this minimal distance as E. After computing E for all i and j, we can find a solution to the original problem: it is the substring for which E is minimal Computing E is similar to computing the edit distance between two strings. In fact, we can use the Levenshtein distance computing algorithm for E, the only difference being that we must initialize the first row with zeros, save the path of computation, that is, whether we used E, E or E in computing E.

In the array containing the E values, we choose the minimal value in the last row, let it be E, follow the path of computation backwards, back to the row number 0. If the field we arrived at was E T... T is a substring of T with the minimal edit distance to the pattern P. Computing the E array takes O time with the dynamic programming algorithm, while the backwards-working phase takes O time. Another new idea recent years is the similarity join; when matching database relates to a large scale of data, the O time with the dynamic programming algorithm cannot work within a limited time. So, the idea is, instead of computing the similarity of all pairs of strings, to reduce the number of candidate pairs. Used algorithms are based on filter-verification, Locally Sensitive Hashing and other greedy and approximation algorithms. Most of them are designed to fit some framework to compute concurrently. Traditionally, approximate string matching algorithms are classified into two categories: on-line and off-line.

With on-line algorithms the pattern can be processed before searching. In other words, on-line techniques do searching without an index. Early algorithms for on-line approximate matching were suggested by Wagner and Fisher and by Sellers. Both algorithms solve different problems. Sellers' algorithm searches for a substring in a text while the algorithm of Wagner and Fisher calculates Levenshtein distance, being appropriate for dictionary fuzzy search only. On-line searching techniques have been improved; the most famous impro

String (computer science)

In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and the length changed. A string is considered a data type and is implemented as an array data structure of bytes that stores a sequence of elements characters, using some character encoding. String may denote more general arrays or other sequence data types and structures. Depending on programming language and precise data type used, a variable declared to be a string may either cause storage in memory to be statically allocated for a predetermined maximum length or employ dynamic allocation to allow it to hold a variable number of elements; when a string appears in source code, it is known as a string literal or an anonymous string. In formal languages, which are used in mathematical logic and theoretical computer science, a string is a finite sequence of symbols that are chosen from a set called an alphabet. Let Σ be a non-empty finite set of symbols, called the alphabet.

No assumption is made about the nature of the symbols. A string over Σ is any finite sequence of symbols from Σ. For example, if Σ = 01011 is a string over Σ; the length of a string s can be any non-negative integer. The empty string is the unique string over Σ of length 0, is denoted ε or λ; the set of all strings over Σ of length n is denoted Σn. For example, if Σ = Σ2 =. Note that Σ0 = for any alphabet Σ; the set of all strings over Σ of any length is the Kleene closure of Σ and is denoted Σ*. In terms of Σn, Σ ∗ = ⋃ n ∈ N ∪ Σ n For example, if Σ = Σ* =. Although the set Σ* itself is countably infinite, each element of Σ* is a string of finite length. A set of strings over Σ is called a formal language over Σ. For example, if Σ =, the set of strings with an number of zeros, is a formal language over Σ. Concatenation is an important binary operation on Σ*. For any two strings s and t in Σ*, their concatenation is defined as the sequence of symbols in s followed by the sequence of characters in t, is denoted st.

For example, if Σ =, s = bear, t = hug st = bearhug and ts = hugbear. String concatenation is an non-commutative operation; the empty string ε serves as the identity element. Therefore, the set Σ* and the concatenation operation form a monoid, the free monoid generated by Σ. In addition, the length function defines a monoid homomorphism from Σ* to the non-negative integers. A string s is said to be a substring or factor of t if there exist strings u and v such that t = usv; the relation "is a substring of" defines a partial order on Σ*, the least element of, the empty string. A string s is said to be a prefix of t if there exists a string u such that t = su. If u is nonempty, s is said to be a proper prefix of t. Symmetrically, a string s is said to be a suffix of t if there exists a string u such that t = us. If u is nonempty, s is said to be a proper suffix of t. Suffixes and prefixes are substrings of t. Both the relations "is a prefix of" and "is a suffix of" are prefix orders. A string s = uv.

For example, if Σ = the string 0011001 is a rotation of 0100110, where u = 00110 and v = 01. The reverse of a string is a string in reverse order. For example, if s = abc the reverse of s is cba. A string, the reverse of itself is called a palindrome, which includes the empty string and all strings of length 1, it is useful to define an ordering on a set of strings. If the alphabet Σ has a total order one can define a total order on Σ* called lexicographical order. For example, if Σ = and 0 < 1 the lexicographical order on Σ* includes the relationships ε < 0 < 00 < 000 <... < 0001 < 001 < 01 < 010 < 011 < 0110 < 01111 < 1 < 10 < 100 < 101 < 111 < 1111 < 11111... The lexicographical order is total if the alphabetical order is, but isn't well-founded for any nontrivial alphabet if the alphabetical order is. See Shortlex for an alternative string ordering that preserves well-foundedness. A number of additional operations on strings occur in the formal theory; these are given in the article on string operations.

Strings admit the following interpretation as nodes on a graph: Fixed-length strings can be viewed as nodes on a hypercube Variable-length strings can be viewed as nodes on the k-ary tree, where k is the number of symbols in Σ Infinite strings can be viewed as i