1.
Data compression
–
In signal processing, data compression, source coding, or bit-rate reduction involves encoding information using fewer bits than the original representation. Compression can be lossy or lossless. Lossless compression reduces bits by identifying and eliminating statistical redundancy, no information is lost in lossless compression. Lossy compression reduces bits by removing unnecessary or less important information, the process of reducing the size of a data file is referred to as data compression. In the context of data transmission, it is called coding in opposition to channel coding. Compression is useful because it reduces resources required to store and transmit data, computational resources are consumed in the compression process and, usually, in the reversal of the process. Data compression is subject to a space–time complexity trade-off, Lossless data compression algorithms usually exploit statistical redundancy to represent data without losing any information, so that the process is reversible. Lossless compression is possible because most real-world data exhibits statistical redundancy, for example, an image may have areas of color that do not change over several pixels, instead of coding red pixel, red pixel. The data may be encoded as 279 red pixels and this is a basic example of run-length encoding, there are many schemes to reduce file size by eliminating redundancy. The Lempel–Ziv compression methods are among the most popular algorithms for lossless storage, DEFLATE is a variation on LZ optimized for decompression speed and compression ratio, but compression can be slow. DEFLATE is used in PKZIP, Gzip, and PNG, LZW is used in GIF images. LZ methods use a table-based compression model where table entries are substituted for repeated strings of data, for most LZ methods, this table is generated dynamically from earlier data in the input. The table itself is often Huffman encoded, current LZ-based coding schemes that perform well are Brotli and LZX. LZX is used in Microsofts CAB format, the best modern lossless compressors use probabilistic models, such as prediction by partial matching. The Burrows–Wheeler transform can also be viewed as a form of statistical modelling. The basic task of grammar-based codes is constructing a context-free grammar deriving a single string, sequitur and Re-Pair are practical grammar compression algorithms for which software is publicly available. In a further refinement of the use of probabilistic modelling. Arithmetic coding is a more modern coding technique that uses the mathematical calculations of a machine to produce a string of encoded bits from a series of input data symbols
2.
Algorithm
–
In mathematics and computer science, an algorithm is a self-contained sequence of actions to be performed. Algorithms can perform calculation, data processing and automated reasoning tasks, an algorithm is an effective method that can be expressed within a finite amount of space and time and in a well-defined formal language for calculating a function. The transition from one state to the next is not necessarily deterministic, some algorithms, known as randomized algorithms, giving a formal definition of algorithms, corresponding to the intuitive notion, remains a challenging problem. In English, it was first used in about 1230 and then by Chaucer in 1391, English adopted the French term, but it wasnt until the late 19th century that algorithm took on the meaning that it has in modern English. Another early use of the word is from 1240, in a manual titled Carmen de Algorismo composed by Alexandre de Villedieu and it begins thus, Haec algorismus ars praesens dicitur, in qua / Talibus Indorum fruimur bis quinque figuris. Which translates as, Algorism is the art by which at present we use those Indian figures, the poem is a few hundred lines long and summarizes the art of calculating with the new style of Indian dice, or Talibus Indorum, or Hindu numerals. An informal definition could be a set of rules that precisely defines a sequence of operations, which would include all computer programs, including programs that do not perform numeric calculations. Generally, a program is only an algorithm if it stops eventually, but humans can do something equally useful, in the case of certain enumerably infinite sets, They can give explicit instructions for determining the nth member of the set, for arbitrary finite n. An enumerably infinite set is one whose elements can be put into one-to-one correspondence with the integers, the concept of algorithm is also used to define the notion of decidability. That notion is central for explaining how formal systems come into being starting from a set of axioms. In logic, the time that an algorithm requires to complete cannot be measured, from such uncertainties, that characterize ongoing work, stems the unavailability of a definition of algorithm that suits both concrete and abstract usage of the term. Algorithms are essential to the way computers process data, thus, an algorithm can be considered to be any sequence of operations that can be simulated by a Turing-complete system. Although this may seem extreme, the arguments, in its favor are hard to refute. Gurevich. Turings informal argument in favor of his thesis justifies a stronger thesis, according to Savage, an algorithm is a computational process defined by a Turing machine. Typically, when an algorithm is associated with processing information, data can be read from a source, written to an output device. Stored data are regarded as part of the state of the entity performing the algorithm. In practice, the state is stored in one or more data structures, for some such computational process, the algorithm must be rigorously defined, specified in the way it applies in all possible circumstances that could arise. That is, any conditional steps must be dealt with, case-by-case
3.
Adaptive Huffman coding
–
Adaptive Huffman coding is an adaptive coding technique based on Huffman coding. It permits building the code as the symbols are being transmitted, having no knowledge of source distribution. The benefit of one-pass procedure is that the source can be encoded in time, though it becomes more sensitive to transmission errors. There are a number of implementations of this method, the most notable are FGK and it is an online coding technique based on Huffman coding. Having no initial knowledge of frequencies, it permits dynamically adjusting the Huffmans tree as data are being transmitted. In a FGK Huffman tree, an external node, called 0-node, is used to identify a newly-coming character. That is, whenever new data are encountered, output the path to the 0-node followed by the data, for a past-coming character, just output the path of the data in the current Huffmans tree. Most importantly, we have to adjust the FGK Huffman tree if necessary, as the frequency of a datum is increased, the sibling property of the Huffmans tree may be broken. The adjustment is triggered for this reason and it is accomplished by consecutive swappings of nodes, subtrees, or both. The data node is swapped with the node of the same frequency in the Huffmans tree. All ancestor nodes of the node should also be processed in the same manner, since the FGK Algorithm has some drawbacks about the node-or-subtree swapping, Vitter proposed another algorithm to improve it. Some important terminologies & constraints, - Implicit Numbering, It simply means that nodes are numbered in increasing order by level and from left to right. I. e. nodes at bottom level will have low implicit number as compared to upper level nodes and nodes on same level are numbered in increasing order from left to right. Invariant, For each weight w, all leaves of weight w precedes all internal nodes having weight w. Blocks, Nodes of same weight, leader, Highest numbered node in a block. Blocks are interlinked by increasing order of their weights, a leaf block always precedes internal block of same weight, thus maintaining the invariant. NYT is special node and used to represents symbols which are not yet transferred, encoder and decoder start with only the root node, which has the maximum number. In the beginning it is our initial NYT node, when we transmit an NYT symbol, we have to transmit code for the NYT node, then for its generic code. For every symbol that is already in the tree, we only have to transmit code for its leaf node, encoding abb gives 0110000100110001011
4.
Range encoding
–
Range encoding is an entropy coding method defined by G. Nigel N. Martin in a 1979 paper, which effectively rediscovered the FIFO arithmetic code first introduced by Richard Clark Pasco in 1976. After the expiration of the first arithmetic coding patent, range encoding appeared to clearly be free of patent encumbrances and this particularly drove interest in the technique in the open source community. Since that time, patents on various well-known arithmetic coding techniques have also expired, range encoding conceptually encodes all the symbols of the message into one number, unlike Huffman coding which assigns each symbol a bit-pattern and concatenates all the bit-patterns together. Each symbol of the message can then be encoded in turn, the decoder must have the same probability estimation the encoder used, which can either be sent in advance, derived from already transferred data or be part of the compressor and decompressor. When all symbols have been encoded, merely identifying the sub-range is enough to communicate the entire message, suppose we want to encode the message AABA<EOM>, where <EOM> is the end-of-message symbol. Because all five-digit integers starting with 251 fall within our final range, it is one of the three-digit prefixes we could transmit that would unambiguously convey our original message. In practice, however, this is not a problem, because instead of starting with a large range and gradually narrowing it down. After some number of digits have been encoded, the leftmost digits will not change, in the example after encoding just three symbols, we already knew that our final result would start with 2. More digits are shifted in on the right as digits on the left are sent off and this is illustrated in the following code, To finish off we may need to emit a few extra digits. The top digit of low is probably too small so we need to increment it, so first we need to make sure range is large enough. One problem that can occur with the Encode function above is that range might become very small but low and this could result in the interval having insufficient precision to distinguish between all of the symbols in the alphabet. When this happens we need to fudge a little, output the first couple of digits even though we might be off by one, the decoder will be following the same steps so it will know when it needs to do this to keep in sync. Base 10 was used in example, but a real implementation would just use binary. Instead of 10000 and 1000 you would likely use hexadecimal constants such as 0x1000000 and 0x10000, instead of emitting a digit at a time you would emit a byte at a time and use a byte-shift operation instead of multiplying by 10. Decoding uses exactly the same algorithm with the addition of keeping track of the current code value consisting of the digits read from the compressor. Instead of emitting the top digit of low you just throw it away, use AppendDigit below instead of EmitDigit. In order to determine which probability intervals to apply, the needs to look at the current value of code within the interval [low, low+range). For the AABA<EOM> example above, this would return a value in the range 0 to 9, values 0 through 5 would represent A,6 and 7 would represent B, and 8 and 9 would represent <EOM>
5.
Tunstall coding
–
In computer science and information theory, Tunstall coding is a form of entropy coding used for lossless data compression. Tunstall coding was the subject of Brian Parker Tunstalls PhD thesis in 1967, the subject of that thesis was Synthesis of noiseless compression codes Its design is a precursor to Lempel-Ziv. Unlike variable-length codes, which include Huffman and Lempel–Ziv coding, Tunstall coding is a code which maps source symbols to a number of bits. Both Tunstall codes and Lempel-Ziv codes represent variable-length words by fixed-length codes, unlike typical set encoding, Tunstall coding parses a stochastic source with codewords of variable length. It can be shown that, for a large dictionary, the number of bits per source letter can be infinitely close to H. The algorithm requires as input an input alphabet U, along with a distribution of probabilities for each word input and it also requires an arbitrary constant C, which is an upper bound to the size of the dictionary that it will compute. The dictionary in question, D, is constructed as a tree of probabilities, the algorithm goes like this, D, = tree of | U | leaves, one for each letter in U. While | D | < C, Convert most probable leaf to tree with | U | leaves, lets imagine that we wish to encode the string hello, world. Lets further assume that the input alphabet U contains only characters from the string hello, world — that is and we can therefore compute the probability of each character based on its statistical appearance in the input string. For instance, the letter L appears thrice in a string of 12 characters and we initialize the tree, starting with a tree of | U | =9 leaves. Each word is therefore directly associated to a letter of the alphabet, the 9 words that we thus obtain can be encoded into a fixed-sized output of ⌈ log 2 ⌉ =4 bits. We then take the leaf of highest probability, and convert it to yet another tree of | U | =9 leaves and we re-compute the probabilities of those leaves. For instance, the sequence of two letters L happens once, given that there are three occurrences of letters followed by an L, the resulting probability is 13 ⋅312 =112. We obtain 17 words, which can each be encoded into an output of ⌈ log 2 ⌉ =5 bits. Note that we could iterate further, increasing the number of words by | U | −1 =8 every time, Tunstall coding requires the algorithm to know, prior to the parsing operation, what the distribution of probabilities for each letter of the alphabet is. This issue is shared with Huffman coding and its requiring a fixed-length block output makes it lesser than Lempel-Ziv, which has a similar dictionary-based design, but with a variable-sized block output
6.
Huffman coding
–
In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding and/or using such a code proceeds by means of Huffman coding, student at MIT, and published in the 1952 paper A Method for the Construction of Minimum-Redundancy Codes. The output from Huffmans algorithm can be viewed as a code table for encoding a source symbol. The algorithm derives this table from the probability or frequency of occurrence for each possible value of the source symbol. As in other entropy encoding methods, more common symbols are represented using fewer bits than less common symbols. Huffmans method can be implemented, finding a code in time linear to the number of input weights if these weights are sorted. However, although optimal among methods encoding symbols separately, Huffman coding is not always optimal among all compression methods, specifically, Huffman coding is optimal only if the probabilities of symbols are natural powers of 1/2. This is usually not the case and this sub-optimality is repaired in arithmetic coding and recent faster Asymmetric Numeral Systems family of entropy codings. In 1951, David A. Huffman and his MIT information theory classmates were given the choice of a paper or a final exam. The professor, Robert M. Fano, assigned a term paper on the problem of finding the most efficient binary code, in doing so, Huffman outdid Fano, who had worked with information theory inventor Claude Shannon to develop a similar code. By building the tree from the bottom up instead of the top down, Huffman coding uses a specific method for choosing the representation for each symbol, resulting in a prefix code. Given A set of symbols and their weights, find A prefix-free binary code with minimum expected codeword length. Alphabet A =, which is the alphabet of size n. Set W =, which is the set of the symbol weights, code C =, which is the tuple of codewords, where c i is the codeword for a i,1 ≤ i ≤ n. Goal. Let L = ∑ i =1 n w i × l e n g t h be the path length of code C. Condition, L ≤ L for any code T and we give an example of the result of Huffman coding for a code with five characters and given weights. We will not verify that it minimizes L over all codes, but we will compute L and compare it to the Shannon entropy H of the set of weights. For any code that is biunique, meaning that the code is uniquely decodeable, in this example, the sum is strictly equal to one, as a result, the code is termed a complete code
7.
Golomb coding
–
Golomb coding is a lossless data compression method using a family of data compression codes invented by Solomon W. Golomb in the 1960s. Rice coding denotes using a subset of the family of Golomb codes to produce a simpler prefix code, Rice used this set of codes in an adaptive coding scheme, Rice coding can refer either to that adaptive scheme or to using that subset of Golomb codes. Whereas a Golomb code has a parameter that can be any positive integer value. This makes Rice codes convenient for use on a computer since multiplication and division by 2 can be implemented efficiently in binary arithmetic. Rice coding is used as the encoding stage in a number of lossless image compression. Golomb coding uses a tunable parameter M to divide an input value N into two parts, q, the result of a division by M, and r, the remainder, the quotient is sent in unary coding, followed by the remainder in truncated binary encoding. When M =1 Golomb coding is equivalent to unary coding, Golomb–Rice codes can be thought of as codes that indicate a number by the position of the bin, and the offset within the bin. The final result looks like, r, note that r can be of a varying number of bits. Specifically, r is only b bits for Rice code and switches between b-1 and b bits for Golomb code, If 0 ≤ r <2 b − M, then use b-1 bits to encode r. If 2 b − M ≤ r < M, then use b bits to encode r, clearly, b = log 2 if M is a power of 2 and we can encode all values of r with b bits. The parameter M is a function of the corresponding Bernoulli process, M is either the median of the distribution or the median +/−1. The Golomb code for this distribution is equivalent to the Huffman code for the same probabilities, golombs scheme was designed to encode sequences of non-negative numbers. The sequence begins,0, -1,1, -2,2, -3,3, the nth negative value is mapped to the nth odd number, and the mth positive value is mapped to the mth even number. This may be expressed mathematically as follows, a value x is mapped to. This is a prefix code only if both the positive and the magnitude of the negative values follow a geometric distribution with the same parameter. Note below that this is the Rice–Golomb encoding, where the code uses simple truncated binary encoding. In this algorithm, if the M parameter is a power of 2, fix the parameter M to an integer value. So log 2 bits are needed, If M is not a power of 2, set b = ⌈ log 2 ⌉ If r <2 b − M code r as plain binary using b-1 bits
8.
Wayback Machine
–
The Internet Archive launched the Wayback Machine in October 2001. It was set up by Brewster Kahle and Bruce Gilliat, and is maintained with content from Alexa Internet, the service enables users to see archived versions of web pages across time, which the archive calls a three dimensional index. Since 1996, the Wayback Machine has been archiving cached pages of websites onto its large cluster of Linux nodes and it revisits sites every few weeks or months and archives a new version. Sites can also be captured on the fly by visitors who enter the sites URL into a search box, the intent is to capture and archive content that otherwise would be lost whenever a site is changed or closed down. The overall vision of the machines creators is to archive the entire Internet, the name Wayback Machine was chosen as a reference to the WABAC machine, a time-traveling device used by the characters Mr. Peabody and Sherman in The Rocky and Bullwinkle Show, an animated cartoon. These crawlers also respect the robots exclusion standard for websites whose owners opt for them not to appear in search results or be cached, to overcome inconsistencies in partially cached websites, Archive-It. Information had been kept on digital tape for five years, with Kahle occasionally allowing researchers, when the archive reached its fifth anniversary, it was unveiled and opened to the public in a ceremony at the University of California, Berkeley. Snapshots usually become more than six months after they are archived or, in some cases, even later. The frequency of snapshots is variable, so not all tracked website updates are recorded, Sometimes there are intervals of several weeks or years between snapshots. After August 2008 sites had to be listed on the Open Directory in order to be included. As of 2009, the Wayback Machine contained approximately three petabytes of data and was growing at a rate of 100 terabytes each month, the growth rate reported in 2003 was 12 terabytes/month, the data is stored on PetaBox rack systems manufactured by Capricorn Technologies. In 2009, the Internet Archive migrated its customized storage architecture to Sun Open Storage, in 2011 a new, improved version of the Wayback Machine, with an updated interface and fresher index of archived content, was made available for public testing. The index driving the classic Wayback Machine only has a bit of material past 2008. In January 2013, the company announced a ground-breaking milestone of 240 billion URLs, in October 2013, the company announced the Save a Page feature which allows any Internet user to archive the contents of a URL. This became a threat of abuse by the service for hosting malicious binaries, as of December 2014, the Wayback Machine contained almost nine petabytes of data and was growing at a rate of about 20 terabytes each week. Between October 2013 and March 2015 the websites global Alexa rank changed from 162 to 208, in a 2009 case, Netbula, LLC v. Chordiant Software Inc. defendant Chordiant filed a motion to compel Netbula to disable the robots. Netbula objected to the motion on the ground that defendants were asking to alter Netbulas website, in an October 2004 case, Telewizja Polska USA, Inc. v. Echostar Satellite, No.02 C3293,65 Fed. 673, a litigant attempted to use the Wayback Machine archives as a source of admissible evidence, Telewizja Polska is the provider of TVP Polonia and EchoStar operates the Dish Network