1.
Data compression
–
In signal processing, data compression, source coding, or bit-rate reduction involves encoding information using fewer bits than the original representation. Compression can be lossy or lossless. Lossless compression reduces bits by identifying and eliminating statistical redundancy, no information is lost in lossless compression. Lossy compression reduces bits by removing unnecessary or less important information, the process of reducing the size of a data file is referred to as data compression. In the context of data transmission, it is called coding in opposition to channel coding. Compression is useful because it reduces resources required to store and transmit data, computational resources are consumed in the compression process and, usually, in the reversal of the process. Data compression is subject to a space–time complexity trade-off, Lossless data compression algorithms usually exploit statistical redundancy to represent data without losing any information, so that the process is reversible. Lossless compression is possible because most real-world data exhibits statistical redundancy, for example, an image may have areas of color that do not change over several pixels, instead of coding red pixel, red pixel. The data may be encoded as 279 red pixels and this is a basic example of run-length encoding, there are many schemes to reduce file size by eliminating redundancy. The Lempel–Ziv compression methods are among the most popular algorithms for lossless storage, DEFLATE is a variation on LZ optimized for decompression speed and compression ratio, but compression can be slow. DEFLATE is used in PKZIP, Gzip, and PNG, LZW is used in GIF images. LZ methods use a table-based compression model where table entries are substituted for repeated strings of data, for most LZ methods, this table is generated dynamically from earlier data in the input. The table itself is often Huffman encoded, current LZ-based coding schemes that perform well are Brotli and LZX. LZX is used in Microsofts CAB format, the best modern lossless compressors use probabilistic models, such as prediction by partial matching. The Burrows–Wheeler transform can also be viewed as a form of statistical modelling. The basic task of grammar-based codes is constructing a context-free grammar deriving a single string, sequitur and Re-Pair are practical grammar compression algorithms for which software is publicly available. In a further refinement of the use of probabilistic modelling. Arithmetic coding is a more modern coding technique that uses the mathematical calculations of a machine to produce a string of encoded bits from a series of input data symbols

2.
DivX
–
DivX is a brand of video codec products developed by DivX, LLC. The DivX codec is notable for its ability to compress lengthy video segments into small sizes while maintaining relatively high visual quality. There are three DivX codecs, the original MPEG-4 Part 2 DivX codec, the H. 264/MPEG-4 AVC DivX Plus HD codec, the most recent version of the codec itself is version 6.9.2, which is several years old. New version numbers on the packages now reflect updates to the player, converter. The DivX brand is distinct from DIVX, which is a video rental system developed by Circuit City Stores that used custom DVD-like discs. The winking emoticon in the early DivX, -) codec name was a reference to the DIVX system. Although not created by them, the DivX company adopted the name of the popular DivX, the company dropped the smiley and released DivX4.0, which was actually the first DivX version, trademarking the word, DivX. DivX, -)3.11 Alpha and later 3. xx versions refers to a version of the Microsoft MPEG-4 Version 3 video codec from Windows Media Tools 4 codecs. The video codec, which was actually not MPEG-4 compliant, was extracted around 1998 by French hacker Jerome Rota at Montpellier, the Microsoft codec originally required that the compressed output be put in an ASF file. It was altered to allow other containers such as Audio Video Interleave, Rota hacked the Microsoft codec because newer versions of the Windows Media Player would not play his video portfolio and résumé that were encoded with it. Instead of re-encoding his portfolio, Rota and German hacker Max Morice decided to engineer the codec. In early 2000, Jordan Greenhall recruited Rota to form a company to develop an MPEG-4 codec, from scratch and this effort resulted first in the release of the OpenDivX codec and source code on 15 January 2001. OpenDivX was hosted as a project on the Project Mayo web site hosted at projectmayo. com. The companys internal developers and some external developers worked jointly on OpenDivX for the several months. In early 2001, DivX employee Sparky wrote a new and improved version of the codecs encoding algorithm known as encore2 and this code was included in the OpenDivX public source repository for a brief time, but then was abruptly removed. The explanation from DivX at the time was that the community really wants a Winamp and it was at this point that the project forked. That summer, Rota left the French Riviera and moved to San Diego with nothing but a pack of cigarettes where he and Greenhall founded what would eventually become DivX, DivX took the encore2 code and developed it into DivX4.0, initially released in July 2001. Other developers who had participated in OpenDivX took encore2 and started a new project—Xvid—that started with the encoding core

3.
VirtualDub
–
VirtualDub is a free and open-source video capture and video processing utility for Microsoft Windows written by Avery Lee. It is designed to process linear video streams, including filtering and it uses AVI container format to store captured video. The first version of VirtualDub, written for Windows 95, to be released on SourceForge was uploaded on August 20,2000, in 2009, the third-party software print guide Learning VirtualDub referred to VirtualDub as the leading free Open Source video capture and processing tool. Several hundred third-party plug-ins for VirtualDub exist, including by professional software companies, furthermore, Debugmode Wax allows use of VirtualDub plug-ins in professional video editing software such as Adobe Premiere Pro and Vegas Pro. VirtualDub is designed for Microsoft Windows but may run on Linux, however, native support for these systems is not available. VirtualDub was made to operate exclusively on AVI files, however, appropriate video and audio codecs need to be installed. VirtualDub supports both DirectShow and Video for Windows for video capture, VirtualDub can help overcome problems with digital cameras that also record video. Many models, especially Canon, record in an M-JPEG format incompatible with Sony Vegas 6.0 and 7.0, saving AVI files as old-style AVI files allows them to appear in Vegas. VirtualDub supports DV capture from Type 2 FireWire controllers only, there is no DV batch capture, still image capture, or DV device control capability. VirtualDub can create a file from a series image files in Truevision TGA or Windows Bitmap file formats. Individual frames must be given file names numbered in order without any gaps. From those, the rate can be adjusted, and other modifications such as the addition of a sound track can be made. VirtualDub can also disassemble a video by extracting its sound tracks saving its frames into Truevision TGA or Windows Bitmap files, VirtualDub can delete segments of a video file, append new segments, or reorder existing segments. Appended segments must have similar audio and video formats, dimensions, number of channels, frame rates. Otherwise, VirtualDub is incapable of mixing dissimilar video files or adding transition effects between segments, VirtualDub comes with a number of video editing components known as filters. They can perform tasks as arbitrary resize, converting the video to grayscale, arbitrary rotation, crop, or changing simple values like brightness. Filters may be used during the assembly as well. Filter plug-ins further extend VirtualDubs capabilities, a plug-in SDK is available for developers to create their own video and audio filters

4.
Xvid
–
Xvid is a video codec library following the MPEG-4 video coding standard, specifically MPEG-4 Part 2 Advanced Simple Profile. It uses ASP features such as b-frames, global and quarter pixel motion compensation, lumi masking, trellis quantization, Xvid is a primary competitor of the DivX Pro Codec. In contrast with the DivX codec, which is software developed by DivX. Xvid is free software distributed under the terms of the GNU General Public License, in January 2001, DivXNetworks founded OpenDivX as part of Project Mayo which was intended to be a home for open source multimedia projects. OpenDivX was an open-source MPEG-4 video codec based on a stripped version of the MoMuSys reference MPEG-4 encoder. The source code, however, was placed under a restrictive license, in early 2001, DARC member Sparky wrote an improved version of the encoding core called encore2. This was updated several times before, in April, it was removed from CVS without warning, the explanation given by Sparky was We decided that we are not ready to have it in public yet. Soon after, DARC released a version of their closed-source commercial DivX4 codec. It was after this that a fork of OpenDivX was created, since then, all the OpenDivX code has been replaced and Xvid has been published under the GNU General Public License. As an implementation of MPEG-4 Part 2, Xvid uses many patented technologies, for this reason, Xvid 0.9. x versions were not licensed in countries where these software patents are recognized. With the 1.0. x releases, a GNU GPL v2 license is used with no explicit geographical restriction, however, the legal usage of Xvid may still be restricted by local laws. In July 2002, Sigma Designs released an MPEG-4 video codec called the REALmagic MPEG-4 Video Codec, before long, people testing this new codec found that it contained considerable portions of Xvid code. Sigma Designs was contacted and confirmed that a programmer had based REALmagic on Xvid, the Xvid developers decided to stop work and go public to force Sigma Designs to respect the terms of the GPL. After articles were published in Slashdot and The Inquirer, in August 2002 Sigma Designs agreed to publish their source code, Xvid is not a video format, it is a program for compressing to and decompressing from the MPEG-4 ASP format. Since Xvid uses MPEG-4 Advanced Simple Profile compression, video encoded with Xvid is MPEG-4 ASP video and this includes a large number of media players and decoders based on libavcodec. As of 2016, xvid. com carries binaries for using the codec, Xvid encoded files can be written to a CD or DVD and played in some DivX compatible DVD players and media players. However, Xvid can optionally encode video with advanced MPEG-4 features that most DivX Certified set-top players do not support, for example, Xvid specifies three warp points for its implementation of global motion compensation as opposed to the single warp point implementation of DivX. Enabling some of the more advanced encoding features can compromise player compatibility, some issues exist with the custom quantization matrices used in tools such as AutoGK that automate encoding with Xvid

5.
International Organization for Standardization
–
The International Organization for Standardization is an international standard-setting body composed of representatives from various national standards organizations. Founded on 23 February 1947, the organization promotes worldwide proprietary and it is headquartered in Geneva, Switzerland, and as of March 2017 works in 162 countries. It was one of the first organizations granted general consultative status with the United Nations Economic, ISO, the International Organization for Standardization, is an independent, non-governmental organization, the members of which are the standards organizations of the 162 member countries. It is the worlds largest developer of international standards and facilitates world trade by providing common standards between nations. Nearly twenty thousand standards have been set covering everything from manufactured products and technology to food safety, use of the standards aids in the creation of products and services that are safe, reliable and of good quality. The standards help businesses increase productivity while minimizing errors and waste, by enabling products from different markets to be directly compared, they facilitate companies in entering new markets and assist in the development of global trade on a fair basis. The standards also serve to safeguard consumers and the end-users of products and services, the three official languages of the ISO are English, French, and Russian. The name of the organization in French is Organisation internationale de normalisation, according to the ISO, as its name in different languages would have different abbreviations, the organization adopted ISO as its abbreviated name in reference to the Greek word isos. However, during the meetings of the new organization, this Greek word was not invoked. Both the name ISO and the logo are registered trademarks, the organization today known as ISO began in 1926 as the International Federation of the National Standardizing Associations. ISO is an organization whose members are recognized authorities on standards. Members meet annually at a General Assembly to discuss ISOs strategic objectives, the organization is coordinated by a Central Secretariat based in Geneva. A Council with a membership of 20 member bodies provides guidance and governance. The Technical Management Board is responsible for over 250 technical committees, ISO has formed joint committees with the International Electrotechnical Commission to develop standards and terminology in the areas of electrical and electronic related technologies. Information technology ISO/IEC Joint Technical Committee 1 was created in 1987 to evelop, maintain, ISO has three membership categories, Member bodies are national bodies considered the most representative standards body in each country. These are the members of ISO that have voting rights. Correspondent members are countries that do not have their own standards organization and these members are informed about ISOs work, but do not participate in standards promulgation. Subscriber members are countries with small economies and they pay reduced membership fees, but can follow the development of standards

6.
Video compression picture types
–
In the field of video compression a video frame is compressed using different algorithms with different advantages and disadvantages, centered mainly around amount of data compression. These different algorithms for video frames are called picture types or frame types, the three major picture types used in the different video algorithms are I, P and B. They are different in the characteristics, I‑frames are the least compressible. P‑frames can use data from previous frames to decompress and are more compressible than I‑frames, B‑frames can use both previous and forward frames for data reference to get the highest amount of data compression. There are three types of pictures used in compression, I‑frames, P‑frames and B‑frames. An I‑frame is an Intra-coded picture, in effect a fully specified picture, P‑frames and B‑frames hold only part of the image information, so they need less space to store than an I‑frame and thus improve video compression rates. A P‑frame holds only the changes in the image from the previous frame, for example, in a scene where a car moves across a stationary background, only the cars movements need to be encoded. The encoder does not need to store the unchanging background pixels in the P‑frame, P‑frames are also known as delta‑frames. A B‑frame saves even more space by using differences between the current frame and both the preceding and following frames to specify its content. While the terms frame and picture are often used interchangeably, strictly speaking, a frame is a complete image captured during a known time interval, and a field is the set of odd-numbered or even-numbered scanning lines composing a partial image. For example, in 1080 full HD mode, there are 1080 lines of pixels, an odd field consists of pixel information for lines 1,3. And even field has pixel information of lines 2,4, frames that are used as a reference for predicting other frames are referred to as reference frames. In the latest international standard, known as H. 264/MPEG-4 AVC, a slice is a spatially distinct region of a frame that is encoded separately from any other region in the same frame. In that standard, instead of I-frames, P-frames, and B-frames, there are I-slices, P-slices, also in H.264 are found several additional types of frames/slices, SI‑frames/slices, Facilitates switching between coded streams, contains SI-macroblocks. SI- SP‑frames will allow for increases in the error resistance, when such frames are used along with a smart decoder, it is possible to recover the broadcast streams of damaged DVDs. I-frames are coded without reference to any frame except themselves, may be generated by an encoder to create a random access point. May also be generated when differentiating image details prohibit generation of effective P or B-frames, typically require more bits to encode than other frame types. Often, I‑frames are used for access and are used as references for the decoding of other pictures

7.
Open-source model
–
Open-source software may be developed in a collaborative public manner. According to scientists who studied it, open-source software is a prominent example of open collaboration, a 2008 report by the Standish Group states that adoption of open-source software models has resulted in savings of about $60 billion per year to consumers. In the early days of computing, programmers and developers shared software in order to learn from each other, eventually the open source notion moved to the way side of commercialization of software in the years 1970-1980. In 1997, Eric Raymond published The Cathedral and the Bazaar and this source code subsequently became the basis behind SeaMonkey, Mozilla Firefox, Thunderbird and KompoZer. Netscapes act prompted Raymond and others to look into how to bring the Free Software Foundations free software ideas, the new term they chose was open source, which was soon adopted by Bruce Perens, publisher Tim OReilly, Linus Torvalds, and others. The Open Source Initiative was founded in February 1998 to encourage use of the new term, a Microsoft executive publicly stated in 2001 that open source is an intellectual property destroyer. I cant imagine something that could be worse than this for the software business, IBM, Oracle, Google and State Farm are just a few of the companies with a serious public stake in todays competitive open-source market. There has been a significant shift in the corporate philosophy concerning the development of FOSS, the free software movement was launched in 1983. In 1998, a group of individuals advocated that the free software should be replaced by open-source software as an expression which is less ambiguous. Software developers may want to publish their software with an open-source license, the Open Source Definition, notably, presents an open-source philosophy, and further defines the terms of usage, modification and redistribution of open-source software. Software licenses grant rights to users which would otherwise be reserved by law to the copyright holder. Several open-source software licenses have qualified within the boundaries of the Open Source Definition, the open source label came out of a strategy session held on April 7,1998 in Palo Alto in reaction to Netscapes January 1998 announcement of a source code release for Navigator. They used the opportunity before the release of Navigators source code to clarify a potential confusion caused by the ambiguity of the free in English. Many people claimed that the birth of the Internet, since 1969, started the open source movement, the Free Software Foundation, started in 1985, intended the word free to mean freedom to distribute and not freedom from cost. Since a great deal of free software already was free of charge, such software became associated with zero cost. The Open Source Initiative was formed in February 1998 by Eric Raymond and they sought to bring a higher profile to the practical benefits of freely available source code, and they wanted to bring major software businesses and other high-tech industries into open source. Perens attempted to open source as a service mark for the OSI. The Open Source Initiatives definition is recognized by governments internationally as the standard or de facto definition, OSI uses The Open Source Definition to determine whether it considers a software license open source

8.
Arithmetic coding
–
Arithmetic coding is a form of entropy encoding used in lossless data compression. Normally, a string of such as the words hello there is represented using a fixed number of bits per character. It represents the current information as a range, defined by two numbers, recent Asymmetric Numeral Systems family of entropy coders allows for faster implementations thanks to directly operating on a single natural number representing the current information. In the simplest case, the probability of each symbol occurring is equal, for example, consider a set of three symbols, A, B, and C, each equally likely to occur. Simple block encoding would require 2 bits per symbol, which is wasteful and that is to say, A=00, B=01, and C=10, but 11 is unused. A more efficient solution is to represent a sequence of three symbols as a rational number in base 3 where each digit represents a symbol. For example, the sequence ABBCAB could become 0.0112013 and this is feasible for long sequences because there are efficient, in-place algorithms for converting the base of arbitrarily precise numbers. To decode the value, knowing the original string had length 6, one can convert back to base 3, round to 6 digits. In general, arithmetic coders can produce output for any given set of symbols. Compression algorithms that use arithmetic coding start by determining a model of the data – basically a prediction of what patterns will be found in the symbols of the message, the more accurate this prediction is, the closer to optimal the output will be. Models can also handle alphabets other than the simple four-symbol set chosen for this example, models can even be adaptive, so that they continually change their prediction of the data based on what the stream actually contains. The decoder must have the model as the encoder. Whichever interval corresponds to the symbol that is next to be encoded becomes the interval used in the next step. When all symbols have been encoded, the resulting interval unambiguously identifies the sequence of symbols that produced it, anyone who has the same final interval and model that is being used can reconstruct the symbol sequence that must have entered the encoder to result in that final interval. It is not necessary to transmit the final interval, however, consider the process for decoding a message encoded with the given four-symbol model. The fraction 0.538 falls into the sub-interval for NEUTRAL, [0,0. 6), next divide the interval [0,0. 6) into sub-intervals, the interval for NEUTRAL would be [0,0. 36), 60% of [0,0. 6). The interval for POSITIVE would be [0.36,0. 48), the interval for NEGATIVE would be [0.48,0. 54), 10% of [0,0. 6). The interval for END-OF-DATA would be [0.54,0. 6), since.538 is within the interval [0.48,0. 54), the second symbol of the message must have been NEGATIVE

9.
Golomb coding
–
Golomb coding is a lossless data compression method using a family of data compression codes invented by Solomon W. Golomb in the 1960s. Rice coding denotes using a subset of the family of Golomb codes to produce a simpler prefix code, Rice used this set of codes in an adaptive coding scheme, Rice coding can refer either to that adaptive scheme or to using that subset of Golomb codes. Whereas a Golomb code has a parameter that can be any positive integer value. This makes Rice codes convenient for use on a computer since multiplication and division by 2 can be implemented efficiently in binary arithmetic. Rice coding is used as the encoding stage in a number of lossless image compression. Golomb coding uses a tunable parameter M to divide an input value N into two parts, q, the result of a division by M, and r, the remainder, the quotient is sent in unary coding, followed by the remainder in truncated binary encoding. When M =1 Golomb coding is equivalent to unary coding, Golomb–Rice codes can be thought of as codes that indicate a number by the position of the bin, and the offset within the bin. The final result looks like, r, note that r can be of a varying number of bits. Specifically, r is only b bits for Rice code and switches between b-1 and b bits for Golomb code, If 0 ≤ r <2 b − M, then use b-1 bits to encode r. If 2 b − M ≤ r < M, then use b bits to encode r, clearly, b = log 2 if M is a power of 2 and we can encode all values of r with b bits. The parameter M is a function of the corresponding Bernoulli process, M is either the median of the distribution or the median +/−1. The Golomb code for this distribution is equivalent to the Huffman code for the same probabilities, golombs scheme was designed to encode sequences of non-negative numbers. The sequence begins,0, -1,1, -2,2, -3,3, the nth negative value is mapped to the nth odd number, and the mth positive value is mapped to the mth even number. This may be expressed mathematically as follows, a value x is mapped to. This is a prefix code only if both the positive and the magnitude of the negative values follow a geometric distribution with the same parameter. Note below that this is the Rice–Golomb encoding, where the code uses simple truncated binary encoding. In this algorithm, if the M parameter is a power of 2, fix the parameter M to an integer value. So log 2 bits are needed, If M is not a power of 2, set b = ⌈ log 2 ⌉ If r <2 b − M code r as plain binary using b-1 bits

10.
Huffman coding
–
In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding and/or using such a code proceeds by means of Huffman coding, student at MIT, and published in the 1952 paper A Method for the Construction of Minimum-Redundancy Codes. The output from Huffmans algorithm can be viewed as a code table for encoding a source symbol. The algorithm derives this table from the probability or frequency of occurrence for each possible value of the source symbol. As in other entropy encoding methods, more common symbols are represented using fewer bits than less common symbols. Huffmans method can be implemented, finding a code in time linear to the number of input weights if these weights are sorted. However, although optimal among methods encoding symbols separately, Huffman coding is not always optimal among all compression methods, specifically, Huffman coding is optimal only if the probabilities of symbols are natural powers of 1/2. This is usually not the case and this sub-optimality is repaired in arithmetic coding and recent faster Asymmetric Numeral Systems family of entropy codings. In 1951, David A. Huffman and his MIT information theory classmates were given the choice of a paper or a final exam. The professor, Robert M. Fano, assigned a term paper on the problem of finding the most efficient binary code, in doing so, Huffman outdid Fano, who had worked with information theory inventor Claude Shannon to develop a similar code. By building the tree from the bottom up instead of the top down, Huffman coding uses a specific method for choosing the representation for each symbol, resulting in a prefix code. Given A set of symbols and their weights, find A prefix-free binary code with minimum expected codeword length. Alphabet A =, which is the alphabet of size n. Set W =, which is the set of the symbol weights, code C =, which is the tuple of codewords, where c i is the codeword for a i,1 ≤ i ≤ n. Goal. Let L = ∑ i =1 n w i × l e n g t h be the path length of code C. Condition, L ≤ L for any code T and we give an example of the result of Huffman coding for a code with five characters and given weights. We will not verify that it minimizes L over all codes, but we will compute L and compare it to the Shannon entropy H of the set of weights. For any code that is biunique, meaning that the code is uniquely decodeable, in this example, the sum is strictly equal to one, as a result, the code is termed a complete code

11.
Adaptive Huffman coding
–
Adaptive Huffman coding is an adaptive coding technique based on Huffman coding. It permits building the code as the symbols are being transmitted, having no knowledge of source distribution. The benefit of one-pass procedure is that the source can be encoded in time, though it becomes more sensitive to transmission errors. There are a number of implementations of this method, the most notable are FGK and it is an online coding technique based on Huffman coding. Having no initial knowledge of frequencies, it permits dynamically adjusting the Huffmans tree as data are being transmitted. In a FGK Huffman tree, an external node, called 0-node, is used to identify a newly-coming character. That is, whenever new data are encountered, output the path to the 0-node followed by the data, for a past-coming character, just output the path of the data in the current Huffmans tree. Most importantly, we have to adjust the FGK Huffman tree if necessary, as the frequency of a datum is increased, the sibling property of the Huffmans tree may be broken. The adjustment is triggered for this reason and it is accomplished by consecutive swappings of nodes, subtrees, or both. The data node is swapped with the node of the same frequency in the Huffmans tree. All ancestor nodes of the node should also be processed in the same manner, since the FGK Algorithm has some drawbacks about the node-or-subtree swapping, Vitter proposed another algorithm to improve it. Some important terminologies & constraints, - Implicit Numbering, It simply means that nodes are numbered in increasing order by level and from left to right. I. e. nodes at bottom level will have low implicit number as compared to upper level nodes and nodes on same level are numbered in increasing order from left to right. Invariant, For each weight w, all leaves of weight w precedes all internal nodes having weight w. Blocks, Nodes of same weight, leader, Highest numbered node in a block. Blocks are interlinked by increasing order of their weights, a leaf block always precedes internal block of same weight, thus maintaining the invariant. NYT is special node and used to represents symbols which are not yet transferred, encoder and decoder start with only the root node, which has the maximum number. In the beginning it is our initial NYT node, when we transmit an NYT symbol, we have to transmit code for the NYT node, then for its generic code. For every symbol that is already in the tree, we only have to transmit code for its leaf node, encoding abb gives 0110000100110001011

12.
Range encoding
–
Range encoding is an entropy coding method defined by G. Nigel N. Martin in a 1979 paper, which effectively rediscovered the FIFO arithmetic code first introduced by Richard Clark Pasco in 1976. After the expiration of the first arithmetic coding patent, range encoding appeared to clearly be free of patent encumbrances and this particularly drove interest in the technique in the open source community. Since that time, patents on various well-known arithmetic coding techniques have also expired, range encoding conceptually encodes all the symbols of the message into one number, unlike Huffman coding which assigns each symbol a bit-pattern and concatenates all the bit-patterns together. Each symbol of the message can then be encoded in turn, the decoder must have the same probability estimation the encoder used, which can either be sent in advance, derived from already transferred data or be part of the compressor and decompressor. When all symbols have been encoded, merely identifying the sub-range is enough to communicate the entire message, suppose we want to encode the message AABA<EOM>, where <EOM> is the end-of-message symbol. Because all five-digit integers starting with 251 fall within our final range, it is one of the three-digit prefixes we could transmit that would unambiguously convey our original message. In practice, however, this is not a problem, because instead of starting with a large range and gradually narrowing it down. After some number of digits have been encoded, the leftmost digits will not change, in the example after encoding just three symbols, we already knew that our final result would start with 2. More digits are shifted in on the right as digits on the left are sent off and this is illustrated in the following code, To finish off we may need to emit a few extra digits. The top digit of low is probably too small so we need to increment it, so first we need to make sure range is large enough. One problem that can occur with the Encode function above is that range might become very small but low and this could result in the interval having insufficient precision to distinguish between all of the symbols in the alphabet. When this happens we need to fudge a little, output the first couple of digits even though we might be off by one, the decoder will be following the same steps so it will know when it needs to do this to keep in sync. Base 10 was used in example, but a real implementation would just use binary. Instead of 10000 and 1000 you would likely use hexadecimal constants such as 0x1000000 and 0x10000, instead of emitting a digit at a time you would emit a byte at a time and use a byte-shift operation instead of multiplying by 10. Decoding uses exactly the same algorithm with the addition of keeping track of the current code value consisting of the digits read from the compressor. Instead of emitting the top digit of low you just throw it away, use AppendDigit below instead of EmitDigit. In order to determine which probability intervals to apply, the needs to look at the current value of code within the interval [low, low+range). For the AABA<EOM> example above, this would return a value in the range 0 to 9, values 0 through 5 would represent A,6 and 7 would represent B, and 8 and 9 would represent <EOM>

13.
Tunstall coding
–
In computer science and information theory, Tunstall coding is a form of entropy coding used for lossless data compression. Tunstall coding was the subject of Brian Parker Tunstalls PhD thesis in 1967, the subject of that thesis was Synthesis of noiseless compression codes Its design is a precursor to Lempel-Ziv. Unlike variable-length codes, which include Huffman and Lempel–Ziv coding, Tunstall coding is a code which maps source symbols to a number of bits. Both Tunstall codes and Lempel-Ziv codes represent variable-length words by fixed-length codes, unlike typical set encoding, Tunstall coding parses a stochastic source with codewords of variable length. It can be shown that, for a large dictionary, the number of bits per source letter can be infinitely close to H. The algorithm requires as input an input alphabet U, along with a distribution of probabilities for each word input and it also requires an arbitrary constant C, which is an upper bound to the size of the dictionary that it will compute. The dictionary in question, D, is constructed as a tree of probabilities, the algorithm goes like this, D, = tree of | U | leaves, one for each letter in U. While | D | < C, Convert most probable leaf to tree with | U | leaves, lets imagine that we wish to encode the string hello, world. Lets further assume that the input alphabet U contains only characters from the string hello, world — that is and we can therefore compute the probability of each character based on its statistical appearance in the input string. For instance, the letter L appears thrice in a string of 12 characters and we initialize the tree, starting with a tree of | U | =9 leaves. Each word is therefore directly associated to a letter of the alphabet, the 9 words that we thus obtain can be encoded into a fixed-sized output of ⌈ log 2 ⌉ =4 bits. We then take the leaf of highest probability, and convert it to yet another tree of | U | =9 leaves and we re-compute the probabilities of those leaves. For instance, the sequence of two letters L happens once, given that there are three occurrences of letters followed by an L, the resulting probability is 13 ⋅312 =112. We obtain 17 words, which can each be encoded into an output of ⌈ log 2 ⌉ =5 bits. Note that we could iterate further, increasing the number of words by | U | −1 =8 every time, Tunstall coding requires the algorithm to know, prior to the parsing operation, what the distribution of probabilities for each letter of the alphabet is. This issue is shared with Huffman coding and its requiring a fixed-length block output makes it lesser than Lempel-Ziv, which has a similar dictionary-based design, but with a variable-sized block output

14.
Fibonacci coding
–
In mathematics and computing, Fibonacci coding is a universal code which encodes positive integers into binary code words. It is one example of representations of integers based on Fibonacci numbers, each code word ends with 11 and contains no other instances of 11 before the end. The Fibonacci code word for an integer is exactly the integers Zeckendorf representation with the order of its digits reversed. For a number N, if d, d, …, d, d represent the digits of the code word representing N then we have, N = ∑ i =0 k −1 d F, and d = d =1. Where F is the ith Fibonacci number, and so F is the ith distinct Fibonacci number starting with 1,2,3,5,8,13, the last bit d is always an appended bit of 1 and does not carry place value. It can be shown that such a coding is unique, note that the penultimate bit is the most significant bit and the first bit is the least significant bit. Note also that leading zeros cannot be omitted as they can in e. g. decimal numbers, the first few Fibonacci codes are shown below, and also the so-called implied probability distribution, the distribution of values for which Fibonacci coding gives a minimum-size code. To encode an integer N, Find the largest Fibonacci number equal to or less than N, subtract this number from N, if the number subtracted was the ith Fibonacci number F, put a 1 in place i−2 in the code word. Repeat the previous steps, substituting the remainder for N, until a remainder of 0 is reached, place an additional 1 after the rightmost digit in the code word. To decode a code word, remove the final 1, assign the remaining the values 1,2,3,5,8,13. to the bits in the code word, and sum the values of the 1 bits. With most other codes, if a single bit is altered. Since the only stream that has no 0 in it is a stream of 11 tokens and this approach - encoding using sequence of symbols, in which some patterns are forbidden, can be freely generalized. The following table shows that the number 65 is represented in Fibonacci coding as 0100100011, the first two Fibonacci numbers are not used, and an additional 1 is always appended. The Fibonacci encodings for the integers are binary strings that end with 11. This can be generalized to binary strings that end with N consecutive 1s, for instance, for N =3 the positive integers are encoded as 111,0111,00111,10111,000111,100111,010111,110111,0000111,1000111,0100111, …. In this case, the number of encodings as a function of string length is given by the sequence of Tribonacci numbers, golden ratio base Ostrowski numeration Universal code Varicode, a practical application Zeckendorfs theorem Maximal Entropy Random Walk Allouche, Jean-Paul, Shallit, Jeffrey. Fraenkel, Aviezri S. Klein, Shmuel T, robust universal complete codes for transmission and compression. The Mathematics of Harmony, From Euclid to Contemporary Mathematics and Computer Science

15.
LZ77 and LZ78
–
LZ77 and LZ78 are the two lossless data compression algorithms published in papers by Abraham Lempel and Jacob Ziv in 1977 and 1978. They are also known as LZ1 and LZ2 respectively and these two algorithms form the basis for many variations including LZW, LZSS, LZMA and others. Besides their academic influence, these formed the basis of several ubiquitous compression schemes, including GIF. They are both theoretically dictionary coders, LZ77 maintains a sliding window during compression. This was later shown to be equivalent to the explicit dictionary constructed by LZ78—however, LZ78 decompression allows random access to the input as long as the entire dictionary is available, while LZ77 decompression must always start at the beginning of the input. The algorithms were named an IEEE Milestone in 2004, in the second of the two papers that introduced these algorithms they are analyzed as encoders defined by finite-state machines. A measure analogous to information entropy is developed for individual sequences and this measure gives a bound on the data compression ratio that can be achieved. It is then shown that there exist finite lossless encoders for every sequence that achieve this bound as the length of the sequence grows to infinity, in this sense an algorithm based on this scheme produces asymptotically optimal encodings. This result can be proved more directly, as for example in notes by Peter Shor, LZ77 algorithms achieve compression by replacing repeated occurrences of data with references to a single copy of that data existing earlier in the uncompressed data stream. To spot matches, the encoder must keep track of some amount of the most recent data, such as the last 2 kB,4 kB, or 32 kB. The structure in which data is held is called a sliding window. The encoder needs to keep this data to look for matches, the larger the sliding window is, the longer back the encoder may search for creating references. It is not only acceptable but frequently useful to allow length-distance pairs to specify a length that exceeds the distance. As a copy command, this is puzzling, Go back four characters, how can ten characters be copied over when only four of them are actually in the buffer. Tackling one byte at a time, there is no problem serving this request, because as a byte is copied over, when the copy-from position makes it to the initial destination position, it is consequently fed data that was pasted from the beginning of the copy-from position. The operation is equivalent to the statement copy the data you were given. As this type of pair repeats a single copy of data multiple times, it can be used to incorporate a flexible, then L characters have been matched in total, L>D, and the code is. When the first LR characters are read to the output, this corresponds to a single run unit appended to the output buffer, the pseudocode is a reproduction of the LZ77 compression algorithm sliding window