In computer science and information theory, Tunstall coding is a form of entropy coding used for lossless data compression. Tunstall coding was the subject of Brian Parker Tunstall's PhD thesis in 1967, while at Georgia Institute of Technology; the subject of that thesis was "Synthesis of noiseless compression codes" Its design is a precursor to Lempel-Ziv. Unlike variable-length codes, which include Huffman and Lempel–Ziv coding, Tunstall coding is a code which maps source symbols to a fixed number of bits. Both Tunstall codes and Lempel-Ziv codes represent variable-length words by fixed-length codes. Unlike typical set encoding, Tunstall coding parses a stochastic source with codewords of variable length, it can be shown that, for a large enough dictionary, the number of bits per source letter can be infinitely close to H, the entropy of the source. The algorithm requires as input an input alphabet U, along with a distribution of probabilities for each word input, it requires an arbitrary constant C, an upper bound to the size of the dictionary that it will compute.
The dictionary in question, D, is constructed as a tree of probabilities, in which each edge is associated to a letter from the input alphabet. The algorithm goes like this: D:= tree of | U | leaves, one for each letter in U. While | D | < C: Convert most probable leaf to tree with | U | leaves. Let's imagine that we wish to encode the string "hello, world". Let's further assume that the input alphabet U contains only characters from the string "hello, world" — that is,'h','e','l',',',",'w','o','r','d'. We can therefore compute the probability of each character based on its statistical appearance in the input string. For instance, the letter L appears thrice in a string of 12 characters: its probability is 3 12. We initialize the tree, starting with a tree of | U | = 9 leaves; each word is therefore directly associated to a letter of the alphabet. The 9 words that we thus obtain can be encoded into a fixed-sized output of ⌈ log 2 ⌉ = 4 bits. We take the leaf of highest probability, convert it to yet another tree of | U | = 9 leaves, one for each character.
We re-compute the probabilities of those leaves. For instance, the sequence of two letters L happens once. Given that there are three occurrences of letters followed by an L, the resulting probability is 1 3 ⋅ 3 12 = 1 12. We obtain. Note that we could iterate further, increasing the number of words by | U | − 1 = 8 every time. Tunstall coding requires the algorithm to know, prior to the parsing operation, what the distribution of probabilities for each letter of the alphabet is; this issue is shared with Huffman coding. Its requiring a fixed-length block output makes it lesser than Lempel-Ziv, which has a similar dictionary-based design, but with a variable-sized block output
JPEG is a used method of lossy compression for digital images for those images produced by digital photography. The degree of compression can be adjusted, allowing a selectable tradeoff between storage size and image quality. JPEG achieves 10:1 compression with little perceptible loss in image quality. JPEG compression is used in a number of image file formats. JPEG/Exif is the most common image format used by digital cameras and other photographic image capture devices; these format variations are not distinguished, are called JPEG. The term "JPEG" is an initialism/acronym for the Joint Photographic Experts Group, which created the standard; the MIME media type for JPEG is image/jpeg, except in older Internet Explorer versions, which provides a MIME type of image/pjpeg when uploading JPEG images. JPEG files have a filename extension of.jpg or.jpeg. JPEG/JFIF supports a maximum image size of 65,535×65,535 pixels, hence up to 4 gigapixels for an aspect ratio of 1:1. "JPEG" stands for Joint Photographic Experts Group, the name of the committee that created the JPEG standard and other still picture coding standards.
The "Joint" stood for ISO TC97 WG8 and CCITT SGVIII. In 1987, ISO TC 97 became ISO/IEC JTC1 and, in 1992, CCITT became ITU-T. On the JTC1 side, JPEG is one of two sub-groups of ISO/IEC Joint Technical Committee 1, Subcommittee 29, Working Group 1 – titled as Coding of still pictures. On the ITU-T side, ITU-T SG16 is the respective body; the original JPEG Group was organized in 1986, issuing the first JPEG standard in 1992, approved in September 1992 as ITU-T Recommendation T.81 and, in 1994, as ISO/IEC 10918-1. The JPEG standard specifies the codec, which defines how an image is compressed into a stream of bytes and decompressed back into an image, but not the file format used to contain that stream; the Exif and JFIF standards define the used file formats for interchange of JPEG-compressed images. JPEG standards are formally named as Information technology – Digital compression and coding of continuous-tone still images. ISO/IEC 10918 consists of the following parts: Ecma International TR/98 specifies the JPEG File Interchange Format.
The JPEG compression algorithm operates at its best on photographs and paintings of realistic scenes with smooth variations of tone and color. For web usage, where reducing the amount of data used for an image is important for responsive presentation, JPEG's compression benefits make JPEG popular. JPEG/Exif is the most common format saved by digital cameras. However, JPEG is not well suited for line drawings and other textual or iconic graphics, where the sharp contrasts between adjacent pixels can cause noticeable artifacts; such images are better saved in a lossless graphics format such as TIFF, GIF, PNG, or a raw image format. The JPEG standard includes a lossless coding mode; as the typical use of JPEG is a lossy compression method, which reduces the image fidelity, it is inappropriate for exact reproduction of imaging data. JPEG is not well suited to files that will undergo multiple edits, as some image quality is lost each time the image is recompressed if the image is cropped or shifted, or if encoding parameters are changed – see digital generation loss for details.
To prevent image information loss during sequential and repetitive editing, the first edit can be saved in a lossless format, subsequently edited in that format finally published as JPEG for distribution. JPEG uses a lossy form of compression based on the discrete cosine transform; this mathematical operation converts each frame/field of the video source from the spatial domain into the frequency domain. A perceptual model based loosely on the human psychovisual system discards high-frequency information, i.e. sharp transitions in intensity, color hue. In the transform domain, the process of reducing information is called quantization. In simpler terms, quantization is a method for optimally reducing a large number scale into a smaller one, the transform-domain is a convenient representation of the image because the high-frequency coefficients, which contribute less to the overall picture than other coefficients, are characteristically small-values with high compressibility; the quantized coefficients are sequenced and losslessly packed into the output bitstream.
Nearly all software implementations of JPEG permit user control over the compression ratio, allowing the user to trade off picture-quality for smaller file size. In embedded applications, the parameters are fixed for the application; the compression method is lossy, meaning that some original image information is lost and cannot be restored affecting image quality. There is an optional lossless mode defined in the JPEG standard. However, this mode is not supported in products. There is an interlaced progressive JPEG format, in which data is compressed in multiple passes of progressively higher detail; this is ideal for large images that will be displayed while downloading over a slow connection, allowing a reasonable preview after receiving only a portion of the data. However, support for progressive JPEGs is not universal; when progressive JPEGs are received by programs that do not support them (such
Range encoding is an entropy coding method defined by G. Nigel N. Martin in a 1979 paper, which rediscovered the FIFO arithmetic code first introduced by Richard Clark Pasco in 1976. Given a stream of symbols and their probabilities, a range coder produces a space-efficient stream of bits to represent these symbols and, given the stream and the probabilities, a range decoder reverses the process. Range coding is similar to arithmetic encoding, except that encoding is done with digits in any base, instead of with bits, so it is faster when using larger bases at small cost in compression efficiency. After the expiration of the first arithmetic coding patent, range encoding appeared to be free of patent encumbrances; this drove interest in the technique in the open source community. Since that time, patents on various well-known arithmetic coding techniques have expired. Range encoding conceptually encodes all the symbols of the message into one number, unlike Huffman coding which assigns each symbol a bit-pattern and concatenates all the bit-patterns together.
Thus range encoding can achieve greater compression ratios than the one-bit-per-symbol lower bound on Huffman encoding and it does not suffer the inefficiencies that Huffman does when dealing with probabilities that are not exact powers of two. The central concept behind range encoding is this: given a large-enough range of integers, a probability estimation for the symbols, the initial range can be divided into sub-ranges whose sizes are proportional to the probability of the symbol they represent; each symbol of the message can be encoded in turn, by reducing the current range down to just that sub-range which corresponds to the next symbol to be encoded. The decoder must have the same probability estimation the encoder used, which can either be sent in advance, derived from transferred data or be part of the compressor and decompressor; when all symbols have been encoded identifying the sub-range is enough to communicate the entire message. A single integer is sufficient to identify the sub-range, it may not be necessary to transmit the entire integer.
Suppose we want to encode the message "AABA<EOM>", where <EOM> is the end-of-message symbol. For this example it is assumed that the decoder knows that we intend to encode five symbols in the base 10 number system ) using the probability distribution; the encoder breaks down the range [0, 100000) into three subranges: A: [ 0, 60000) B: [ 60000, 80000) <EOM>: [ 80000, 100000) Since our first symbol is an A, it reduces our initial range down to [0, 60000). The second symbol choice leaves us with three sub-ranges of this range. We show them following the already-encoded'A': AA: [ 0, 36000) AB: [ 36000, 48000) A<EOM>: [ 48000, 60000) With two symbols encoded, our range is now [0, 36000) and our third symbol leads to the following choices: AAA: [ 0, 21600) AAB: [ 21600, 28800) AA<EOM>: [ 28800, 36000) This time it is the second of our three choices that represent the message we want to encode, our range becomes [21600, 28800). It may look harder to determine our sub-ranges in this case, but it is not: we can subtract the lower bound from the upper bound to determine that there are 7200 numbers in our range.
Adding back the lower bound gives us our ranges: AABA: [21600, 25920) AABB: [25920, 27360) AAB<EOM>: [27360, 28800) Finally, with our range narrowed down to [21600, 25920), we have just one more symbol to encode. Using the same technique as before for dividing up the range between the lower and upper bound, we find the three sub-ranges are: AABAA: [21600, 24192) AABAB: [24192, 25056) AABA<EOM>: [25056, 25920) And since <EOM> is our final symbol, our final range is [25056, 25920). Because all five-digit integers starting with "251" fall within our final range, it is one of the three-digit prefixes we could transmit that would unambiguously convey our original message; the central problem may appear to be selecting an initial range large enough that no matter how many symbols we have to encode, we will always have a current range large enough to divide into non-zero sub-ranges. In practice, this is not a problem, because instead of starting with a large range and narrowing it down, the encoder works with a smaller range of numbers at any given time.
After some number of digits have been encoded, the leftmost digits will not change. In the example after encoding just three symbols, we knew that our final result would start with "2". More digits are shifted in on the right; this is illustrated in the following code: To finish off we may need to emit a few extra digits. The top digit of low is too small so we need to increment it, but we have to make sure we don't increment it past low+range. So first we need to make sure. One problem that can occur with the Encode function above is that range might become small but low and low+range still have differ
Asymmetric numeral systems
Asymmetric numeral systems is a family of entropy encoding methods introduced by Jarosław Duda from Jagiellonian University, used in data compression since 2014 due to improved performance compared to used methods, being up to 30 times faster. ANS combines the compression ratio of arithmetic coding, with a processing cost similar to that of Huffman coding. In the tabled ANS variant, this is achieved by constructing a finite-state machine to operate on a large alphabet without using multiplication. Among others, ANS is used in the Facebook Zstandard compressor, in the Apple LZFSE compressor, Google Draco 3D compressor and PIK image compressor, in CRAM DNA compressor from SAMtools utilities, Dropbox DivANS compressor, it is being considered for the AV1 open video coding format from the Alliance for Open Media; the basic idea is to encode information into a single natural number x. In the standard binary number system, we can add a bit s ∈ of information to x by appending s at the end of x which gives us x ′ = 2 x + s.
For an entropy coder, this is optimal if Pr = Pr = 1 / 2. ANS generalizes this process for arbitrary sets of symbols s ∈ S with an accompanying probability distribution s ∈ S. In ANS, if x ′ is the result of appending the information from s to x x ′ ≈ x ⋅ p s − 1. Equivalently, log 2 ≈ log 2 + log 2 , where log 2 is the number of bits of information stored in the number x and log 2 is the number of bits contained in the symbol s. For the encoding rule, the set of natural numbers is split into disjoint subsets corresponding to different symbols – like into and odd numbers, but with densities corresponding to the probability distribution of the symbols to encode. To add information from symbol s into the information stored in the current number x, we go to number x ′ = C ≈ x / p being the position of the x -th appearance from the s -th subset. There are alternative ways to apply it in practice – direct mathematical formulas for encoding and decoding steps, or one can put the entire behavior into a table.
Renormalization is used to prevent x going to infinity – transferring accumulated bits to or from the bitstream. Imagine you want to encode a sequence of 1,000 zeros and ones, which would take 1000 bits to store directly. However, if it is somehow known that it only contains 1 zero and 999 ones, it would be sufficient to encode the zero's position, which requires only ⌈ log 2 ⌉ = 10 bits here instead of the original 1000 bits; such length n sequences containing p n zeros and n ones, for some probability p ∈, are called combinations. Using Stirling's approximation we get their asymptotic number being ≈ 2 n h for large n and h = − p log 2 − log 2 , called Shannon entropy. Hence, to choose one such sequence we need n h ( p
In information theory, Shannon–Fano–Elias coding is a precursor to arithmetic coding, in which probabilities are used to determine codewords. Given a discrete random variable X of ordered values to be encoded, let p be the probability for any x in X. Define a function F ¯ = ∑ x i < x p + 1 2 p Algorithm: For each x in X, Let Z be the binary expansion of F ¯. Choose the length of the encoding of x, L, to be the integer ⌈ log 2 1 p ⌉ + 1 Choose the encoding of x, c o d e, be the first L most significant bits after the decimal point of Z. Let X =, with probabilities p =. For A F ¯ = 1 2 p = 1 2 ⋅ 1 3 = 0.1666... In binary, Z = 0.0010101010... L = ⌈ log 2 1 1 3 ⌉ + 1 = 3 code is 001For B F ¯ = p + 1 2 p = 1 3 + 1 2 ⋅ 1 4 = 0.4583333... In binary, Z = 0.01110101010101... L = ⌈ log 2 1 1 4 ⌉ + 1 = 3 code is 011For C F ¯ = p + p + 1 2 p = 1 3 + 1 4 + 1 2 ⋅ 1 6 = 0.66666... In binary, Z = 0.101010101010... L = ⌈ log 2 1 1 6 ⌉ + 1 = 4 code is 1010For D F ¯ = p + p + p + 1 2 p = 1 3 + 1 4 + 1 6 + 1 2 ⋅ 1 4 = 0.875 In binary, Z = 0.111 L = ⌈ log 2 1 1 4 ⌉ + 1 = 3 code is 111 Shannon–Fano–Elias coding produces a binary prefix code, allowing for direct decoding.
Let bcode be the rational number formed by adding a decimal point before a binary code. For example, if code=1010 bcode = 0.1010. For all x, if no y exists such that b c o d e ≤ b c o d e < b c o d e + 2 − L all the codes form a prefix code. By comparing F to the CDF of X, this property may be demonstrated graphically for Shannon–Fano–Elias coding. By definition of L it follows that 2 − L ≤ 1 2 p And because the bits after L are truncated from F to form code, it follows that F
International Standard Serial Number
An International Standard Serial Number is an eight-digit serial number used to uniquely identify a serial publication, such as a magazine. The ISSN is helpful in distinguishing between serials with the same title. ISSN are used in ordering, interlibrary loans, other practices in connection with serial literature; the ISSN system was first drafted as an International Organization for Standardization international standard in 1971 and published as ISO 3297 in 1975. ISO subcommittee TC 46/SC 9 is responsible for maintaining the standard; when a serial with the same content is published in more than one media type, a different ISSN is assigned to each media type. For example, many serials are published both in electronic media; the ISSN system refers to these types as electronic ISSN, respectively. Conversely, as defined in ISO 3297:2007, every serial in the ISSN system is assigned a linking ISSN the same as the ISSN assigned to the serial in its first published medium, which links together all ISSNs assigned to the serial in every medium.
The format of the ISSN is an eight digit code, divided by a hyphen into two four-digit numbers. As an integer number, it can be represented by the first seven digits; the last code digit, which may be 0-9 or an X, is a check digit. Formally, the general form of the ISSN code can be expressed as follows: NNNN-NNNC where N is in the set, a digit character, C is in; the ISSN of the journal Hearing Research, for example, is 0378-5955, where the final 5 is the check digit, C=5. To calculate the check digit, the following algorithm may be used: Calculate the sum of the first seven digits of the ISSN multiplied by its position in the number, counting from the right—that is, 8, 7, 6, 5, 4, 3, 2, respectively: 0 ⋅ 8 + 3 ⋅ 7 + 7 ⋅ 6 + 8 ⋅ 5 + 5 ⋅ 4 + 9 ⋅ 3 + 5 ⋅ 2 = 0 + 21 + 42 + 40 + 20 + 27 + 10 = 160 The modulus 11 of this sum is calculated. For calculations, an upper case X in the check digit position indicates a check digit of 10. To confirm the check digit, calculate the sum of all eight digits of the ISSN multiplied by its position in the number, counting from the right.
The modulus 11 of the sum must be 0. There is an online ISSN checker. ISSN codes are assigned by a network of ISSN National Centres located at national libraries and coordinated by the ISSN International Centre based in Paris; the International Centre is an intergovernmental organization created in 1974 through an agreement between UNESCO and the French government. The International Centre maintains a database of all ISSNs assigned worldwide, the ISDS Register otherwise known as the ISSN Register. At the end of 2016, the ISSN Register contained records for 1,943,572 items. ISSN and ISBN codes are similar in concept. An ISBN might be assigned for particular issues of a serial, in addition to the ISSN code for the serial as a whole. An ISSN, unlike the ISBN code, is an anonymous identifier associated with a serial title, containing no information as to the publisher or its location. For this reason a new ISSN is assigned to a serial each time it undergoes a major title change. Since the ISSN applies to an entire serial a new identifier, the Serial Item and Contribution Identifier, was built on top of it to allow references to specific volumes, articles, or other identifiable components.
Separate ISSNs are needed for serials in different media. Thus, the print and electronic media versions of a serial need separate ISSNs. A CD-ROM version and a web version of a serial require different ISSNs since two different media are involved. However, the same ISSN can be used for different file formats of the same online serial; this "media-oriented identification" of serials made sense in the 1970s. In the 1990s and onward, with personal computers, better screens, the Web, it makes sense to consider only content, independent of media; this "content-oriented identification" of serials was a repressed demand during a decade, but no ISSN update or initiative occurred. A natural extension for ISSN, the unique-identification of the articles in the serials, was the main demand application. An alternative serials' contents model arrived with the indecs Content Model and its application, the digital object identifier, as ISSN-independent initiative, consolidated in the 2000s. Only in 2007, ISSN-L was defined in the
Adaptive Huffman coding
Adaptive Huffman coding is an adaptive coding technique based on Huffman coding. It permits building the code as the symbols are being transmitted, having no initial knowledge of source distribution, that allows one-pass encoding and adaptation to changing conditions in data; the benefit of one-pass procedure is that the source can be encoded in real time, though it becomes more sensitive to transmission errors, since just a single loss ruins the whole code. There are a number of implementations of this method, the most notable are Vitter algorithm, it is an online coding technique based on Huffman coding. Having no initial knowledge of occurrence frequencies, it permits dynamically adjusting the Huffman's tree as data are being transmitted. In a FGK Huffman tree, a special external node, called 0-node, is used to identify a newly-coming character; that is, whenever new data is encountered, output the path to the 0-node followed by the data. For a past-coming character, just output the path of the data in the current Huffman's tree.
Most we have to adjust the FGK Huffman tree if necessary, update the frequency of related nodes. As the frequency of a datum is increased, the sibling property of the Huffman's tree may be broken; the adjustment is triggered for this reason. It is accomplished by consecutive swappings of subtrees, or both; the data node is swapped with the highest-ordered node of the same frequency in the Huffman's tree. All ancestor nodes of the node should be processed in the same manner. Since the FGK Algorithm has some drawbacks about the node-or-subtree swapping, Vitter proposed another algorithm to improve it; some important terminologies & constraints:- Implicit Numbering: It means that nodes are numbered in increasing order by level and from left to right. I.e. nodes at bottom level will have low implicit number as compared to upper level nodes and nodes on same level are numbered in increasing order from left to right. Invariant: For each weight w, all leaves of weight w precede all internal nodes having weight w. Blocks: Nodes of same weight and same type form a Block.
Leader: Highest numbered node in a block. Blocks are interlinked by increasing order of their weights. A leaf block always precedes internal block of same weight, thus maintaining the invariant. NYT is special node and used to represents symbols which are'not yet transferred'. Leaf_to_increment = NULL p = pointer to the leaf node containing the next symbol IF THEN Extend p by adding two children Left child becomes new NYT and right child is the new symbol leaf node p = parent of new symbol leaf node leaf_to_increment = Right Child of p ELSE Swap p with leader of its block IF THEN leaf_to_increment = p p = parent of p WHILE Slide_And_Increment IF Slide_And_Increment previous_p = parent of p IF THEN Slide p in the tree higher than the leaf nodes of weight wt + 1 increase weight of p by 1 p = previous_p ELSE Slide p in the tree higher than the internal nodes of weight wt increase weight of p by 1 p = new parent of p. Encoder and decoder start with only the root node, which has the maximum number.
In the beginning it is our initial NYT node. When we transmit an NYT symbol, we have to transmit code for the NYT node for its generic code. For every symbol, in the tree, we only have to transmit code for its leaf node. Encoding "abb" gives 01100001 001100010 11. Step 1: Start with an empty tree. For "a" transmit its binary code. Step 2: NYT spawns two child nodes: 254 and 255, both with weight 0. Increase weight for root and 255. Code for "a", associated with node 255, is 1. For "b" transmit 0 its binary code. Step 3: NYT spawns two child nodes: 252 for NYT and 253 for leaf node, both with weight 0. Increase weights for 253, 254, root. To maintain Vitter’s invariant that all leaves of weight w precede all internal nodes of weight w, the branch starting with node 254 should be swop with node 255. Code for "b" is 11. For the second "b" transmit 11. For the convenience of explanation this step doesn't follow Vitter's algorithm, but the effects are equivalent. Step 4: Go to leaf node 253. Notice we have two blocks with weight 1.
Node 253 and 254 is one block, node 255 is another block. For node 253, the biggest number in its block is 254, so swap the weights and symbols of nodes 253 and 254. Now node 254 and the branch starting from node 255 satisfy the SlideAndIncrement condition and hence must be swop. At last increase node 255 and 256’s weight. Future code for "b" is 1, for "a" is now 01, which reflects their frequency. Vitter's original paper: J. S. Vitter, "Design and Analysis of Dynamic Huffman Codes", Journal of the ACM, 34, October 1987, pp 825–845. J. S. Vitter, "ALGORITHM 673 Dynamic Huffman Coding", ACM Transactions on Mathematical Software, 15, June 1989, pp 158–167. Appears in Collected Algorithms of ACM. Donald E. Knuth, "Dynamic Huffman Coding", Journal of Algorithm, 6, 1985, pp 163–180; this article incorporates public domain material from the NIST document: Black, Paul E. "adaptive Huffman coding". Dictionary of Algorithms and Data Structures. University of California Dan Hirschberg site Cardiff Uni