1.
Plane (Unicode)
–
In the Unicode standard, a plane is a continuous group of 65,536 code points. There are 17 planes, identified by the numbers 0 to 16decimal, Plane 0 is the Basic Multilingual Plane, which contains most commonly-used characters. The higher planes 1 through 16 are called supplementary planes, or humorously astral planes, as of Unicode version 9.0, six of the planes have assigned code points, and four are named. The limit of 17 is due to the design of UTF-16, which can encode 16 supplementary planes and the BMP, to a value of 0x10FFFF. The encoding scheme used by UTF-8 was designed with a larger limit of 231 code points. Since Unicode limits the points to the 17 planes that can be encoded by UTF-16, code points above 0x10FFFF are invalid in UTF-8. The 17 planes can accommodate 1,114,112 code points, of these,2,048 are surrogates,66 are non-characters, and 137,468 are reserved for private use, leaving 974,530 for public assignment. Planes are further subdivided into Unicode blocks, which, unlike planes, the 273 blocks defined in Unicode 9.0 cover 24% of the possible code point space, and range in size from a minimum of 16 code points to a maximum of 65,536 code points. For future usage, ranges of characters have been mapped out for most known current and ancient writing systems. The first plane, plane 0, the Basic Multilingual Plane contains characters for almost all languages. A primary objective for the BMP is to support the unification of prior character sets as well as characters for writing, most of the assigned code points in the BMP are used to encode Chinese, Japanese, and Korean characters. The High Surrogates and Low Surrogate codes are reserved for encoding characters in UTF-16 by using a pair of 16-bit codes, one High Surrogate. A single surrogate code point will never be assigned a character,65,408 of the 65,536 code points in this plane have been allocated to a Unicode block, leaving just 128 code points in unallocated ranges. As of Unicode 9.0, the BMP comprises the following 161 blocks, Plane 1, the Supplementary Multilingual Plane, contains historic scripts, scripts include Linear B, Egyptian hieroglyphs, and cuneiform scripts, and also reform orthographies like Shavian and Deseret. Symbols and notations include historic and modern musical notation, mathematical alphanumerics, Emoji and other sets, and game symbols for playing cards, Mah Jongg. Plane 3 is tentatively named the Tertiary Ideographic Plane, but as of version 9.0 there are no characters assigned to it and it is reserved for Oracle Bone script, Bronze Script, Small Seal Script, additional CJK unified ideographs, and other historic ideographic scripts. It is not anticipated that all these planes will be used in the foreseeable future, the number of possible symbol characters that could arise outside of the context of writing systems is potentially huge. At the moment, these 11 planes out of 17 are unused, Plane 14, the Supplementary Special-purpose Plane, currently contains non-graphical characters
2.
Script (Unicode)
–
In Unicode, a script is a collection of letters and other written signs used to represent textual information in one or more writing systems. Some scripts support one and only one writing system and language, for example, other scripts support many different writing systems, for example, the Latin script supports English, French, German, Italian, Vietnamese, Latin itself, and several other languages. Some languages make use of multiple alternate writing systems, thus also use several scripts, in Turkish, the Arabic script was used before the 20th century, but transitioned to Latin in the early part of the 20th century. For a list of languages supported by each script see the list of languages by writing system, more or less complementary to scripts are symbols and Unicode control characters. The unified diacritical characters and unified punctuation characters frequently have the common or inherited script property, Unicode 9.0 defines 135 separate scripts, including 84 modern scripts and 51 ancient or historic scripts. More scripts are in the process for encoding or have been allocated for encoding in roadmaps. When multiple languages make use of the script, there are frequently some differences, particularly in diacritics. For example, Swedish and English both use the Latin script, however, Swedish includes the character ‘å’ while English has no such character. Nor does English make use of the diacritic combining circle above for any character, in general the languages sharing the same scripts share many of the same characters. Despite these peripheral differences in the Swedish and English writing systems they are said to use the same Latin script, so the Unicode abstraction of scripts is a basic organizing technique. The differences between different alphabets or writing systems remain and are supported through Unicode’s flexible scripts, combining marks, writing system is sometimes treated as a synonym for script. However it also can be used as the specific writing system supported by a script. For example, the Vietnamese writing system is supported by the Latin script, a writing system may also cover more than one script, for example the Japanese writing system makes use of the Han, Hiragana and Katakana scripts. The term complex system is used to describe those where the admixture makes classification problematic. Unicode supports all of these types of writing systems through its numerous scripts, Unicode also adds further properties to characters to help differentiate the various characters and the ways they behave within Unicode text processing algorithms. In addition to explicit or specific script properties Unicode uses three values, Common Unicode can assign a character in the UCS to a single script only. However, many characters — those that are not part of a natural language writing system or are unified across many writing systems may be used in more than one script. For example, currency signs, symbols, numerals and punctuation marks, in these cases Unicode defines them as belonging to the common script
3.
Unicode
–
Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the worlds writing systems. As of June 2016, the most recent version is Unicode 9.0, the standard is maintained by the Unicode Consortium. Unicodes success at unifying character sets has led to its widespread, the standard has been implemented in many recent technologies, including modern operating systems, XML, Java, and the. NET Framework. Unicode can be implemented by different character encodings, the most commonly used encodings are UTF-8, UTF-16 and the now-obsolete UCS-2. UTF-8 uses one byte for any ASCII character, all of which have the same values in both UTF-8 and ASCII encoding, and up to four bytes for other characters. UCS-2 uses a 16-bit code unit for each character but cannot encode every character in the current Unicode standard, UTF-16 extends UCS-2, using one 16-bit unit for the characters that were representable in UCS-2 and two 16-bit units to handle each of the additional characters. Many traditional character encodings share a common problem in that they allow bilingual computer processing, Unicode, in intent, encodes the underlying characters—graphemes and grapheme-like units—rather than the variant glyphs for such characters. In the case of Chinese characters, this leads to controversies over distinguishing the underlying character from its variant glyphs. In text processing, Unicode takes the role of providing a unique code point—a number, in other words, Unicode represents a character in an abstract way and leaves the visual rendering to other software, such as a web browser or word processor. This simple aim becomes complicated, however, because of concessions made by Unicodes designers in the hope of encouraging a more rapid adoption of Unicode, the first 256 code points were made identical to the content of ISO-8859-1 so as to make it trivial to convert existing western text. For other examples, see duplicate characters in Unicode and he explained that he name Unicode is intended to suggest a unique, unified, universal encoding. In this document, entitled Unicode 88, Becker outlined a 16-bit character model, Unicode could be roughly described as wide-body ASCII that has been stretched to 16 bits to encompass the characters of all the worlds living languages. In a properly engineered design,16 bits per character are more than sufficient for this purpose, Unicode aims in the first instance at the characters published in modern text, whose number is undoubtedly far below 214 =16,384. By the end of 1990, most of the work on mapping existing character encoding standards had been completed, the Unicode Consortium was incorporated in California on January 3,1991, and in October 1991, the first volume of the Unicode standard was published. The second volume, covering Han ideographs, was published in June 1992, in 1996, a surrogate character mechanism was implemented in Unicode 2.0, so that Unicode was no longer restricted to 16 bits. The Microsoft TrueType specification version 1.0 from 1992 used the name Apple Unicode instead of Unicode for the Platform ID in the naming table, Unicode defines a codespace of 1,114,112 code points in the range 0hex to 10FFFFhex. Normally a Unicode code point is referred to by writing U+ followed by its hexadecimal number, for code points in the Basic Multilingual Plane, four digits are used, for code points outside the BMP, five or six digits are used, as required. Code points in Planes 1 through 16 are accessed as surrogate pairs in UTF-16, within each plane, characters are allocated within named blocks of related characters
4.
Interlinear gloss
–
In linguistics and pedagogy, an interlinear gloss is a gloss placed between lines, such as between a line of original text and its translation into another language. When glossed, each line of the original text acquires one or more lines of known as an interlinear text or interlinear glossed text —interlinear for short. Such glosses help the reader follow the relationship between the text and its translation, and the structure of the original language. In its simplest form, an interlinear gloss is simply a literal, Interlinear glosses have been used for a variety of purposes over a long period of time. One common usage has been to annotate bilingual textbooks for language education and this sort of interlinearization serves to help make the meaning of a source text explicit without attempting to formally model the structural characteristics of the source language. Such annotations have occasionally been expressed not through interlinear layout, but rather, through enumeration of words in the object, even so, this approach requires the readers to re-align the correspondences between source and target forms. In this style, the example might be rendered thus. Finally, modern linguists have adopted the practice of using abbreviated grammatical category labels, in computing, special text markers are provided in Specials to indicate the start and end of interlinear glosses. A semi-standardized set of parsing conventions and grammatical abbreviations is explained in the Leipzig Glossing Rules, goá iáu-boē koat-tēng tang-sî boeh tńg-khì. Goa1 iau1-boe3 koat2-teng3 tang7-si5 boeh2 tng1-khi3, goa2 iau2-boe7 koat4-teng7 tang1-si5 boeh4 tng2-khi3. I not-yet decide when want return, I have not yet decided when I shall return. In linguistics, it has become standard to align the words and that is, koat-tēng in line 1 above would either require a hyphenated two-word gloss, or be transcribed without a hyphen, for example as koattēng. Grammatical terms are commonly abbreviated and printed in SMALL CAPITALS to keep them distinct from translations, varying levels of analysis may be detailed. For example, in a Lezgian text using standard romanization, Here every Lezgian morpheme is set off with hyphens, since many of these are difficult to gloss in English, the roots are translated, but the grammatical suffixes are glossed with three-letter grammatical abbreviations. In interlinear morphological glosses, various forms of punctuation separate the glosses. Typically, the words are aligned with their glosses, within words and that is, there should be the same number of words separated with spaces in the text and its gloss, as well as the same number of hyphenated morphemes within a word and its gloss. This is the system, and can be applied universally. For example, Odadan hızla çıktım. FEM. PL. DAT to the houses, however, sometimes finer distinctions may be made
5.
Universal Character Set characters
–
The Unicode Consortium and the International Organisation for Standardisation collaborate on the Universal Character Set. The UCS is a standard to map characters used in natural language, mathematics, music. By creating this mapping, the UCS enables computer software vendors to interoperate, because it is a universal map, it can be used to represent multiple languages at the same time. UCS has a capacity to encode over 1 million characters. Each UCS character is represented by a code point, which is an integer between 0 and 1,114,111, used to represent each character within the internal logic of text processing software. The number of encoded characters is made up as follows,128,019 graphical characters 218 special purpose characters for control, ISO maintains the basic mapping of characters from character name to code point. Often the terms character and code point will get used interchangeably, however, when a distinction is made, a code point refers to the integer of the character, what one might think of as its address. Input methods can be through keyboard or a character palette. The UCS can be divided in various ways, such as by plane, block, character category, the x must be lowercase in XML documents. The nnnn or hhhh may be any number of digits and may include leading zeros, the hhhh may mix uppercase and lowercase, though uppercase is the usual style. In contrast, an entity reference refers to a character by the name of an entity which has the desired character as its replacement text. The entity must either be predefined or explicitly declared in a Document Type Definition, the format is the same as for any entity reference, &name, where name is the case-sensitive name of the entity. Unicode and ISO divide the set of points into 17 planes. As of 2016 ISO and the Unicode Consortium has only allocated characters, the others remain empty and reserved for future use. Most characters are assigned to the first plane, the Basic Multilingual Plane. This is to ease the transition for legacy software since the Basic Multilingual Plane is addressable with just two octets. The characters outside the first plane usually have very specialized or rare use. Each plane corresponds with the value of the one or two hexadecimal digits preceding the four ones, hence U+24321 is in Plane 2, U+4321 is in Plane 0
6.
Endianness
–
Endianness refers to the sequential order used to numerically interpret a range of bytes in computer memory as a larger, composed word value. It also describes the order of transmission over a digital link. Little-endian format reverses the order of the sequence and stores the least significant byte at the first location with the most significant byte being stored last. The order of bits within a byte can also have endianness, however, both big and little forms of endianness are widely used in digital electronics. As examples, the IBM z/Architecture mainframes use big-endian while the Intel x86 processors use little-endian, the designers chose endianness in the 1960s and 1970s respectively. Big-endian is the most common format in data networking, fields in the protocols of the Internet protocol suite, such as IPv4, IPv6, TCP, for this reason, big-endian byte order is also referred to as network byte order. Little-endian storage is popular for microprocessors, in due to significant influence on microprocessor designs by Intel Corporation. Mixed forms also exist, for instance the ordering of bytes in a 16-bit word may differ from the ordering of 16-bit words within a 32-bit word, such cases are sometimes referred to as mixed-endian or middle-endian. There are also some bi-endian processors that operate in either little-endian or big-endian mode, big-endianness may be demonstrated by writing a decimal number, say one hundred twenty-three, on paper in the usual positional notation understood by a numerate reader,123. The digits are written starting from the left and to the right, with the most significant digit,1 and this is analogous to the lowest address of memory being used first. This is an example of a big-endian convention taken from daily life, the little-endian way of writing the same number, one hundred twenty-three, would place the hundreds-digit 1 in the right-most position,321. A person following conventional big-endian place-value order, who is not aware of this ordering, would read a different number. Endianness in computing is similar, but it applies to the ordering of bytes. The illustrations to the right, where a is a memory address, danny Cohen introduced the terms Little-Endian and Big-Endian for byte ordering in an article from 1980. Computer memory consists of a sequence of storage cells, each cell is identified in hardware and software by its memory address. If the total number of cells in memory is n. Computer programs often use data structures of fields that may consist of data than is stored in one memory cell. For the purpose of this article where its use as an operand of an instruction is relevant, in addition to that, it has to be of numeric type in some positional number system
7.
Text file
–
A text file is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists within a file system. The end of a file is often denoted by placing one or more special characters. Such markers were required under the CP/M and MS-DOS operating systems, on modern operating systems such as Windows and Unix-like systems, text files do not contain any special EOF character. Text file refers to a type of container, while plain text refers to a type of content, Text files can contain plain text, but they are not limited to such. At a generic level of description, there are two kinds of files, text files and binary files. Because of their simplicity, text files are used for storage of information. They avoid some of the problems encountered with other formats, such as endianness, padding bytes. Further, when data corruption occurs in a file, it is often easier to recover. A disadvantage of text files is that usually have a low entropy. A simple text file needs no additional metadata to assist the reader in interpretation, and therefore may contain no data at all, the ASCII character set is the most common format for English-language text files, and is generally assumed to be the default file format in many situations. For accented and other characters, it is necessary to choose a character encoding. In many systems, this is chosen on the basis of the locale setting on the computer it is read on. Common character encodings include ISO 8859-1 for many European languages, because many encodings have only a limited repertoire of characters, they are often only usable to represent text in a limited subset of human languages. Unicode is an attempt to create a standard for representing all known languages. On most operating systems the name text file refers to file format that only plain text content with very little formatting. Such files can be viewed and edited on text terminals or in text editors. Text files usually have the MIME type text/plain, usually with additional information indicating an encoding, MS-DOS and Windows use a common text file format, with each line of text separated by a two-character combination, carriage return and line feed
8.
UTF-8
–
UTF-8 is a character encoding capable of encoding all possible characters, or code points, defined by Unicode and originally designed by Ken Thompson and Rob Pike. The encoding is variable-length and uses 8-bit code units and it was designed for backward compatibility with ASCII and to avoid the complications of endianness and byte order marks in the alternative UTF-16 and UTF-32 encodings. The name is derived from Unicode Transformation Format – 8-bit, UTF-8 is the dominant character encoding for the World Wide Web, accounting for 88. 9% of all Web pages in April 2017. The Internet Mail Consortium recommended that all programs be able to display and create mail using UTF-8. UTF-8 encodes each of the 1,112,064 valid code points in Unicode using one to four 8-bit bytes, code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. The following table shows the structure of the encoding, the x characters are replaced by the bits of the code point. If the number of significant bits is no more than 7, the first line applies, if no more 11 bits, the second line applies, the first 128 characters need one byte. Three bytes are needed for characters in the rest of the Basic Multilingual Plane, four bytes are needed for characters in the other planes of Unicode, which include less common CJK characters, various historic scripts, mathematical symbols, and emoji. The salient features of this scheme are as follows, Backward compatibility, One-byte codes are used for the ASCII values 0 through 127, clear indication of byte sequence length, The first byte indicates the number of bytes in the sequence. The length of multi-byte sequences is determined as it is simply the number of high-order 1s in the leading byte. Self-synchronization, The leading bytes and the continuation bytes do not share values and this means a search will not accidentally find the sequence for one character starting in the middle of another character. It also means the start of a character can be found from a position by backing up at most 3 bytes to find the leading byte. Consider the encoding of the Euro sign, €, the Unicode code point for € is U+20AC. According to the table above, this will take three bytes to encode, since it is between U+0800 and U+FFFF. Hexadecimal 20AC is binary 0010000010101100, the two leading zeros are added because, as the scheme table shows, a three-byte encoding needs exactly sixteen bits from the code point. All continuation bytes contain exactly six bits from the code point, so the next six bits of the code point are stored in the low order six bits of the next byte, and 10 is stored in the high order two bits to mark it as a continuation byte. Finally the last six bits of the point are stored in the low order six bits of the final byte. The three bytes 111000101000001010101100 can be concisely written in hexadecimal, as E282 AC
9.
Mojibake
–
Mojibake, from the Japanese 文字 character + 化け transform, is the garbled text that is the result of text being decoded using an unintended character encoding. The result is a replacement of symbols with completely unrelated ones. This display may include the replacement character � in places where the binary representation is considered invalid. A replacement can also involve multiple consecutive symbols, as viewed in one encoding and this is either because of differing constant length encoding, or the use of variable length encodings. Failed rendering of glyphs due to either missing fonts or missing glyphs in a font is a different issue that is not to be confused with mojibake, symptoms of this failed rendering include blocks with the codepoint displayed in hexadecimal or using the generic replacement character �. Importantly, these replacements are valid and are the result of error handling by the software. To correctly reproduce the text that was encoded, the correspondence between the encoded data and the notion of its encoding must be preserved. As mojibake is the instance of incompliance between these, it can be achieved by manipulating the data itself, or just relabeling it. Mojibake is often seen with text data that have been tagged with an encoding, it may not even be tagged at all. A major source of trouble are communication protocols that rely on settings on each computer rather than sending or storing metadata together with the data. Whereas Linux distributions mostly switched to UTF-8 for all uses of text, Microsoft Windows still uses codepages for text files, for some writing systems, an example being Japanese, several encodings have historically been employed, causing users to see mojibake relatively often. If the encoding is not specified, it is up to the software to decide it by other means, depending on type of software, the typical solution is either configuration or charset detection heuristics. Both are prone to mispredict in not-so-uncommon scenarios, the encoding of text files is usually governed by the OS-level setting, which depends on brand of operating system and possibly the users language. Therefore, the encoding is systematically wrong for files that come from a computer with a different setting. One solution is to use a byte order mark, but for source code and other machine readable text, another is storing the encoding as metadata in the filesystem. Filesystems that support extended file attributes can store this as user. charset and this also requires support in software that wants to take advantage of it, but does not disturb other software. While a few encodings are easy to detect, in particular UTF-8, Mojibake also occurs when the encoding is wrongly specified. This often happens between encodings that are similar, for example, the Eudora email client for Windows was known to send emails labelled as ISO-8859-1 that were in reality Windows-1252
10.
Windows-1252
–
Windows-1252 or CP-1252 is a character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows in English and some other Western languages. It is one version within the group of Windows code pages, notable additional characters include curly quotation marks and all the printable characters that are in ISO 8859-15. It is known to Windows by the page number 1252. It is very common to mislabel Windows-1252 text with the charset label ISO-8859-1, a common result was that all the quotes and apostrophes were replaced with question marks or boxes on non-Windows operating systems, making text difficult to read. Most modern web browsers and e-mail clients treat the media type charset ISO-8859-1 as Windows-1252 to accommodate such mislabeling and this is now standard behavior in the HTML5 specification, which requires that documents advertised as ISO-8859-1 actually be parsed with the Windows-1252 encoding. Historically, the phrase ANSI Code Page is used in Windows to refer to various code pages considered as native, the intention was that most of these would be ANSI standards such as ISO-8859-1. Even though Windows-1252 was the first and by far most popular code page named so in Microsoft Windows parlance, Microsoft explains, The term ANSI as used to signify Windows code pages is a historical reference, but is nowadays a misnomer that continues to persist in the Windows community. In LaTeX packages, CP-1252 is referred to as ansinew, each character is shown with its Unicode equivalent and its decimal code. Conversions to Unicode are based on the Unicode. org mapping of Windows-1252 with best fit, the best fit mapping documents this behavior, too. The first version of the codepage 1252 used in Microsoft Windows 1.0 did not have positions D7, all the characters in the ranges 80-9F were undefined too. The second version, used in Microsoft Windows 2.0, positions D7, F7,91, the third version, used since Microsoft Windows 3.1, had all the present-day positions defined, except Euro sign and Z with caron character pair. The final version listed above debuted in Microsoft Windows 98 and was ported to older versions of Windows with the Euro symbol update, Western Latin character sets Windows-1250 Windows 1252 reference chart IANA Charset Name Registration Unicode mapping table for Windows 1252
11.
Jon Bosak
–
Jon Bosak led the creation of the XML specification at the W3C. From 1996–2008, he worked for Sun Microsystems, jons stewardship of the XML process has been marked by a combination of deft political maneuvering with steadfast insistence on the principle of doing things based on principle, not expediency. In 2001, Bosak organized the OASIS Universal Business Language Technical Committee to create standard formats for electronic business documents. He led the UBL TC through the completion of UBL2.1 in November 2013, Bosak is author of the book The Old Measure, An Inquiry into the Origins of the U. S. Customary System of Weights and Measures. and the article Canonical grain weights as a key to ancient systems of weights and measures. Jon Bosaks father, Robert Bosak, began the long involvement in the computer industry in 1947 when he went to work on the first computer on the west coast of the USA. He joined RAND in 1948 to work on analysis and programming of scientific problems, in 1951, he joined Lockheed Aircraft Corporation, where he organized and directed the Mathematical Analysis Group. For a short time after his divorce in the 1950s, he shared an apartment with Bob Bemer and he was one of the designers of JOVIAL and principal author of the seminal paper An Information Algebra. Background information at ibiblio Interview with JavaWorld Schema document for namespace http, //www. w3. org/XML/1998/namespace
12.
Zero-width non-joiner
–
The zero-width non-joiner is a non-printing character used in the computerization of writing systems that make use of ligatures. When placed between two characters that would otherwise be connected into a ligature, a ZWNJ causes them to be printed in their final and initial forms, respectively. This is also an effect of a character, but a ZWNJ is used when it is desirable to keep the words closer together or to connect a word with its morpheme. The ZWNJ is encoded in Unicode as U+200C ZERO WIDTH NON-JOINER, in certain languages, the ZWNJ is necessary for unambiguously specifying the correct typographic form of a character sequence. The ASCII control code unit separator was formerly used, in the Biblical Hebrew example, the placement of the holam dot to the left of the letter vav ⟨ו⟩ is correct. If a dot were placed over the ⟨ו⟩, it would resemble ⟨עוֹנוֹת⟩, in Modern Hebrew, there is no reason to use the ḥolam after the vav, and the ligature is not needed, so it is rarely used in Modern Hebrew typesetting. In German typography, ligatures may not cross the constituent boundaries within compounds, thus, in the first German example, the prefix Auf- is separated from the rest of the word to prohibit the ligature fl. Similarly in English, ligatures should not cross morpheme boundaries, for example, in some words fly and fish are morphemes but in others theyre not, therefore words like deafly and selfish should not have ligatures while dayfly and catfish should have them. Persian uses this character extensively for certain prefixes, suffixes and compound words, in Kannada, the characters ನ್ and ನ combine to from ನ್ನ, but when a ZWNJ is inserted between them, ನ್ನ would be seen. In Bengali, the characters র and the combined to র্য which is a conjugate glyph of র and য. Because অ্যা isnt a character rather it is written by 3 letters, to fix the problem ZWNJ is used. Examples, The words র্যাব, র্যান্ডম etc are fixed by inserting ZWNJ, without ZWNJ they are shown as র্যাব, র্যান্ডম. Besides, to write the words উদ্ঘাটন, ইক্রা, etc need ZWNJ too, Zero-width joiner Zero-width space Word divider Using the ZWNJ in Persian
13.
CJK Unified Ideographs
–
The Chinese, Japanese and Korean scripts share a common background, collectively known as CJK characters. In the process called Han unification, the characters were identified and named CJK Unified Ideographs. As of Unicode 9.0, Unicode defines a total of 80,388 CJK Unified Ideographs, the terms ideographs or ideograms may be misleading, since the Chinese script is not strictly a picture writing system. Historically, Vietnam used Chinese ideographs too, so sometimes the abbreviation CJKV is used and this system was replaced by the Latin-based Vietnamese alphabet in the 1920s. The basic block named CJK Unified Ideographs contains 20,950 basic Chinese characters in the range U+4E00 through U+9FD5, the block not only includes characters used in the Chinese writing system but also kanji used in the Japanese writing system and hanja, whose use is diminishing in Korea. Many characters in block are used in all three writing systems, while others are in only one or two of the three. Chinese characters were used in Vietnams Nôm script. The first 20,902 characters in the block are arranged according to the Kangxi Dictionary ordering of radicals, in this system the characters written with the fewest strokes are listed first. The remaining characters were added later, and so are not in radical order, the block is the result of Han unification, which was somewhat controversial in the Far East. Since Chinese, Japanese and Korean characters were coded in the same location, however, the source separation rule states that characters encoded separately in an earlier character set would remain separate in the new Unicode encoding. Using variation selectors, it is possible to specify certain variant CJK ideograms within Unicode, the Adobe-Japan1 character set proposal, which actually calls for 14,679 ideographic variation sequences, is an extreme example of the use of variation selectors. Note, Most characters appear in sources, making the sum of individual character counts far more than the number of encoded characters. In Unicode 4.1,14 HKSCS-2004 characters and 8 GB18030 characters were assigned to between U+9FA6 and U+9FBB code points. The block named CJK Unified Ideographs Extension A contains 6,582 additional characters in the range U+3400 through U+4DB5 that were added in Unicode 3.0. Note, Most characters appear in more than one source, making the sum of individual character counts far more than the number of encoded characters. The block named CJK Unified Ideographs Extension B contains 42,711 characters in the range U+20000 through U+2A6D6 that were added in Unicode 3.1. These include most of the used in the Kangxi Dictionary that are not in the basic CJK Unified Ideographs block. 20000-215FF, 21600-230FF, 23100-245FF, 24600-260FF, 26100-275FF, 27600-290FF, note, Many characters appear in more than one source, making the sum of individual character counts far more than the number of encoded characters
14.
Combining character
–
In digital typography, combining characters are characters that are intended to modify other characters. The most common combining characters in the Latin script are the combining diacritical marks, Unicode also contains many precomposed characters, so that in many cases it is possible to use both combining diacritics and precomposed characters, at the users or applications choice. In Unicode, the block of combining diacritics for European languages. Combining diacritical marks are present in many other blocks of Unicode characters. OpenType has the feature tag to define glyphs that are compositions or decompositions involving combining characters. Codepoints U+035C–0362 are double diacritics, diacritic signs placed across two letters, for example, U+0364 is an e written above the preceding letter, to be used for Early Modern High German umlaut notation, such as uͤ for Modern German ü
15.
Numerals in Unicode
–
Numerals are characters or sequences of characters that denote a number. The same Arabic-Indic numerals are used widely in various writing systems throughout the world, however, the graphemes representing these numerals differ widely from one writing system to another. To support these grapheme differences, Unicode includes encodings of these numerals within many of the script blocks, the decimal digits are repeated in 23 separate blocks, twice in Arabic. Six additional blocks contain the digits again as rich text primarily to serve as a palette of graphemes for specialized mathematical use, numerals invariably involve composition of glyphs as a limited number of characters are composed to make other numerals. For example, the sequence 9–9–0 in Arabic-Indic numerals composes the numeral for nine hundred ninety, in Roman numerals, the same number is expressed by the composed numeral Ⅹↀ or ⅩⅯ. Each of these is a numeral for representing the same abstract number. The semantics of the numerals differ in particular in their composition, the Arabic-Indic decimal digits are positional-value compositions, while the Roman numerals are sign-value and they are additive and subtractive depending on their composition. Grouped by their property as used in a text, Unicode has four values for Numeric Type. First there is the not a number type, not noted is a numbering like A. B. C. for chapter numbering. Hexadecimal digits in Unicode are not separate characters, existing letters and numbers are used and these characters have marked Character properties Hex_digit=Yes, and ASCII_Hex_digit=Yes when appropriate. The Arabic–Indic numerals involve ten digits and a separator that can be combined into composite numerals representing any rational number. Unicode includes these ten digits in the Basic Latin block, Unicode has no decimal separator for common unified use. The Arabic script includes an Arabic specific decimal separator, other writing systems are to use whatever punctuation produces the appropriate glyph for the locale, for example ‘Full Stop’ in United States usage and Comma in many other locales. Unicode includes a numeric value property for each digit to assist in collation, however, there is no mapping between the various related Arabic–Indic digits. The fraction slash character allows authors using Unicode to compose any arbitrary fraction along with the decimal digits, Unicode also includes a handful of vulgar fractions as compatibility characters, but discourages their use. Several characters in Unicode can serve as a decimal separator depending on the locale, decimal fractions are represented in text as a sequence of decimal digit numerals with a decimal separator separating the whole-number portion from the fractional portion. For example, the decimal fraction for “¼” is expressed as zero-point-two-five, Unicode has no dedicated general decimal separator but unifies the decimal separator function with other punctuation characters. Used in “0. 25” is the same character used to end the sentence
16.
Halfwidth and fullwidth forms
–
In CJK computing, graphic characters are traditionally classed into fullwidth and halfwidth characters. With fixed-width fonts, a halfwidth character occupies half the width of a fullwidth character, in the days of computer terminals and text mode computing, characters were normally laid out in a grid, often 80 columns by 24 or 25 lines. Each character was displayed as a dot matrix, often about 8 pixels wide. For a number of practical and aesthetic reasons, Han characters would need to be twice as wide as these fixed-width SBCS characters and these fullwidth characters were typically encoded in a DBCS, although less common systems used other variable-width character sets that used more bytes per character. Halfwidth and Fullwidth Forms is also the name of a Unicode block U+FF00–FFEF, in Unicode, if a certain grapheme can be represented as either a fullwidth character or a halfwidth character, it is said to have both a fullwidth form and a halfwidth form. Halfwidth and Fullwidth Forms is the name of Unicode block U+FF00–FFEF, range U+FF01–FF5E reproduces the characters of ASCII21 to 7E as fullwidth forms, that is, a fixed width form used in CJK computing. This is useful for typesetting Latin characters in a CJK environment, u+FF00 does not correspond to a fullwidth ASCII20, since that role is already fulfilled by U+3000 ideographic space. Range U+FF65–FFDC encodes halfwidth forms of Katakana and Hangul characters – see half-width kana, range U+FFE0–FFEE includes fullwidth and halfwidth symbols. OpenType has the fwid and hwid feature tag to be used for providing fullwidth or halfwidth form of a character
17.
Bi-directional text
–
Bi-directional text is text containing text in both text directionalities, both right-to-left and left-to-right. It generally involves text containing different types of alphabets, but may refer to boustrophedon. This is different from the direction used by most writing systems in the world. When LTR text is mixed with RTL in the paragraph, each type of text is written in its own direction. This can get complex when multiple levels of quotation are used. Many computer programs fail to display bi-directional text correctly, for example, the Hebrew name Sarah is spelled, sin, then resh, and finally heh. Note, Some web browsers may display the Hebrew text in this article in the opposite direction, Bidirectional script support is the capability of a computer system to correctly display bi-directional text. The term is shortened to BiDi or bidi. Early computer installations were designed only to support a single writing system, right-to-left scripts were introduced through encodings like ISO/IEC 8859-6 and ISO/IEC 8859-8, storing the letters in writing and reading order. It is possible to flip the left-to-right display order to a right-to-left display order. With bidirectional script support, it is possible to mix scripts from different scripts on the same page, in particular, the Unicode standard provides foundations for complete BiDi support, with detailed rules as to how mixtures of left-to-right and right-to-left scripts are to be encoded and displayed. The Unicode standard calls for characters to be ordered logically, i. e. in the sequence they are intended to be interpreted, as opposed to visually, the sequence they appear. This distinction is relevant for bidi support because at any bidi transition, thus, in order to offer bidi support, Unicode prescribes an algorithm for how to convert the logical sequence of characters into the correct visual presentation. For this purpose, the Unicode encoding standard divides all its characters into one of four types, strong, weak, neutral, strong characters are those with definite directionality. Weak characters are those with vague directionality, Examples of this type of character include European digits, Eastern Arabic-Indic digits, arithmetic symbols, and currency symbols. Punctuation symbols that are common to many scripts, such as the colon, comma, full-stop, unless a directional override is present numbers are always encoded big-endian, and the numerals rendered LTR. The weak directionality only applies to the placement of the number in its entirety, neutral characters have directionality indeterminable without context. Examples include paragraph separators, tabs, and most other whitespace characters, explicit formatting characters, also referred to as directional formatting characters, are special Unicode sequences that direct the unicode algorithm to modify its default behavior