Pages in category "Statistical laws"
The following 27 pages are in this category, out of 27 total. This list may not reflect recent changes (learn more).
The following 27 pages are in this category, out of 27 total. This list may not reflect recent changes (learn more).
1. Zipf's law – The law is named after the American linguist George Kingsley Zipf, who popularized it and sought to explain it, though he did not claim to have originated it. The French stenographer Jean-Baptiste Estoup appears to have noticed the regularity before Zipf and it was also noted in 1913 by German physicist Felix Auerbach. Zipfs law states that given some corpus of natural language utterances, for example, in the Brown Corpus of American English text, the word the is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences. True to Zipfs Law, the word of accounts for slightly over 3. 5% of words, followed by. Only 135 vocabulary items are needed to account for half the Brown Corpus, the appearance of the distribution in rankings of cities by population was first noticed by Felix Auerbach in 1913. When Zipfs law is checked for cities, a better fit has been found with exponent s =1.07, while Zipfs law holds for the upper tail of the distribution, the entire distribution of cities is log-normal and follows Gibrats law. Both laws are consistent because a log-normal tail can not be distinguished from a Pareto tail. Zipfs law is most easily observed by plotting the data on a log-log graph, for example, the word the would appear at x = log, y = log. It is also possible to plot reciprocal rank against frequency or reciprocal frequency or interword interval against rank, the data conform to Zipfs law to the extent that the plot is linear. Formally, let, N be the number of elements, k be their rank and it has been claimed that this representation of Zipfs law is more suitable for statistical testing, and in this way it has been analyzed in more than 30,000 English texts. The goodness-of-fit tests yield that only about 15% of the texts are statistically compatible with this form of Zipfs law, slight variations in the definition of Zipfs law can increase this percentage up to close to 50%. In the example of the frequency of words in the English language, N is the number of words in the English language and, if we use the version of Zipfs law. F will then be the fraction of the time the kth most common word occurs, the law may also be written, f =1 k s H N, s where HN, s is the Nth generalized harmonic number. The simplest case of Zipfs law is a 1⁄f function, given a set of Zipfian distributed frequencies, sorted from most common to least common, the second most common frequency will occur ½ as often as the first. The third most common frequency will occur ⅓ as often as the first, the fourth most common frequency will occur ¼ as often as the first. The nth most common frequency will occur 1⁄n as often as the first, however, this cannot hold exactly, because items must occur an integer number of times, there cannot be 2.5 occurrences of a word. Nevertheless, over fairly wide ranges, and to a good approximation. Mathematically, the sum of all frequencies in a Zipf distribution is equal to the harmonic series
2. Benford's law – Benfords law, also called the first-digit law, is an observation about the frequency distribution of leading digits in many real-life sets of numerical data. The law states that in naturally occurring collections of numbers. For example, in sets which obey the law, the number 1 appears as the most significant digit about 30% of the time, by contrast, if the digits were distributed uniformly, they would each occur about 11. 1% of the time. Benfords law also makes predictions about the distribution of digits, third digits, digit combinations. It tends to be most accurate values are distributed across multiple orders of magnitude. The graph here shows Benfords law for base 10, there is a generalization of the law to numbers expressed in other bases, and also a generalization from leading 1 digit to leading n digits. It is named after physicist Frank Benford, who stated it in 1938, Benfords law is a special case of Zipfs law. A set of numbers is said to satisfy Benfords law if the digit d occurs with probability P = log 10 − log 10 = log 10 = log 10 . Therefore, this is the distribution expected if the mantissae of the logarithms of the numbers are uniformly and randomly distributed. For example, a x, constrained to lie between 1 and 10, starts with the digit 1 if 1 ≤ x <2. Therefore, x starts with the digit 1 if log 1 ≤ log x < log 2, the probabilities are proportional to the interval widths, and this gives the equation above. An extension of Benfords law predicts the distribution of first digits in other bases besides decimal, in fact, the general form is, P = log b − log b = log b . For b =2, Benfords law is true but trivial, the discovery of Benfords law goes back to 1881, when the American astronomer Simon Newcomb noticed that in logarithm tables the earlier pages were much more worn than the other pages. Newcombs published result is the first known instance of this observation and includes a distribution on the second digit, Newcomb proposed a law that the probability of a single number N being the first digit of a number was equal to log − log. The phenomenon was noted in 1938 by the physicist Frank Benford. The total number of used in the paper was 20,229. This discovery was named after Benford. In 1995, Ted Hill proved the result about mixed distributions mentioned below, arno Berger and Ted Hill have stated that, The widely known phenomenon called Benford’s law continues to defy attempts at an easy derivation
3. Rank-size distribution – Rank-size distribution is the distribution of size by rank, in decreasing order of size. For example, if a set consists of items of sizes 5,100,5, and 8. This is also known as the distribution, when the source data are from a frequency distribution. These are particularly of interest when the data vary significantly in scale, a rank-size distribution is not a probability distribution or cumulative distribution function. Rather, it is a form of a quantile function in reverse order. This results in a few cities and a much larger number of cities orders of magnitude smaller. For example, a rank 3 city would have one-third the population of a countrys largest city, a rank 4 city would have one-fourth the population of the largest city, and so on. When any log-linear factor is ranked, the follow the Lucas numbers. Like the more famous Fibonacci sequence, each number is approximately 1.618 times the preceding number. For example, the term in the sequence above,4, is approximately 1.6183, or 4.236, the fourth term,7, is approximately 1.6184, or 6.854. With higher values, the figures converge, an equiangular spiral is sometimes used to visualize such sequences. A rank-size distribution is often segmented into ranges and this is frequently done somewhat arbitrarily or due to external factors, particularly for market segmentation, but can also be due to distinct behavior as rank varies. Most simply and commonly, a distribution may be split in two, termed the head and tail, if a distribution is broken into three pieces, the third piece has several terms, generically middle, also belly, torso, and body. These frequently have some added, most significantly long tail, also fat belly, chunky middle. In more traditional terms, these may be called top-tier, mid-tier, the relative sizes and weights of these segments qualitatively characterizes a distribution, analogously to the skewness or kurtosis of a probability distribution. Namely, is it dominated by a few top members, or is it dominated by many small members, practically, this determines strategy, where should attention be focused. These distinctions may be made for various reasons, the exact cutoff depends on the distribution – each distribution has a single such cutoff point – and for power laws can be computed from the Pareto index. Segments may arise due to actual changes in behavior of the distribution as rank varies
4. Heaps' law – In linguistics, Heaps law is an empirical law which describes the number of distinct words in a document as a function of the document length. It can be formulated as V R = K n β where VR is the number of words in an instance text of size n. K and β are free parameters determined empirically, with English text corpora, typically K is between 10 and 100, and β is between 0.4 and 0.6. The law is attributed to Harold Stanley Heaps, but was originally discovered by Gustav Herdan. Under mild assumptions, the Herdan–Heaps law is equivalent to Zipfs law concerning the frequencies of individual words within a text. This is a consequence of the fact that the relation of a homogenous text can be derived from the distribution of its types. Heaps law means that as more text is gathered, there will be diminishing returns in terms of discovery of the full vocabulary from which the distinct terms are drawn. Heaps law also applies to situations in which the vocabulary is just some set of types which are attributes of some collection of objects. For example, the objects could be people, and the types could be country of origin of the person. Egghe, L. Untangling Herdans law and Heaps law, Mathematical and informetric arguments, Journal of the American Society for Information Science and Technology,58,702, Heaps, Harold Stanley, Information Retrieval, Computational and Theoretical Aspects, Academic Press. Heaps law is proposed in Section 7.5, Herdan, Gustav, Type-token mathematics, The Hague, Mouton. Kornai, Andras, Zipfs law outside the range, in Rogers, James, Proceedings of the Sixth Meeting on Mathematics of Language, University of Central Florida. Milička, Jiří, Type-token & Hapax-token Relation, A Combinatorial Model, international Journal of Theoretical Linguistics,1, 99–110, doi,10. 1515/glot-2009-0009. Van Leijenhorst, D. C, van der Weide, Th. P, a formal derivation of Heaps Law, Information Sciences,170, 263–272, doi,10. 1016/j. ins.2004.03.006. This article incorporates material from Heaps law on PlanetMath, which is licensed under the Creative Commons Attribution/Share-Alike License
5. Lotka's law – Lotkas law, named after Alfred J. Lotka, is one of a variety of special applications of Zipfs law. It describes the frequency of publication by authors in any given field. e, an approximate inverse-square law, where the number of authors publishing a certain number of articles is a fixed ratio to the number of authors publishing a single article. As the number of articles published increases, authors producing that many publications become less frequent, though the law itself covers many disciplines, the actual ratios involved are discipline-specific. This law is believed to have applications in fields, for example in the military for fighter pilot kills. This is an empirical observation rather than a necessary result and this form of the law is as originally published and is sometimes referred to as the discrete Lotka power function. Kee H. Chung and Raymond A. K. Cox, patterns of Productivity in the Finance Literature, A Study of the Bibliometric Distributions. Access to The Journal of the Washington Academy of Sciences, vol.16 Friedman, the Power of Lotka’s Law Through the Eyes of R The Romanian Statistical Review. Published by National Institute of Statistics, ISSN 1018-046X B Rousseau and R Rousseau. LOTKA, A program to fit a power law distribution to observed frequency data, - Software to fit a Lotka power law distribution to observed frequency data
6. Power law – For instance, considering the area of a square in terms of the length of its side, if the length is doubled, the area is multiplied by a factor of four. Few empirical distributions fit a power law for all their values, acoustic attenuation follows frequency power-laws within wide frequency bands for many complex media. Allometric scaling laws for relationships between biological variables are among the best known power-law functions in nature, one attribute of power laws is their scale invariance. Given a relation f = a x − k, scaling the argument x by a constant factor c causes only a proportionate scaling of the function itself and that is, f = a − k = c − k f ∝ f. That is, scaling by a constant c simply multiplies the original power-law relation by the constant c − k, thus, it follows that all power laws with a particular scaling exponent are equivalent up to constant factors, since each is simply a scaled version of the others. This behavior is what produces the linear relationship when logarithms are taken of both f and x, and the straight-line on the plot is often called the signature of a power law. With real data, such straightness is a necessary, but not sufficient, in fact, there are many ways to generate finite amounts of data that mimic this signature behavior, but, in their asymptotic limit, are not true power laws. Thus, accurately fitting and validating power-law models is an area of research in statistics. This can be seen in the thought experiment, imagine a room with your friends. Now imagine the worlds richest person entering the room, with an income of about 1 billion US$. What happens to the income in the room. Income is distributed according to a known as the Pareto distribution. On the one hand, this makes it incorrect to apply traditional statistics that are based on variance, on the other hand, this also allows for cost-efficient interventions. For example, given that car exhaust is distributed according to a power-law among cars it would be sufficient to eliminate those very few cars from the road to reduce total exhaust substantially. For instance, the behavior of water and CO2 at their boiling points fall in the universality class because they have identical critical exponents. In fact, almost all material phase transitions are described by a set of universality classes. Similar observations have made, though not as comprehensively, for various self-organized critical systems. Formally, this sharing of dynamics is referred to as universality, scientific interest in power-law relations stems partly from the ease with which certain general classes of mechanisms generate them
7. Long tail – In statistics and business, a long tail of some distributions of numbers is the portion of the distribution having a large number of occurrences far from the head or central part of the distribution. The distribution could involve popularities, random numbers of occurrences of events with various probabilities, the term is often used loosely, with no definition or arbitrary definition, but precise definitions are possible. In statistics, the term long-tailed distribution has a technical meaning. Note that statistically, there is no sense of the tail of a distribution. In business, the long tail is applied to rank-size distributions or rank-frequency distributions. Sometimes an intermediate category is also included, variously called the body, belly, torso, the specific cutoff of what part of a distribution is the long tail is often arbitrary, but in some cases may be specified objectively, see segmentation of rank-size distributions. The long tail concept has found some ground for application, research and it is a term used in online business, mass media, micro-finance, user-driven innovation, and social network mechanisms, economic models, and marketing. A frequency distribution with a tail has been studied by statisticians since at least 1946. The term has also used in the finance and insurance business for many years. The work of Benoît Mandelbrot in the 1950s and later has led to him being referred to as the father of long tails. The long tail was popularized by Chris Anderson in an October 2004 Wired magazine article, in which he mentioned Amazon. com, Apple, Anderson elaborated the concept in his book The Long Tail, Why the Future of Business Is Selling Less of More. The total sales of large number of non-hit items is called the long tail. It is important to understand why some distributions are normal vs. long tail distributions, the long tail is the name for a long-known feature of some statistical distributions. In long-tailed distributions a high-frequency or high-amplitude population is followed by a low-frequency or low-amplitude population which gradually tails off asymptotically, the events at the far end of the tail have a very low probability of occurrence. As a rule of thumb, for population distributions the majority of occurrences are accounted for by the first 20% of items in the distribution. Power law distributions or functions characterize an important number of behaviors from nature and this fact has given rise to a keen scientific and social interest in such distributions, and the relationships that create them. The observation of such a distribution often points to specific kinds of mechanisms, examples of behaviors that exhibit long-tailed distribution are the occurrence of certain words in a given language, the income distribution of a business or the intensity of earthquakes. Chris Andersons and Clay Shirkys articles highlight special cases in which we are able to modify the underlying relationships and evaluate the impact on the frequency of events
8. 1% rule (Internet culture) – The 1% rule states that the number of people who create content on the Internet represents approximately 1% of the people who view that content. For example, for person who posts on a forum, generally about 99 other people view that forum. The term was coined by authors and bloggers Ben McConnell and Jackie Huba, there were repeated inquiries about her identity and her refusal to engage in chat. The etiquette was, apparently, to other users upon entry into the chat rooms/sites. In some instances, she needed to explain her coinage of the term lurking, as the term was new to the online community, but others quickly understood her meaning. To her knowledge, the terms had not been used prior to that period, the actual percentage is likely to vary depending upon the subject matter. The 1% rule is often misunderstood to apply to the Internet in general and it is for this reason that one can see evidence for the 1% principle on many websites, but aggregated together one can see a different distribution. This latter distribution is unknown and likely to shift, but various researchers. Research in late 2012 suggested that only 23% of the population could properly be classified as lurkers, several years prior, results were reported on a sample of students from Chicago where 60 percent of the sample created content in some form. A similar concept was introduced by Will Hill of AT&T Laboratories and later cited by Jakob Nielsen, the term regained public attention in 2006 when it was used in a strictly quantitative context within a blog entry on the topic of marketing. Netocracy Digital citizen Sturgeons law Participation Inequality, Lurkers vs. Contributors in Internet Communities by Jakob Nielsen, by Charles Arthur in The Guardian, July 20,2006. The 1% Rule by Heather Green in BusinessWeek, May 10,2006 Institutions vs. Collaboration by Clay Shirky, July 2005, Video at 06,00 and 12,42