ZIPF’S AND HEAPS’ LAWS FOR THE NATURAL AND SOME RELATED RANDOM TEXTS

Oleh Kushnir, V. Buryi, S. Grydzhan, L. Ivanitskyi, Serhiy Rykhlyuk

Abstract


We have generated randomized Chomsky’s texts and Miller’s monkey random texts (RTs), basing on a source natural text (NT), and clarified their rank–frequency depend­ences, Pareto distributions, word-frequency probability distributions, and vocabularies as functions of text lengths. Here the Chomsky’s RT is a NT randomized so that its ‘words’ re­pres­ent any sequences of letters and blanks between the nearest occurrences of some preset letter (e.g., the letter i). We have compared the exponents appearing in different power laws that describe the word statistics for the NTs and RTs, and have analyzed how well theoretical relationships among those exponents are fulfilled in practice. We have proven empirically that the exponents α and β of the Zipf’s law and the word probability distribution for the Chomsky’s RTs are limited by the inequalities α < 1 and β > 1, while their Heaps’ exponent should be equal to η ≈ 1. We have also compared our results to those obtained for the monkey texts. We have shown that the vocabulary of the Chom­sky’s texts is richer than that of the monkey texts. The Heaps’ law is valid to extraordinar­ily good approximation for the Chomsky’s RTs, similarly to the RTs generated by the in­termittence silence process and unlike to sufficiently long NTs that reveal slightly convex vocabulary versus text length dependences plotted on the double logarithmic scale.

Key words: random texts, randomized texts, Miller’s monkey texts, Chomsky’s randomization, power laws, Zipf’s law, Pareto distribution, word-frequency probability dis­tribution, Heaps’ law

Full Text:

PDF


DOI: http://dx.doi.org/10.30970/eli.9.94

Refbacks

  • There are currently no refbacks.