PROPERTIES OF LEXICAL NETWORKS BUILT ON NATURAL AND RANDOM TEXTS

Oleh Kushnir, A. Drebot, D. Ostrikov, O. Kravchuk

Abstract


We study experimentally and analyze phenomenologically the properties of lexical co-occurrence networks. Our linguistic networks are built on a natural text (NT) and a random text (RT), which has been obtained after randomizing the NT on the lexical level. Another subject of our interest is a random-graph-type (RG) network having the same number of nodes as in the NT network and randomized links among those nodes. We consider non-weighted non-restricted networks where words are not filtered by their frequency and stop words are not removed.

The main parameters of the above networks are calculated and compared with the data of lexical statistics obtained for the NT, RT and RG texts. The latter data is usually expressed by so-called Zipf, Pareto and Heaps laws. For this aim an additional ‘RG text’ has been built following form the RG network. Both of the NT and RT reveal well-known power-law word frequency distributions with heavy tails. On the contrary, the lexical statistics for the RG text is characterized by a nearly-logarithmic rank–frequency dependence and a thin-tail (approximately exponential) frequency distribution.

The probability distributions for the degrees of nodes found for the NT and RT networks are close to each other. Moreover, the degree distributions for the NT (RT) and RG have respectively heavy (power-law) and thin (nearly exponential) tails. In other words, the NT and RT networks are scale-free, unlike our RG network. Moreover, this implies that a heavy (thin) tail of the degree distribution is a consequence of a heavy (thin) tail of the frequency distribution.

The average clustering coefficient and path lengths for the networks built upon the NT and RT are very close to each other. Contrary to the RG network, the NT and RT networks are small worlds and their Walsh’ parameters measuring the small-worldliness are only 2 per cents different. Finally, we analyze a number of consequences of our empirical results and some data known from the literature. In particular, since the RT lacks any semantics or syntax, it would not be proper to associate the scale-free and small-world properties of the lexical networks to either semantics or syntax.

Key words: natural language processing, linguistic networks, analysis and classification of texts, semantics recognition, random models, keyword detection.


References


  1. Watts D. J., Strogatz S. H. Collective dynamics of ‘small-world’ networks // Nature. – 1998. – Vol. 393. – P. 440–442.
  2. Albert R., Jeong H., Barabási A.-L. Diameter of the world-wide web // Nature. – 1999. – Vol. 401. – P. 130–131.
  3. Barabási A.-L., Albert R. Emergence of scaling in random networks // Science. – 1999. – Vol. 286. – P. 509–512.
  4. Albert R., Barabási A.-L. Statistical mechanics of complex networks // Rev. Mod. Phys. – 2002. – Vol. 74. – P. 47–97.
  5. Newman M. E. J. The structure and function of complex networks // SIAM Rev. – 2003. – Vol. 45. – P. 167–256.
  6. Головач Ю., Олємской О., Фербер фон К., Головач Т., Мриглод О., Олємской І., Пальчиков В. Складні мережі // Журн. фіз. дослідж. – 2006. – Т. 10, №4. – С. 247–289.
  7. Ferrer i Cancho R., Solé R. V. The small world of human language // Proc. Roy. Soc. Lond. B. – 2001. – Vol. 268. – P. 2261–2265.
  8. Dorogovtsev S. N., Mendes J. F. F. Language as an evolving word web // Proc. Roy. Soc. Lond. B. – 2001. – Vol. 268. – P. 2603–2606.
  9. Matsuo Y., Ohsawa Y., Ishizuka M. KeyWorld: Extracting keywords from a document as a small world / In: Jantke K. P., Shinohara A. (Eds.). Discovery Science, 2001. Lecture Notes in Computer Science, Vol. 2226. – Berlin, Heidelberg: Springer.
  10. Solé R. V., Corominas-Murtra B., Valverde S., Steels L. Language networks: Their structure, function, and evolution // Complexity. – 2010. – Vol. 15. – P. 20–26.
  11. Ferrer i Cancho R., Solé R. V., Köhler R. Patterns in syntactic dependency networks // Phys. Rev. E. – 2004. – Vol. 69. – P. 051915 (8 pp.).
  12. Antiqueira L., Nunes M. G. V., Oliveira Jr. O. N., Costa L. da F. Strong correlations between text quality and complex networks features // Physica A. – 2007. – Vol. 373. – P. 811–820.
  13. Frabska-Gradzińska I., Kulig A., Kwapień J., Drożdż S. Complex network analysis of literary and scientific texts // Int. J. Mod. Phys. C. – 2012. – Vol. 23. – P. 1250051 (15 pp.).
  14. Liu H., Cong J. Language clustering with word co-occurrence networks based on parallel texts // Chin. Sci. Bull. – 2013. – Vol. 58. – P. 1139–1144.
  15. Cong J., Liu H. Approaching human language with complex networks // Phys. Life Rev. – 2014. – Vol. 11. – P. 598–618.
  16. Espitia D., Larralde H. Universal and non-universal text statistics: Clustering coefficient for language identification // Physica A. – 2020. – Vol. 553. – P. 123905 (25 рр.).
  17. Masucci A. P., Rodgers G. J. Network properties of written human language // Phys. Rev. E. – 2006. – Vol. 74. – 026102 (8 pp.).
  18. Caldeira S. M. G., Petit Lobão T. C., Andrade R. F. S., Neme A., Miranda J. G. V. The network of concepts in written texts // Eur. Phys. J. B. – 2006. – Vol. 49. – P. 523–529.
  19. Masucci A. P., Rodgers G. J. Differences between normal and shuffed texts: Structural properties of weighted networks // Adv. Complex Syst. – 2009. – Vol. 12. – P. 113–129.
  20. Amancio D. R., Altmann E. G., Rybski D., Oliveira Jr. O. N., Costa L. da F. Probing the statistical properties of unknown texts: Application to the Voynich manuscript // PLOS ONE. – 2013. – Vol. 8. – e67310 (10 pp.).
  21. Buk S., Krynytskyi Y., Rovenchak A. Properties of autosemantic word networks in Ukrainian texts // Adv. Complex Syst. – 2019. – Vol. 22. – 1950016 (22 pp.).
  22. Ohsawa Y., Benson N. E., Yachida M. KeyGraph: Automatic indexing by co-occurrence graph based on building construction metaphor / In: Proc. IEEE International Forum on Research and Technology Advances in Digital Libraries ADL’98. – Santa Barbara, USA, 1998. – pp. 12–18.
  23. Matsumura N., Ohsawa Y., Ishizuka M. PAI: Automatic indexing for extracting asserted keywords from a document // NGCO. – 2003. – Vol. 21. – P. 37–47
  24. Matsuo Y., Ishizuka M. Keyword extraction from a single document using word co-occur­rence statistical information // Int. J. Artif. Intell. Tools. – 2004. – Vol. 13. – P. 157–169.
  25. Mihalcea R., Tarau P. TextRank: bringing order into texts / In: Proc. 2004 Conference on Empirical Methods in Natural Language Processing. – pp. 404–411.
  26. Palshikar G. K. Keyword extraction from a single document using centrality measures / In: Pattern Recognition and Machine Intelligence Lecture Notes in Computer Science Ed. by Ghosh A., De R. K., Pal S. K. – Vol. 4815. – Berlin, Heidelberg: Springer-Verlag, 2007. – pp. 503–510.
  27. Rossi R. G., Marcacini R. M., Rezende S. O. Analysis of domain independent statistical keyword extraction methods for incremental clustering // Learning and Nonlinear Models. – 2014. – Vol. 12. – P. 17–37.
  28. Motter A. E., de Moura A. P. S., Lai Ying-Cheng, Dasgupta P. Topology of the conceptual network of language // Phys. Rev. E. – 2002. – Vol. 65. – 065102(R) (4 pp.).
  29. Zanette D. H. Statistical patterns in written language / Centro Atomico Bariloche, 2012. – 87 p. – URL: http://fisica.cab.cnea.gov.ar/estadistica/2te/
  30. Altmann E. G., Gerlach M. Statistical laws in linguistics / In: Proc. Flow Machines Work­shop: Creativity and Universality in Language. – Paris, 2014. – arXiv:1502.03296 (2015).
  31. Kush­nir O. S., Buryi V. O., Grydzhan S. V., Ivanitskyi L. B., Rykh­lyuk S. V. Zipf’s and Heaps’ laws for the natural and some related random texts // Елект­роніка та інформацій­ні технології. – 2018. – Вип. 9. – С. 94–105.
  32. van Leijenhorst D. C., van der Weide Th. P. A formal derivation of Heaps’ law // Inf. Sci. – 2005. – Vol. 170. – P. 263–272.
  33. Lü L., Zhang Z.-K., Zhou T. Zipf’s law leads to Heaps’ law: Analyzing their relation in finite-size systems // PLOS ONE. – 2010. – Vol. 5. – e14139 (11 pp.).
  34. Tria F., Loreto V., Servedio V. D. P. Zipf’s, Heaps’ and Taylor’s laws are determined by the expansion into the adjacent possible // Entropy. – 2018. – Vol. 20. – 752 (19 pp.).
  35. Gerlach M., Altmann E. G. Stochastic model for the vocabulary growth in natural languages // Phys. Rev. X. – 2013. – Vol. 3. – 021006 (10 pp.).
  36. Font-Clos F., Corral A. Log-log convexity of type-token growth in Zipf’s systems // Phys. Rev. Lett. – 2015. – Vol. 114. – 238701 (5 pp.).
  37. Lü L., Zhang Z.-K., Zhou T. Deviation of Zipf’s and Heaps’ laws in human languages with limited dictionary sizes // Sci. Rep. – 2013. – Vol. 3. – 1082 (7 pp.).
  38. Li W. Random texts exhibit Zipf’s-law-like word frequency distribution // IEEE Trans. Inform. Theory. – 1992. – Vol. 38. – P. 1842–1845.
  39. Ferrer-i-Cancho R., Elvevåg B. Random texts do not exhibit the real Zipf’s law-like rank distribution // PLOS ONE. – 2010. – Vol. 5. – e9411 (10 pp.).
  40. Walsh T. Search in a small world / In: Proc. 16th Int. Joint Conf. on Artificial Intelligence IJCAI’99. – 1999. – Vol. 2. – pp. 1172–1177.
  41. Luhn H. P. The automatic creation of literature abstracts // IBM J. Res. Development. – 1958. – Vol. 2. – P. 159–165.
  42. Rovenchak A., Rovenchak O. Quantifying comprehensibility of Christmas and Easter addresses from the Ukrainian Greek Catholic Church hierarchs // Glottometrics. – 2018. – Vol. 41. – P. 57–66.




DOI: http://dx.doi.org/10.30970/eli.28.3

Refbacks

  • There are currently no refbacks.