ANALOGIES BETWEEN IMAGES AND TEXTS: 'CLUSTERIZATION' PHENOMENON IN TEXTS AND DIGITAL IMAGES
Abstract
We develop a number of analogies between texts and images, where different pixel-brightness levels (or gray values) in an image correspond to different linguistic elements of some level (such as letters, symbols, words, etc.) in a text. This correspondence is possible only in the case of discrete structures such as digital images. A number of practical recipes for converting two-dimensional images into one-dimensional linear chains of pixels (i.e., ‘texts’) are discussed. In particular, this can be consideration of separate rows (or columns) in an image – or a series of sequential rows (or columns) in it, where a sequential number of pixel in a ‘one-dimensional image’ is analogue of a position of linguistic element in a text. The advantages and shortcomings of these recipes are analyzed. Besides, we introduce analogues of the statistical-linguistic notions of ‘vocabulary’ and ‘rank dependence’ in the case of images.
Then we clarify the possibilities for application to digital images of standard statistical-linguistic techniques for detecting keywords in texts. In particular, we remind of the phenomenon of clustering of words in a text, which is very weak or absent for so-called function words – and pronounced for keywords. A simplest clustering parameter R is introduced that relates to the first statistical moments of the probability distribution for the waiting times of a given word in a text. It quantizes the scale of the above phenomenon for this word. After that, we analyze the latter phenomenon in the ‘texts’ that correspond to two different digital images.
Our main result is that the ‘texts’ corresponding to a pure white noise and a simple informative image differ notably by their R-parameters associated with different ‘words’ (i.e., brightness levels). A possible meaning of ‘keywords’ in an image is discussed, which can be associated with some ‘semantic load’ of the appropriate brightness levels. We also advise to check the availability of long-range correlations among the same brightness levels in an image. This can be done according to one of standard techniques known for the studies of correlations (e.g., fluctuation analysis or detrended fluctuation analysis).
Key words: statistical linguistics, natural language processing, clustering, keywords, digital image processing.
Full Text:
PDF (Українська)References
[1] Gonzalez R., Woods R. Digital image processing / Pearson, 2018. - 1020 p.
[2] Clark A., Fox C., Lappin S. The handbook of computational linguistics and natural language processing / Wiley-Blackwell, 2010. - 802 p.
[3] Meriem H., Farida M. H. Detection of a region of interest in the images based on Zipf laws // Proc. 7th Int. Conf. on Signal Image Technol. & Internet-Based Systems. - 2011. - P. 416-421.
[4] Manzanera A. ?-? background subtraction and the Zipf law / In: "Progress in Pattern Recognition, Image Analysis and Applications (CIARP'07)". - Vi?a del Mar-Valpara?so, Chile, 2007. 10.1007/978-3-540-76725-1_5. hal-01222660ff
[5] Zanette D. H. Statistical patterns in written language / Centro Atomico Bariloche, 2012. - 87 p.
[6] Altmann E. G., Gerlach M. Statistical laws in linguistics // Proc. Flow Machines Workshop: Creativity and Universality in Language (Paris, 2014). - arXiv:1502.03296 (2015).
[7] Kushnir O. S., Buryi V. O., Grydzhan S. V., Ivanitskyi L. B., Rykhlyuk S. V. Zipf's and Heaps' laws for the natural and some related random texts // Електроніка та інформаційні технології. - 2018. - Вип. 9. - С. 94-105.
[8] Montemurro M. A., Zanette D. H. Entropic analysis of the role of words in literary texts // Adv. Complex Syst. - 2002. - Vol. 05. - P. 7-17.
[9] Ortuno M., Carpena P., Bernaola-Galvan P., Munoz E., Somoza A. M. Keyword detection in natural languages and DNA // Europhys. Lett. - 2002. - Vol. 57. - P. 759-764.
[10] Herrera J. P., Pury P. A. Statistical keyword detection in literary corpora // Europ. Phys. J. - 2008. - Vol. 63. - P. 135-146.
[11] Carpena P., Bernaola-Galv?n P., Hackenberg M., Coronado A. V., Oliver J. L. Level statistics of words: finding keywords in literary texts and symbolic sequences // Phys. Rev. E. - 2009. - Vol. 79. - P. 035102(R).
[12] Altmann E. G., Pierrehumbert J. B., Motter A. E. Beyond word frequency: bursts, lulls, and scaling in the temporal distributions of words // PLoS ONE. - 2009. - Vol. 4. - P. e7678.
[13] Кушнір О. С., Волоско А. В., Іваніцький Л. Б., Рихлюк С. В. Про статистику відстаней між словами в тексті та проблему розпізнавання змістових слів // Електроніка та інформаційні технології. - 2016. - Вип. 6. - С. 155-164.
[14] Carretero-Campos C., Bernaola-Galv?n P., Ivanov P. Ch., Carpena P. Improving statistical keyword detection in short texts: Entropic and clustering approaches // Phys. Rev. E. - 2012. - Vol. 85. - P. 011139.
[15] Кушнір О. С., Альфавіцький М. А., Дзіковський В. Є., Іваніцький Л. Б., Рихлюк С. В., Сокульський В. І. Статистика появи слів у природних і рандомних текстах // Вісник нац. ун-ту "Львівська політехніка". Серія "Інформаційні системи та мережі". - 2017. - №872. - С. 162-178.
[16] Altmann E. G., Cristadoro G., Esposti M. D. On the origin of long-range correlations in texts // Proc. Natl. Acad. Sci. (USA). - 2012. - Vol. 109. - P. 11582-11587.
[17] Kushnir O. S., Ivanitskyi L. B., Kashuba A. I., Mostova M. R., Mykhaylyk V. B. Large-scale studies of the repetition characteristic for different models of symbolic sequences // Proc. 12th IEEE Int. Conf. on Electron. and Inf. Technol., 2021. - P. 61-66.
[18] Kantelhard J. W. Fractal and multifractal time series / In: Mathematics of complexity and dynamical systems (Ed. by Meyers R. A.). - New York: Springer, 2011. - P. 463-487.
[19] Яремків В. В., Кушнір О. С., Іваніцький Л. Б. Довгосяжні кореляції символів і послідовностей символів у текстах: метод флуктуаційного аналізу // Матер. IX Укр.-Польськ. наук.-практ. конф. "Електроніка та інформаційні технології". - Львів : Видавн. Львів. ун-ту, 2017. - С. 29-33.
[20] Іваніцький Л. Б., Кушнір О. С., Яремків В. В., Альфавіцький М. А. Метод DFA для аналізу довгосяжних кореляцій у часових послідовностях // Матер. IX Укр.-Польськ. наук.-практ. конф. "Електроніка та інформаційні технології". - Львів : Видавн. Львів. ун-ту, 2017. - С. 59-62.
[21] Carpena P., Bernaola-Galv?n P. A., Carretero-Campos C., Coronado A. V. Probability distribution of intersymbol distances in random symbolic sequences: applications to improving detection of keywords in texts and of amino acid clustering in proteins // Phys. Rev. E. - 2016. - Vol. 94. - P. 052302.
DOI: http://dx.doi.org/10.30970/eli.17.1
Refbacks
- There are currently no refbacks.