significant difference in the results of the previous passages of the test block and the
current one, he must make an expert analysis of the results and confirm or reject the
changes.
NLP. Natural language processing - the intersection of machine learning and
mathematical linguistics, aimed at studying the methods of analysis and synthesis of
natural language. Today NLP is used in many areas, including voice assistants,
automatic text transfers and text filtering. The main three areas are Speech Recognition,
Natural Language Understanding and Natural Language Generation).
NLP solves a large set of tasks that can be broken down into levels (in brackets).
Among these tasks, one can distinguish the following: Recognition of text, speech,
speech synthesis (signal); Morphological analysis, canonization (word); POS-tagging,
recognition of named entities, selection of words (word-combination); Syntactical
parsing, tokenization of sentences (sentence); Extracting relationships, defining a
language, analyzing emotional color (paragraph); Annotation of the document,
translation, analysis of the subject (document); Deduplication, information search
(corps).
Preprocessing of the text translates the text into a natural language in a format
convenient for further work. Pre-processing consists of different stages, which may
vary depending on the task and implementation. Below is one of the possible set of
steps: Translation of all letters in the text into lower or upper register; removing digits
(numbers) or substituting for a text equivalent (regular expressions are usually used);
removing punctuation. Usually implemented as a removal of characters from a
predefined set; Delete whitespace; Tokenization (usually implemented on the basis of
regular expressions); Removing the stop words; Stemming; Lemmatization;
Vectorization.
The number of correct word forms whose values are similar, but the spelling is
different with suffixes, prefixes, endings, and so on, is very large, which complicates
the creation of dictionaries and further processing. Stemming allows you to bring the
word to its basic form. The essence of the approach in finding the basis of the word,
for this from the end and the beginning of the word is consistently cut off its parts. The
cut off rules for the stemmer are created in advance, and most often are regular
expressions, which makes this approach labor-intensive, since when connecting
another language, new linguistic studies are needed. The second disadvantage of the
approach is the potential loss of information when cutting parts, for example, we can
lose information about the parts of speech.
The lemmatization approach is an alternative to stemming. The basic idea of
bringing the word to the vocabulary form - lemma. For example, for the Ukrainian
language: for nouns - a nominal case, a unique number; for adjectives - nominative
- 506 -