Page 508

case, single number, male gender; for verbs, sacraments, gentry - the verb in infinitive
imperfect form.

Most mathematical models work in vector spaces of large dimensions, so it is

necessary to display text in vector space. The main approach is a bag of words: a vector
of the dimension of a dictionary is created for a document, for each word its own
dimension is highlighted; for a document it is written a sign how often the word is
encountered in it, we get a vector. The most common method for calculating the
attribute is TF-IDF (TF - word frequency, term frequency, IDF - inverse document
frequency, inverse document frequency). TF is calculated, for example, by a word
count counter. The IDF is usually calculated as the logarithm of the number of
documents in the shell, divided by the number of documents where the word is
represented. Thus, if a word has been encountered in all the documents of the corps,
then such a word will not be added anywhere. The breadth of the word bag is a simple
implementation, but this method loses some of the information, for example, the word
order. To reduce the loss of information, you can use the N-gram bag (add not only
words but also word combinations), or use the methods of vector representation of
words, for example, it allows you to reduce the error in words with the same spelling
but different values.

Since the number of similar documents in a large body can be huge, it is necessary

to get rid of duplicates. Since each document can be represented as a vector, we can
determine their proximity by taking a cosine or other metric. The minus of this
deduplication method is that for large enclosures, a complete search on all documents
will be impossible. To optimize, you can use a locally sensitive hash, which will place
close-to-similar objects.

Semantic (meaning) analysis of the text - the allocation of semantic relations, the

formation of a semantic representation. In the general case, the semantic representation
is a graph, a semantic network that reflects binary relations between two nodes -
semantic units of text. The depth of semantic analysis can be different, and in real
systems most often only the syntactic-semantic representation of the text or individual
sentences is constructed. Semantic analysis is used in tasks of the analysis of the
tonality of the text (Sentiment analysis), for example, for automated determination of
the positive feedback.

Detection of named entities and extraction of relationships. Named entities are

objects from the text that can be assigned to one of the predefined categories (eg,
organization, personality, address). The identification of references to similar entities
in the text is the task of recognizing named entities. For example, having sentence
“Stankevich Andrey Sergeevich - winner of the special award of the corporation IBM.”
We can determine that

“Stankevich Andrey Sergeevich” is a person and “IBM” is a company.

- 507 -