Keyword Extraction

Words are the main phenomena that form a sentence. In terms of understanding the topics in a conversation, one can look at the important words in it. A good keyword extraction model effectively solves problems regarding:

  • Text summarization, tagging;
  • Indexing and searching;
  • Text categorization.

  • if a text is relevant to several topics, these algorithms extract word-wise keywords but it is desired to represent keywords of topics separately. As a result of these, in ArKeywordExtractor, a hybrid model has been created which uses unsupervised learning and TF/IDF scores.

    ArKeywordExtractor process steps:

  • Stopwords, HTML codes and inappropriate words are filtered out from the given text.
  • The root of each of the words belonging to the text is taken.
  • The word groups that are clustered by unsupervised learning are dynamically selected which are closest to the text.
  • The words in the text in the selected group are extracted according to the TF/IDF algorithm.

  • Word embeddings or word vectorization is an NLP methodology used to find the similarity of words with each other, allowing words to be represented by vectors corresponding to real numbers.
    FastText was used to find word vectors of texts in ArKeywordExtractor.