
catching the topical sense of similarity in IT-related
terms. LSA is definitely not dead and can work very
well despite begin relatively old solution.
The presented embeddings have proven to be use-
ful for the Teamy.ai candidates retrieval pipeline in
the Query Expansion module. However, there are still
many places where the method can be improved.
Firstly, we are planning to try some alternative
algorithms for building latent representations of our
terms such as PLSA (Hofmann, 1999), LDA (Blei
et al., 2003) or Paragraph2vec (Le and Mikolov,
2014). Particularly Paragraph2vec seems to be very
promising since it is designed to better reflect the top-
ical notion of similarity than the traditional word2vec.
Another idea that comes to our minds is using
more information from Stack Overflow than only
the question tags. Eventually, we want to develop
a method for automatically building a Knowledge
Graph of IT-related concepts. The KG may be then
used to better capture dependencies between various
entities which can improve the results of Query Ex-
tension and candidates retrieval in general.
Finally, we understand that our gold standard
dataset is not very big but since the Teamy.ai system
is still in its development stage, it is the best that we
can have for now. In the future, we want to find a way
to extend it. On the one hand, this will give us more
reliable accuracy measurements, on the other may al-
low us to try some learning-to-rank techniques that
require training data.
ACKNOWLEDGEMENTS
The research has been founded by The National
Centre for Research and Development within the
project POIR.01.01.01-00-076120 ”System sztucznej
inteligencji korelujacy zespoły pracownikow z pro-
jektami IT”
REFERENCES
Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent
dirichlet allocation. Journal of machine Learning re-
search, 3(Jan):993–1022.
Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T.
(2016). Enriching word vectors with subword infor-
mation. arXiv preprint arXiv:1607.04606.
Chiche, A. and Yitagesu, B. (2022). Part of speech tagging:
a systematic review of deep learning and machine
learning approaches. Journal of Big Data, 9(1):1–25.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer,
T. K., and Harshman, R. (1990). Indexing by latent
semantic analysis. Journal of the American society
for information science, 41(6):391–407.
Deng, L. and Liu, Y. (2018). Deep learning in natural lan-
guage processing. Springer.
Guo, J., Fan, Y., Ai, Q., and Croft, W. B. (2016). A deep
relevance matching model for ad-hoc retrieval. In
Proceedings of the 25th ACM international on con-
ference on information and knowledge management,
pages 55–64.
Hofmann, T. (1999). Probabilistic latent semantic index-
ing. In Proceedings of the 22nd annual international
ACM SIGIR conference on Research and development
in information retrieval, pages 50–57.
Huang, P.-S., He, X., Gao, J., Deng, L., Acero, A., and
Heck, L. (2013). Learning deep structured seman-
tic models for web search using clickthrough data.
In Proceedings of the 22nd ACM international con-
ference on Information & Knowledge Management,
pages 2333–2338.
Le, Q. and Mikolov, T. (2014). Distributed representations
of sentences and documents. In International confer-
ence on machine learning, pages 1188–1196. PMLR.
Li, J., Sun, A., Han, J., and Li, C. (2020). A survey on deep
learning for named entity recognition. IEEE Transac-
tions on Knowledge and Data Engineering, 34(1):50–
70.
Liu, T.-Y. et al. (2009). Learning to rank for information
retrieval. Foundations and Trends
R
in Information
Retrieval, 3(3):225–331.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013).
Efficient estimation of word representations in vector
space. arXiv preprint arXiv:1301.3781.
Mitra, B., Craswell, N., et al. (2018). An introduction to
neural information retrieval. Now Foundations and
Trends Boston, MA.
Mitra, B., Diaz, F., and Craswell, N. (2017). Learning to
match using local and distributed representations of
text for web search. In Proceedings of the 26th Inter-
national Conference on World Wide Web, pages 1291–
1299.
Mitra, B., Nalisnick, E., Craswell, N., and Caruana, R.
(2016). A dual embedding space model for document
ranking. arXiv preprint arXiv:1602.01137.
Pennington, J., Socher, R., and Manning, C. D. (2014).
Glove: Global vectors for word representation. In
Proceedings of the 2014 conference on empirical
methods in natural language processing (EMNLP),
pages 1532–1543.
Roy, D., Paul, D., Mitra, M., and Garain, U. (2016). Us-
ing word embeddings for automatic query expansion.
arXiv preprint arXiv:1606.07608.
Sahlgren, M. (2006). The Word-Space Model: Us-
ing distributional analysis to represent syntagmatic
and paradigmatic relations between words in high-
dimensional vector spaces. PhD thesis, Institutionen
f
¨
or lingvistik.
ˇ
Reh
˚
u
ˇ
rek, R. (2011). Fast and faster: A comparison of two
streamed matrix decomposition algorithms.
LSA Is not Dead: Improving Results of Domain-Specific Information Retrieval System Using Stack Overflow Questions Tags
453