Recently, a team of third-year students from the Department of PMA of the Faculty of Applied Mathematics at Igor Sikorsky KPI demonstrated high results at the international competition in the field of machine learning and natural language processing Make Data Count – Find Data References. This team, called NeoNa, consisting of Mykyta Barkalov, Mykola Bovan, Illia Palienko, and Bohdan Tkach, took 27th place (out of 1,282 teams participating in the tournament!) and won a silver medal. Danylo Tavrov, head of the Department of Applied Mathematics at the FPM, told the Kyiv Polytechnic correspondent about the participation of Polytechnic students in these competitions.

✨✨✨

"Modern applied mathematics, as we understand it at the Department of Applied Mathematics, is not only about formulas and proofs, but also about actively immersing oneself in the world of modern technologies. The educational program for bachelor's degrees at our department is called “Machine Learning and Mathematical Modeling,” which emphasizes that one of our key priorities is the development of machine learning and artificial intelligence. We are convinced that a fundamental mathematical education is the basis for successful work with data, model building, and the creation of new algorithms. An important part of our work is supporting student initiatives. We encourage students to participate in international hackathons and competitions, because it is in such conditions that they gain practical experience, learn teamwork, and acquire the skills to combine academic knowledge with the real challenges of modern science and industry. It is gratifying to note that our students are confidently making their mark on the international stage, demonstrating that a combination of deep mathematical knowledge and an interest in new technologies paves the way to significant results.

– What was the format of the competition and how did it work?

– Anyone could participate in the competition, from students and young researchers to experienced data science professionals. The competition combines academic goals (increasing the transparency and reproducibility of research) with practical machine learning challenges: working with text, PDF, and XML, processing “dirty” data, and building classification models. The goal of the competition was not only to test the technical skills of the participants, but also to create open tools that help the scientific community make data more visible and traceable. The competition took place online over three months (June–September 2025). The organizers provided data and technical instructions. The teams uploaded their solutions to the platform, where the accuracy of the models was automatically verified.

Below are the impressions of Mykyta Barkalov, Mykola Bovan, Illia Palienko, and Bohdan Tkach, who formed their own team to participate in the competition:

– We have been interested in machine learning since our first year, participated in various hackathons, and achieved success there. While watching announcements and monitoring sources, we came across Make Data Count and decided to join to gain experience working with NLP in the context of scientific data. To coordinate our work, we used Notion for planning and documentation, Discord for real-time communication, and a version control system (Kaggle/repository) to store code and experiments. We held two types of online meetings: scheduled and practical sessions. It should be noted that our studies at the KPI Department of Applied Mathematics gave us a strong mathematical foundation, which was an important advantage in working with models and analyzing results. Thanks to this, we were able to coordinate as a team and achieve remarkable results.

During the competition, we encountered several key problems. First, the data was of very poor quality and partially missing. There were a large number of errors in the training data: about 60% of the records did not contain values for the target variable.

Second, during training and ensemble training of large language models, we encountered video memory limitations despite using powerful graphics cards in the Kaggle cloud environment. And third, many teams that ranked higher than us actively used open data without deeper analysis of the context of the reference. Although this yielded good metrics on public data, in our opinion, this approach does not generalize well to new data and does not fully meet the goals of the competition. Interestingly, most of the teams with higher scores relied solely on open data, ignoring the context of link mentions. We believe that this approach cannot be generalized to new data.

Participating in the competition was interesting and informative, although not easy. We gained valuable experience in working with “dirty” scientific data, building pipelines for NLP, and organizing teamwork. We plan to continue participating in similar competitions to improve our skills and methods. We would like to thank the organizers of the Make Data Count initiative and everyone who supported us during the competition.

Prepared by Volodymyr Shkolny
based on information from Danylo Tavrov

Дата події