Group projects for the course Data Engineering held by professor Paolo Merialdo at Roma Tre University.
-
Project 1 - Scraping & Data extraction:
Downloading scientific papers from Arxiv (in html format) and extracting information regarding tables from them using xpaths.
-
Project 2 - Paper Search Engine:
Search engine for scientific papers, extracted in the previous project.
Server made with Apache Lucene and SpringBoot (Java).
Client made with Typescript and Bootstrap.
-
Project 3 - Table Search Engine + Semantich Search:
Continuation of project 2 with the introduction of the table search engine. Semantic search with evaluation of different models (e.g., BERT, All MiniLM v2) and different embedding methods.
-
Knowledge extraction from html tables (from project 1) using LLMs.
Project 4 - Table data extraction and understanding:
-
Integration of companies data from 16 different sources. The mediated schema is realized with the use of LLMs to generate field descriptions, on which embeddings were subsequently calculated and then used to do clustering (with HDBSCAN). The Blocking is done using Locality Sensitive Hashing (LSH) with words and bi-gram vectors. The final step of pairwise matching is realized with 3 different approaches: DITTO, DeepMatcher and Jaro-Winkler.
Project 5 - Data Integration of Companies: