Skip to content

Projects for the course Data Engineering held by professor Paolo Merialdo at Roma Tre University.

License

Notifications You must be signed in to change notification settings

Xhst/data-engineering-projects

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Engineering Projects

Group projects for the course Data Engineering held by professor Paolo Merialdo at Roma Tre University.

  • Project 1 - Scraping & Data extraction:

    Downloading scientific papers from Arxiv (in html format) and extracting information regarding tables from them using xpaths.

  • Project 2 - Paper Search Engine:

    Search engine for scientific papers, extracted in the previous project.

    Server made with Apache Lucene and SpringBoot (Java).

    Client made with Typescript and Bootstrap.

  • Project 3 - Table Search Engine + Semantich Search:

    Continuation of project 2 with the introduction of the table search engine. Semantic search with evaluation of different models (e.g., BERT, All MiniLM v2) and different embedding methods.

  • Project 4 - Table data extraction and understanding:

    Knowledge extraction from html tables (from project 1) using LLMs.

  • Project 5 - Data Integration of Companies:

    Integration of companies data from 16 different sources. The mediated schema is realized with the use of LLMs to generate field descriptions, on which embeddings were subsequently calculated and then used to do clustering (with HDBSCAN). The Blocking is done using Locality Sensitive Hashing (LSH) with words and bi-gram vectors. The final step of pairwise matching is realized with 3 different approaches: DITTO, DeepMatcher and Jaro-Winkler.