The Quora Question Pairs dataset aims to identify whether two questions asked on Quora are duplicate or not. This is a classic natural language processing (NLP) problem where the goal is to improve the question-answering system by detecting similar intent in different wordings.
The dataset contains the following files:
File Name | Description |
---|---|
train.csv.zip |
Training dataset (contains question pairs and labels) |
test.csv.zip |
Test dataset (without labels, used for evaluation) |
sample_submission.csv.zip |
Sample format for submission |
Each row in the dataset represents a pair of questions with the following columns:
Column Name | Description |
---|---|
id |
Unique identifier for the row |
qid1 |
Unique ID for question 1 |
qid2 |
Unique ID for question 2 |
question1 |
First question in the pair |
question2 |
Second question in the pair |
is_duplicate |
Label (Target Variable): 1 if questions are duplicates, 0 otherwise |
- Total Rows: 404,290
- Duplicate Questions: ~37%
- Unique Questions: 537,933
The dataset is part of the Quora Question Pairs competition on Kaggle:
π Kaggle Dataset
The TF-IDF (Term Frequency-Inverse Document Frequency) score for a word W in a document D is computed as:
Where:
- TF (Term Frequency) = How often word W appears in D.
- IDF (Inverse Document Frequency) = Measures how rare W is across all documents.
π If a word appears in almost every document, its IDF score is low β Less Important
π If a word is unique to a few documents, its IDF score is high β More Important
1οΈβ£ "The movie was amazing and had great cinematography."
2οΈβ£ "The cinematography and plot twist were Oscar-worthy!"
3οΈβ£ "I love this movie, but the ending was bad."
Word | TF-IDF Score | Importance |
---|---|---|
cinematography | High | β Important (Rare, specific to some documents) |
plot twist | High | β Important (Key phrase in only one document) |
movie | Low | β Less Important (Appears in all documents) |
the, was, and | Very Low | β Stopwords, common in all text |
π TF-IDF improves text representation by reducing the impact of common words while giving importance to unique words.
π‘ This is crucial in NLP tasks like text classification, document similarity, and search engines.
- Question Deduplication: Helps in reducing redundant questions in Q&A platforms.
- Semantic Text Similarity: Improves chatbot and search engine performance.
- NLP Model Training: Can be used to train models for text similarity tasks.
πΉ Note: This dataset is provided by Quora and is publicly available for research and learning purposes.