Skip to content

Commit 8f4a8e2

Browse files
committed
added code
1 parent 5a327af commit 8f4a8e2

10 files changed

+242
-21
lines changed

.DS_Store

0 Bytes
Binary file not shown.
Binary file not shown.

README.md

Lines changed: 44 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -4,35 +4,58 @@ rr# oneAPI-GenAI-Hackathon-2023 - Hack2Skill
44
#### Problem Statement - AI-Enhanced Legal Practice Platform
55
#### Team Leader Email - [email protected]
66

7-
8-
9-
Welcome to the official repository for the oneAPI-GenAI-Hackathon-2023 organized by Hack2Skill!
10-
11-
## Getting Started
12-
13-
To get started with the oneAPI-GenAI-Hackathon-2023 repository, follow these steps:
14-
15-
### Submission Instruction:
16-
1. Fork this repository
17-
2. Create a folder with your Team Name
18-
3. Upload all the code and necessary files in the created folder
19-
4. Upload a **README.md** file in your folder with the below mentioned informations.
20-
5. Generate a Pull Request with your Team Name. (Example: submission-XYZ_team)
21-
22-
### README.md must consist of the following information:
23-
24-
#### Team Name -
25-
#### Problem Statement -
26-
#### Team Leader Email -
7+
### Overview
8+
This article delves into the innovative approach of leveraging government digital land record data to streamline the search for property ownership trails. By harnessing the capabilities of Intel’s OneAPI, coupled with the power of Artificial Intelligence and Asynchronous programming, we have significantly improved the speed, accuracy, and relevance of the data required for tracing property ownership. This professional enhancement of data processing not only simplifies the task at hand but also paves the way for a more efficient and reliable system for property ownership verification.
9+
10+
The primary objective of this project is to revolutionize the process of property verification. We aim to achieve this by downloading year-wise property data, ensuring its relevance and accuracy, and moving away from traditional methods of property verification. By harnessing the untapped potential of Intel's OneAPI and Artificial Intelligence, we can present data in various formats required for property verification. This not only enhances the efficiency of the process but also ensures a higher degree of reliability and accuracy.
2711

2812
### A Brief of the Prototype:
29-
This section must include UML Diagrams and prototype description
13+
App is available on https://app.bhume.in/
14+
15+
Lawyers use this tool to automatically download property registry data from government website, and then filter the property of interest based on property schedule containing khasra no., survey no., plot no. and other fields.
3016

17+
Valuers use this tool to extract sale instances of properties near their area of interest.
18+
19+
3120
### Tech Stack:
3221
List Down all technologies used to Build the prototype
22+
We use a mix of react, python, django, postgres and libraries like Selenium, scikit-learn and finetuned LLMs from OpenAI to build and run the app. Data is scrapped and stored for each request during runtime. downloaded data is filtered using document type and then fed into LLMs one by one to extract information which can help us identify the property precisely. For each row in the dataset, we check whether the entry might be relevant to our property of interest. All relevant rows are then shown to the user.
3323

3424
### Step-by-Step Code Execution Instructions:
3525
This Section must contain a set of instructions required to clone and run the prototype so that it can be tested and deeply analyzed
26+
Go to app.bhume.in and use the app as prototype.
27+
28+
### Step-by-Step Finetuning
29+
Use the following commands to setup and activate the conda environment
30+
```bash
31+
conda create -n venv python==3.8.10
32+
conda activate venv
33+
install pip install -r requirements.txt
34+
```
35+
36+
set the env variable to select Intel AMX ISA
37+
```bash
38+
export ONEDNN_MAX_CPU_ISA="AVX512_CORE_AMX"
39+
```
40+
41+
Preprocessing
42+
Prepare the dataset using preprocess.py
43+
```python preprocess.py```
44+
45+
Finetuning
46+
run the command
47+
```python falcon-tune.py --bf16 True --use_ipex True --max_seq_length 512```
48+
49+
Inference
50+
```python falcon-tuned-inference.py --checkpoints <PATH-TO-CHECKPOINT> --max_length 200 --top_k 10```
51+
3652

3753
### Future Scope:
3854
Write about the scalability and futuristic aspects of the prototype developed
55+
56+
Property disputes account for 70% of all civil court cases in India. Proper due-diligence before any transaction can help a person avoid legal issues. Barrier to due-diligence currently is data cleaning, processing and analyzing for each property in a short duration of time, while the deal is being negotiated.
57+
58+
Future scope of work is:
59+
1. to integrate other legal documents pertaining to property ownership (depth)
60+
2. to expand to other states
61+
3. simplify the legal document to be understandable by a layman
365 KB
Loading
236 KB
Loading

falcon-tune.py

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
# falcon-tune.py
2+
import time
3+
import argparse
4+
5+
from datasets import load_dataset
6+
from trl import SFTTrainer
7+
from transformers import (
8+
AutoModelForCausalLM,
9+
AutoTokenizer,
10+
TrainingArguments)
11+
12+
def main(FLAGS):
13+
14+
# dataset = load_dataset("timdettmers/openassistant-guanaco", split="train")
15+
dataset = load_and_process_json('path_to_your_file.json')
16+
17+
model_name = "tiiuae/falcon-7b"
18+
tokenizer = AutoTokenizer.from_pretrained(model_name)
19+
tokenizer.pad_token = tokenizer.eos_token
20+
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
21+
22+
print('setting training arguments')
23+
24+
training_arguments = TrainingArguments(
25+
output_dir="./results",
26+
bf16=FLAGS.bf16, #change for CPU
27+
use_ipex=FLAGS.use_ipex, #change for CPU IPEX
28+
no_cuda=True,
29+
fp16_full_eval=False,
30+
)
31+
32+
print('Creating SFTTrainer')
33+
34+
trainer = SFTTrainer(
35+
model=model,
36+
train_dataset=dataset,
37+
dataset_text_field="text",
38+
max_seq_length=FLAGS.max_seq_length,
39+
tokenizer=tokenizer,
40+
args=training_arguments,
41+
packing=True,
42+
)
43+
44+
print('Starting Training')
45+
start = time.time()
46+
47+
trainer.train()
48+
49+
total = time.time() - start
50+
print(f'Time to tune {total}')
51+
52+
if __name__ == "__main__":
53+
parser = argparse.ArgumentParser()
54+
55+
parser.add_argument('-bf16',
56+
'--bf16',
57+
type=bool,
58+
default=True,
59+
help="activate mix precision training with bf16")
60+
parser.add_argument('-ipex',
61+
'--use_ipex',
62+
type=bool,
63+
default=True,
64+
help="used to control the maximum length of the generated text in text generation tasks")
65+
parser.add_argument('-msq',
66+
'--max_seq_length',
67+
type=int,
68+
default=512,
69+
help="specifies the number of highest probability tokens to consider at each step")
70+
71+
FLAGS = parser.parse_args()
72+
main(FLAGS)

falcon-tuned-inference.py

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
# falcon-tuned-inference.py
2+
3+
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig
4+
import transformers
5+
import torch
6+
import argparse
7+
import time
8+
9+
def main(FLAGS):
10+
11+
model = AutoModelForCausalLM.from_pretrained(FLAGS.checkpoints, trust_remote_code=True)
12+
tokenizer = AutoTokenizer.from_pretrained(FLAGS.checkpoints, trust_remote_code=True)
13+
tokenizer.pad_token = tokenizer.eos_token
14+
15+
generator = transformers.pipeline(
16+
"text-generation",
17+
model=model,
18+
tokenizer=tokenizer,
19+
torch_dtype=torch.bfloat16,
20+
trust_remote_code=True,
21+
device_map="auto",
22+
)
23+
24+
user_input = "start"
25+
26+
while user_input != "stop":
27+
28+
user_input = input(f"Provide Input to tuned falcon: ")
29+
30+
start = time.time()
31+
32+
if user_input != "stop":
33+
sequences = generator(
34+
f""" {user_input}""",
35+
max_length=FLAGS.max_length,
36+
do_sample=False,
37+
top_k=FLAGS.top_k,
38+
num_return_sequences=1,
39+
eos_token_id=tokenizer.eos_token_id,)
40+
41+
inference_time = time.time() - start
42+
43+
for seq in sequences:
44+
print(f"Result: {seq['generated_text']}")
45+
46+
print(f'Total Inference Time: {inference_time} seconds')
47+
48+
if __name__ == "__main__":
49+
parser = argparse.ArgumentParser()
50+
51+
parser.add_argument('-c',
52+
'--checkpoints',
53+
type=str,
54+
default=None,
55+
help="path to model checkpoint files")
56+
parser.add_argument('-ml',
57+
'--max_length',
58+
type=int,
59+
default="200",
60+
help="used to control the maximum length of the generated text in text generation tasks")
61+
parser.add_argument('-tk',
62+
'--top_k',
63+
type=int,
64+
default="10",
65+
help="specifies the number of highest probability tokens to consider at each step")
66+
67+
FLAGS = parser.parse_args()
68+
main(FLAGS)

0 commit comments

Comments
 (0)