Skip to content

Aegis 6406 asm2vec pytorch magics #20

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 70 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
9180e19
TRIVIAL - added req files and fix bug
jamienutter Oct 2, 2023
76cf8d8
Merge pull request #1 from wandera/TRIVIAL
jamienutter Oct 2, 2023
9b8decb
Create SECURITY.md
jamienutter Oct 2, 2023
7e659f6
TRIVIAL - init
jamienutter Oct 2, 2023
20df9cc
[Jenkins] Set version to 1.0.1
Oct 2, 2023
5be2ef8
AEGIS-6405 datatype PEP8
ilektragiassa Oct 2, 2023
2833c60
AEGIS-6405 PEP8 model.py
ilektragiassa Oct 2, 2023
5d03534
AEGIS-6405 PEP8 utils.py
ilektragiassa Oct 2, 2023
62bafc3
AEGIS-6405 Create binary_to_assembly.py
ilektragiassa Oct 2, 2023
5dea443
AEGIS-6405 Delete scripts/bin2asm.py
ilektragiassa Oct 2, 2023
988f430
AEGIS-6405 Rename disassembling.py to binary_to_asm.py
ilektragiassa Oct 2, 2023
ec58db1
AEGIS-6405 Update __init__.py
ilektragiassa Oct 2, 2023
18cd90c
AEGIS-6405 Update asm2vec/utils.py - JN review
ilektragiassa Oct 2, 2023
6632b19
AEGIS-6405 remove class
ilektragiassa Oct 2, 2023
5497a20
Merge pull request #2 from wandera/AEGIS-6405-asm2vec-pytorch-edits
ilektragiassa Oct 2, 2023
98f9868
AEGIS-6406 Create train.py
ilektragiassa Oct 2, 2023
74be99e
AEGIS-6406 Update __init__.py
ilektragiassa Oct 2, 2023
2755c99
AEGIS-6406 Update __init__.py
ilektragiassa Oct 2, 2023
26d492c
AEGIS-6406 Create tensors.py
ilektragiassa Oct 2, 2023
36d29bb
AEGIS-6405 pass magic bytes as variable
ilektragiassa Oct 3, 2023
f47abd0
AEGIS-6405 fixing logging
ilektragiassa Oct 3, 2023
9c3cf83
AEGIS-6406 Update - JN review
ilektragiassa Oct 3, 2023
9d0ea0f
AEGIS-6406 fix package import, args types
ilektragiassa Oct 3, 2023
045ea32
AEGIS-6406 args types, function return
ilektragiassa Oct 3, 2023
70acd34
AEGIS-6406 remove import
ilektragiassa Oct 3, 2023
8453b40
AEGIS-6405 magic bytes as list of strings
ilektragiassa Oct 3, 2023
36eda60
AEGIS-6405 add more magic bytes for MacOS
ilektragiassa Oct 3, 2023
249db63
Merge pull request #4 from wandera/AEGIS-6405-trivial-fix
ilektragiassa Oct 3, 2023
96a8a00
AEGIS-6406 migrate utils.py to train.py
ilektragiassa Oct 3, 2023
a3bc3d0
AEGIS-6406 remove utils
ilektragiassa Oct 3, 2023
b73b939
AEGIS-6406 fix imports to account for moving utils.py to train.py
ilektragiassa Oct 3, 2023
102c392
Merge branch 'master' into AEGIS-6406-asm2vec-pytorch-retrain
jamienutter Oct 3, 2023
d878ce9
Merge pull request #3 from wandera/AEGIS-6406-asm2vec-pytorch-retrain
ilektragiassa Oct 3, 2023
bd8bcd7
AEGIS-6406 Delete asm2vec/utils.py
ilektragiassa Oct 3, 2023
91bfc90
[Jenkins] Set version to 1.0.2
Oct 3, 2023
8751b19
AEGIS-6405 Create test_binary_to_asm.py
ilektragiassa Oct 3, 2023
0a990f9
AEGIS-6405 Create __init__.py
ilektragiassa Oct 3, 2023
991f9ce
AEGIS-6405 Create sample_binary
ilektragiassa Oct 3, 2023
85a8b95
AEGIS-6405 upload test binary
ilektragiassa Oct 3, 2023
38cf710
AEGIS-6405 Delete asm2vec/data/sample_binary
ilektragiassa Oct 3, 2023
6f73e51
AEGIS-6405 Delete asm2vec/data directory
ilektragiassa Oct 3, 2023
d9f3f99
AEGIS-6405 Create sample_binary
ilektragiassa Oct 3, 2023
383dfed
Add files via upload
ilektragiassa Oct 3, 2023
b057394
AEGIS-6405 Delete data/5cca32eb8f9c2a024a57ce12e3fb66070662de80
ilektragiassa Oct 3, 2023
047e418
AEGIS-6405 add sample binary
ilektragiassa Oct 3, 2023
d4af35d
AEGIS-6405 fix path
ilektragiassa Oct 3, 2023
fb3b507
AEGIS-6405 Delete data/sample_binary
ilektragiassa Oct 3, 2023
df44a2f
AEGIS-6405 - test fix
jamienutter Oct 4, 2023
7beb705
AEGIS-6405 - r2env
jamienutter Oct 4, 2023
ba5f408
AEGIS-6405 - radar2 install
jamienutter Oct 4, 2023
9775f0c
AEGIS-6405 - radar2 test
jamienutter Oct 4, 2023
e8735fe
AEGIS-6405 - radare2 test 2
jamienutter Oct 4, 2023
3a2db77
AEGIS-6405 - radare2 test 3
jamienutter Oct 4, 2023
d36eced
AEGIS-6405 - setup arch
jamienutter Oct 4, 2023
852ce92
Merge pull request #5 from wandera/AEGIS-6405--test-binary-to-asm
jamienutter Oct 4, 2023
4ef137e
Merge branch 'master' into AEGIS-6406-fix-scripts
jamienutter Oct 5, 2023
7ab939b
AEGIS-6406 - moved scripts
jamienutter Oct 5, 2023
50fe6fc
Merge pull request #6 from wandera/AEGIS-6406-fix-scripts
jamienutter Oct 5, 2023
8f572c3
TRIVIAL - doc strings
jamienutter Oct 5, 2023
1cf86db
[Jenkins] Set version to 1.0.3
Oct 5, 2023
9d794e2
AEGIS-6406 rename "test" mode to "update"
ilektragiassa Oct 23, 2023
2fec9d1
AEGIS-6406 add docstring, change "test" mode to "update" mode
ilektragiassa Oct 23, 2023
3c9833c
AEGIS-6406 add docstring, set mode to "update"
ilektragiassa Oct 23, 2023
df23bc6
Merge pull request #7 from wandera/AEGIS-6406-asm2vec-pytorch-minor-f…
ilektragiassa Oct 23, 2023
2a8433a
AEGIS-6406 change mode from "test" to "update"
ilektragiassa Oct 23, 2023
208340c
Merge pull request #8 from wandera/AEGIS-6406-asm2vec-pytorch-minor
ilektragiassa Oct 23, 2023
45e10f7
AEGIS-6406 add identation
ilektragiassa Oct 25, 2023
8d1b419
AEGIS-6406 fix function_count
ilektragiassa Oct 25, 2023
53e5a6e
Merge pull request #9 from wandera/AEGIS-6406-asm2vec-pytorch-fix
ilektragiassa Oct 25, 2023
90c9f99
AEGIS-6406 add magic bytes
ilektragiassa Oct 25, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .idea/.gitignore

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

32 changes: 32 additions & 0 deletions CODEOWNERS
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# This is a comment.
# Each line is a file pattern followed by one or more owners.

# These owners will be the default owners for everything in
# the repo. Unless a later match takes precedence,
# @global-owner1 and @global-owner2 will be requested for
# review when someone opens a pull request.
* @wandera/datascience

# Order is important; the last matching pattern takes the most
# precedence. When someone opens a pull request that only
# modifies JS files, only @js-owner and not the global
# owner(s) will be requested for a review.
# *.js @js-owner

# You can also use email addresses if you prefer. They'll be
# used to look up users just like we do for commit author
# emails.
#*.go [email protected]

# The `docs/*` pattern will match files like
# `docs/getting-started.md` but not further nested files like
# `docs/build-app/troubleshooting.md`.
# docs/* [email protected]

# In this example, @octocat owns any file in an apps directory
# anywhere in your repository.
# apps/ @octocat

# In this example, @doctocat owns any file in the `/docs`
# directory in the root of your repository.
# /docs/ @doctocat
13 changes: 13 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
FROM python:3.10.11-slim

ADD . /asm2vec-pytorch
WORKDIR asm2vec-pytorch

RUN apt-get update && apt-get install -y --no-install-recommends \
unixodbc-dev \
unixodbc \
libpq-dev && \
pip install -r requirements.txt && \
python setup.py install

CMD ["/bin/sh"]
166 changes: 16 additions & 150 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# asm2vec-pytorch

<a><img alt="release 1.0.0" src="https://img.shields.io/badge/release-v1.0.0-yellow?style=for-the-badge"></a>
<a><img alt="release 1.0.3" src="https://img.shields.io/badge/release-v1.0.0-yellow?style=for-the-badge"></a>
<a><img alt="mit" src="https://img.shields.io/badge/license-MIT-brightgreen?style=for-the-badge"></a>
<a><img alt="python" src="https://img.shields.io/badge/-python-9cf?style=for-the-badge&logo=python"></a>

Expand All @@ -9,30 +9,17 @@ The details of the model can be found in the original paper: [(sp'19) Asm2Vec: B

## Requirements

python >= 3.6

| packages | for |
| --- | --- |
| r2pipe | `scripts/bin2asm.py` |
| click | `scripts/*` |
| torch | almost all code need it |

You also need to install `radare2` to run `scripts/bin2asm.py`. `r2pipe` is just the python interface to `radare2`

If you only want to use the library code, you just need to install `torch`
* python >= 3.10
* radare2
* Packages listed in `requirements.txt`

## Install

```
pip install -r requirements.txt &&
python setup.py install
```

or

```
pip install git+https://github.com/oalieno/asm2vec-pytorch.git
```

## Benchmark

An implementation already exists here: [Lancern/asm2vec](https://github.com/Lancern/asm2vec)
Expand All @@ -46,141 +33,20 @@ Following is the benchmark of training 1000 functions in 1 epoch.

## Get Started

```bash
python scripts/bin2asm.py -i /bin/ -o asm/
```

First generate asm files from binarys under `/bin/`.
You can hit `Ctrl+C` anytime when there is enough data.

```bash
python scripts/train.py -i asm/ -l 100 -o model.pt --epochs 100
```

Try to train the model using only 100 functions and 100 epochs for a taste.
Then you can use more data if you want.

```bash
python scripts/test.py -i asm/123456 -m model.pt
```

After you train your model, try to grab an assembly function and see the result.
This script will show you how the model perform.
Once you satisfied, you can take out the embedding vector of the function and do whatever you want with it.
### TODO - update this with description about to how use etc

## Usage
## Tests

### bin2asm.py
### Run test suite

```
Usage: bin2asm.py [OPTIONS]
* Run all tests: ``python -m unittest discover -v``
* Run a certain module's tests: ``python -m unittest -v test.test_binary_to_asm``
* Run a certain test class: ``python -m unittest -v test.test_binary_to_asm.TestBinaryToAsm``
* Run a certain test method:

Extract assembly functions from binary executable
``python -m unittest -v test.test_binary_to_asm.TestBinaryToAsm.test_sha3``

Options:
-i, --input TEXT input directory / file [required]
-o, --output TEXT output directory
-l, --len INTEGER ignore assembly code with instructions amount smaller
than minlen
### Coverage

--help Show this message and exit.
```

```bash
# Example
python bin2asm.py -i /bin/ -o asm/
```

### train.py

```
Usage: train.py [OPTIONS]

Options:
-i, --input TEXT training data folder [required]
-o, --output TEXT output model path [default: model.pt]
-m, --model TEXT load previous trained model path
-l, --limit INTEGER limit the number of functions to be loaded
-d, --ebedding-dimension INTEGER
embedding dimension [default: 100]
-b, --batch-size INTEGER batch size [default: 1024]
-e, --epochs INTEGER training epochs [default: 10]
-n, --neg-sample-num INTEGER negative sampling amount [default: 25]
-a, --calculate-accuracy whether calculate accuracy ( will be
significantly slower )

-c, --device TEXT hardware device to be used: cpu / cuda /
auto [default: auto]

-lr, --learning-rate FLOAT learning rate [default: 0.02]
--help Show this message and exit.
```

```bash
# Example
python train.py -i asm/ -o model.pt --epochs 100
```

### test.py

```
Usage: test.py [OPTIONS]

Options:
-i, --input TEXT target function [required]
-m, --model TEXT model path [required]
-e, --epochs INTEGER training epochs [default: 10]
-n, --neg-sample-num INTEGER negative sampling amount [default: 25]
-l, --limit INTEGER limit the amount of output probability result
-c, --device TEXT hardware device to be used: cpu / cuda / auto
[default: auto]

-lr, --learning-rate FLOAT learning rate [default: 0.02]
-p, --pretty pretty print table [default: False]
--help Show this message and exit.
```

```bash
# Example
python test.py -i asm/123456 -m model.pt
```

```
┌──────────────────────────────────────────┐
│ endbr64 │
│ ➔ push r15 │
│ push r14 │
├────────┬─────────────────────────────────┤
│ 34.68% │ [rdx + rsi*CONST + CONST] │
│ 20.29% │ push │
│ 16.22% │ r15 │
│ 04.36% │ r14 │
│ 03.55% │ r11d │
└────────┴─────────────────────────────────┘
```

### compare.py

```
Usage: compare.py [OPTIONS]

Options:
-i1, --input1 TEXT target function 1 [required]
-i2, --input2 TEXT target function 2 [required]
-m, --model TEXT model path [required]
-e, --epochs INTEGER training epochs [default: 10]
-c, --device TEXT hardware device to be used: cpu / cuda / auto
[default: auto]

-lr, --learning-rate FLOAT learning rate [default: 0.02]
--help Show this message and exit.
```

```bash
# Example
python compare.py -i1 asm/123456 -i2 asm/654321 -m model.pt -e 30
```

```
cosine similarity : 0.873684
```
* Create report: ``coverage run -m unittest discover -v``
* Read report: ``coverage report -m``
26 changes: 26 additions & 0 deletions SECURITY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
Thanks for helping make GitHub safe for everyone.

# Security

Jamf takes the security of our software products and services seriously, including all of the open source code repositories managed through our GitHub organizations, such as asm2vec-pytorch.

We will ensure that your finding gets passed along to the appropriate maintainers for remediation.

# Reporting Security Issues

If you believe you have found a security vulnerability in any Jamf-owned repository, please report it to us through coordinated disclosure.

Please do not report security vulnerabilities through public GitHub issues, discussions, or pull requests.

Instead, please send an email to info[@]jamf.com.

Please include as much of the information listed below as you can to help us better understand and resolve the issue:
- The type of issue (e.g., buffer overflow, SQL injection, or cross-site scripting)
- Full paths of source file(s) related to the manifestation of the issue
- The location of the affected source code (tag/branch/commit or direct URL)
- Any special configuration required to reproduce the issue
- Step-by-step instructions to reproduce the issue
- Proof-of-concept or exploit code (if possible)
- Impact of the issue, including how an attacker might exploit the issue

This information will help us triage your report more quickly.
11 changes: 7 additions & 4 deletions asm2vec/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
import importlib
import os

__all__ = ['model', 'datatype', 'utils']
__home__ = os.path.dirname(os.path.abspath(__path__[0]))
__data__ = os.path.join(__home__, "data")

for module in __all__:
importlib.import_module(f'.{module}', 'asm2vec')
__all__ = [
"__data__", "__home__", "binary_to_asm", "data", "datatype", "model", "similarity", "tensors", "test", "train",
"utilities", "version"
]
Loading