MemorySim is an RTL-native, timing-accurate memory simulator designed for the Chisel/Chipyard ecosystem. It provides cycle-accurate profiling of memory subsystems, enabling hardware designers to evaluate bandwidth, latency, and power-performance trade-offs in next-generation AI accelerators.
- RTL-Level Fidelity
Implements bank-level finite-state machines (FSMs) and a comprehensive DRAM timing model entirely in hardware for bit-true data correctness. - Seamless Integration
Compatible with Chisel and Verilog-based designs; easily embedded into Chipyard and FireSim flows for FPGA-accelerated emulation. - Cycle-Accurate DRAM Model
Supports key JEDEC timing parameters (e.g., tRCD, tRP, tRFC) with closed-page policy and self-refresh modes. - Backpressure Analysis
Centralized request queue with multi-dequeue support to study the impact of queue depth on latency and throughput. - Trace-Driven and Standalone Modes
Run isolated trace-based experiments or co-simulate with full-system benchmarks.
- Top-Level Interface
- Frontend accepts memory trace requests (
addr
,cycle
) and enqueues intoreqQueue
.
- Frontend accepts memory trace requests (
- Memory Controller
- Splits requests by rank and bank, dispatches to bank schedulers, and aggregates responses in
respQueue
.
- Splits requests by rank and bank, dispatches to bank schedulers, and aggregates responses in
- Bank Scheduler
- Enforces closed-page policy, manages ACTIVATE–READ/WRITE–PRECHARGE handshakes, and handles refresh/self-refresh states.
- DRAM Timing Model
- Tracks timing constraints (e.g., tRCD, tRP, tRFC) and issues acknowledgments after parameterized delays.
- Physical Channel Hierarchy
- Models channels, ranks, bank groups, and banks with round-robin arbitration for responses.
We evaluated MemorySim against DRAMSim3 using four microbenchmarks: conv2d.c
, multihead_attention.c
, trace_example.c
, and vector_similarity.c
. Key findings:
- Read/Write Overhead
- Average read penalty: ~111 cycles
- Average write penalty: ~125 cycles
- Latency vs. Queue Depth
- Exponential latency growth with larger
reqQueue
sizes; sub-80 cycles at queue size 2, >250 cycles at size 1024.
- Exponential latency growth with larger
- Throughput–Latency Trade-off
- Smaller queues reduce latency but can starve bank schedulers, lowering overall requests served.
For detailed metrics, refer to the results section of the paper.
git clone https://github.com/AnshKetchum/hbm-controller.git
cd hbm-controller
# Build Chisel project
sbt compile
# Trace-driven standalone mode
sbt "runMain memsim.TraceRunner --trace-file traces/conv2d.trace --queue-size 128"
# Co-simulation within Chipyard/FireSim
cd chipyard
./scripts/generate-configs.sh
make hbm-sim-project
conv2d.c
: 2D convolution kernelmultihead_attention.c
: Toy attention workloadtrace_example.c
: Basic sequence validationvector_similarity.c
: Cosine similarity search
Trace generators and scripts are located in benchmarks/
.
If you use MemorySim in your research, please cite:
@inproceedings{chaurasia2025mem,
title={MemorySim: An RTL-level, timing accurate simulator model for the Chisel ecosystem},
author={Chaurasia, Ansh},
booktitle={},
year={2025},
pages={1--8}
}
Thanks to Professor Christopher Fletcher, Professor Sagar Karandikar, and Ph.D Tianrui Wei for their invaluable guidance through the process.