Skip to content

Alert Email Module

Costas Yiallourides edited this page Apr 15, 2025 · 1 revision

Overview

The alert_email.py module in the frequenz-lib-notebooks repository provides a comprehensive solution for constructing HTML-based email alerts from tabular data. This module is designed to be used in notebooks or backend services where alert summaries, visualisations, and configuration links are programmatically generated and sent to stakeholders.

Its flexible formatting, interactive plotting, and export capabilities make it well-suited for monitoring microgrid component behavior and notifying teams about abnormal states, and detailed breakdown including state transitions and errors/warnings per microgrid and/or microgrid component.

Features

  • 📬 HTML Email Generation: Build full email bodies from raw alert data, including summary and detail views.
  • 🧠 Severity Sorting and Filtering: Sort and prioritise alerts based on type (error, warning, state), and exclude empty groups.
  • 🎨 Color-Coded HTML Tables: Alert rows are styled by severity (e.g., red for errors, orange for warnings) to aid quick scanning.
  • 📊 Interactive Plots: Visualise alert frequency and state transitions over time with Plotly, and export to various formats (HTML, PNG, etc.).
  • ⚙️ Configurable Output: Through the AlertEmailConfig and ExportOptions dataclasses, developers can easily customise grouping, table size, sorting, and output paths.
  • 🧾 Structured JSON Output: Use generate_alert_json() to convert alert data into a grouped dictionary format, suitable for APIs, logging pipelines, dashboard ingestion, or integration testing.

Example Usage

Generate and preview an HTML-formatted alert email

The table contents are also sorted by severity in this example.

import pandas as pd
from frequenz.lib.notebooks.alerts.alert_email import AlertEmailConfig, generate_alert_email

def example():
    # Example alert records dataframe
    alert_records = pd.DataFrame(
        [
            {
                "microgrid_id": 1,
                "component_id": 1,
                "state_type": "error",
                "state_value": "UNDERVOLTAGE",
                "start_time": "2025-03-14 15:06:30",
                "end_time": "2025-03-14 17:00:00",
            },
            {
                "microgrid_id": 2,
                "component_id": 1,
                "state_type": "state",
                "state_value": "DISCHARGING",
                "start_time": "2025-03-14 15:06:30",
                "end_time": None,
            },
        ]
    )

    # Configuration for email generation
    alert_email_config = AlertEmailConfig(
        notebook_url="http://alerts.example.com",
        displayed_rows=10,
        sort_by_severity=True,
    )

    # Generate the HTML body
    html_email = generate_alert_email(alert_records=alert_records, config=alert_email_config)

Visualise alert summaries and export them to a file

from frequenz.lib.notebooks.alerts.alert_email import plot_alerts, ExportOptions

export_config = ExportOptions(format=["png", "html"], output_dir="./exports", show=False)
plot_alerts(alerts, plot_type="all", export_options=export_config)

Considerations

  • The module assumes alerts follow a consistent schema (microgrid_id, component_id, state_type, state_value, start_time, etc.).
  • Charts use Plotly and require appropriate renderers for interactive environments (like Jupyter).

Future Integrations/Features

  1. Markdown/Plaintext Email Fallback: Add support for generating fallback plaintext or markdown emails for systems that do not support HTML rendering (e.g., some alerting platforms or command-line tools).
  2. Alert Severity Thresholds and Escalation Logic: Allow defining custom severity thresholds (e.g., 5+ errors = critical) and generate alerts accordingly. Integrate with escalation policies or Slack integration in future iterations. The idea is: don’t spam everyone all the time. Escalate only when things really need attention.
  3. State Duration Analysis and Ranking: Add a new section or visualization showing which components have remained in an error/warning state the longest. This could help prioritize maintenance or alert triage.
Clone this wiki locally