Files
baffle-hub/docs/rule-architecture.md

16 KiB

Baffle Hub - Rule Architecture

Overview

Baffle Hub uses a distributed rule system where the Hub generates and manages rules, and Agents download and enforce them locally using optimized SQLite queries. This architecture provides sub-millisecond rule evaluation while maintaining centralized intelligence and control.

Core Principles

  1. Hub-side Intelligence: Pattern detection and rule generation happens on the Hub
  2. Agent-side Enforcement: Rule evaluation happens locally on Agents for speed
  3. Incremental Sync: Agents poll for rule updates using timestamp-based cursors
  4. Dynamic Backpressure: Hub controls event sampling based on load
  5. Temporal Rules: Rules can expire automatically (e.g., 24-hour bans)
  6. Soft Deletes: Rules are disabled, not deleted, for proper sync and audit trail

Rule Types

1. Network Rules (network_v4, network_v6)

Block or allow traffic based on IP address or CIDR ranges.

Use Cases:

  • Block scanner IPs (temporary or permanent)
  • Block datacenter/VPN/proxy ranges
  • Allow trusted IP ranges
  • Geographic blocking via IP ranges

Evaluation:

  • Most specific CIDR wins (smallest prefix)
  • /32 beats /24 beats /16 beats /8
  • Agent uses optimized range queries on ipv4_ranges/ipv6_ranges tables

Example:

{
  "id": 12341,
  "rule_type": "network_v4",
  "action": "deny",
  "conditions": { "cidr": "185.220.100.0/22" },
  "priority": 22,
  "expires_at": "2024-11-04T12:00:00Z",
  "enabled": true,
  "source": "auto:scanner_detected",
  "metadata": {
    "reason": "Tor exit node hitting /.env",
    "auto_generated": true
  }
}

2. Rate Limit Rules (rate_limit)

Control request rate per IP or per CIDR range.

Scopes (Phase 1):

  • Global per-IP: Limit requests per IP across all paths
  • Per-CIDR: Different limits for different network ranges

Scopes (Phase 2+):

  • Per-path per-IP: Different limits for /api/*, /login, etc.

Evaluation:

  • Agent maintains in-memory counters per IP
  • Finds most specific CIDR rule for the IP
  • Applies that rule's rate limit configuration
  • Optional: Persist counters to SQLite for restart resilience

Example (Phase 1):

{
  "id": 12342,
  "rule_type": "rate_limit",
  "action": "rate_limit",
  "conditions": {
    "cidr": "0.0.0.0/0",
    "scope": "global"
  },
  "priority": 0,
  "enabled": true,
  "source": "manual",
  "metadata": {
    "limit": 100,
    "window": 60,
    "per_ip": true
  }
}

Example (Phase 2+):

{
  "id": 12343,
  "rule_type": "rate_limit",
  "action": "rate_limit",
  "conditions": {
    "cidr": "0.0.0.0/0",
    "scope": "per_path",
    "path_pattern": "/api/login"
  },
  "metadata": {
    "limit": 5,
    "window": 60,
    "per_ip": true
  }
}

3. Path Pattern Rules (path_pattern)

Detect suspicious path access patterns (mainly for Hub analytics).

Use Cases:

  • Detect scanners hitting /.env, /.git, /wp-admin
  • Identify bots with suspicious path traversal
  • Trigger automatic IP bans when patterns match

Evaluation:

  • Agent does lightweight pattern matching
  • When matched, sends event to Hub with matched_pattern: true
  • Hub analyzes and creates IP block rules if needed
  • Agent picks up new IP block rule in next sync (~10s)

Example:

{
  "id": 12344,
  "rule_type": "path_pattern",
  "action": "log",
  "conditions": {
    "patterns": ["/.env", "/.git/*", "/wp-admin/*", "/.aws/*", "/phpMyAdmin/*"]
  },
  "enabled": true,
  "source": "default:scanner_detection",
  "metadata": {
    "auto_ban_ip": true,
    "ban_duration_hours": 24,
    "description": "Common scanner paths"
  }
}

Rule Actions

Action Description HTTP Response
allow Pass request through Continue to app
deny Block request 403 Forbidden
rate_limit Enforce rate limit 429 Too Many Requests
redirect Redirect to URL 301/302 + Location header
challenge Show CAPTCHA (Phase 2+) 403 with challenge
log Log only, don't block Continue to app

Rule Priority & Specificity

Network Rules

  • Priority is determined by CIDR prefix length
  • Smaller prefix (more specific) = higher priority
  • /32 (single IP) beats /24 (256 IPs) beats /8 (16M IPs)
  • Example: Block 10.0.0.0/8 but allow 10.0.1.0/24
    • Request from 10.0.1.5 → matches /24 → allowed
    • Request from 10.0.2.5 → matches /8 only → blocked

Rate Limit Rules

  • Most specific CIDR match wins
  • Per-path rules take precedence over global (Phase 2+)

Path Pattern Rules

  • All patterns are evaluated (not exclusive)
  • Used for detection, not blocking
  • Multiple pattern matches = stronger signal for ban

Rule Synchronization

Timestamp-Based Cursor

Agents use updated_at timestamps as sync cursors to handle rule updates and deletions.

Why updated_at instead of id?

  • Handles rule updates (e.g., disabling a rule updates updated_at)
  • Handles rule deletions via enabled=false flag
  • Simple for agents: "give me everything that changed since X"

Agent Sync Flow:

1. Agent starts: last_sync = nil
2. GET /api/:key/rules → Full sync, store latest updated_at
3. Every 10s or 1000 events: GET /api/:key/rules?since=<last_sync>
4. Process rules: add new, update existing, remove disabled
5. Update last_sync to latest updated_at from response

Query Overlap: Hub queries updated_at >= since - 0.5s to handle clock skew and millisecond duplicates.

API Endpoints

1. Version Check (Lightweight)

GET /api/:public_key/rules/version

Response:
{
  "version": "2024-11-03T12:30:45.123Z",
  "count": 150,
  "sampling": {
    "allowed_requests": 0.5,
    "blocked_requests": 1.0,
    "rate_limited_requests": 1.0,
    "effective_until": "2024-11-03T12:30:55.123Z"
  }
}

2. Incremental Sync

GET /api/:public_key/rules?since=2024-11-03T12:00:00.000Z

Response:
{
  "version": "2024-11-03T12:30:45.123Z",
  "sampling": { ... },
  "rules": [
    {
      "id": 12341,
      "rule_type": "network_v4",
      "action": "deny",
      "conditions": { "cidr": "1.2.3.4/32" },
      "priority": 32,
      "expires_at": "2024-11-04T12:00:00Z",
      "enabled": true,
      "source": "auto:scanner_detected",
      "metadata": { "reason": "Hitting /.env" },
      "created_at": "2024-11-03T12:00:00Z",
      "updated_at": "2024-11-03T12:00:00Z"
    },
    {
      "id": 12340,
      "rule_type": "network_v4",
      "action": "deny",
      "conditions": { "cidr": "5.6.7.8/32" },
      "priority": 32,
      "enabled": false,
      "source": "manual",
      "metadata": { "reason": "False positive" },
      "created_at": "2024-11-02T10:00:00Z",
      "updated_at": "2024-11-03T12:25:00Z"
    }
  ]
}

3. Full Sync

GET /api/:public_key/rules

Response:
{
  "version": "2024-11-03T12:30:45.123Z",
  "sampling": { ... },
  "rules": [ ...all enabled rules... ]
}

Dynamic Event Sampling

Hub controls how many events Agents send based on load.

Sampling Strategy

Hub monitors:

  • SolidQueue job depth
  • Events/second rate
  • Database write latency

Sampling rates:

Queue Depth     | Allowed | Blocked | Rate Limited
----------------|---------|---------|-------------
0-1,000         | 100%    | 100%    | 100%
1,001-5,000     | 50%     | 100%    | 100%
5,001-10,000    | 20%     | 100%    | 100%
10,001+         | 5%      | 100%    | 100%

Phase 2+: Path-based sampling:

{
  "sampling": {
    "allowed_requests": 0.1,
    "blocked_requests": 1.0,
    "paths": {
      "block": ["/.env", "/.git/*"],
      "allow": ["/health", "/metrics"]
    }
  }
}

Agent respects sampling:

  • Always sends blocked/rate-limited events
  • Samples allowed events based on rate
  • Can prioritize suspicious paths over routine traffic

Temporal Rules (Expiration)

Rules can have an expires_at timestamp for automatic expiration.

Use Cases:

  • 24-hour scanner bans
  • Temporary rate limit adjustments
  • Time-boxed maintenance blocks

Cleanup:

  • ExpiredRulesCleanupJob runs hourly
  • Disables rules where expires_at < now
  • Agent picks up disabled rules in next sync

Example:

# Hub auto-generates rule when scanner detected:
Rule.create!(
  rule_type: "network_v4",
  action: "deny",
  conditions: { cidr: "1.2.3.4/32" },
  expires_at: 24.hours.from_now,
  source: "auto:scanner_detected",
  metadata: { reason: "Hit /.env 5 times in 10 seconds" }
)

# 24 hours later: ExpiredRulesCleanupJob disables it
# Agent syncs and removes from ipv4_ranges table

Rule Sources

The source field tracks rule origin for audit and filtering.

Source Formats:

  • manual - Created by user via UI
  • auto:scanner_detected - Auto-generated from scanner pattern
  • auto:rate_limit_exceeded - Auto-generated from rate limit abuse
  • auto:bot_detected - Auto-generated from bot behavior
  • imported:fail2ban - Imported from external source
  • imported:crowdsec - Imported from CrowdSec
  • default:scanner_paths - Default rule set

Database Schema

Hub Schema

create_table "rules" do |t|
  # Identification
  t.integer :id, primary_key: true
  t.string :source, limit: 100

  # Rule definition
  t.string :rule_type, null: false
  t.string :action, null: false
  t.json :conditions, null: false
  t.json :metadata

  # Priority & lifecycle
  t.integer :priority
  t.datetime :expires_at
  t.boolean :enabled, default: true, null: false

  # Timestamps (updated_at is sync cursor!)
  t.timestamps

  # Indexes
  t.index [:updated_at, :id]  # Primary sync query
  t.index :enabled
  t.index :expires_at
  t.index :source
  t.index :rule_type
end

Agent Schema (Existing)

create_table "ipv4_ranges" do |t|
  t.integer :network_start, limit: 8, null: false
  t.integer :network_end, limit: 8, null: false
  t.integer :network_prefix, null: false
  t.integer :waf_action, default: 0, null: false
  t.integer :priority, default: 100
  t.string :redirect_url, limit: 500
  t.integer :redirect_status
  t.string :source, limit: 50
  t.timestamps

  t.index [:network_start, :network_end, :network_prefix]
  t.index :waf_action
end

create_table "ipv6_ranges" do |t|
  t.binary :network_start, limit: 16, null: false
  t.binary :network_end, limit: 16, null: false
  t.integer :network_prefix, null: false
  t.integer :waf_action, default: 0, null: false
  t.integer :priority, default: 100
  t.string :redirect_url, limit: 500
  t.integer :redirect_status
  t.string :source, limit: 50
  t.timestamps

  t.index [:network_start, :network_end, :network_prefix]
  t.index :waf_action
end

Agent Rule Processing

Network Rules

# Agent receives network rule from Hub:
rule = {
  id: 12341,
  rule_type: "network_v4",
  action: "deny",
  conditions: { cidr: "10.0.0.0/8" },
  priority: 8,
  enabled: true
}

# Agent converts to ipv4_ranges entry:
cidr = IPAddr.new("10.0.0.0/8")
Ipv4Range.upsert({
  source: "hub:12341",
  network_start: cidr.to_i,
  network_end: cidr.to_range.end.to_i,
  network_prefix: 8,
  waf_action: 1,  # deny
  priority: 8
}, unique_by: :source)

# Agent evaluates request:
# SELECT * FROM ipv4_ranges
# WHERE ? BETWEEN network_start AND network_end
# ORDER BY network_prefix DESC
# LIMIT 1

Rate Limit Rules

# Agent stores in memory:
@rate_limit_rules = {
  "global" => { limit: 100, window: 60, cidr: "0.0.0.0/0" }
}

@rate_counters = {
  "1.2.3.4" => { count: 50, window_start: Time.now }
}

# On each request:
def check_rate_limit(ip)
  rule = find_most_specific_rate_limit_rule(ip)
  counter = @rate_counters[ip] ||= { count: 0, window_start: Time.now }

  # Reset window if expired
  if Time.now - counter[:window_start] > rule[:window]
    counter = { count: 0, window_start: Time.now }
  end

  counter[:count] += 1

  if counter[:count] > rule[:limit]
    { action: "rate_limit", status: 429 }
  else
    { action: "allow" }
  end
end

Path Pattern Rules

# Agent evaluates patterns:
PATH_PATTERNS = [/.env$/, /.git/, /wp-admin/]

def check_path_patterns(path)
  matched = PATH_PATTERNS.any? { |pattern| path.match?(pattern) }

  if matched
    # Send event to Hub with flag
    send_event_to_hub(
      path: path,
      matched_pattern: true,
      waf_action: "log"  # Don't block yet
    )

    # Hub will analyze and create IP block rule if needed
  end
end

Hub Intelligence (Auto-Generation)

Scanner Detection

# PathScannerDetectorJob
class PathScannerDetectorJob < ApplicationJob
  SCANNER_PATHS = %w[/.env /.git /wp-admin /phpMyAdmin /.aws]

  def perform
    # Find IPs hitting scanner paths
    scanner_ips = Event
      .where("request_path IN (?)", SCANNER_PATHS)
      .where("timestamp > ?", 5.minutes.ago)
      .group(:ip_address)
      .having("COUNT(*) >= 3")
      .pluck(:ip_address)

    scanner_ips.each do |ip|
      # Create 24h ban rule
      Rule.create!(
        rule_type: "network_v4",
        action: "deny",
        conditions: { cidr: "#{ip}/32" },
        priority: 32,
        expires_at: 24.hours.from_now,
        source: "auto:scanner_detected",
        metadata: {
          reason: "Hit #{SCANNER_PATHS.join(', ')}",
          auto_generated: true
        }
      )
    end
  end
end

Rate Limit Abuse Detection

# RateLimitAnomalyJob
class RateLimitAnomalyJob < ApplicationJob
  def perform
    # Find IPs exceeding normal rate
    abusive_ips = Event
      .where("timestamp > ?", 1.minute.ago)
      .group(:ip_address)
      .having("COUNT(*) > 200")  # >200 req/min
      .pluck(:ip_address)

    abusive_ips.each do |ip|
      # Create aggressive rate limit or block
      Rule.create!(
        rule_type: "rate_limit",
        action: "rate_limit",
        conditions: { cidr: "#{ip}/32", scope: "global" },
        priority: 32,
        expires_at: 1.hour.from_now,
        source: "auto:rate_limit_exceeded",
        metadata: {
          limit: 10,
          window: 60,
          per_ip: true
        }
      )
    end
  end
end

Performance Characteristics

Hub

  • Rule query: O(log n) with (updated_at, id) index
  • Version check: Single index lookup
  • Rule generation: Background jobs, no request impact

Agent

  • Network rule lookup: O(log n) via B-tree index on (network_start, network_end)
  • Rate limit check: O(1) hash lookup in memory
  • Path pattern check: O(n) regex match (n = number of patterns)
  • Overall request evaluation: <1ms for typical case

Sync Efficiency

  • Incremental sync: Only changed rules since last sync
  • Typical sync payload: <10 KB for 50 rules
  • Sync frequency: Every 10s or 1000 events
  • Version check: <1 KB response

Future Enhancements (Phase 2+)

Per-Path Rate Limiting

  • Different limits for /api/*, /login, /admin
  • Agent tracks multiple counters per IP

Path-Based Event Sampling

  • Send all /admin requests
  • Skip /health, /metrics
  • Sample 10% of regular traffic

Challenge Actions

  • CAPTCHA challenges for suspicious IPs
  • JavaScript challenges for bot detection

Scheduled Rules

  • Block during maintenance windows
  • Time-of-day rate limits

Multi-Project Rules (Phase 10+)

  • Global rules across all projects
  • Per-project rule overrides

Summary

The Baffle Hub rule system provides:

  • Fast local enforcement (sub-millisecond)
  • Centralized intelligence (Hub analytics)
  • Efficient synchronization (timestamp-based incremental sync)
  • Dynamic adaptation (backpressure control via sampling)
  • Temporal flexibility (auto-expiring rules)
  • Audit trail (soft deletes, source tracking)

This architecture scales from single-server deployments to distributed multi-agent installations while maintaining simplicity and pragmatic design choices focused on the "low-hanging fruit" of WAF functionality.