Files

Dan Milne 3f274c842c Fix some blocked/allow laggards after migrating. Add DuckDB for outstanding analyitcs performance. Start adding an import for all bot networks

2025-11-18 16:40:05 +11:00

16 KiB

Raw Blame History

Baffle Hub - Rule Architecture

Overview

Baffle Hub uses a distributed rule system where the Hub generates and manages rules, and Agents download and enforce them locally using optimized SQLite queries. This architecture provides sub-millisecond rule evaluation while maintaining centralized intelligence and control.

Core Principles

Hub-side Intelligence: Pattern detection and rule generation happens on the Hub
Agent-side Enforcement: Rule evaluation happens locally on Agents for speed
Incremental Sync: Agents poll for rule updates using timestamp-based cursors
Dynamic Backpressure: Hub controls event sampling based on load
Temporal Rules: Rules can expire automatically (e.g., 24-hour bans)
Soft Deletes: Rules are disabled, not deleted, for proper sync and audit trail

Rule Types

1. Network Rules (`network_v4`, `network_v6`)

Block or allow traffic based on IP address or CIDR ranges.

Use Cases:

Block scanner IPs (temporary or permanent)
Block datacenter/VPN/proxy ranges
Allow trusted IP ranges
Geographic blocking via IP ranges

Evaluation:

Most specific CIDR wins (smallest prefix)
/32 beats /24 beats /16 beats /8
Agent uses optimized range queries on ipv4_ranges/ipv6_ranges tables

Example:

{
  "id": 12341,
  "rule_type": "network_v4",
  "action": "deny",
  "conditions": { "cidr": "185.220.100.0/22" },
  "priority": 22,
  "expires_at": "2024-11-04T12:00:00Z",
  "enabled": true,
  "source": "auto:scanner_detected",
  "metadata": {
    "reason": "Tor exit node hitting /.env",
    "auto_generated": true
  }
}

2. Rate Limit Rules (`rate_limit`)

Control request rate per IP or per CIDR range.

Scopes (Phase 1):

Global per-IP: Limit requests per IP across all paths
Per-CIDR: Different limits for different network ranges

Scopes (Phase 2+):

Per-path per-IP: Different limits for /api/*, /login, etc.

Evaluation:

Agent maintains in-memory counters per IP
Finds most specific CIDR rule for the IP
Applies that rule's rate limit configuration
Optional: Persist counters to SQLite for restart resilience

Example (Phase 1):

{
  "id": 12342,
  "rule_type": "rate_limit",
  "action": "rate_limit",
  "conditions": {
    "cidr": "0.0.0.0/0",
    "scope": "global"
  },
  "priority": 0,
  "enabled": true,
  "source": "manual",
  "metadata": {
    "limit": 100,
    "window": 60,
    "per_ip": true
  }
}

Example (Phase 2+):

{
  "id": 12343,
  "rule_type": "rate_limit",
  "action": "rate_limit",
  "conditions": {
    "cidr": "0.0.0.0/0",
    "scope": "per_path",
    "path_pattern": "/api/login"
  },
  "metadata": {
    "limit": 5,
    "window": 60,
    "per_ip": true
  }
}

3. Path Pattern Rules (`path_pattern`)

Detect suspicious path access patterns (mainly for Hub analytics).

Use Cases:

Detect scanners hitting /.env, /.git, /wp-admin
Identify bots with suspicious path traversal
Trigger automatic IP bans when patterns match

Evaluation:

Agent does lightweight pattern matching
When matched, sends event to Hub with matched_pattern: true
Hub analyzes and creates IP block rules if needed
Agent picks up new IP block rule in next sync (~10s)

Example:

{
  "id": 12344,
  "rule_type": "path_pattern",
  "action": "log",
  "conditions": {
    "patterns": ["/.env", "/.git/*", "/wp-admin/*", "/.aws/*", "/phpMyAdmin/*"]
  },
  "enabled": true,
  "source": "default:scanner_detection",
  "metadata": {
    "auto_ban_ip": true,
    "ban_duration_hours": 24,
    "description": "Common scanner paths"
  }
}

Rule Actions

Action	Description	HTTP Response
`allow`	Pass request through	Continue to app
`deny`	Block request	403 Forbidden
`rate_limit`	Enforce rate limit	429 Too Many Requests
`redirect`	Redirect to URL	301/302 + Location header
`challenge`	Show CAPTCHA (Phase 2+)	403 with challenge
`log`	Log only, don't block	Continue to app

Rule Priority & Specificity

Network Rules

Priority is determined by CIDR prefix length
Smaller prefix (more specific) = higher priority
/32 (single IP) beats /24 (256 IPs) beats /8 (16M IPs)
Example: Block 10.0.0.0/8 but allow 10.0.1.0/24
- Request from 10.0.1.5 → matches /24 → allowed
- Request from 10.0.2.5 → matches /8 only → blocked

Rate Limit Rules

Most specific CIDR match wins
Per-path rules take precedence over global (Phase 2+)

Path Pattern Rules

All patterns are evaluated (not exclusive)
Used for detection, not blocking
Multiple pattern matches = stronger signal for ban

Rule Synchronization

Timestamp-Based Cursor

Agents use updated_at timestamps as sync cursors to handle rule updates and deletions.

Why updated_at instead of id?

Handles rule updates (e.g., disabling a rule updates updated_at)
Handles rule deletions via enabled=false flag
Simple for agents: "give me everything that changed since X"

Agent Sync Flow:

1. Agent starts: last_sync = nil
2. GET /api/:key/rules → Full sync, store latest updated_at
3. Every 10s or 1000 events: GET /api/:key/rules?since=<last_sync>
4. Process rules: add new, update existing, remove disabled
5. Update last_sync to latest updated_at from response

Query Overlap: Hub queries updated_at >= since - 0.5s to handle clock skew and millisecond duplicates.

API Endpoints

1. Version Check (Lightweight)

GET /api/:public_key/rules/version

Response:
{
  "version": 1730646645123000,
  "count": 150,
  "sampling": {
    "allowed_requests": 0.5,
    "blocked_requests": 1.0,
    "rate_limited_requests": 1.0,
    "effective_until": "2024-11-03T12:30:55.123Z"
  }
}

Timestamp Format: The version field uses microsecond Unix timestamp (e.g., 1730646645123000) for efficient machine comparison. For backward compatibility, the API also accepts ISO8601 timestamps in the since parameter.

2. Incremental Sync

GET /api/:public_key/rules?since=1730646000000000

Response:
{
  "version": 1730646645123000,
  "sampling": { ... },
  "rules": [
    {
      "id": 12341,
      "rule_type": "network_v4",
      "action": "deny",
      "conditions": { "cidr": "1.2.3.4/32" },
      "priority": 32,
      "expires_at": "2024-11-04T12:00:00Z",
      "enabled": true,
      "source": "auto:scanner_detected",
      "metadata": { "reason": "Hitting /.env" },
      "created_at": "2024-11-03T12:00:00Z",
      "updated_at": "2024-11-03T12:00:00Z"
    },
    {
      "id": 12340,
      "rule_type": "network_v4",
      "action": "deny",
      "conditions": { "cidr": "5.6.7.8/32" },
      "priority": 32,
      "enabled": false,
      "source": "manual",
      "metadata": { "reason": "False positive" },
      "created_at": "2024-11-02T10:00:00Z",
      "updated_at": "2024-11-03T12:25:00Z"
    }
  ]
}

3. Full Sync

GET /api/:public_key/rules

Response:
{
  "version": 1730646645123000,
  "sampling": { ... },
  "rules": [ ...all enabled rules... ]
}

Dynamic Event Sampling

Hub controls how many events Agents send based on load.

Sampling Strategy

Hub monitors:

SolidQueue job depth
Events/second rate
Database write latency

Sampling rates:

Queue Depth     | Allowed | Blocked | Rate Limited
----------------|---------|---------|-------------
0-1,000         | 100%    | 100%    | 100%
1,001-5,000     | 50%     | 100%    | 100%
5,001-10,000    | 20%     | 100%    | 100%
10,001+         | 5%      | 100%    | 100%

Phase 2+: Path-based sampling:

{
  "sampling": {
    "allowed_requests": 0.1,
    "blocked_requests": 1.0,
    "paths": {
      "block": ["/.env", "/.git/*"],
      "allow": ["/health", "/metrics"]
    }
  }
}

Agent respects sampling:

Always sends blocked/rate-limited events
Samples allowed events based on rate
Can prioritize suspicious paths over routine traffic

Temporal Rules (Expiration)

Rules can have an expires_at timestamp for automatic expiration.

Use Cases:

24-hour scanner bans
Temporary rate limit adjustments
Time-boxed maintenance blocks

Cleanup:

ExpiredRulesCleanupJob runs hourly
Disables rules where expires_at < now
Agent picks up disabled rules in next sync

Example:

# Hub auto-generates rule when scanner detected:
Rule.create!(
  rule_type: "network_v4",
  action: "deny",
  conditions: { cidr: "1.2.3.4/32" },
  expires_at: 24.hours.from_now,
  source: "auto:scanner_detected",
  metadata: { reason: "Hit /.env 5 times in 10 seconds" }
)

# 24 hours later: ExpiredRulesCleanupJob disables it
# Agent syncs and removes from ipv4_ranges table

Rule Sources

The source field tracks rule origin for audit and filtering.

Source Formats:

manual - Created by user via UI
auto:scanner_detected - Auto-generated from scanner pattern
auto:rate_limit_exceeded - Auto-generated from rate limit abuse
auto:bot_detected - Auto-generated from bot behavior
imported:fail2ban - Imported from external source
imported:crowdsec - Imported from CrowdSec
default:scanner_paths - Default rule set

Database Schema

Hub Schema

create_table "rules" do |t|
  # Identification
  t.integer :id, primary_key: true
  t.string :source, limit: 100

  # Rule definition
  t.string :rule_type, null: false
  t.string :action, null: false
  t.json :conditions, null: false
  t.json :metadata

  # Priority & lifecycle
  t.integer :priority
  t.datetime :expires_at
  t.boolean :enabled, default: true, null: false

  # Timestamps (updated_at is sync cursor!)
  t.timestamps

  # Indexes
  t.index [:updated_at, :id]  # Primary sync query
  t.index :enabled
  t.index :expires_at
  t.index :source
  t.index :rule_type
end

Agent Schema (Existing)

create_table "ipv4_ranges" do |t|
  t.integer :network_start, limit: 8, null: false
  t.integer :network_end, limit: 8, null: false
  t.integer :network_prefix, null: false
  t.integer :waf_action, default: 0, null: false
  t.integer :priority, default: 100
  t.string :redirect_url, limit: 500
  t.integer :redirect_status
  t.string :source, limit: 50
  t.timestamps

  t.index [:network_start, :network_end, :network_prefix]
  t.index :waf_action
end

create_table "ipv6_ranges" do |t|
  t.binary :network_start, limit: 16, null: false
  t.binary :network_end, limit: 16, null: false
  t.integer :network_prefix, null: false
  t.integer :waf_action, default: 0, null: false
  t.integer :priority, default: 100
  t.string :redirect_url, limit: 500
  t.integer :redirect_status
  t.string :source, limit: 50
  t.timestamps

  t.index [:network_start, :network_end, :network_prefix]
  t.index :waf_action
end

Agent Rule Processing

Network Rules

# Agent receives network rule from Hub:
rule = {
  id: 12341,
  rule_type: "network_v4",
  action: "deny",
  conditions: { cidr: "10.0.0.0/8" },
  priority: 8,
  enabled: true
}

# Agent converts to ipv4_ranges entry:
cidr = IPAddr.new("10.0.0.0/8")
Ipv4Range.upsert({
  source: "hub:12341",
  network_start: cidr.to_i,
  network_end: cidr.to_range.end.to_i,
  network_prefix: 8,
  waf_action: 0,  # deny
  priority: 8
}, unique_by: :source)

# Agent evaluates request:
# SELECT * FROM ipv4_ranges
# WHERE ? BETWEEN network_start AND network_end
# ORDER BY network_prefix DESC
# LIMIT 1

Rate Limit Rules

# Agent stores in memory:
@rate_limit_rules = {
  "global" => { limit: 100, window: 60, cidr: "0.0.0.0/0" }
}

@rate_counters = {
  "1.2.3.4" => { count: 50, window_start: Time.now }
}

# On each request:
def check_rate_limit(ip)
  rule = find_most_specific_rate_limit_rule(ip)
  counter = @rate_counters[ip] ||= { count: 0, window_start: Time.now }

  # Reset window if expired
  if Time.now - counter[:window_start] > rule[:window]
    counter = { count: 0, window_start: Time.now }
  end

  counter[:count] += 1

  if counter[:count] > rule[:limit]
    { action: "rate_limit", status: 429 }
  else
    { action: "allow" }
  end
end

Path Pattern Rules

# Agent evaluates patterns:
PATH_PATTERNS = [/.env$/, /.git/, /wp-admin/]

def check_path_patterns(path)
  matched = PATH_PATTERNS.any? { |pattern| path.match?(pattern) }

  if matched
    # Send event to Hub with flag
    send_event_to_hub(
      path: path,
      matched_pattern: true,
      waf_action: "log"  # Don't block yet
    )

    # Hub will analyze and create IP block rule if needed
  end
end

Hub Intelligence (Auto-Generation)

Scanner Detection

# PathScannerDetectorJob
class PathScannerDetectorJob < ApplicationJob
  SCANNER_PATHS = %w[/.env /.git /wp-admin /phpMyAdmin /.aws]

  def perform
    # Find IPs hitting scanner paths
    scanner_ips = Event
      .where("request_path IN (?)", SCANNER_PATHS)
      .where("timestamp > ?", 5.minutes.ago)
      .group(:ip_address)
      .having("COUNT(*) >= 3")
      .pluck(:ip_address)

    scanner_ips.each do |ip|
      # Create 24h ban rule
      Rule.create!(
        rule_type: "network_v4",
        action: "deny",
        conditions: { cidr: "#{ip}/32" },
        priority: 32,
        expires_at: 24.hours.from_now,
        source: "auto:scanner_detected",
        metadata: {
          reason: "Hit #{SCANNER_PATHS.join(', ')}",
          auto_generated: true
        }
      )
    end
  end
end

Rate Limit Abuse Detection

# RateLimitAnomalyJob
class RateLimitAnomalyJob < ApplicationJob
  def perform
    # Find IPs exceeding normal rate
    abusive_ips = Event
      .where("timestamp > ?", 1.minute.ago)
      .group(:ip_address)
      .having("COUNT(*) > 200")  # >200 req/min
      .pluck(:ip_address)

    abusive_ips.each do |ip|
      # Create aggressive rate limit or block
      Rule.create!(
        rule_type: "rate_limit",
        action: "rate_limit",
        conditions: { cidr: "#{ip}/32", scope: "global" },
        priority: 32,
        expires_at: 1.hour.from_now,
        source: "auto:rate_limit_exceeded",
        metadata: {
          limit: 10,
          window: 60,
          per_ip: true
        }
      )
    end
  end
end

Performance Characteristics

Hub

Rule query: O(log n) with (updated_at, id) index
Version check: Single index lookup
Rule generation: Background jobs, no request impact

Agent

Network rule lookup: O(log n) via B-tree index on (network_start, network_end)
Rate limit check: O(1) hash lookup in memory
Path pattern check: O(n) regex match (n = number of patterns)
Overall request evaluation: <1ms for typical case

Sync Efficiency

Incremental sync: Only changed rules since last sync
Typical sync payload: <10 KB for 50 rules
Sync frequency: Every 10s or 1000 events
Version check: <1 KB response

Future Enhancements (Phase 2+)

Per-Path Rate Limiting

Different limits for /api/*, /login, /admin
Agent tracks multiple counters per IP

Path-Based Event Sampling

Send all /admin requests
Skip /health, /metrics
Sample 10% of regular traffic

Challenge Actions

CAPTCHA challenges for suspicious IPs
JavaScript challenges for bot detection

Scheduled Rules

Block during maintenance windows
Time-of-day rate limits

Multi-Project Rules (Phase 10+)

Global rules across all projects
Per-project rule overrides

Summary

The Baffle Hub rule system provides:

Fast local enforcement (sub-millisecond)
Centralized intelligence (Hub analytics)
Efficient synchronization (timestamp-based incremental sync)
Dynamic adaptation (backpressure control via sampling)
Temporal flexibility (auto-expiring rules)
Audit trail (soft deletes, source tracking)

This architecture scales from single-server deployments to distributed multi-agent installations while maintaining simplicity and pragmatic design choices focused on the "low-hanging fruit" of WAF functionality.

16 KiB Raw Blame History

Baffle Hub - Rule Architecture

Overview

Core Principles

Rule Types

1. Network Rules (network_v4, network_v6)

2. Rate Limit Rules (rate_limit)

3. Path Pattern Rules (path_pattern)

Rule Actions

Rule Priority & Specificity

Network Rules

Rate Limit Rules

Path Pattern Rules

Rule Synchronization

Timestamp-Based Cursor

API Endpoints

1. Version Check (Lightweight)

2. Incremental Sync

3. Full Sync

Dynamic Event Sampling

Sampling Strategy

Temporal Rules (Expiration)

Rule Sources

Database Schema

Hub Schema

Agent Schema (Existing)

Agent Rule Processing

Network Rules

Rate Limit Rules

Path Pattern Rules

Hub Intelligence (Auto-Generation)

Scanner Detection

Rate Limit Abuse Detection

Performance Characteristics

Hub

Agent

Sync Efficiency

Future Enhancements (Phase 2+)

Per-Path Rate Limiting

Path-Based Event Sampling

Challenge Actions

Scheduled Rules

Multi-Project Rules (Phase 10+)

Summary

16 KiB

Raw Blame History

1. Network Rules (`network_v4`, `network_v6`)

2. Rate Limit Rules (`rate_limit`)

3. Path Pattern Rules (`path_pattern`)