First commit!

2025-11-03 17:37:28 +11:00
commit 429d41eead
141 changed files with 5890 additions and 0 deletions
--- a/docs/path-segment-architecture.md
+++ b/docs/path-segment-architecture.md
@@ -0,0 +1,512 @@
+# Path Segment Architecture
+
+## Overview
+
+Baffle Hub uses a path segment decomposition strategy to efficiently store and query URL paths in WAF event logs. This architecture provides significant storage compression while enabling fast prefix-based path searches using SQLite's B-tree indexes.
+
+## The Problem
+
+WAF systems generate millions of request events. Storing full URL paths like `/api/v1/users/123/posts` repeatedly wastes storage and makes pattern-based queries inefficient.
+
+Traditional approaches:
+- **Full path storage**: High redundancy, large database size
+- **String pattern matching with LIKE**: No index support, slow queries
+- **Full-Text Search (FTS)**: Complex setup, overkill for structured paths
+
+## Our Solution: Path Segment Normalization
+
+### Architecture Components
+
+```
+Request: /api/v1/users/123/posts
+    ↓
+Decompose into segments: ["api", "v1", "users", "123", "posts"]
+    ↓
+Normalize to IDs: [1, 2, 3, 4, 5]
+    ↓
+Store as JSON array: "[1,2,3,4,5]"
+```
+
+### Database Schema
+
+```ruby
+# path_segments table - deduplicated segment dictionary
+create_table :path_segments do |t|
+  t.string :segment, null: false, index: { unique: true }
+  t.integer :usage_count, default: 1, null: false
+  t.datetime :first_seen_at, null: false
+  t.timestamps
+end
+
+# events table - references segments by ID
+create_table :events do |t|
+  t.string :request_segment_ids  # JSON array: "[1,2,3]"
+  t.string :request_path         # Original path for display
+  # ... other fields
+end
+
+# Critical index for fast lookups
+add_index :events, :request_segment_ids
+```
+
+### Models
+
+**PathSegment** - The segment dictionary:
+```ruby
+class PathSegment < ApplicationRecord
+  validates :segment, presence: true, uniqueness: true
+  validates :usage_count, presence: true, numericality: { greater_than: 0 }
+
+  def self.find_or_create_segment(segment)
+    find_or_create_by(segment: segment) do |path_segment|
+      path_segment.usage_count = 1
+      path_segment.first_seen_at = Time.current
+    end
+  end
+
+  def increment_usage!
+    increment!(:usage_count)
+  end
+end
+```
+
+**Event** - Stores segment IDs as JSON array:
+```ruby
+class Event < ApplicationRecord
+  serialize :request_segment_ids, type: Array, coder: JSON
+
+  # Path reconstruction helper
+  def reconstructed_path
+    return request_path if request_segment_ids.blank?
+
+    segments = PathSegment.where(id: request_segment_ids).index_by(&:id)
+    '/' + request_segment_ids.map { |id| segments[id]&.segment }.compact.join('/')
+  end
+
+  def path_depth
+    request_segment_ids&.length || 0
+  end
+end
+```
+
+## The Indexing Strategy
+
+### Why Standard LIKE Doesn't Work
+
+SQLite's B-tree indexes only work with LIKE when the pattern is a simple alphanumeric prefix:
+
+```sql
+-- ✅ Uses index (alphanumeric prefix)
+WHERE column LIKE 'api%'
+
+-- ❌ Full table scan (starts with '[')
+WHERE request_segment_ids LIKE '[1,2,%'
+```
+
+### The Solution: Range Queries on Lexicographic Sort
+
+JSON arrays sort lexicographically in SQLite:
+
+```
+"[1,2]"      (exact match)
+"[1,2,3]"    (prefix match - has [1,2] as start)
+"[1,2,4]"    (prefix match - has [1,2] as start)
+"[1,2,99]"   (prefix match - has [1,2] as start)
+"[1,3]"      (out of range - different prefix)
+```
+
+To find all paths starting with `[1,2]`:
+```sql
+-- Exact match OR prefix range
+WHERE request_segment_ids = '[1,2]'
+   OR (request_segment_ids >= '[1,2,' AND request_segment_ids < '[1,3]')
+```
+
+The range `>= '[1,2,' AND < '[1,3]'` captures all arrays starting with `[1,2,...]`.
+
+### Query Performance
+
+```
+EXPLAIN QUERY PLAN:
+  MULTI-INDEX OR
+    ├─ INDEX 1: SEARCH events USING INDEX index_events_on_request_segment_ids (request_segment_ids=?)
+    └─ INDEX 2: SEARCH events USING INDEX index_events_on_request_segment_ids (request_segment_ids>? AND request_segment_ids<?)
+```
+
+Both branches use the B-tree index = O(log n) lookups!
+
+### Implementation: with_path_prefix Scope
+
+```ruby
+scope :with_path_prefix, ->(prefix_segment_ids) {
+  return none if prefix_segment_ids.blank?
+
+  # Convert [1, 2] to JSON string "[1,2]"
+  prefix_str = prefix_segment_ids.to_json
+
+  # Build upper bound by incrementing last segment
+  # [1, 2] + 1 = [1, 3]
+  upper_prefix = prefix_segment_ids[0..-2] + [prefix_segment_ids.last + 1]
+  upper_str = upper_prefix.to_json
+
+  # Lower bound for prefix matches: "[1,2,"
+  lower_prefix_str = "#{prefix_str[0..-2]},"
+
+  # Range query that uses B-tree index
+  where("request_segment_ids = ? OR (request_segment_ids >= ? AND request_segment_ids < ?)",
+        prefix_str, lower_prefix_str, upper_str)
+}
+```
+
+## Usage Examples
+
+### Basic Prefix Search
+
+```ruby
+# Find all /api/v1/* paths
+api_seg = PathSegment.find_by(segment: 'api')
+v1_seg = PathSegment.find_by(segment: 'v1')
+
+events = Event.with_path_prefix([api_seg.id, v1_seg.id])
+# Matches: /api/v1, /api/v1/users, /api/v1/users/123, etc.
+```
+
+### Combined with Other Filters
+
+```ruby
+# Blocked requests to /admin/* from specific IP
+admin_seg = PathSegment.find_by(segment: 'admin')
+
+Event.where(ip_address: '192.168.1.100')
+     .where(waf_action: :deny)
+     .with_path_prefix([admin_seg.id])
+```
+
+### Using Composite Index
+
+```ruby
+# POST requests to /api/* on specific host
+# Uses: idx_events_host_method_path
+host = RequestHost.find_by(hostname: 'api.example.com')
+api_seg = PathSegment.find_by(segment: 'api')
+
+Event.where(request_host_id: host.id, request_method: :post)
+     .with_path_prefix([api_seg.id])
+```
+
+### Exact Path Match
+
+```ruby
+# Find exact path /api/v1 (not /api/v1/users)
+api_seg = PathSegment.find_by(segment: 'api')
+v1_seg = PathSegment.find_by(segment: 'v1')
+
+Event.where(request_segment_ids: [api_seg.id, v1_seg.id].to_json)
+```
+
+### Path Reconstruction for Display
+
+```ruby
+events = Event.with_path_prefix([api_seg.id]).limit(10)
+
+events.each do |event|
+  puts "#{event.reconstructed_path} - #{event.waf_action}"
+  # => /api/v1/users - allow
+  # => /api/v1/posts - deny
+end
+```
+
+## Performance Characteristics
+
+| Operation | Index Used | Complexity | Notes |
+|-----------|-----------|------------|-------|
+| Exact path match | ✅ B-tree | O(log n) | Single index lookup |
+| Prefix path match | ✅ B-tree range | O(log n + k) | k = number of matches |
+| Path depth filter | ❌ None | O(n) | Full table scan - use sparingly |
+| Host+method+path | ✅ Composite | O(log n + k) | Optimal for WAF queries |
+
+### Indexes in Schema
+
+```ruby
+# Single-column index for path queries
+add_index :events, :request_segment_ids
+
+# Composite index for common WAF query patterns
+add_index :events, [:request_host_id, :request_method, :request_segment_ids],
+  name: 'idx_events_host_method_path'
+```
+
+## Storage Efficiency
+
+### Compression Benefits
+
+Example: `/api/v1/users` appears in 100,000 events
+
+**Without normalization:**
+```
+100,000 events × 15 bytes = 1,500,000 bytes (1.5 MB)
+```
+
+**With normalization:**
+```
+3 segments × 10 bytes (avg) = 30 bytes
+100,000 events × 7 bytes ("[1,2,3]") = 700,000 bytes (700 KB)
+Total: 700,030 bytes (700 KB)
+
+Savings: 53% reduction
+```
+
+Plus benefits:
+- **Usage tracking**: `usage_count` shows hot paths
+- **Analytics**: Easy to identify common path patterns
+- **Flexibility**: Can query at segment level
+
+## Normalization Process
+
+### Event Creation Flow
+
+```ruby
+# 1. Event arrives with full path
+payload = {
+  "request" => { "path" => "/api/v1/users/123" }
+}
+
+# 2. Event model extracts path
+event = Event.create_from_waf_payload!(event_id, payload, project)
+# Sets: request_path = "/api/v1/users/123"
+
+# 3. After validation, EventNormalizer runs
+EventNormalizer.normalize_event!(event)
+
+# 4. Path is decomposed into segments
+segments = ["/api/v1/users/123"].split('/').reject(&:blank?)
+# => ["api", "v1", "users", "123"]
+
+# 5. Each segment is normalized to ID
+segment_ids = segments.map do |segment|
+  path_segment = PathSegment.find_or_create_segment(segment)
+  path_segment.increment_usage! unless path_segment.new_record?
+  path_segment.id
+end
+# => [1, 2, 3, 4]
+
+# 6. IDs stored as JSON array
+event.request_segment_ids = segment_ids
+# Stored in DB as: "[1,2,3,4]"
+```
+
+### EventNormalizer Service
+
+```ruby
+class EventNormalizer
+  def normalize_path_segments
+    segments = @event.path_segments_array
+    return if segments.empty?
+
+    segment_ids = segments.map do |segment|
+      path_segment = PathSegment.find_or_create_segment(segment)
+      path_segment.increment_usage! unless path_segment.new_record?
+      path_segment.id
+    end
+
+    # Store as array - serialize will handle JSON encoding
+    @event.request_segment_ids = segment_ids
+  end
+end
+```
+
+## Important: JSON Functions and Performance
+
+### ❌ Avoid in WHERE Clauses
+
+JSON functions like `json_array_length()` cannot use indexes:
+
+```ruby
+# ❌ SLOW - Full table scan
+Event.where("json_array_length(request_segment_ids) = ?", 3)
+
+# ✅ FAST - Filter in Ruby after indexed query
+Event.with_path_prefix([api_id]).select { |e| e.path_depth == 3 }
+```
+
+### ✅ Use for Analytics (Async)
+
+JSON functions are fine for analytics queries run in background jobs:
+
+```ruby
+# Background job for analytics
+class PathDepthAnalysisJob < ApplicationJob
+  def perform(project_id)
+    # This is OK in async context
+    stats = Event.where(project_id: project_id)
+                 .select("json_array_length(request_segment_ids) as depth, COUNT(*) as count")
+                 .group("depth")
+                 .order(:depth)
+
+    # Store results for dashboard
+    PathDepthStats.create!(project_id: project_id, data: stats)
+  end
+end
+```
+
+## Edge Cases and Considerations
+
+### Empty Paths
+
+```ruby
+request_path = "/"
+segments = []  # Empty after split and reject
+request_segment_ids = []  # Empty array
+# Stored as: "[]"
+```
+
+### Trailing Slashes
+
+```ruby
+"/api/v1/" == "/api/v1"  # Both normalize to ["api", "v1"]
+```
+
+### Special Characters in Segments
+
+```ruby
+# URL-encoded segments are stored as-is
+"/search?q=hello%20world"
+# Segments: ["search?q=hello%20world"]
+```
+
+Consider normalizing query params separately if needed.
+
+### Very Deep Paths
+
+Paths with 10+ segments work fine but consider:
+- Are they legitimate? (Could indicate attack)
+- Impact on JSON array size
+- Consider truncating for analytics
+
+## Analytics Use Cases
+
+### Most Common Paths
+
+```ruby
+# Top 10 most accessed paths
+Event.group(:request_segment_ids)
+     .order('COUNT(*) DESC')
+     .limit(10)
+     .count
+     .map { |seg_ids, count|
+       path = PathSegment.where(id: JSON.parse(seg_ids))
+                        .pluck(:segment)
+                        .join('/')
+       ["/#{path}", count]
+     }
+```
+
+### Hot Path Segments
+
+```ruby
+# Most frequently used segments (indicates common endpoints)
+PathSegment.order(usage_count: :desc).limit(20)
+```
+
+### Attack Pattern Detection
+
+```ruby
+# Paths with unusual depth (possible directory traversal)
+Event.where(waf_action: :deny)
+     .select { |e| e.path_depth > 10 }
+     .group_by { |e| e.request_segment_ids.first }
+```
+
+### Path-Based Rule Generation
+
+```ruby
+# Auto-block paths that are frequently denied
+suspicious_paths = Event.where(waf_action: :deny)
+                        .where('created_at > ?', 1.hour.ago)
+                        .group(:request_segment_ids)
+                        .having('COUNT(*) > ?', 100)
+                        .pluck(:request_segment_ids)
+
+suspicious_paths.each do |seg_ids|
+  RuleSet.global.block_path_segments(seg_ids)
+end
+```
+
+## Future Optimizations
+
+### Phase 2 Considerations
+
+If performance becomes critical:
+
+1. **Materialized Path Column**: Pre-compute common prefix patterns
+2. **Trie Data Structure**: In-memory trie for ultra-fast prefix matching
+3. **Redis Cache**: Cache hot path lookups
+4. **Partial Indexes**: Index only blocked/challenged events
+
+```ruby
+# Example: Partial index for security-relevant events
+add_index :events, :request_segment_ids,
+  where: "waf_action IN ('deny', 'challenge')",
+  name: 'idx_events_blocked_paths'
+```
+
+### Storage Considerations
+
+For very large deployments (100M+ events):
+
+- **Archive old events**: Move to separate table
+- **Aggregate path stats**: Pre-compute daily/hourly summaries
+- **Compress JSON**: SQLite JSON1 extension supports compression
+
+## Testing
+
+### Test Index Usage
+
+```ruby
+# Verify B-tree index is being used
+sql = Event.with_path_prefix([1, 2]).to_sql
+plan = ActiveRecord::Base.connection.execute("EXPLAIN QUERY PLAN #{sql}")
+
+# Should see: "SEARCH events USING INDEX index_events_on_request_segment_ids"
+puts plan.to_a
+```
+
+### Benchmark Queries
+
+```ruby
+require 'benchmark'
+
+prefix_ids = [1, 2]
+
+# Test indexed range query
+Benchmark.bm do |x|
+  x.report("Indexed range:") {
+    Event.with_path_prefix(prefix_ids).count
+  }
+
+  x.report("LIKE query:") {
+    Event.where("request_segment_ids LIKE ?", "[1,2,%").count
+  }
+end
+
+# Range query should be 10-100x faster
+```
+
+## Conclusion
+
+Path segment normalization with JSON array storage provides:
+
+✅ **Significant storage savings** (50%+ compression)
+✅ **Fast prefix queries** using standard B-tree indexes
+✅ **Analytics-friendly** with usage tracking and pattern detection
+✅ **Rails-native** using built-in serialization
+✅ **Scalable** to millions of events with O(log n) lookups
+
+The key insight: **Range queries on lexicographically-sorted JSON strings use B-tree indexes efficiently**, avoiding the need for complex full-text search or custom indexing strategies.
+
+---
+
+**Related Documentation:**
+- [Event Ingestion](./event-ingestion.md) (TODO)
+- [WAF Rule Engine](./rule-engine.md) (TODO)
+- [Analytics Architecture](./analytics.md) (TODO)