First commit!

This commit is contained in:
Dan Milne
2025-11-03 17:37:28 +11:00
commit 429d41eead
141 changed files with 5890 additions and 0 deletions

View File

@@ -0,0 +1,512 @@
# Path Segment Architecture
## Overview
Baffle Hub uses a path segment decomposition strategy to efficiently store and query URL paths in WAF event logs. This architecture provides significant storage compression while enabling fast prefix-based path searches using SQLite's B-tree indexes.
## The Problem
WAF systems generate millions of request events. Storing full URL paths like `/api/v1/users/123/posts` repeatedly wastes storage and makes pattern-based queries inefficient.
Traditional approaches:
- **Full path storage**: High redundancy, large database size
- **String pattern matching with LIKE**: No index support, slow queries
- **Full-Text Search (FTS)**: Complex setup, overkill for structured paths
## Our Solution: Path Segment Normalization
### Architecture Components
```
Request: /api/v1/users/123/posts
Decompose into segments: ["api", "v1", "users", "123", "posts"]
Normalize to IDs: [1, 2, 3, 4, 5]
Store as JSON array: "[1,2,3,4,5]"
```
### Database Schema
```ruby
# path_segments table - deduplicated segment dictionary
create_table :path_segments do |t|
t.string :segment, null: false, index: { unique: true }
t.integer :usage_count, default: 1, null: false
t.datetime :first_seen_at, null: false
t.timestamps
end
# events table - references segments by ID
create_table :events do |t|
t.string :request_segment_ids # JSON array: "[1,2,3]"
t.string :request_path # Original path for display
# ... other fields
end
# Critical index for fast lookups
add_index :events, :request_segment_ids
```
### Models
**PathSegment** - The segment dictionary:
```ruby
class PathSegment < ApplicationRecord
validates :segment, presence: true, uniqueness: true
validates :usage_count, presence: true, numericality: { greater_than: 0 }
def self.find_or_create_segment(segment)
find_or_create_by(segment: segment) do |path_segment|
path_segment.usage_count = 1
path_segment.first_seen_at = Time.current
end
end
def increment_usage!
increment!(:usage_count)
end
end
```
**Event** - Stores segment IDs as JSON array:
```ruby
class Event < ApplicationRecord
serialize :request_segment_ids, type: Array, coder: JSON
# Path reconstruction helper
def reconstructed_path
return request_path if request_segment_ids.blank?
segments = PathSegment.where(id: request_segment_ids).index_by(&:id)
'/' + request_segment_ids.map { |id| segments[id]&.segment }.compact.join('/')
end
def path_depth
request_segment_ids&.length || 0
end
end
```
## The Indexing Strategy
### Why Standard LIKE Doesn't Work
SQLite's B-tree indexes only work with LIKE when the pattern is a simple alphanumeric prefix:
```sql
-- ✅ Uses index (alphanumeric prefix)
WHERE column LIKE 'api%'
-- ❌ Full table scan (starts with '[')
WHERE request_segment_ids LIKE '[1,2,%'
```
### The Solution: Range Queries on Lexicographic Sort
JSON arrays sort lexicographically in SQLite:
```
"[1,2]" (exact match)
"[1,2,3]" (prefix match - has [1,2] as start)
"[1,2,4]" (prefix match - has [1,2] as start)
"[1,2,99]" (prefix match - has [1,2] as start)
"[1,3]" (out of range - different prefix)
```
To find all paths starting with `[1,2]`:
```sql
-- Exact match OR prefix range
WHERE request_segment_ids = '[1,2]'
OR (request_segment_ids >= '[1,2,' AND request_segment_ids < '[1,3]')
```
The range `>= '[1,2,' AND < '[1,3]'` captures all arrays starting with `[1,2,...]`.
### Query Performance
```
EXPLAIN QUERY PLAN:
MULTI-INDEX OR
├─ INDEX 1: SEARCH events USING INDEX index_events_on_request_segment_ids (request_segment_ids=?)
└─ INDEX 2: SEARCH events USING INDEX index_events_on_request_segment_ids (request_segment_ids>? AND request_segment_ids<?)
```
Both branches use the B-tree index = O(log n) lookups!
### Implementation: with_path_prefix Scope
```ruby
scope :with_path_prefix, ->(prefix_segment_ids) {
return none if prefix_segment_ids.blank?
# Convert [1, 2] to JSON string "[1,2]"
prefix_str = prefix_segment_ids.to_json
# Build upper bound by incrementing last segment
# [1, 2] + 1 = [1, 3]
upper_prefix = prefix_segment_ids[0..-2] + [prefix_segment_ids.last + 1]
upper_str = upper_prefix.to_json
# Lower bound for prefix matches: "[1,2,"
lower_prefix_str = "#{prefix_str[0..-2]},"
# Range query that uses B-tree index
where("request_segment_ids = ? OR (request_segment_ids >= ? AND request_segment_ids < ?)",
prefix_str, lower_prefix_str, upper_str)
}
```
## Usage Examples
### Basic Prefix Search
```ruby
# Find all /api/v1/* paths
api_seg = PathSegment.find_by(segment: 'api')
v1_seg = PathSegment.find_by(segment: 'v1')
events = Event.with_path_prefix([api_seg.id, v1_seg.id])
# Matches: /api/v1, /api/v1/users, /api/v1/users/123, etc.
```
### Combined with Other Filters
```ruby
# Blocked requests to /admin/* from specific IP
admin_seg = PathSegment.find_by(segment: 'admin')
Event.where(ip_address: '192.168.1.100')
.where(waf_action: :deny)
.with_path_prefix([admin_seg.id])
```
### Using Composite Index
```ruby
# POST requests to /api/* on specific host
# Uses: idx_events_host_method_path
host = RequestHost.find_by(hostname: 'api.example.com')
api_seg = PathSegment.find_by(segment: 'api')
Event.where(request_host_id: host.id, request_method: :post)
.with_path_prefix([api_seg.id])
```
### Exact Path Match
```ruby
# Find exact path /api/v1 (not /api/v1/users)
api_seg = PathSegment.find_by(segment: 'api')
v1_seg = PathSegment.find_by(segment: 'v1')
Event.where(request_segment_ids: [api_seg.id, v1_seg.id].to_json)
```
### Path Reconstruction for Display
```ruby
events = Event.with_path_prefix([api_seg.id]).limit(10)
events.each do |event|
puts "#{event.reconstructed_path} - #{event.waf_action}"
# => /api/v1/users - allow
# => /api/v1/posts - deny
end
```
## Performance Characteristics
| Operation | Index Used | Complexity | Notes |
|-----------|-----------|------------|-------|
| Exact path match | ✅ B-tree | O(log n) | Single index lookup |
| Prefix path match | ✅ B-tree range | O(log n + k) | k = number of matches |
| Path depth filter | ❌ None | O(n) | Full table scan - use sparingly |
| Host+method+path | ✅ Composite | O(log n + k) | Optimal for WAF queries |
### Indexes in Schema
```ruby
# Single-column index for path queries
add_index :events, :request_segment_ids
# Composite index for common WAF query patterns
add_index :events, [:request_host_id, :request_method, :request_segment_ids],
name: 'idx_events_host_method_path'
```
## Storage Efficiency
### Compression Benefits
Example: `/api/v1/users` appears in 100,000 events
**Without normalization:**
```
100,000 events × 15 bytes = 1,500,000 bytes (1.5 MB)
```
**With normalization:**
```
3 segments × 10 bytes (avg) = 30 bytes
100,000 events × 7 bytes ("[1,2,3]") = 700,000 bytes (700 KB)
Total: 700,030 bytes (700 KB)
Savings: 53% reduction
```
Plus benefits:
- **Usage tracking**: `usage_count` shows hot paths
- **Analytics**: Easy to identify common path patterns
- **Flexibility**: Can query at segment level
## Normalization Process
### Event Creation Flow
```ruby
# 1. Event arrives with full path
payload = {
"request" => { "path" => "/api/v1/users/123" }
}
# 2. Event model extracts path
event = Event.create_from_waf_payload!(event_id, payload, project)
# Sets: request_path = "/api/v1/users/123"
# 3. After validation, EventNormalizer runs
EventNormalizer.normalize_event!(event)
# 4. Path is decomposed into segments
segments = ["/api/v1/users/123"].split('/').reject(&:blank?)
# => ["api", "v1", "users", "123"]
# 5. Each segment is normalized to ID
segment_ids = segments.map do |segment|
path_segment = PathSegment.find_or_create_segment(segment)
path_segment.increment_usage! unless path_segment.new_record?
path_segment.id
end
# => [1, 2, 3, 4]
# 6. IDs stored as JSON array
event.request_segment_ids = segment_ids
# Stored in DB as: "[1,2,3,4]"
```
### EventNormalizer Service
```ruby
class EventNormalizer
def normalize_path_segments
segments = @event.path_segments_array
return if segments.empty?
segment_ids = segments.map do |segment|
path_segment = PathSegment.find_or_create_segment(segment)
path_segment.increment_usage! unless path_segment.new_record?
path_segment.id
end
# Store as array - serialize will handle JSON encoding
@event.request_segment_ids = segment_ids
end
end
```
## Important: JSON Functions and Performance
### ❌ Avoid in WHERE Clauses
JSON functions like `json_array_length()` cannot use indexes:
```ruby
# ❌ SLOW - Full table scan
Event.where("json_array_length(request_segment_ids) = ?", 3)
# ✅ FAST - Filter in Ruby after indexed query
Event.with_path_prefix([api_id]).select { |e| e.path_depth == 3 }
```
### ✅ Use for Analytics (Async)
JSON functions are fine for analytics queries run in background jobs:
```ruby
# Background job for analytics
class PathDepthAnalysisJob < ApplicationJob
def perform(project_id)
# This is OK in async context
stats = Event.where(project_id: project_id)
.select("json_array_length(request_segment_ids) as depth, COUNT(*) as count")
.group("depth")
.order(:depth)
# Store results for dashboard
PathDepthStats.create!(project_id: project_id, data: stats)
end
end
```
## Edge Cases and Considerations
### Empty Paths
```ruby
request_path = "/"
segments = [] # Empty after split and reject
request_segment_ids = [] # Empty array
# Stored as: "[]"
```
### Trailing Slashes
```ruby
"/api/v1/" == "/api/v1" # Both normalize to ["api", "v1"]
```
### Special Characters in Segments
```ruby
# URL-encoded segments are stored as-is
"/search?q=hello%20world"
# Segments: ["search?q=hello%20world"]
```
Consider normalizing query params separately if needed.
### Very Deep Paths
Paths with 10+ segments work fine but consider:
- Are they legitimate? (Could indicate attack)
- Impact on JSON array size
- Consider truncating for analytics
## Analytics Use Cases
### Most Common Paths
```ruby
# Top 10 most accessed paths
Event.group(:request_segment_ids)
.order('COUNT(*) DESC')
.limit(10)
.count
.map { |seg_ids, count|
path = PathSegment.where(id: JSON.parse(seg_ids))
.pluck(:segment)
.join('/')
["/#{path}", count]
}
```
### Hot Path Segments
```ruby
# Most frequently used segments (indicates common endpoints)
PathSegment.order(usage_count: :desc).limit(20)
```
### Attack Pattern Detection
```ruby
# Paths with unusual depth (possible directory traversal)
Event.where(waf_action: :deny)
.select { |e| e.path_depth > 10 }
.group_by { |e| e.request_segment_ids.first }
```
### Path-Based Rule Generation
```ruby
# Auto-block paths that are frequently denied
suspicious_paths = Event.where(waf_action: :deny)
.where('created_at > ?', 1.hour.ago)
.group(:request_segment_ids)
.having('COUNT(*) > ?', 100)
.pluck(:request_segment_ids)
suspicious_paths.each do |seg_ids|
RuleSet.global.block_path_segments(seg_ids)
end
```
## Future Optimizations
### Phase 2 Considerations
If performance becomes critical:
1. **Materialized Path Column**: Pre-compute common prefix patterns
2. **Trie Data Structure**: In-memory trie for ultra-fast prefix matching
3. **Redis Cache**: Cache hot path lookups
4. **Partial Indexes**: Index only blocked/challenged events
```ruby
# Example: Partial index for security-relevant events
add_index :events, :request_segment_ids,
where: "waf_action IN ('deny', 'challenge')",
name: 'idx_events_blocked_paths'
```
### Storage Considerations
For very large deployments (100M+ events):
- **Archive old events**: Move to separate table
- **Aggregate path stats**: Pre-compute daily/hourly summaries
- **Compress JSON**: SQLite JSON1 extension supports compression
## Testing
### Test Index Usage
```ruby
# Verify B-tree index is being used
sql = Event.with_path_prefix([1, 2]).to_sql
plan = ActiveRecord::Base.connection.execute("EXPLAIN QUERY PLAN #{sql}")
# Should see: "SEARCH events USING INDEX index_events_on_request_segment_ids"
puts plan.to_a
```
### Benchmark Queries
```ruby
require 'benchmark'
prefix_ids = [1, 2]
# Test indexed range query
Benchmark.bm do |x|
x.report("Indexed range:") {
Event.with_path_prefix(prefix_ids).count
}
x.report("LIKE query:") {
Event.where("request_segment_ids LIKE ?", "[1,2,%").count
}
end
# Range query should be 10-100x faster
```
## Conclusion
Path segment normalization with JSON array storage provides:
**Significant storage savings** (50%+ compression)
**Fast prefix queries** using standard B-tree indexes
**Analytics-friendly** with usage tracking and pattern detection
**Rails-native** using built-in serialization
**Scalable** to millions of events with O(log n) lookups
The key insight: **Range queries on lexicographically-sorted JSON strings use B-tree indexes efficiently**, avoiding the need for complex full-text search or custom indexing strategies.
---
**Related Documentation:**
- [Event Ingestion](./event-ingestion.md) (TODO)
- [WAF Rule Engine](./rule-engine.md) (TODO)
- [Analytics Architecture](./analytics.md) (TODO)