First commit!
This commit is contained in:
512
docs/path-segment-architecture.md
Normal file
512
docs/path-segment-architecture.md
Normal file
@@ -0,0 +1,512 @@
|
||||
# Path Segment Architecture
|
||||
|
||||
## Overview
|
||||
|
||||
Baffle Hub uses a path segment decomposition strategy to efficiently store and query URL paths in WAF event logs. This architecture provides significant storage compression while enabling fast prefix-based path searches using SQLite's B-tree indexes.
|
||||
|
||||
## The Problem
|
||||
|
||||
WAF systems generate millions of request events. Storing full URL paths like `/api/v1/users/123/posts` repeatedly wastes storage and makes pattern-based queries inefficient.
|
||||
|
||||
Traditional approaches:
|
||||
- **Full path storage**: High redundancy, large database size
|
||||
- **String pattern matching with LIKE**: No index support, slow queries
|
||||
- **Full-Text Search (FTS)**: Complex setup, overkill for structured paths
|
||||
|
||||
## Our Solution: Path Segment Normalization
|
||||
|
||||
### Architecture Components
|
||||
|
||||
```
|
||||
Request: /api/v1/users/123/posts
|
||||
↓
|
||||
Decompose into segments: ["api", "v1", "users", "123", "posts"]
|
||||
↓
|
||||
Normalize to IDs: [1, 2, 3, 4, 5]
|
||||
↓
|
||||
Store as JSON array: "[1,2,3,4,5]"
|
||||
```
|
||||
|
||||
### Database Schema
|
||||
|
||||
```ruby
|
||||
# path_segments table - deduplicated segment dictionary
|
||||
create_table :path_segments do |t|
|
||||
t.string :segment, null: false, index: { unique: true }
|
||||
t.integer :usage_count, default: 1, null: false
|
||||
t.datetime :first_seen_at, null: false
|
||||
t.timestamps
|
||||
end
|
||||
|
||||
# events table - references segments by ID
|
||||
create_table :events do |t|
|
||||
t.string :request_segment_ids # JSON array: "[1,2,3]"
|
||||
t.string :request_path # Original path for display
|
||||
# ... other fields
|
||||
end
|
||||
|
||||
# Critical index for fast lookups
|
||||
add_index :events, :request_segment_ids
|
||||
```
|
||||
|
||||
### Models
|
||||
|
||||
**PathSegment** - The segment dictionary:
|
||||
```ruby
|
||||
class PathSegment < ApplicationRecord
|
||||
validates :segment, presence: true, uniqueness: true
|
||||
validates :usage_count, presence: true, numericality: { greater_than: 0 }
|
||||
|
||||
def self.find_or_create_segment(segment)
|
||||
find_or_create_by(segment: segment) do |path_segment|
|
||||
path_segment.usage_count = 1
|
||||
path_segment.first_seen_at = Time.current
|
||||
end
|
||||
end
|
||||
|
||||
def increment_usage!
|
||||
increment!(:usage_count)
|
||||
end
|
||||
end
|
||||
```
|
||||
|
||||
**Event** - Stores segment IDs as JSON array:
|
||||
```ruby
|
||||
class Event < ApplicationRecord
|
||||
serialize :request_segment_ids, type: Array, coder: JSON
|
||||
|
||||
# Path reconstruction helper
|
||||
def reconstructed_path
|
||||
return request_path if request_segment_ids.blank?
|
||||
|
||||
segments = PathSegment.where(id: request_segment_ids).index_by(&:id)
|
||||
'/' + request_segment_ids.map { |id| segments[id]&.segment }.compact.join('/')
|
||||
end
|
||||
|
||||
def path_depth
|
||||
request_segment_ids&.length || 0
|
||||
end
|
||||
end
|
||||
```
|
||||
|
||||
## The Indexing Strategy
|
||||
|
||||
### Why Standard LIKE Doesn't Work
|
||||
|
||||
SQLite's B-tree indexes only work with LIKE when the pattern is a simple alphanumeric prefix:
|
||||
|
||||
```sql
|
||||
-- ✅ Uses index (alphanumeric prefix)
|
||||
WHERE column LIKE 'api%'
|
||||
|
||||
-- ❌ Full table scan (starts with '[')
|
||||
WHERE request_segment_ids LIKE '[1,2,%'
|
||||
```
|
||||
|
||||
### The Solution: Range Queries on Lexicographic Sort
|
||||
|
||||
JSON arrays sort lexicographically in SQLite:
|
||||
|
||||
```
|
||||
"[1,2]" (exact match)
|
||||
"[1,2,3]" (prefix match - has [1,2] as start)
|
||||
"[1,2,4]" (prefix match - has [1,2] as start)
|
||||
"[1,2,99]" (prefix match - has [1,2] as start)
|
||||
"[1,3]" (out of range - different prefix)
|
||||
```
|
||||
|
||||
To find all paths starting with `[1,2]`:
|
||||
```sql
|
||||
-- Exact match OR prefix range
|
||||
WHERE request_segment_ids = '[1,2]'
|
||||
OR (request_segment_ids >= '[1,2,' AND request_segment_ids < '[1,3]')
|
||||
```
|
||||
|
||||
The range `>= '[1,2,' AND < '[1,3]'` captures all arrays starting with `[1,2,...]`.
|
||||
|
||||
### Query Performance
|
||||
|
||||
```
|
||||
EXPLAIN QUERY PLAN:
|
||||
MULTI-INDEX OR
|
||||
├─ INDEX 1: SEARCH events USING INDEX index_events_on_request_segment_ids (request_segment_ids=?)
|
||||
└─ INDEX 2: SEARCH events USING INDEX index_events_on_request_segment_ids (request_segment_ids>? AND request_segment_ids<?)
|
||||
```
|
||||
|
||||
Both branches use the B-tree index = O(log n) lookups!
|
||||
|
||||
### Implementation: with_path_prefix Scope
|
||||
|
||||
```ruby
|
||||
scope :with_path_prefix, ->(prefix_segment_ids) {
|
||||
return none if prefix_segment_ids.blank?
|
||||
|
||||
# Convert [1, 2] to JSON string "[1,2]"
|
||||
prefix_str = prefix_segment_ids.to_json
|
||||
|
||||
# Build upper bound by incrementing last segment
|
||||
# [1, 2] + 1 = [1, 3]
|
||||
upper_prefix = prefix_segment_ids[0..-2] + [prefix_segment_ids.last + 1]
|
||||
upper_str = upper_prefix.to_json
|
||||
|
||||
# Lower bound for prefix matches: "[1,2,"
|
||||
lower_prefix_str = "#{prefix_str[0..-2]},"
|
||||
|
||||
# Range query that uses B-tree index
|
||||
where("request_segment_ids = ? OR (request_segment_ids >= ? AND request_segment_ids < ?)",
|
||||
prefix_str, lower_prefix_str, upper_str)
|
||||
}
|
||||
```
|
||||
|
||||
## Usage Examples
|
||||
|
||||
### Basic Prefix Search
|
||||
|
||||
```ruby
|
||||
# Find all /api/v1/* paths
|
||||
api_seg = PathSegment.find_by(segment: 'api')
|
||||
v1_seg = PathSegment.find_by(segment: 'v1')
|
||||
|
||||
events = Event.with_path_prefix([api_seg.id, v1_seg.id])
|
||||
# Matches: /api/v1, /api/v1/users, /api/v1/users/123, etc.
|
||||
```
|
||||
|
||||
### Combined with Other Filters
|
||||
|
||||
```ruby
|
||||
# Blocked requests to /admin/* from specific IP
|
||||
admin_seg = PathSegment.find_by(segment: 'admin')
|
||||
|
||||
Event.where(ip_address: '192.168.1.100')
|
||||
.where(waf_action: :deny)
|
||||
.with_path_prefix([admin_seg.id])
|
||||
```
|
||||
|
||||
### Using Composite Index
|
||||
|
||||
```ruby
|
||||
# POST requests to /api/* on specific host
|
||||
# Uses: idx_events_host_method_path
|
||||
host = RequestHost.find_by(hostname: 'api.example.com')
|
||||
api_seg = PathSegment.find_by(segment: 'api')
|
||||
|
||||
Event.where(request_host_id: host.id, request_method: :post)
|
||||
.with_path_prefix([api_seg.id])
|
||||
```
|
||||
|
||||
### Exact Path Match
|
||||
|
||||
```ruby
|
||||
# Find exact path /api/v1 (not /api/v1/users)
|
||||
api_seg = PathSegment.find_by(segment: 'api')
|
||||
v1_seg = PathSegment.find_by(segment: 'v1')
|
||||
|
||||
Event.where(request_segment_ids: [api_seg.id, v1_seg.id].to_json)
|
||||
```
|
||||
|
||||
### Path Reconstruction for Display
|
||||
|
||||
```ruby
|
||||
events = Event.with_path_prefix([api_seg.id]).limit(10)
|
||||
|
||||
events.each do |event|
|
||||
puts "#{event.reconstructed_path} - #{event.waf_action}"
|
||||
# => /api/v1/users - allow
|
||||
# => /api/v1/posts - deny
|
||||
end
|
||||
```
|
||||
|
||||
## Performance Characteristics
|
||||
|
||||
| Operation | Index Used | Complexity | Notes |
|
||||
|-----------|-----------|------------|-------|
|
||||
| Exact path match | ✅ B-tree | O(log n) | Single index lookup |
|
||||
| Prefix path match | ✅ B-tree range | O(log n + k) | k = number of matches |
|
||||
| Path depth filter | ❌ None | O(n) | Full table scan - use sparingly |
|
||||
| Host+method+path | ✅ Composite | O(log n + k) | Optimal for WAF queries |
|
||||
|
||||
### Indexes in Schema
|
||||
|
||||
```ruby
|
||||
# Single-column index for path queries
|
||||
add_index :events, :request_segment_ids
|
||||
|
||||
# Composite index for common WAF query patterns
|
||||
add_index :events, [:request_host_id, :request_method, :request_segment_ids],
|
||||
name: 'idx_events_host_method_path'
|
||||
```
|
||||
|
||||
## Storage Efficiency
|
||||
|
||||
### Compression Benefits
|
||||
|
||||
Example: `/api/v1/users` appears in 100,000 events
|
||||
|
||||
**Without normalization:**
|
||||
```
|
||||
100,000 events × 15 bytes = 1,500,000 bytes (1.5 MB)
|
||||
```
|
||||
|
||||
**With normalization:**
|
||||
```
|
||||
3 segments × 10 bytes (avg) = 30 bytes
|
||||
100,000 events × 7 bytes ("[1,2,3]") = 700,000 bytes (700 KB)
|
||||
Total: 700,030 bytes (700 KB)
|
||||
|
||||
Savings: 53% reduction
|
||||
```
|
||||
|
||||
Plus benefits:
|
||||
- **Usage tracking**: `usage_count` shows hot paths
|
||||
- **Analytics**: Easy to identify common path patterns
|
||||
- **Flexibility**: Can query at segment level
|
||||
|
||||
## Normalization Process
|
||||
|
||||
### Event Creation Flow
|
||||
|
||||
```ruby
|
||||
# 1. Event arrives with full path
|
||||
payload = {
|
||||
"request" => { "path" => "/api/v1/users/123" }
|
||||
}
|
||||
|
||||
# 2. Event model extracts path
|
||||
event = Event.create_from_waf_payload!(event_id, payload, project)
|
||||
# Sets: request_path = "/api/v1/users/123"
|
||||
|
||||
# 3. After validation, EventNormalizer runs
|
||||
EventNormalizer.normalize_event!(event)
|
||||
|
||||
# 4. Path is decomposed into segments
|
||||
segments = ["/api/v1/users/123"].split('/').reject(&:blank?)
|
||||
# => ["api", "v1", "users", "123"]
|
||||
|
||||
# 5. Each segment is normalized to ID
|
||||
segment_ids = segments.map do |segment|
|
||||
path_segment = PathSegment.find_or_create_segment(segment)
|
||||
path_segment.increment_usage! unless path_segment.new_record?
|
||||
path_segment.id
|
||||
end
|
||||
# => [1, 2, 3, 4]
|
||||
|
||||
# 6. IDs stored as JSON array
|
||||
event.request_segment_ids = segment_ids
|
||||
# Stored in DB as: "[1,2,3,4]"
|
||||
```
|
||||
|
||||
### EventNormalizer Service
|
||||
|
||||
```ruby
|
||||
class EventNormalizer
|
||||
def normalize_path_segments
|
||||
segments = @event.path_segments_array
|
||||
return if segments.empty?
|
||||
|
||||
segment_ids = segments.map do |segment|
|
||||
path_segment = PathSegment.find_or_create_segment(segment)
|
||||
path_segment.increment_usage! unless path_segment.new_record?
|
||||
path_segment.id
|
||||
end
|
||||
|
||||
# Store as array - serialize will handle JSON encoding
|
||||
@event.request_segment_ids = segment_ids
|
||||
end
|
||||
end
|
||||
```
|
||||
|
||||
## Important: JSON Functions and Performance
|
||||
|
||||
### ❌ Avoid in WHERE Clauses
|
||||
|
||||
JSON functions like `json_array_length()` cannot use indexes:
|
||||
|
||||
```ruby
|
||||
# ❌ SLOW - Full table scan
|
||||
Event.where("json_array_length(request_segment_ids) = ?", 3)
|
||||
|
||||
# ✅ FAST - Filter in Ruby after indexed query
|
||||
Event.with_path_prefix([api_id]).select { |e| e.path_depth == 3 }
|
||||
```
|
||||
|
||||
### ✅ Use for Analytics (Async)
|
||||
|
||||
JSON functions are fine for analytics queries run in background jobs:
|
||||
|
||||
```ruby
|
||||
# Background job for analytics
|
||||
class PathDepthAnalysisJob < ApplicationJob
|
||||
def perform(project_id)
|
||||
# This is OK in async context
|
||||
stats = Event.where(project_id: project_id)
|
||||
.select("json_array_length(request_segment_ids) as depth, COUNT(*) as count")
|
||||
.group("depth")
|
||||
.order(:depth)
|
||||
|
||||
# Store results for dashboard
|
||||
PathDepthStats.create!(project_id: project_id, data: stats)
|
||||
end
|
||||
end
|
||||
```
|
||||
|
||||
## Edge Cases and Considerations
|
||||
|
||||
### Empty Paths
|
||||
|
||||
```ruby
|
||||
request_path = "/"
|
||||
segments = [] # Empty after split and reject
|
||||
request_segment_ids = [] # Empty array
|
||||
# Stored as: "[]"
|
||||
```
|
||||
|
||||
### Trailing Slashes
|
||||
|
||||
```ruby
|
||||
"/api/v1/" == "/api/v1" # Both normalize to ["api", "v1"]
|
||||
```
|
||||
|
||||
### Special Characters in Segments
|
||||
|
||||
```ruby
|
||||
# URL-encoded segments are stored as-is
|
||||
"/search?q=hello%20world"
|
||||
# Segments: ["search?q=hello%20world"]
|
||||
```
|
||||
|
||||
Consider normalizing query params separately if needed.
|
||||
|
||||
### Very Deep Paths
|
||||
|
||||
Paths with 10+ segments work fine but consider:
|
||||
- Are they legitimate? (Could indicate attack)
|
||||
- Impact on JSON array size
|
||||
- Consider truncating for analytics
|
||||
|
||||
## Analytics Use Cases
|
||||
|
||||
### Most Common Paths
|
||||
|
||||
```ruby
|
||||
# Top 10 most accessed paths
|
||||
Event.group(:request_segment_ids)
|
||||
.order('COUNT(*) DESC')
|
||||
.limit(10)
|
||||
.count
|
||||
.map { |seg_ids, count|
|
||||
path = PathSegment.where(id: JSON.parse(seg_ids))
|
||||
.pluck(:segment)
|
||||
.join('/')
|
||||
["/#{path}", count]
|
||||
}
|
||||
```
|
||||
|
||||
### Hot Path Segments
|
||||
|
||||
```ruby
|
||||
# Most frequently used segments (indicates common endpoints)
|
||||
PathSegment.order(usage_count: :desc).limit(20)
|
||||
```
|
||||
|
||||
### Attack Pattern Detection
|
||||
|
||||
```ruby
|
||||
# Paths with unusual depth (possible directory traversal)
|
||||
Event.where(waf_action: :deny)
|
||||
.select { |e| e.path_depth > 10 }
|
||||
.group_by { |e| e.request_segment_ids.first }
|
||||
```
|
||||
|
||||
### Path-Based Rule Generation
|
||||
|
||||
```ruby
|
||||
# Auto-block paths that are frequently denied
|
||||
suspicious_paths = Event.where(waf_action: :deny)
|
||||
.where('created_at > ?', 1.hour.ago)
|
||||
.group(:request_segment_ids)
|
||||
.having('COUNT(*) > ?', 100)
|
||||
.pluck(:request_segment_ids)
|
||||
|
||||
suspicious_paths.each do |seg_ids|
|
||||
RuleSet.global.block_path_segments(seg_ids)
|
||||
end
|
||||
```
|
||||
|
||||
## Future Optimizations
|
||||
|
||||
### Phase 2 Considerations
|
||||
|
||||
If performance becomes critical:
|
||||
|
||||
1. **Materialized Path Column**: Pre-compute common prefix patterns
|
||||
2. **Trie Data Structure**: In-memory trie for ultra-fast prefix matching
|
||||
3. **Redis Cache**: Cache hot path lookups
|
||||
4. **Partial Indexes**: Index only blocked/challenged events
|
||||
|
||||
```ruby
|
||||
# Example: Partial index for security-relevant events
|
||||
add_index :events, :request_segment_ids,
|
||||
where: "waf_action IN ('deny', 'challenge')",
|
||||
name: 'idx_events_blocked_paths'
|
||||
```
|
||||
|
||||
### Storage Considerations
|
||||
|
||||
For very large deployments (100M+ events):
|
||||
|
||||
- **Archive old events**: Move to separate table
|
||||
- **Aggregate path stats**: Pre-compute daily/hourly summaries
|
||||
- **Compress JSON**: SQLite JSON1 extension supports compression
|
||||
|
||||
## Testing
|
||||
|
||||
### Test Index Usage
|
||||
|
||||
```ruby
|
||||
# Verify B-tree index is being used
|
||||
sql = Event.with_path_prefix([1, 2]).to_sql
|
||||
plan = ActiveRecord::Base.connection.execute("EXPLAIN QUERY PLAN #{sql}")
|
||||
|
||||
# Should see: "SEARCH events USING INDEX index_events_on_request_segment_ids"
|
||||
puts plan.to_a
|
||||
```
|
||||
|
||||
### Benchmark Queries
|
||||
|
||||
```ruby
|
||||
require 'benchmark'
|
||||
|
||||
prefix_ids = [1, 2]
|
||||
|
||||
# Test indexed range query
|
||||
Benchmark.bm do |x|
|
||||
x.report("Indexed range:") {
|
||||
Event.with_path_prefix(prefix_ids).count
|
||||
}
|
||||
|
||||
x.report("LIKE query:") {
|
||||
Event.where("request_segment_ids LIKE ?", "[1,2,%").count
|
||||
}
|
||||
end
|
||||
|
||||
# Range query should be 10-100x faster
|
||||
```
|
||||
|
||||
## Conclusion
|
||||
|
||||
Path segment normalization with JSON array storage provides:
|
||||
|
||||
✅ **Significant storage savings** (50%+ compression)
|
||||
✅ **Fast prefix queries** using standard B-tree indexes
|
||||
✅ **Analytics-friendly** with usage tracking and pattern detection
|
||||
✅ **Rails-native** using built-in serialization
|
||||
✅ **Scalable** to millions of events with O(log n) lookups
|
||||
|
||||
The key insight: **Range queries on lexicographically-sorted JSON strings use B-tree indexes efficiently**, avoiding the need for complex full-text search or custom indexing strategies.
|
||||
|
||||
---
|
||||
|
||||
**Related Documentation:**
|
||||
- [Event Ingestion](./event-ingestion.md) (TODO)
|
||||
- [WAF Rule Engine](./rule-engine.md) (TODO)
|
||||
- [Analytics Architecture](./analytics.md) (TODO)
|
||||
Reference in New Issue
Block a user