Skip to content

Conversation

@drernie
Copy link
Member

@drernie drernie commented Dec 29, 2025

Summary

Comprehensive documentation of EventBridge → SNS → SQS → Lambda routing architecture, replacing hypothesis-focused analysis with an architectural guide focused on the core problem.

Key Changes:

  • Renamed 10-input-transformer-hypothesis.md10-eventbridge-routing.md
  • Restructured content to focus on the core problem (SNS message wrapping) rather than one solution approach
  • Added detailed Lambda processing patterns from code review
  • Expanded testing strategy with isolation requirements and common mistakes

Core Documentation Sections

Message Flow Architecture

  • EventBridge → SNS → SQS → Lambda transformation chain
  • Example message structures at each layer
  • Required unwrapping steps (SQS → SNS → Event payload)

Lambda Processing Patterns

Analyzed 4 Lambda types from actual code:

  • ManifestIndexer (≤1.65): Missing SNS unwrapping → crashes
  • SearchHandler: Correct SNS unwrapping → works with S3 Records
  • ⚠️ EsIngest: EventBridge variant format → untested
  • ⚠️ Iceberg: EventBridge format → untested

The Two Event Sources Problem

Why testing was flaky: production environments may have both:

  1. Direct S3 Notifications (instant, S3 Records format)
  2. EventBridge Route (3-5 min delay, EventBridge format)

Critical insight: Files appeared in search via direct S3 notifications, masking EventBridge routing failures.

Solutions & Approaches

Solution 1: Fix Lambda Code (Recommended)

  • Add SNS unwrapping to ManifestIndexer
  • Minimal code change, maintains backward compatibility
  • Platform 1.66+

Solution 2: Input Transformers (Limited Use)

  • Converts EventBridge → S3 Records format
  • Helps SearchHandler process EventBridge events
  • ❌ Cannot solve SNS unwrapping issue (transforms BEFORE SNS wrapping)

Solution 3: Dual Format Support (Comprehensive)

  • Enhanced Lambdas detect and handle multiple formats
  • Handles all event sources gracefully
  • Platform 1.66+

Testing Strategy

Critical Testing Principle: ALWAYS isolate event sources to avoid false positives

4 Test Scenarios:

  1. Baseline - Direct S3 Notifications
  2. EventBridge WITHOUT Input Transformer
  3. EventBridge WITH Input Transformer
  4. Platform 1.66 with SNS Unwrap Fix

Common Testing Mistakes:

  • ❌ Relying on intermediate metrics (EventBridge triggered ≠ Lambda success)
  • ❌ Not isolating event sources (success via S3, not EventBridge)
  • ❌ Not waiting for CloudTrail (3-5 min delay)
  • ❌ Assuming success from invocation count (may be crashing)

Production Deployment

Version-Specific Recommendations:

  • Platform ≤1.65: ManifestIndexer will NOT work with EventBridge routing
  • Platform 1.66: ManifestIndexer works, SearchHandler needs Input Transformer
  • Platform 1.66+: All Lambdas work with dual format support

Impact

This documentation:

  • ✅ Explains why EventBridge routing breaks certain Lambdas
  • ✅ Provides clear testing methodology to avoid false positives
  • ✅ Documents three solution approaches with version guidance
  • ✅ Includes S3 Event Notification management commands
  • ✅ Focuses on architectural understanding, not one failed solution

Related Issues

  • Resolves confusion about Input Transformer effectiveness
  • Documents the "two event sources" testing trap
  • Provides version-specific migration path

🤖 Generated with Claude Code

drernie and others added 6 commits December 29, 2025 11:00
Customer (FL109) followed EventBridge docs but package creation not working.
Analysis shows:
- EventBridge rule firing correctly
- File indexing works
- Package indexing broken
- Input transformer missing (sending raw CloudTrail format)
- PackagerQueue not subscribed to SNS

Documentation gaps identified for testing and resolution.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Based on customer interaction transcript:
- No Input Transformer was added (confirmed @ 52:20)
- PackagerQueue subscriptions handled automatically by Quilt
- SNS policy fix (events.amazonaws.com) was the only change needed
- Quilt processes raw CloudTrail events natively

Test plan created for quilt-staging environment:
- Bucket: aneesh-test-service (us-east-1)
- Tests EventBridge → SNS → SQS without Input Transformer
- Verifies SNS policy is the critical configuration
- Captures actual event format for documentation

Ready to execute test to confirm findings.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
## Summary
Customer's S3 events weren't reaching Quilt indexer via EventBridge.
Root cause: EventBridge rule 'cloudtrail-to-sns' was disabled.

## Solution
Enabled the rule with one command:
```bash
aws events enable-rule --name cloudtrail-to-sns --region us-east-1
```

## Test Results
✅ EventBridge rule triggered: 1 event
✅ SNS published: 1 message
✅ SQS received and processed successfully

## Key Findings
- CloudTrail→EventBridge integration is automatic with event selectors
- Infrastructure was already correctly configured
- Always check rule states before investigating complex issues

## Changes
- Add SUCCESS-REPORT.md with complete resolution details
- Add config-quilt-eventbridge-test.toml (working configuration)
- Reorganize folder: backup-policies/, test-artifacts/, obsolete-reports/
- Rewrite README.md for concise, standalone reference
- Archive superseded investigation documents

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Files renamed to show investigation progression:
01-customer-issue-summary.md (2025-12-29 11:00)
02-local-test-setup.md (2025-12-29 11:01)
03-test-plan-staging.md (2025-12-29 11:08)
04-config-quilt-eventbridge-test.toml (2025-12-29 12:28)
05-ACTION-ITEMS.md (2025-12-29 12:43)
06-SUCCESS-REPORT.md (2025-12-29 13:03) - Initial fix (enabled rule)
07-README.md (2025-12-29 21:18) - Complete fix summary
08-FAILURE_REPORT.md (2025-12-29 22:17) - Deep dive into Lambda issue
09-documented-steps.md (2025-12-29 22:18) - Public documentation

Chronological order shows:
1. Customer report & initial investigation
2. Test planning & execution
3. First success (enabling EventBridge rule)
4. Discovery of deeper Lambda compatibility issue
5. Documentation of all findings
Major updates:
- 07-README.md: Remove false 'RESOLVED' status, clarify 3-layer problem
- 10-input-transformer-hypothesis.md: Complete analysis of Input Transformers

Key findings:
- Infrastructure fixes complete (EventBridge + SNS subscriptions)
- Application issue identified: ManifestIndexer lacks SNS unwrapping
- Input Transformers transform BEFORE SNS wrapping (insufficient alone)
- Lambda code fix required in Platform 1.66+

Lessons learned:
1. Metrics ≠ end-to-end success (intermediate success, final failure)
2. Input Transformers cannot eliminate SNS wrapping layer
3. Test with real workflows (package creation), not synthetic events
4. Two event sources caused flaky testing (S3 direct + EventBridge)
5. All Lambdas need consistent SNS message handling

Documentation includes:
- Complete Lambda code audit (4 Lambdas analyzed)
- Rigorous testing strategy (5 tests with isolation requirements)
- S3 Event Notification management (when/how to disable)
- Version-specific behavior (≤1.65 vs ≥1.66)
- Production deployment guidance with rollback plans
@drernie drernie changed the title Resolve EventBridge routing issue - Enable disabled rule EventBridge routing investigation: Infrastructure fixed, Lambda code issue identified Dec 30, 2025
Comprehensive documentation of EventBridge → SNS → SQS → Lambda flow,
replacing hypothesis-focused analysis with architectural guide.

Key content:
- Message transformation chain and SNS wrapping behavior
- Lambda processing patterns from code review (4 Lambda types)
- Two event sources problem and testing isolation requirements
- Three solution approaches with version-specific guidance
- Complete testing strategy with common mistakes to avoid
- S3 Event Notification management and production deployment

Core findings:
- ManifestIndexer (≤1.65) crashes due to missing SNS unwrapping
- Input Transformers help SearchHandler but don't solve SNS issue
- Testing requires isolating event sources to avoid false positives

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@drernie drernie changed the title EventBridge routing investigation: Infrastructure fixed, Lambda code issue identified Document EventBridge routing architecture and Lambda message processing Dec 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants