Serverless, event-driven malware scanning for AWS S3 using ClamAV. Automatically scan files uploaded to S3 buckets, tag clean files, quarantine or delete infected files, and maintain an audit trail.
- Automated Scanning: Automatically scan files when uploaded to S3 using event-driven architecture
- ClamAV Integration: Industry-standard open-source antivirus engine
- Efficient Definitions Management: ClamAV virus definitions cached in S3 and updated automatically
- Flexible Handling: Tag clean files, quarantine or delete infected files
- Audit Trail: All scan results stored in DynamoDB for compliance and tracking
- Webhook Support: Optional webhook notifications for scan results
- Serverless: No servers to manage, scales automatically
- Cost Effective: Pay only for what you use with Lambda and on-demand DynamoDB
- Infrastructure as Code: Complete AWS CDK deployment
- Architecture
- Prerequisites
- Quick Start
- Configuration
- Deployment
- Operations
- How It Works
- Cost Considerations
- Security
- Contributing
- License
S3 (ObjectCreated) βββΆ SQS βββΆ Lambda(ClamAV) βββΆ DynamoDB (audit)
β
βββΆ Tag clean object
βββΆ Delete infected
βββΆ Copy to Quarantine (if provided)
β
βββΆ Optional Webhook
EventBridge (rate) βββΆ Updater Lambda βββΆ S3 (ClamAV DB cache)
β² β
βββββββ Scanner Lambda pulls definitions
- S3 Source Bucket: Your existing S3 bucket to be scanned
- SQS Queue: Receives S3 event notifications with dead-letter queue for failed messages
- Scanner Lambda: Docker container with ClamAV that scans files
- Definitions Cache: S3 bucket storing ClamAV virus definitions
- Updater Lambda: Periodically updates virus definitions using freshclam
- DynamoDB Table: Stores audit records of all scans
- Quarantine Bucket (optional): Stores infected files for investigation
- Webhook (optional): Receives POST notifications of scan results
- AWS Account with appropriate permissions
- Node.js 18.x or later
- AWS CDK 2.x (
npm install -g aws-cdk) - Docker installed and running (for building Lambda container images)
- AWS CLI configured with credentials
The deploying user/role needs permissions to:
- Create/manage Lambda functions
- Create/manage S3 buckets and notifications
- Create/manage SQS queues
- Create/manage DynamoDB tables
- Create/manage IAM roles and policies
- Create/manage EventBridge rules
git clone https://github.com/yourusername/s3_virus_scanner.git
cd s3_virus_scannernpm cicp env.example .envEdit .env and set your configuration:
# Required
SOURCE_BUCKET_ARN=arn:aws:s3:::your-bucket-name
# Optional
QUARANTINE_BUCKET_ARN=arn:aws:s3:::your-quarantine-bucket-name
WEBHOOK_URL=https://your-webhook-endpoint.com/scan-results
ENVIRONMENT=prodcdk bootstrapcdk deployReview the changes and confirm the deployment.
| Variable | Description | Example |
|---|---|---|
SOURCE_BUCKET_ARN |
ARN of the S3 bucket to scan | arn:aws:s3:::my-uploads-bucket |
| Variable | Description | Default | Example |
|---|---|---|---|
QUARANTINE_BUCKET_ARN |
ARN of bucket for infected files | - | arn:aws:s3:::my-quarantine-bucket |
WEBHOOK_URL |
HTTPS URL for scan result notifications | - | https://api.example.com/webhook |
ENVIRONMENT |
Environment name for resource naming | dev |
prod, staging |
ACCOUNT_ID |
AWS Account ID | CDK default | 123456789012 |
REGION |
AWS Region | CDK default | us-east-1 |
When configured, the webhook uses the Standard Webhooks envelope:
{
"id": "28fb6e8f-8841-42db-98f1-6e6e1a2db468",
"source": "s3-virus-scanner",
"event": "clamav.scan.clean",
"created_at": "2025-10-15T12:34:56.789Z",
"data": {
"id": "28fb6e8f-8841-42db-98f1-6e6e1a2db468",
"bucket": "example-input-bucket",
"key": "uploads/report.pdf",
"scannedAt": "2025-10-15T12:34:56.789Z",
"status": "clean",
"signature": "" // malware signature (present only when status === "infected")
}
}Each delivery includes Standard Webhooks headers:
| Header | Description |
|---|---|
Webhook-Id |
Unique identifier for the delivery (matches data.id) |
Webhook-Source |
Source identifier (defaults to s3-virus-scanner, configurable via WEBHOOK_SOURCE) |
Webhook-Event |
Event name (clamav.scan.clean, clamav.scan.infected, or clamav.scan.error) |
Webhook-Timestamp |
UNIX epoch (seconds) when the message was generated |
Webhook-Version |
Spec version (1) |
Webhook-Signature |
HMAC-SHA256 signature v1=<hex> computed with WEBHOOK_SECRET over <timestamp>.<body> |
You can use the official standardwebhooks JavaScript library to verify requests:
import { Webhook } from 'standardwebhooks';
// WEBHOOK_SECRET is the same shared secret configured for the scanner Lambda.
// The Standard Webhooks library expects the secret to be base64-encoded.
const base64Secret = Buffer.from(process.env.WEBHOOK_SECRET!, 'utf8').toString('base64');
const wh = new Webhook(base64Secret);
export function verify(requestBody: string, headers: Record<string, string>) {
try {
// verify throws if the signature or timestamp is invalid
const event = wh.verify(requestBody, headers);
// event contains { id, source, event, created_at, data }
return event;
} catch (err) {
throw new Error('Invalid webhook signature');
}
}requestBody must be the raw string payload (before JSON parsing) and headers must include the Standard Webhooks headers (Webhook-Id, Webhook-Source, Webhook-Timestamp, Webhook-Event, Webhook-Version, Webhook-Signature).
Reject requests when:
Webhook-Timestampis outside an acceptable skew (e.g., Β±5 minutes)Webhook-Sourceis not the expected valueWebhook-Signatureverification fails
# Install Dependencies
npm install
# Run Unit Tests (uses standard-webhooks spec-compliant headers)
npm run test
# Synthesize CloudFormation template
cdk synth
# Compare deployed stack with current state
cdk diff
# Deploy to AWS
cdk deploy
# Destroy the stack
cdk destroy- First deployment takes longer due to Docker image builds
- Lambda containers are automatically built and pushed to ECR
- S3 notifications are configured automatically using custom resources
- Virus definitions are downloaded on first scanner invocation
- Scanner Memory: 3 GB (supports large files)
- Scanner Timeout: 15 minutes
- Scanner Ephemeral Storage: 2 GB (for virus definitions)
- Updater Memory: 2 GB
- Updater Timeout: 15 minutes
- Updater Schedule: Every 4 hours
- Scanner: Downloading virus definitions from S3 (or fallback to freshclam)
- Updater: Downloading virus definitions from ClamAV mirrors
Options:
- Deploy Lambda in public subnet (not recommended for production)
- Deploy Lambda in private subnet with NAT Gateway (recommended)
- Use S3 definitions cache to minimize internet dependency
Lambda functions log to CloudWatch Logs:
- Scanner:
/aws/lambda/clamav-scanner-{environment}-{account}-{region} - Updater:
/aws/lambda/clamav-defs-updater-{environment}-{account}-{region}
Failed messages are sent to the DLQ after 5 retry attempts. Investigate DLQ messages for:
- Permission errors
- Timeout issues
- freshclam failures
- File size exceeding limits
Query the audit table to:
- View scan history
- Track infection rates
- Compliance reporting
- Original file remains in source bucket
- Tags added:
scan-status=clean,scannedAt,engine=ClamAV
- With quarantine bucket: File copied to quarantine, then deleted from source
- Without quarantine bucket: File deleted from source
- Not tagged (file is removed)
- File remains in source bucket
- Message retried (up to 5 times)
- Eventually moved to DLQ
- Max file size: 2 GB (configurable in scanner code)
- S3 tags limit: 10 tags per object (ensure you have available tag slots)
- Lambda ephemeral storage: 2 GB (for virus definitions)
- SQS message retention: 4 days
- DLQ retention: 14 days
- File uploaded to S3 bucket triggers S3 Event Notification
- S3 sends message to SQS queue
- Lambda function triggered by SQS message
- Lambda downloads file to
/tmp - Lambda scans file with ClamAV using local virus definitions
- Based on scan result:
- Clean: Add tags to S3 object
- Infected: Quarantine/delete object
- Error: Retry (or send to DLQ)
- Write audit record to DynamoDB
- Send webhook notification (if configured)
- Updater Lambda runs every 4 hours via EventBridge Schedule
- Downloads latest virus definitions using
freshclam - Uploads definitions (CVD files) to S3 definitions cache
- Scanner Lambda downloads definitions from S3 cache on cold start
- Falls back to
freshclamif cache is unavailable (requires internet)
Estimated monthly costs for scanning 10,000 files/month (average 5 MB each):
| Service | Cost | Notes |
|---|---|---|
| Lambda (Scanner) | ~$15 | 3 GB memory, ~30s per scan |
| Lambda (Updater) | <$1 | Runs every 4 hours |
| S3 (Definitions) | <$1 | ~150 MB storage |
| DynamoDB | ~$1.25 | On-demand, 10K writes/month |
| SQS | <$1 | 10K messages |
| Data Transfer | Variable | Depends on file downloads |
Total: ~$18-25/month for 10,000 file scans
π‘ Tips to reduce costs:
- Use S3 definitions cache to avoid freshclam on every cold start
- Adjust Lambda memory based on your file sizes
- Use S3 Lifecycle policies on quarantine bucket
- Consider provisioned capacity for DynamoDB at higher volumes
- Use quarantine bucket to investigate infected files safely
- Enable S3 bucket encryption (enabled by default for definitions bucket)
- Restrict IAM permissions to least privilege
- Enable AWS CloudTrail for audit logging
- Review CloudWatch Logs regularly
- Monitor DLQ for persistent failures
- Update virus definitions frequently (configured to run every 4 hours)
- Use VPC endpoints for S3 access from Lambda (optional)
- Enable versioning on source bucket for recovery
Please see SECURITY.md for information on reporting security vulnerabilities.
We welcome contributions! Please see CONTRIBUTING.md for details on:
- How to submit issues
- How to submit pull requests
- Code style guidelines
- Development setup
This project is licensed under the MIT License - see the LICENSE file for details.
- Issues: GitHub Issues
- Discussions: GitHub Discussions
Made with β€οΈ by the open source community