Infrastructure
Sigil Bot
An autonomous scanner that monitors PyPI, npm, ClawHub, and GitHub for new and updated packages, scans them with all eight Sigil phases, and publishes results to the public scan database. Runs 24/7 — no human input required.
What Sigil Bot does
Sigil Bot watches public package registries for newly published and updated packages. When a new package appears, the bot downloads it into quarantine, runs all eight scan phases, stores the results, and publishes a report page at sigilsec.ai/scans.
┌──────────────────────┐
│ SIGIL BOT │
│ │
│ Monitors registries │
│ Downloads packages │
│ Runs Sigil scans │
│ Stores results │
└──────────┬───────────┘
│
┌────────────────┼────────────────┐
│ │ │
▼ ▼ ▼
┌────────────┐ ┌────────────┐ ┌────────────┐
│ Scan DB │ │ Badges │ │ Threat │
│ /scans/* │ │ /badge/* │ │ Feed │
│ pages │ │ SVGs │ │ RSS + API │
└────────────┘ └────────────┘ └────────────┘Public scan database
Every scanned package gets a report page. Each page is an SEO surface that AI models and search engines can cite.
Real-time threat feed
New scans published as they happen via RSS feed, API endpoint, and alerts for HIGH RISK and CRITICAL RISK findings.
Badge generation
Automatically generates and caches SVG badges for every scanned package. Badges update when packages are rescanned.
Downstream integrations
The GitHub App, MCP server, and CLI threat intelligence all consume scan data produced by the bot.
Monitored registries
Four registries are monitored continuously. Each has a dedicated watcher process with optimised polling for that registry's API.
PyPI
Polls every 5 minRSS feeds for new packages and version updates, plus the changelog serial API for incremental event tracking. Packages are downloaded via pip download --no-deps — no code is installed or executed.
| Feed | pypi.org/rss/packages.xml + changelog serial API |
| Scope | AI ecosystem packages (langchain, openai, anthropic, mcp, agent, etc.) |
| Volume | ~200–400 relevant packages/day |
npm
Polls every 60 secCouchDB _changes stream from the npm registry replicate. Packages in @langchain/*, @anthropic/*, @openai/*, and @modelcontextprotocol/* scopes are scanned regardless of keyword matches.
| Feed | replicate.npmjs.com/registry/_changes |
| Scope | AI ecosystem packages + all MCP-related scopes |
| Volume | ~300–600 relevant packages/day |
ClawHub
Polls every 6 hoursREST API paginated by update time. All skills are scanned — no keyword filtering needed. The entire registry is relevant because every skill has direct access to the user's environment.
| Feed | clawhub.ai/api/v1/skills?sort=updated |
| Scope | All skills (no filtering) |
| Volume | ~50–100 new/updated skills per day |
GitHub (MCP Servers)
Sweeps every 12 hoursGitHub Search API for repositories matching MCP server patterns, plus the Events API for push events to known repos between sweeps. Repositories are cloned with git clone --depth 1 into quarantine.
| Feed | api.github.com/search/repositories + /events |
| Scope | MCP server repos (>0 stars or >1 commit) |
| Volume | ~20–50 new/updated repos per day |
Scan pipeline
Every scan follows the same five-stage pipeline: watch, queue, scan, store, publish.
WATCHER ──▶ QUEUE ──▶ SCANNER ──▶ STORE ──▶ PUBLISHER Poll feeds Redis Download Postgres Report page Deduplicate Priority Extract Findings Badge cache Filter Retry Sigil scan Metadata RSS feed Enqueue Backoff All phases Alerts
Deduplication
Key: {ecosystem}:{name}:{version}:{content_hash}. If the exact same content has been scanned, it's skipped. If the version is the same but the content hash differs (re-upload), it's rescanned.
Priority levels
| Priority | SLA | Criteria |
|---|---|---|
| critical | Immediate | Typosquatting patterns, suspicious new publisher names |
| high | 5 min | MCP scopes, ClawHub skills, popular packages with new versions |
| normal | 30 min | Everything else matching AI keyword filters |
Scan isolation
Each scan runs in a fresh temporary directory. No network access during the scan — Sigil is static analysis only. No code is installed or executed. The quarantine directory is destroyed after scanning.
Typosquatting detection
New packages with names within edit distance 2 of popular AI packages are automatically boosted to critical priority. This catches common squatting patterns before developers encounter them.
Target packages monitored for typosquats:
langchain, openai, anthropic, transformers,
huggingface, crewai, autogen, llamaindex,
pinecone, chromadb, fastapi, streamlit
Detection patterns:
Character substitution: langch4in, openal
Character insertion: langchainn, openaai
Character deletion: langchai, opena
Transposition: langchian, openiaFlagged packages receive an additional finding in the Provenance phase noting the name similarity.
Threat feed
Scan results are published to multiple output channels for downstream consumption.
RSS feed
Standard RSS 2.0 feed at sigilsec.ai/feed.xml. Contains the latest 100 scan results. Supports filtered variants:
All scans: sigilsec.ai/feed.xml
Threats only: sigilsec.ai/feed.xml?verdict=high_risk,critical_risk
ClawHub only: sigilsec.ai/feed.xml?ecosystem=clawhub
PyPI only: sigilsec.ai/feed.xml?ecosystem=pypi
npm only: sigilsec.ai/feed.xml?ecosystem=npmAPI endpoint
GET /api/v1/feed?ecosystem={eco}&verdict={v}&limit={n}&since={iso_datetime}JSON array of recent scans. Same filtering as RSS. This is what the MCP server queries, the GitHub App looks up, and third-party integrations consume.
Alerts
HIGH RISK and CRITICAL RISK findings trigger alerts to subscribed webhook endpoints. Only findings with a risk score of 25 or above generate alerts.
Scan attestations
Every scan produced by Sigil Bot is cryptographically signed and recorded in a public transparency log. This lets anyone verify that a scan result is genuine and untampered.
Ed25519 signatures
Each scan is wrapped in a DSSE envelope and signed with an Ed25519 key. The public key is published at /.well-known/sigil-verify.json.
in-toto attestations
Attestations follow the in-toto Statement v1 format with a custom predicate type for Sigil scan results.
Transparency log
Signed attestations are recorded in the Sigstore Rekor transparency log. Each scan report links to its log entry.
Verification API
Verify any scan via GET /api/v1/verify?scan_id=... or fetch the raw attestation from GET /api/v1/attestation/{id}.
For full verification steps, public keys, and SDK usage, see the Attestation docs.
AI ecosystem filtering
The bot doesn't scan every package on PyPI and npm — it targets the AI agent supply chain. Packages are matched if their name, description, or keywords contain any of these terms:
Frameworks: langchain, crewai, autogen, llamaindex, haystack, dspy
LLM providers: openai, anthropic, cohere, mistral, groq, together
MCP / agents: mcp, model-context-protocol, agentic, tool-use
RAG: rag, retrieval, vector, embedding, pinecone, chroma
ML: transformers, huggingface, diffusers, torch, tensorflow
Skills: skill, plugin, extension, chatgpt-plugin, copilot-extensionExpected volume
| Registry | Scans/day | Avg time | Compute |
|---|---|---|---|
| PyPI (AI-filtered) | 200–400 | ~5 sec | ~30 min |
| npm (AI-filtered) | 300–600 | ~5 sec | ~50 min |
| ClawHub | 50–100 | ~3 sec | ~5 min |
| GitHub MCP | 20–50 | ~8 sec | ~7 min |
| Total | 570–1,150 | — | ~90 min |
Bot identity
The bot operates under a dedicated sigil-bot account, separate from NOMARK staff activity. Automated outputs are clearly labeled as automated.
- GitHub: The GitHub App acts as
sigil-bot[bot] - Scan database: Report pages show “Scanned by Sigil Bot” with timestamp
- Threat feed: RSS and API entries attributed to the bot identity
Dispute a result
Packages are scanned automatically from public registries without author consent. If you believe a scan result is incorrect, you can:
- Use the “Request a review” link on any scan report page
- Email security@sigilsec.ai directly
Disputes are acknowledged within 48 hours. See the full dispute process in our Terms of Service.
See also
- •Scan Database — browse all published scan results
- •Methodology — how scans work, six phases, detection criteria
- •API Reference — consume scan data programmatically
- •Scan Attestations — verification steps, public keys, SDK usage
- •Agent Discovery — A2A agent card, WebMCP, structured data for AI agents
- •Terms of Service — automated scanning, badge usage, dispute process
Need help?
Ask a question in GitHub Discussions or check the troubleshooting guide.