How Bixel Reads the Web
The public reference for BixelBot. Site operators, anti-bot vendors, and our own engineering team share it as one source of truth about what Bixel does, what it won't do, and how to talk to us.
Operator
- Bot name
- BixelBot
- Operator
- Bixel, Inc. (United States)
- Reference URL
- https://bixel.com/bot
- Contact
- bot-abuse@bixel.com
- Category
- Aggregators
- Verification
- Web Bot Auth via
/.well-known/http-message-signatures-directory. See Section 08.
Compliance
BixelBot is operated in compliance with Cloudflare’s Verified Bots Policy, Amazon’s Acceptable Use Policy, Google Cloud Platform Terms of Service, and Microsoft Azure’s Terms of Use. Where any of those policies tighten in the future, our behavior tightens with them.
Service purpose
Bixel is a data platform for the private company economy. BixelBot reads publicly accessible web pages and contributes structured observations to a continuously-updated, cross-company reference dataset.
The output is verified company data used by investors, analysts, and the companies themselves to bring clarity to private markets. Companies can claim their Bixel profiles, verify their data, and control how they are represented.
What BixelBot is not: a scraper built to advantage any single company against another. We do not crawl shopping carts, dashboards, transactional surfaces, or anything behind authentication. We are a documentation reader, not a discovery scanner.
What BixelBot reads
BixelBot reads pages that are accessible to anyone with a web browser. If a page is publicly served without authentication, a paywall, or an interactive challenge, we may read it. We do not sign in, fill in forms, click through CAPTCHAs, or otherwise interact with sites beyond a plain GET request.
- Anything behind authentication
- Login or sign-up walls
- CAPTCHA or interactive challenges
- Paywalled content
- Member dashboards or admin surfaces
- Shopping carts or checkout flows
- Transactional commerce surfaces
- Forum or comment threads identifying users
- Form submission or POST endpoints
- API endpoints not intended for public discovery
- Sensitive paths (see Section 04)
- Domains that have opted out
If a page requires a session or token, we stop.
Sensitive paths
BixelBot is a documentation crawler, not a discovery scanner. It does not request paths that resemble admin surfaces, internal tools, source repositories, configuration files, or credentials, even when robots.txt is silent on them. Examples of paths we never request:
- Admin and console paths:
/admin,/wp-admin,/console,/dashboard. - Configuration and version-control surfaces:
/.env,/.git,/config.json,/backup.sql. - Search and tracking endpoints generated dynamically per visitor.
- Anything robots.txt explicitly disallows, including paths a site owner has marked sensitive.
Crawling etiquette
- 05.1
We obey robots.txt.
Both
Disallowdirectives andCrawl-delayhints. We honorUser-agent: BixelBotrules first, then fall back to*. Cached per domain for 24 hours, refreshed on the next visit after expiration. - 05.2
We rate-limit to one request per second per domain.
Default cap is 1 req/sec/domain with 300-900ms jitter added to every request. If your robots.txt sets a longer Crawl-delay, we honor that instead. See Section 06 for the full bandwidth posture.
- 05.3
We identify as BixelBot in every request.
No rotating identities. No User-Agent spoofing. No pretending to be a browser. You see us coming and you can block us in one line of robots.txt.
- 05.4
We solve zero interactive challenges.
If your site returns a Turnstile, hCaptcha, reCAPTCHA, or Arkose challenge, we log the block and stop. The line between legitimate crawling and evading intentional defenses is one we do not cross.
- 05.5
We back off when you tell us to.
HTTP 429 and 503 responses trigger an automatic backoff with exponential delay before any retry. Repeated 429s on a domain pause crawling there for at least an hour.
- 05.6
We follow publicly-discovered hostnames only.
When BixelBot encounters additional hostnames through public sources (links on a page, sitemaps, RSS feeds, publicly-announced subdomains like
docs.*,blog.*, orstatus.*), it may make unauthenticated GET requests to read whatever the server returns publicly. We do not enumerate subdomains from wordlists, brute-force hostname patterns, scan for unannounced surfaces, POST credentials, attempt to authenticate, bypass an auth gate, or follow any link that requires a session.We discover additional hostnames exclusively through public channels: links on pages we have already been permitted to read, published sitemaps, RSS feeds, and publicly-announced subdomains. We do not probe for hostnames that have not been publicly referenced.
Rate, bandwidth, and frequency
BixelBot is a low-volume, low-frequency crawler by design. The standard posture against any single domain:
- Request rate
- Average 1 request per second per host, burst maximum 4 requests per second per host, plus 300-900ms random jitter. Same numbers published in our signature-agent-card.json.
- Concurrency
- Maximum 1 concurrent connection per host, maximum 4 concurrent connections across the bot. Many distinct hosts in parallel; never multiple connections to the same origin.
- Pages per visit
- Typically under twenty pages per domain in a single batch pass. Most domains see fewer than ten.
- Refresh cadence
- Most domains are revisited weekly.
- Bandwidth
- Compressed responses preferred. We send
Accept-Encoding: gzip, brand respectIf-Modified-Sinceto skip unchanged resources.
If our default rate is still too aggressive for your infrastructure, set Crawl-delay in robots.txt or email bot-abuse@bixel.com. We will accommodate.
User-Agent string
All Bixel crawl traffic uses a single User-Agent string. No rotation, no spoofing, no masquerading as a browser. The string references this page in the +URL comment.
Crawl traffic
Mozilla/5.0 (compatible; BixelBot/1.0; +https://bixel.com/bot)Used for every BixelBot request: discovery, refresh, fingerprinting, all GET requests against documented surfaces.
To match BixelBot in robots.txt or a firewall rule, the token to use is BixelBot.
Verification and provenance
BixelBot is cryptographically verifiable via Web Bot Auth. Every BixelBot request carries an HTTP Message Signature (RFC 9421) signed with an Ed25519 key whose public half is published at our key directory. The required headers are Signature, Signature-Input, and Signature-Agent, constructed per the IETF Web Bot Auth draft.
Key directory
https://bixel.com/.well-known/http-message-signatures-directoryJSON Web Key Set with our active Ed25519 public keys. Served over HTTPS with the application/http-message-signatures-directory+json content type and a self-signed directory signature.
Egress IP list
https://bixel.com/.well-known/bixelbot-ips.jsonAuthoritative list of dedicated egress IPs, served in the OpenAI gptbot.json schema. Today the list contains a single AWS Lightsail IPv4 in us-east-1; the file’s source comments commit to publishing any new prefix before traffic exits from it.
Source IP alone is not a complete verification signal — combine the IP check with the Web Bot Auth signature above. A request that claims to be BixelBot but originates from outside the published list and fails the signature check is not us. We do not object to such traffic being blocked.
What happens to the data
Captured HTML is parsed into structured observations that populate company profiles on bixel.com. We retain raw response bodies for re-parsing and quality auditing.
- The Bixel dataset describes companies and their publicly-visible characteristics. It does not describe, track, or attempt to identify individual people who visit those companies’ websites.
- Personally identifying information that appears in public company pages (executive names, public author bylines) is treated as part of the company’s public record. We do not cross-reference, enrich, or build profiles of individuals.
- Companies can request access at bixel.com/access to review what we have captured about their company, correct inaccuracies, and add context.
- We do not sell raw HTML, redistribute page-level content, or provide bulk content exports. Our dataset is structured observations about companies, not republished web pages.
- We do not use captured content to train general-purpose language models. Our internal models are scoped to the Bixel dataset and run on Bixel infrastructure.
Anti-abuse commitments
Bixel commits, in writing, that BixelBot will not be used for any of the following:
- No scalping. No inventory probing, ticket-buying, drop monitoring, or transactional commerce activity of any kind.
- No credential stuffing. We never POST credentials, reuse leaked password lists, or probe authentication endpoints for valid logins.
- No directory traversal or vulnerability scanning. We do not scan for
/.env,/backup.sql, exposed/.gitdirectories, or known-CVE paths. We are not a security scanner. - No DDoS. Our hard rate limit and per-domain concurrency cap exist to guarantee we are never load-significant for any origin.
- No transactional commerce activity. We do not crawl shopping carts, SKU pages, regional commerce pulls, or any transactional surface.
- No targeted surveillance. Bixel is a broad market dataset. We do not run targeted crawls against any single company at the request of another company. Every company in our dataset is observed with the same methodology, at the same cadence, under the same rules.
- No challenge bypass. We do not solve, fingerprint-around, or otherwise evade CAPTCHA, Turnstile, hCaptcha, reCAPTCHA, Arkose, JavaScript challenges, or browser fingerprinting defenses.
- No identity rotation to avoid detection. The two User-Agent strings in Section 07 are the only ones we use. We do not spin up alt identities to keep crawling after a block.
Stop crawling my site
Enter a domain. We will stop crawling it within the hour, across all current and future Bixel pipelines. No account required, no email required, no justification required. We believe opt-out should be as easy as possible.
Prefer robots.txt? Add User-agent: BixelBot followed by Disallow: /. We pick it up on the next robots.txt refresh, within 24 hours. To remove existing pages from our published surfaces as well, email bot-abuse@bixel.com with the domain and we will purge within five business days.
Legal basis
Reading public web pages is lawful in the United States and in most jurisdictions where Bixel operates. Extracting publicly observable information from publicly accessible pages is squarely within the safe zone established by hiQ Labs v. LinkedIn (9th Cir. 2022) and reinforced by Meta v. Bright Data (N.D. Cal. 2024).
Bixel does not access non-public data, does not bypass authentication or paywalls, does not circumvent technical access controls, does not violate the Computer Fraud and Abuse Act through unauthorized access, and respects opt-outs from site operators. Where applicable, Bixel honors the EU’s text-and-data-mining opt-out under Article 4(3) of the DSM Directive when the site signals it.
If you have a specific legal concern about your site, email legal@bixel.com with the domain and details. Every message is read.
Reporting issues
- Bot misbehavior
- If BixelBot exceeded its rate limit, ignored your robots.txt, or hit a path it shouldn’t have, send logs to bot-abuse@bixel.com. We treat these as production bugs.
- Impersonation
- If traffic claiming to be BixelBot fails the verification checks in Section 08, it is not us. Forward suspicious requests to security@bixel.com.
- Coordinated disclosure
- Vulnerability reports go to security@bixel.com. We respond within two business days and credit researchers in the changelog with permission.
- Takedown / removal
- Email bot-abuse@bixel.com with the domain. Crawl stops within an hour; published pages purged within five business days.
Contact
- General
- bot-abuse@bixel.com
- Security
- security@bixel.com
- Legal
- legal@bixel.com
- Response time
- Two business days. Longer over US holidays.
- Handled by
- A person, not a bot. The mailbox is monitored.
- Related
- See Terms of Service and Privacy Policy for the broader operator commitments.