Methodology

How the data on this site is sourced, transformed, refreshed, and validated.

Primary sources

SourceWhat we getRefresh
Companies House — BasicCompanyDataAsOneFileCore registry (5.7M companies)Monthly
Companies House — Public Data APIOfficers, filings, charges, PSCsOn-demand, 7-day cache
UK SIC 2007Industry classification referenceStatic (SIC 2007 revision)
Open Government Licence v3.0Licence under which we use the datan/a

Refresh process

  1. On the first Monday of each month, a scheduled job discovers the latest BasicCompanyDataAsOneFile-YYYY-MM-DD.zip from the Companies House download page.
  2. The job downloads the ZIP (~500 MB) and extracts the underlying CSV (~2.7 GB, 5.7 million rows).
  3. Each row is transformed: dates parsed from DD/MM/YYYY to ISO, SIC codes extracted from the “12345 - Description” format, addresses flattened, previous names rolled into JSONB.
  4. Rows stream into a new table companies_new via Postgres COPY (no inserts go through ORM).
  5. Indexes are built on the new table after loading completes (faster than incremental updates during load).
  6. A single atomic ALTER TABLE … RENAME swaps the new table in. The old table is preserved for 24 hours as a rollback safety.
  7. SIC code aggregates are recomputed.
  8. Total runtime: under 5 minutes end-to-end on the production Hetzner box. The user-facing site stays available throughout.

Live data via the API

When a user opens a company profile, we fetch its officers, filings, charges, and persons-with-significant-control from the Companies House Public Data API. These responses are cached locally for seven days, so repeat visits don't re-hit the API.

Rate limit: 600 requests per 5 minutes per API key. With caching this is comfortably below what any real traffic load requires.

Known limitations

  • The bulk dataset only contains companies currently on the live register. Companies fully dissolved are removed from this dataset, so dissolution counts on this site are conservative.
  • The bulk dataset publishes once a month, so newly incorporated or recently dissolved companies (within ~30 days) may not yet reflect in our search index — but their profile pages will load fresh data from the live API.
  • SIC code descriptions are extracted from the bulk dataset itself, so any new SIC text variations introduced by Companies House propagate through naturally.
  • The site uses Postgres trigram fuzzy matching for typos, but very short queries (1–2 characters) won't return useful results.
  • Postcodes containing non-postcode strings (Companies House allows values like “NOT APPLICABLE”) are passed through as-is, so the postcode filter occasionally surfaces unusual values.

Corrections

If you spot an error in our presentation of the data, email [email protected] and we'll fix or remove the issue within 48 hours.

For corrections to the underlying record — directors, status, registered office — you need to file with Companies House directly. We re-load the data each month and your changes propagate automatically.

More