Identify website visitors with Clearbit Reveal and create HubSpot companies using code
Prerequisites
- Python 3.9+ or Node.js 18+
- Clearbit Reveal API key (legacy) or HubSpot account with Breeze Intelligence add-on
- HubSpot private app token with
crm.objects.companies.readandcrm.objects.companies.writescopes - Access to server logs or an analytics pipeline that captures visitor IP addresses
Clearbit was acquired by HubSpot and rebranded as Breeze Intelligence. The standalone Reveal API is being sunset. This guide covers both the legacy API approach (for existing Clearbit customers) and the HubSpot-native Breeze approach. New users should start with Breeze.
Why code?
Code gives you the most flexibility for parsing log formats, filtering IPs, and batch processing. You can handle server logs in any format (Apache, nginx, JSON), implement custom ICP scoring logic, and process thousands of IPs in a single run. Free to host on GitHub Actions.
The trade-off is setup complexity. You need server access to capture IPs, a log parsing pipeline, and comfort with Python or Node.js. But once set up, the script runs reliably on a schedule with no per-execution cost.
How it works
- Log parser extracts unique IPs from server logs, filtering for high-intent pages (pricing, demo, contact)
- Clearbit Reveal API resolves each IP to a company (name, domain, industry, employee count)
- ICP filter checks company size and sector against your criteria, discarding non-matches
- HubSpot API deduplicates by domain and creates new company records with enrichment data
Step 1: Set up the project
# Test your Clearbit API key
curl -s "https://reveal.clearbit.com/v1/companies/find?ip=203.0.113.42" \
-H "Authorization: Bearer $CLEARBIT_API_KEY" | head -c 300
# Test your HubSpot token
curl -s "https://api.hubapi.com/crm/v3/objects/companies?limit=1" \
-H "Authorization: Bearer $HUBSPOT_ACCESS_TOKEN" | head -c 200Step 2: Extract unique IPs from your logs
Before calling Clearbit, extract and deduplicate visitor IPs. This example reads from a common log format, but adapt it to your analytics pipeline.
import re
from collections import Counter
def extract_ips_from_log(log_path, min_visits=2):
"""Extract IPs that visited key pages multiple times (shows intent)."""
ip_pages = {}
target_pages = ["/pricing", "/demo", "/contact", "/enterprise"]
with open(log_path) as f:
for line in f:
match = re.match(r'^(\d+\.\d+\.\d+\.\d+).*"GET (\S+)', line)
if not match:
continue
ip, page = match.groups()
if any(page.startswith(p) for p in target_pages):
ip_pages.setdefault(ip, []).append(page)
# Only return IPs with multiple visits to high-intent pages
return {ip: pages for ip, pages in ip_pages.items() if len(pages) >= min_visits}Step 3: Resolve IPs to companies via Clearbit Reveal
import requests
import os
import time
CLEARBIT_API_KEY = os.environ["CLEARBIT_API_KEY"]
HUBSPOT_ACCESS_TOKEN = os.environ["HUBSPOT_ACCESS_TOKEN"]
HS_HEADERS = {"Authorization": f"Bearer {HUBSPOT_ACCESS_TOKEN}", "Content-Type": "application/json"}
def reveal_company(ip):
"""Resolve an IP to a company via Clearbit Reveal."""
resp = requests.get(
"https://reveal.clearbit.com/v1/companies/find",
params={"ip": ip},
headers={"Authorization": f"Bearer {CLEARBIT_API_KEY}"},
)
if resp.status_code == 404:
return None
resp.raise_for_status()
data = resp.json()
company = data.get("company")
if not company or company.get("type") != "company":
return None
return {
"domain": company.get("domain"),
"name": company.get("name"),
"industry": company.get("category", {}).get("industry"),
"employees": company.get("metrics", {}).get("employees"),
"city": company.get("geo", {}).get("city"),
"state": company.get("geo", {}).get("state"),
"country": company.get("geo", {}).get("country"),
"description": company.get("description"),
}Only 20-30% of B2B visitor IPs resolve to a company. Consumer ISPs (Comcast, AT&T), VPNs, and mobile carriers always return null. Filter your IP list to corporate-looking traffic before calling the API to save credits.
Step 4: Filter for ICP and deduplicate against HubSpot
def matches_icp(company, min_employees=50):
"""Check if a resolved company matches your ICP criteria."""
if not company.get("domain"):
return False
employees = company.get("employees") or 0
return employees >= min_employees
def company_exists_in_hubspot(domain):
"""Check if a company with this domain already exists in HubSpot."""
resp = requests.post(
"https://api.hubapi.com/crm/v3/objects/companies/search",
headers=HS_HEADERS,
json={
"filterGroups": [{"filters": [{
"propertyName": "domain",
"operator": "EQ",
"value": domain,
}]}],
},
)
resp.raise_for_status()
results = resp.json().get("results", [])
return results[0]["id"] if results else NoneStep 5: Create companies in HubSpot
def create_hubspot_company(company, pages_visited):
"""Create a new company in HubSpot with visitor metadata."""
resp = requests.post(
"https://api.hubapi.com/crm/v3/objects/companies",
headers=HS_HEADERS,
json={
"properties": {
"domain": company["domain"],
"name": company["name"],
"industry": company.get("industry", ""),
"numberofemployees": str(company.get("employees", "")),
"city": company.get("city", ""),
"state": company.get("state", ""),
"country": company.get("country", ""),
"description": company.get("description", ""),
}
},
)
resp.raise_for_status()
return resp.json()["id"]
# --- Main execution ---
ip_pages = extract_ips_from_log("/var/log/nginx/access.log")
print(f"Found {len(ip_pages)} IPs with high-intent visits")
created = 0
skipped = 0
unresolved = 0
for ip, pages in ip_pages.items():
company = reveal_company(ip)
if not company:
unresolved += 1
continue
if not matches_icp(company):
skipped += 1
continue
existing = company_exists_in_hubspot(company["domain"])
if existing:
print(f" EXISTS: {company['name']} ({company['domain']})")
skipped += 1
continue
company_id = create_hubspot_company(company, pages)
print(f" CREATED: {company['name']} — {company.get('employees', '?')} employees — visited {', '.join(pages)}")
created += 1
time.sleep(0.2)
print(f"\nDone. Created: {created}, Skipped: {skipped}, Unresolved: {unresolved}")Step 6: Schedule with cron or GitHub Actions
# .github/workflows/identify-visitors.yml
name: Identify Website Visitors
on:
schedule:
- cron: '0 8 * * *' # Daily at 8 AM UTC
workflow_dispatch: {}
jobs:
run:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.12'
- run: pip install requests
- run: python identify_visitors.py
env:
CLEARBIT_API_KEY: ${{ secrets.CLEARBIT_API_KEY }}
HUBSPOT_ACCESS_TOKEN: ${{ secrets.HUBSPOT_ACCESS_TOKEN }}If your logs aren't accessible from GitHub Actions, pipe IPs to a file (S3, GCS) during the day and download it in the workflow. Or use a webhook-based approach where your server pushes IPs to an API endpoint in real time.
Breeze Intelligence alternative
If you're using HubSpot Breeze Intelligence, the IP-to-company resolution happens automatically within HubSpot — no code needed for that step.
What code adds on top of Breeze:
- Custom ICP filtering — Breeze identifies all visitors, but you may want stricter filters
- Custom properties — Enrich the auto-created records with data from your logs (pages visited, visit count, referrer)
- Routing logic — Assign companies to sales reps based on territory, industry, or company size
# Example: Enrich Breeze-created companies with visit metadata
# Poll for recently created companies and update them
resp = requests.post(
"https://api.hubapi.com/crm/v3/objects/companies/search",
headers=HS_HEADERS,
json={
"filterGroups": [{"filters": [{
"propertyName": "createdate",
"operator": "GTE",
"value": str(twenty_four_hours_ago_ms),
}]}],
"properties": ["domain", "name"],
"limit": 100,
},
)Troubleshooting
Common questions
What percentage of IPs will resolve to a company?
Expect 20-30% for B2B traffic. Consumer ISPs, VPNs, mobile carriers, and work-from-home traffic almost never resolve. If you're getting under 10%, check that you're reading the correct header for the client IP (not your load balancer's IP).
How do I handle log rotation?
Implement a checkpoint — save the last-processed log line offset or timestamp to a file, then resume from that point on the next run. Or rotate logs daily and only process the current day's file with a cron job that runs at end of day.
How much does Clearbit Reveal cost?
The legacy Reveal API is volume-based, typically starting around $99/mo for 2,500 lookups. Breeze Intelligence (the HubSpot-native replacement) is included with Professional+ plans and priced per credit. Check your HubSpot contract for specifics.
Cost
- Hosting: Free on GitHub Actions or ~$5/mo on Railway
- Clearbit Reveal (legacy): Volume-based pricing, typically starting ~$99/mo for 2,500 lookups
- Breeze Intelligence: Included with HubSpot Professional+, priced per credit
- HubSpot API: Free with any plan that supports private apps
Looking to scale your AI operations?
We build and optimize automation systems for mid-market businesses. Let's discuss the right approach for your team.