Introduction

I recently tried to get one of my old projects, BreweriesNearMe working again. After multiple issues getting the site running again (details will be in another post soon), I realised the API that I was using, brewerydb.com, no longer exists. I found OpenBreweryDB pretty soon, and had it hooked up and running. However, I found that there was no local data for Australia (it’s an open/crowd sourced data source). I then thought, “How hard can that be to get?”

When I started this task, my goal was deceptively simple: collect a comprehensive list of breweries in Australia and format them for ingestion into the OpenBreweryDB schema. The catch? There’s no single authoritative source — and every source that does exist presents its own challenges.

This post is a deep dive into how I designed and built a fully automated data pipeline to tackle that problem. It’s not a tutorial, but a breakdown of the engineering decisions, mistakes, and solutions that got us from “let’s scrape some websites” to “production-ready, enriched, validated data.”


Architecture Overview

At a high level, the scraper evolved into a modular system with three major components:

  1. Data Extraction Layer – Scrape and parse multiple upstream sources.
  2. Normalization & Cleaning Layer – Standardize the fields into a unified schema.
  3. Enrichment & Post-Processing Layer – Use the Google Places API to add structured location data, validate results, and filter noise.

The guiding principle was “merge many imperfect sources into one high-quality dataset.”


1. Data Extraction Layer

The first major step was building scrapers for four key sources:

Each of these posed different challenges.

Craft Cartel: Parsing Unstructured Text

The simplest-looking source turned out to be tricky. The brewery names weren’t in HTML tables or lists — they were buried inside a block of text. Once extracted, each needed to be split and cleaned. That got us names, but no addresses, phones, or coordinates — we’d deal with that later.

IBA: Paginated and Noisy Data

IBA was the richest source but introduced pagination (/page/2/, /page/3/, etc.). Handling this meant writing a loop that crawled until no more pages existed. I also had to filter out non-brewery blocks like “Brewery Members” and “Want to be a member?” that appeared on each page.

Wikipedia: Tables That Weren’t Really Tables

Wikipedia’s structure was the most brittle. The tables I needed were buried after specific <h2> headings, but weren’t direct siblings. My first attempt returned nothing. The fix was to walk forward through the DOM until the next <table class="wikitable"> appeared, then flatten it — accounting for rowspan and column misalignment.

In the end, Wikipedia provided high-value signals: ownership classification and historical data that wasn’t available elsewhere.


2. Normalization & Cleaning

With data now flowing from three sources, the next challenge was making it consistent. OpenBreweryDB expects a schema like:

id, name, brewery_type, address_1, address_2, city, state_province, postal_code, country, phone, website_url, longitude, latitude

This meant handling dozens of edge cases:

Normalization turned out to be where most of the real engineering time went — 70% of the work was spent here.


3. Enrichment with Google Places API

The raw scraped data was still thin — we had names and maybe a rough suburb. The next step was to enrich it.

For each brewery, I constructed a text query like:

<name> brewery Australia

and passed it to the Google Places API. From the response, I extracted:

The enrichment step transformed the dataset from “interesting” to “useful.”

Filtering Non-Australian Results

Many names overlapped with breweries overseas. Adding a strict "country: Australia" filter and rejecting any result not geocoded inside Australia cleaned up the data dramatically.


4. Reliability, Error Handling & Backoff

The first full run failed halfway through: ConnectionResetError: [Errno 54] Connection reset by peer. That turned out to be a network-level reset during Places API calls — likely due to too many requests too quickly.

The solution was threefold:

With these improvements, I could run enrichment across 500+ breweries without a single crash.


5. Lessons Learned & Future Work

This project ended up being far more complex than a “simple web scraper.” Along the way, I learned a few key lessons:

Future improvements include:


Conclusion

What started as a simple scraper evolved into a production-ready data pipeline — one that fetches, normalizes, enriches, and validates hundreds of Australian breweries. The final result is a clean, geocoded dataset that can slot directly into OpenBreweryDB, providing far richer data than any single source on its own.

If you’re building something similar, my advice is simple: treat scraping as just the first step. The real value lies in how you clean, enrich, and harden that data for downstream use.

Final note: If you want, check out the source of the scraper here: https://github.com/simonmackinnon/breweriesnearme/blob/master/data/scrape_au_breweries.py