<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.3.4">Jekyll</generator><link href="https://theclouddevopslearningblog.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://theclouddevopslearningblog.com/" rel="alternate" type="text/html" /><updated>2025-10-16T11:43:36+00:00</updated><id>https://theclouddevopslearningblog.com/feed.xml</id><title type="html">The Cloud DevOps Learning Blog</title><entry><title type="html">How I Hosted a JavaScript Single-Page App (BreweriesNearMe) on a Subdomain with S3, CloudFront and Route 53</title><link href="https://theclouddevopslearningblog.com/devops/aws/s3/cloudfront/frontend/2025/10/07/breweries-spa-subdomain.html" rel="alternate" type="text/html" title="How I Hosted a JavaScript Single-Page App (BreweriesNearMe) on a Subdomain with S3, CloudFront and Route 53" /><published>2025-10-07T00:00:00+00:00</published><updated>2025-10-07T00:00:00+00:00</updated><id>https://theclouddevopslearningblog.com/devops/aws/s3/cloudfront/frontend/2025/10/07/breweries-spa-subdomain</id><content type="html" xml:base="https://theclouddevopslearningblog.com/devops/aws/s3/cloudfront/frontend/2025/10/07/breweries-spa-subdomain.html"><![CDATA[<p>One of the best things about static sites hosted on AWS is how easy it is to extend them. My main Jekyll blog (<code class="language-plaintext highlighter-rouge">theclouddevopslearningblog.com</code>) runs from an S3 bucket behind CloudFront, but I recently wanted to host a <strong>standalone JavaScript single-page application (SPA)</strong> on a subdomain:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>https://breweriesnearme.theclouddevopslearningblog.com
</code></pre></div></div>

<p>Here’s exactly how I set it up.</p>

<hr />

<h2 id="1-creating-a-new-s3-bucket-for-the-subdomain">1. Creating a New S3 Bucket for the Subdomain</h2>

<p>The first step was to create a dedicated S3 bucket to hold the SPA build. Following AWS best practices:</p>

<ul>
  <li>Bucket name: <code class="language-plaintext highlighter-rouge">breweriesnearme.theclouddevopslearningblog.com</code></li>
  <li><strong>Block all public access</strong>: ✅ ON</li>
  <li><strong>Static website hosting</strong>: ❌ Disabled (we’ll use CloudFront instead)</li>
</ul>

<p>I uploaded the build artifacts (<code class="language-plaintext highlighter-rouge">index.html</code>, <code class="language-plaintext highlighter-rouge">main.js</code>, <code class="language-plaintext highlighter-rouge">assets/</code>, etc.) directly into the bucket root.</p>

<blockquote>
  <p>💡 <strong>Tip:</strong> Make sure <code class="language-plaintext highlighter-rouge">index.html</code> is at the root of the bucket, not in a subfolder like <code class="language-plaintext highlighter-rouge">dist/</code>, unless you plan to set a CloudFront origin path.</p>
</blockquote>

<hr />

<h2 id="2-requesting-an-ssltls-certificate-in-acm">2. Requesting an SSL/TLS Certificate in ACM</h2>

<p>CloudFront requires certificates to be in the <code class="language-plaintext highlighter-rouge">us-east-1</code> region, so I switched to <strong>N. Virginia</strong> and requested a new cert for:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>breweriesnearme.theclouddevopslearningblog.com
</code></pre></div></div>

<p>I used DNS validation and, because the domain is managed in Route 53, ACM automatically created the necessary CNAME record. Once validation succeeded, the certificate was ready to attach.</p>

<hr />

<h2 id="3-setting-up-the-cloudfront-distribution">3. Setting Up the CloudFront Distribution</h2>

<p>Next, I created a <strong>new CloudFront distribution</strong> to serve the SPA.</p>

<p><strong>Key settings:</strong></p>

<ul>
  <li><strong>Origin Domain:</strong> S3 <em>regional endpoint</em> (not the website endpoint)</li>
  <li><strong>Origin Access Control (OAC):</strong> Enabled, to keep the S3 bucket private</li>
  <li><strong>Viewer Protocol Policy:</strong> Redirect HTTP → HTTPS</li>
  <li><strong>Alternate Domain Name (CNAME):</strong> <code class="language-plaintext highlighter-rouge">breweriesnearme.theclouddevopslearningblog.com</code></li>
  <li><strong>SSL Certificate:</strong> Custom ACM certificate from step 2</li>
  <li><strong>Default Root Object:</strong> <code class="language-plaintext highlighter-rouge">index.html</code> (⚠️ no leading slash)</li>
  <li><strong>Compression:</strong> Enabled</li>
</ul>

<h3 id="-spa-friendly-error-pages">🔁 SPA-Friendly Error Pages</h3>

<p>Because SPAs handle routing client-side, I needed to configure CloudFront to serve <code class="language-plaintext highlighter-rouge">index.html</code> even when a 403 or 404 occurs:</p>

<ul>
  <li><strong>403 → 200</strong> → <code class="language-plaintext highlighter-rouge">/index.html</code></li>
  <li><strong>404 → 200</strong> → <code class="language-plaintext highlighter-rouge">/index.html</code></li>
</ul>

<p>This ensures deep links like <code class="language-plaintext highlighter-rouge">/brewery/42</code> work correctly.</p>

<hr />

<h2 id="4-bucket-policy-for-cloudfront-access">4. Bucket Policy for CloudFront Access</h2>

<p>With OAC enabled, I updated the S3 bucket policy to allow CloudFront to read objects:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"Version"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2012-10-17"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"Statement"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
    </span><span class="p">{</span><span class="w">
      </span><span class="nl">"Sid"</span><span class="p">:</span><span class="w"> </span><span class="s2">"AllowCloudFrontOACRead"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Effect"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Allow"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Principal"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w"> </span><span class="nl">"Service"</span><span class="p">:</span><span class="w"> </span><span class="s2">"cloudfront.amazonaws.com"</span><span class="w"> </span><span class="p">},</span><span class="w">
      </span><span class="nl">"Action"</span><span class="p">:</span><span class="w"> </span><span class="s2">"s3:GetObject"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Resource"</span><span class="p">:</span><span class="w"> </span><span class="s2">"arn:aws:s3:::breweriesnearme.theclouddevopslearningblog.com/*"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"Condition"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
        </span><span class="nl">"StringEquals"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
          </span><span class="nl">"AWS:SourceArn"</span><span class="p">:</span><span class="w"> </span><span class="s2">"arn:aws:cloudfront::&lt;ACCOUNT_ID&gt;:distribution/&lt;DISTRIBUTION_ID&gt;"</span><span class="w">
        </span><span class="p">}</span><span class="w">
      </span><span class="p">}</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<hr />

<h2 id="5-adding-the-subdomain-in-route-53">5. Adding the Subdomain in Route 53</h2>

<p>Finally, I added a new DNS record in Route 53:</p>

<ul>
  <li><strong>Record type:</strong> A (Alias)</li>
  <li><strong>Name:</strong> <code class="language-plaintext highlighter-rouge">breweriesnearme</code></li>
  <li><strong>Target:</strong> My CloudFront distribution</li>
</ul>

<p>Once propagation completed, the subdomain pointed to CloudFront and the SPA became publicly accessible.</p>

<hr />

<h2 id="6-common-gotchas-and-how-i-fixed-them">6. Common Gotchas (and How I Fixed Them)</h2>

<ul>
  <li><strong>AccessDenied at root:</strong> I initially saw an <code class="language-plaintext highlighter-rouge">AccessDenied</code> error when visiting <code class="language-plaintext highlighter-rouge">/</code>. The fix was making sure <strong>Default Root Object</strong> was set to <code class="language-plaintext highlighter-rouge">index.html</code> (without a slash).</li>
  <li><strong>Certificate validation not showing:</strong> The first time I requested the cert, I did it in the wrong region (<code class="language-plaintext highlighter-rouge">ap-southeast-2</code>). Certificates for CloudFront must be in <code class="language-plaintext highlighter-rouge">us-east-1</code>.</li>
</ul>

<hr />

<h2 id="-cicd-deployment">🧰 CI/CD Deployment</h2>

<p>I set up a GitHub Actions pipeline to automatically build and deploy the SPA to S3 and invalidate the CloudFront cache:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">name</span><span class="pi">:</span> <span class="s">Deploy Breweries Near Me SPA</span>

<span class="na">on</span><span class="pi">:</span>
  <span class="na">push</span><span class="pi">:</span>
    <span class="na">branches</span><span class="pi">:</span> <span class="pi">[</span> <span class="nv">master</span> <span class="pi">]</span>

<span class="na">permissions</span><span class="pi">:</span>
  <span class="na">contents</span><span class="pi">:</span> <span class="s">read</span>

<span class="na">jobs</span><span class="pi">:</span>
  <span class="na">deploy</span><span class="pi">:</span>
    <span class="na">name</span><span class="pi">:</span> <span class="s">Build and Deploy SPA</span>
    <span class="na">runs-on</span><span class="pi">:</span> <span class="s">ubuntu-latest</span>

    <span class="na">steps</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Checkout repository</span>
        <span class="na">uses</span><span class="pi">:</span> <span class="s">actions/checkout@v4</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Setup Node.js</span>
        <span class="na">uses</span><span class="pi">:</span> <span class="s">actions/setup-node@v4</span>
        <span class="na">with</span><span class="pi">:</span>
          <span class="na">node-version</span><span class="pi">:</span> <span class="m">20</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Install dependencies</span>
        <span class="na">run</span><span class="pi">:</span> <span class="s">npm ci</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Build application</span>
        <span class="na">run</span><span class="pi">:</span> <span class="s">npm run build</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Configure AWS credentials</span>
        <span class="na">uses</span><span class="pi">:</span> <span class="s">aws-actions/configure-aws-credentials@v4</span>
        <span class="na">with</span><span class="pi">:</span>
          <span class="na">aws-access-key-id</span><span class="pi">:</span> <span class="s">$</span>
          <span class="na">aws-secret-access-key</span><span class="pi">:</span> <span class="s">$</span>
          <span class="na">aws-region</span><span class="pi">:</span> <span class="s">ap-southeast-2</span>

      <span class="c1"># Upload static assets from dist/ (immutable cache)</span>
      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Upload dist/ to S3</span>
        <span class="na">run</span><span class="pi">:</span> <span class="pi">|</span>
          <span class="s">aws s3 sync dist/ s3://$/ \</span>
            <span class="s">--delete \</span>
            <span class="s">--cache-control "public,max-age=31536000,immutable"</span>

      <span class="c1"># Upload images from img/ (long cache but not immutable)</span>
      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Upload img/ to S3</span>
        <span class="na">run</span><span class="pi">:</span> <span class="pi">|</span>
          <span class="s">aws s3 sync src/img/ s3://$/img/ \</span>
            <span class="s">--delete \</span>
            <span class="s">--cache-control "public,max-age=31536000"</span>

      <span class="c1"># Upload index.html separately with no-cache</span>
      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Upload index.html</span>
        <span class="na">run</span><span class="pi">:</span> <span class="pi">|</span>
          <span class="s">aws s3 cp src/index.html s3://$/index.html \</span>
            <span class="s">--cache-control "no-store" \</span>
            <span class="s">--content-type "text/html"</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Invalidate CloudFront cache</span>
        <span class="na">run</span><span class="pi">:</span> <span class="pi">|</span>
          <span class="s">aws cloudfront create-invalidation \</span>
            <span class="s">--distribution-id $ \</span>
            <span class="s">--paths "/*"</span>

</code></pre></div></div>

<hr />

<h2 id="7-application-changes-updating-the-app-and-migrating-to-openbrewerydb">7. Application Changes: Updating the App and Migrating to OpenBreweryDB</h2>

<p><img src="/media/breweriesnearme.png" alt="BreweriesNearMe" /></p>

<p>I wrote BreweriesNearMe as a standalone app (was previously deployed and hosted on a raspberry pi at home), as part of some study years ago (I was doing a Functional Programming in JavaScript course using the Rambda library). The code for this app is pretty cool, as it used the <code class="language-plaintext highlighter-rouge">hyperscript-helpers</code> library, which allows you to assign the css class for each element where it is being coded via helper functions (making the styling much more readable, than normal CSS, and for my purposes, was definitely sufficient), e.g:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>function fieldSet(labelText, inputValue, oninput) {
    return div({ className: 'w-80'},
    [
        label({ className: 'db mb1 mw-80' }, labelText),
        input({ 
            className: 'pa2 input-reset ba w-100 mb2 br3',
            id: 'addressSearch',
            type: 'text',
            value: inputValue,
            oninput
        }),
    ]);
}
</code></pre></div></div>

<p>Switching from the proprietary BreweryDB API to the open, community-driven <a href="https://www.openbrewerydb.org/">OpenBreweryDB</a> required several code and data handling changes. In my previous blog, I wrote how I scraped, enriched and then added data to this project, so I won’t go over it again today. However, while cutting the source over was straight-forward, there were a few things that I had to update to get everything working smoothly:</p>

<h3 id="-api-endpoint--data-model-changes">🔄 API Endpoint &amp; Data Model Changes</h3>

<ul>
  <li><strong>API URL:</strong> Updated all fetch calls to use the OpenBreweryDB REST endpoints instead of the old BreweryDB URLs.</li>
  <li><strong>Field Names:</strong> OpenBreweryDB uses different field names (e.g., <code class="language-plaintext highlighter-rouge">name</code>, <code class="language-plaintext highlighter-rouge">street</code>, <code class="language-plaintext highlighter-rouge">city</code>, <code class="language-plaintext highlighter-rouge">state</code>, <code class="language-plaintext highlighter-rouge">postal_code</code>, <code class="language-plaintext highlighter-rouge">website_url</code>, <code class="language-plaintext highlighter-rouge">latitude</code>, <code class="language-plaintext highlighter-rouge">longitude</code>). I refactored the code to map and display these new fields.</li>
  <li><strong>No API Key Needed:</strong> Removed all authentication logic and API key handling, since OpenBreweryDB is public. In fact, my previous authenticated calls through to BreweryDB.com were proxied via an AWS Lambda function to abstract the API auth side of things altogether.</li>
</ul>

<h3 id="️-address--location-handling">🗺️ Address &amp; Location Handling</h3>

<ul>
  <li><strong>Address Formatting:</strong> Adjusted the address formatting logic to handle OpenBreweryDB’s fields, which sometimes differ from BreweryDB (e.g., <code class="language-plaintext highlighter-rouge">address_1</code> vs <code class="language-plaintext highlighter-rouge">street</code>, <code class="language-plaintext highlighter-rouge">state_province</code> vs <code class="language-plaintext highlighter-rouge">state</code>).</li>
  <li><strong>Geolocation:</strong> Ensured that the app gracefully handles missing or partial location data, since not all breweries in OpenBreweryDB have latitude/longitude.</li>
</ul>

<h3 id="-distance-calculation">📏 Distance Calculation</h3>

<ul>
  <li><strong>Distance Calculation:</strong> Since OpenBreweryDB doesn’t provide distance-from-user, I implemented a Haversine formula in the frontend to calculate the distance between the user’s search location and each brewery’s coordinates.</li>
  <li><strong>Unit Selection:</strong> Preserved support for both kilometers and miles.</li>
  <li><strong>Filtering by Distance:</strong> Updated the logic so that breweries are filtered by the selected radius before rendering, and the “No breweries to display within the selected distance.” message is shown if none match. The distance calculation is now performed for each brewery before filtering and deduplication, ensuring the UI is always accurate and user-friendly.</li>
</ul>

<h3 id="-deduplication">🧹 Deduplication</h3>

<ul>
  <li><strong>Duplicate Results:</strong> OpenBreweryDB sometimes returns duplicate or near-duplicate breweries (with slight name or address variations - kind of my bad given I created the initial dataset 💀). I added a deduplication step in the frontend, matching on address, website, and distance, to ensure only unique breweries are shown.</li>
</ul>

<h3 id="️-ui--table-rendering">🖥️ UI &amp; Table Rendering</h3>

<ul>
  <li><strong>Table Columns:</strong> Updated the UI to show the new fields, and ensured links (like website and Google Maps) use the correct data. Sadly, I had to remove the Image field, as this information isn’t supported in the OpenBreweryDB API (perhaps room to add improvements later!).</li>
  <li><strong>Error Handling:</strong> Improved error handling for missing data and empty results.</li>
</ul>

<h3 id="-testing--edge-cases">🧪 Testing &amp; Edge Cases</h3>

<ul>
  <li><strong>Deep Links:</strong> Verified that SPA routing still works for direct links to brewery detail pages.</li>
  <li><strong>No Results:</strong> Ensured the app displays a friendly message if no breweries are found for a given search or if none are within the selected distance.</li>
</ul>

<hr />

<p>With these changes, the app now works seamlessly with OpenBreweryDB, is easier to maintain, and is free from API key or quota restrictions.</p>

<hr />

<h2 id="final-thoughts">Final Thoughts</h2>

<p>This project was a great reminder of how flexible AWS’s static hosting model is. With just a few services — S3, CloudFront, ACM, and Route 53 — I was able to stand up a completely separate React application under the same domain as my Jekyll blog, with full HTTPS support, CDN caching, and SPA-friendly routing.</p>]]></content><author><name></name></author><category term="devops" /><category term="aws" /><category term="s3" /><category term="cloudfront" /><category term="frontend" /><summary type="html"><![CDATA[A step-by-step walkthrough of deploying a separate single-page application under a subdomain of an existing Jekyll site using AWS S3, CloudFront, ACM, and Route 53.]]></summary></entry><entry><title type="html">Building an End-to-End Brewery Scraper for Australian Data for OpenBreweryDB</title><link href="https://theclouddevopslearningblog.com/devops/data-engineering/python/scraping/2025/10/04/building-an-end-to-end-brewery-scraper.html" rel="alternate" type="text/html" title="Building an End-to-End Brewery Scraper for Australian Data for OpenBreweryDB" /><published>2025-10-04T00:00:00+00:00</published><updated>2025-10-04T00:00:00+00:00</updated><id>https://theclouddevopslearningblog.com/devops/data-engineering/python/scraping/2025/10/04/building-an-end-to-end-brewery-scraper</id><content type="html" xml:base="https://theclouddevopslearningblog.com/devops/data-engineering/python/scraping/2025/10/04/building-an-end-to-end-brewery-scraper.html"><![CDATA[<h2 id="introduction">Introduction</h2>

<p>I recently tried to get one of my old projects, <a href="https://github.com/simonmackinnon/breweriesnearme">BreweriesNearMe</a> working again. After multiple issues getting the site running again (details will be in another post soon), I realised the API that I was using, brewerydb.com, no longer exists. I found OpenBreweryDB pretty soon, and had it hooked up and running. However, I found that there was no local data for Australia (it’s an open/crowd sourced data source). I then thought, “How hard can that be to get?”</p>

<p>When I started this task, my goal was deceptively simple: <strong>collect a comprehensive list of breweries in Australia</strong> and format them for ingestion into the <a href="https://www.openbrewerydb.org/">OpenBreweryDB</a> schema. The catch? There’s no single authoritative source — and every source that <em>does</em> exist presents its own challenges.</p>

<p>This post is a deep dive into how I designed and built a fully automated data pipeline to tackle that problem. It’s not a tutorial, but a breakdown of the engineering decisions, mistakes, and solutions that got us from <em>“let’s scrape some websites”</em> to <em>“production-ready, enriched, validated data.”</em></p>

<hr />

<h2 id="architecture-overview">Architecture Overview</h2>

<p>At a high level, the scraper evolved into a modular system with three major components:</p>

<ol>
  <li><strong>Data Extraction Layer</strong> – Scrape and parse multiple upstream sources.</li>
  <li><strong>Normalization &amp; Cleaning Layer</strong> – Standardize the fields into a unified schema.</li>
  <li><strong>Enrichment &amp; Post-Processing Layer</strong> – Use the Google Places API to add structured location data, validate results, and filter noise.</li>
</ol>

<p>The guiding principle was <strong>“merge many imperfect sources into one high-quality dataset.”</strong></p>

<hr />

<h2 id="1-data-extraction-layer">1. Data Extraction Layer</h2>

<p>The first major step was building scrapers for four key sources:</p>

<ul>
  <li><strong>Craft Cartel:</strong> A long-form page with a single block of comma-separated brewery names.</li>
  <li><strong>Independent Brewers Association (IBA):</strong> Paginated cards with names and rough locations.</li>
  <li><strong>Wikipedia:</strong> Tables split into “major company owned” and “microbreweries.”</li>
  <li><strong>Untappd:</strong> Initially attempted, but abandoned due to authentication requirements and closed API.</li>
</ul>

<p>Each of these posed different challenges.</p>

<h3 id="craft-cartel-parsing-unstructured-text">Craft Cartel: Parsing Unstructured Text</h3>

<p>The simplest-looking source turned out to be tricky. The brewery names weren’t in HTML tables or lists — they were buried inside a block of text. Once extracted, each needed to be split and cleaned. That got us names, but no addresses, phones, or coordinates — we’d deal with that later.</p>

<h3 id="iba-paginated-and-noisy-data">IBA: Paginated and Noisy Data</h3>

<p>IBA was the richest source but introduced pagination (<code class="language-plaintext highlighter-rouge">/page/2/</code>, <code class="language-plaintext highlighter-rouge">/page/3/</code>, etc.). Handling this meant writing a loop that crawled until no more pages existed. I also had to filter out non-brewery blocks like “Brewery Members” and “Want to be a member?” that appeared on each page.</p>

<h3 id="wikipedia-tables-that-werent-really-tables">Wikipedia: Tables That Weren’t Really Tables</h3>

<p>Wikipedia’s structure was the most brittle. The tables I needed were buried after specific <code class="language-plaintext highlighter-rouge">&lt;h2&gt;</code> headings, but weren’t direct siblings. My first attempt returned nothing. The fix was to walk forward through the DOM until the next <code class="language-plaintext highlighter-rouge">&lt;table class="wikitable"&gt;</code> appeared, then flatten it — accounting for rowspan and column misalignment.</p>

<p>In the end, Wikipedia provided high-value signals: ownership classification and historical data that wasn’t available elsewhere.</p>

<hr />

<h2 id="2-normalization--cleaning">2. Normalization &amp; Cleaning</h2>

<p>With data now flowing from three sources, the next challenge was making it <strong>consistent</strong>. OpenBreweryDB expects a schema like:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>id, name, brewery_type, address_1, address_2, city, state_province, postal_code, country, phone, website_url, longitude, latitude
</code></pre></div></div>

<p>This meant handling dozens of edge cases:</p>

<ul>
  <li>Splitting free-form location strings like <code class="language-plaintext highlighter-rouge">"340 Melton Rd, Northgate QLD 4013, Australia"</code> into structured fields.</li>
  <li>Dropping breweries with no Australian presence.</li>
  <li>Deduplicating records by normalizing names and fuzzy matching city/state combinations.</li>
  <li>Removing metadata rows from Wikipedia that weren’t breweries at all.</li>
</ul>

<p>Normalization turned out to be where most of the real engineering time went — 70% of the work was spent here.</p>

<hr />

<h2 id="3-enrichment-with-google-places-api">3. Enrichment with Google Places API</h2>

<p>The raw scraped data was still thin — we had names and maybe a rough suburb. The next step was to <strong>enrich it</strong>.</p>

<p>For each brewery, I constructed a text query like:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;name&gt; brewery Australia
</code></pre></div></div>

<p>and passed it to the <a href="https://developers.google.com/maps/documentation/places/web-service/overview">Google Places API</a>. From the response, I extracted:</p>

<ul>
  <li>✅ Full structured address</li>
  <li>✅ Phone number</li>
  <li>✅ Official website</li>
  <li>✅ Latitude &amp; longitude</li>
</ul>

<p>The enrichment step transformed the dataset from <em>“interesting”</em> to <em>“useful.”</em></p>

<h3 id="filtering-non-australian-results">Filtering Non-Australian Results</h3>

<p>Many names overlapped with breweries overseas. Adding a strict <code class="language-plaintext highlighter-rouge">"country: Australia"</code> filter and rejecting any result not geocoded inside Australia cleaned up the data dramatically.</p>

<hr />

<h2 id="4-reliability-error-handling--backoff">4. Reliability, Error Handling &amp; Backoff</h2>

<p>The first full run failed halfway through: <code class="language-plaintext highlighter-rouge">ConnectionResetError: [Errno 54] Connection reset by peer</code>. That turned out to be a <strong>network-level reset during Places API calls</strong> — likely due to too many requests too quickly.</p>

<p>The solution was threefold:</p>

<ul>
  <li>✅ <strong>Exponential backoff + jitter</strong> on all API calls.</li>
  <li>✅ A global <code class="language-plaintext highlighter-rouge">requests.Session</code> adapter with automatic retries and respect for <code class="language-plaintext highlighter-rouge">Retry-After</code>.</li>
  <li>✅ A configurable <code class="language-plaintext highlighter-rouge">--places-rate</code> flag to control throughput (e.g. <code class="language-plaintext highlighter-rouge">1.5s</code> between calls).</li>
</ul>

<p>With these improvements, I could run enrichment across 500+ breweries without a single crash.</p>

<hr />

<h2 id="5-lessons-learned--future-work">5. Lessons Learned &amp; Future Work</h2>

<p>This project ended up being far more complex than a “simple web scraper.” Along the way, I learned a few key lessons:</p>

<ul>
  <li><strong>Scraping is the easy part.</strong> Normalization and enrichment are where the real complexity lies.</li>
  <li><strong>DOM structures are fragile.</strong> Wikipedia’s layout changes broke three early attempts — defensive parsing is essential.</li>
  <li><strong>Error handling isn’t optional.</strong> A 500-request enrichment job <em>will</em> hit transient network issues. Plan for them.</li>
  <li><strong>Multiple imperfect sources &gt; one perfect one.</strong> The final dataset was only possible by merging three sources and using Places data to fill gaps.</li>
</ul>

<p>Future improvements include:</p>

<ul>
  <li>Adding brewery size classification (micro, regional, large) automatically.</li>
  <li>Building a scheduling layer to re-run the scraper periodically.</li>
  <li>Adding CI tests to validate schema integrity and geolocation coverage.</li>
</ul>

<hr />

<h2 id="conclusion">Conclusion</h2>

<p>What started as a simple scraper evolved into a production-ready data pipeline — one that fetches, normalizes, enriches, and validates hundreds of Australian breweries. The final result is a clean, geocoded dataset that can slot directly into OpenBreweryDB, providing far richer data than any single source on its own.</p>

<p>If you’re building something similar, my advice is simple: treat scraping as just the <strong>first step</strong>. The real value lies in how you clean, enrich, and harden that data for downstream use.</p>

<p>Final note: If you want, check out the source of the scraper here: https://github.com/simonmackinnon/breweriesnearme/blob/master/data/scrape_au_breweries.py</p>]]></content><author><name></name></author><category term="devops" /><category term="data-engineering" /><category term="python" /><category term="scraping" /><summary type="html"><![CDATA[A technical deep dive into how I built a robust, multi-source brewery scraper to power Australian data for OpenBreweryDB — including scraping, enrichment, retries, and lessons learned.]]></summary></entry><entry><title type="html">Automating Jekyll Builds and S3 Deployments with GitHub Actions</title><link href="https://theclouddevopslearningblog.com/devops/aws/jekyll/ci-cd/2025/10/01/automating-jekyll-ci-cd-with-reference.html" rel="alternate" type="text/html" title="Automating Jekyll Builds and S3 Deployments with GitHub Actions" /><published>2025-10-01T14:00:00+00:00</published><updated>2025-10-01T14:00:00+00:00</updated><id>https://theclouddevopslearningblog.com/devops/aws/jekyll/ci-cd/2025/10/01/automating-jekyll-ci-cd-with-reference</id><content type="html" xml:base="https://theclouddevopslearningblog.com/devops/aws/jekyll/ci-cd/2025/10/01/automating-jekyll-ci-cd-with-reference.html"><![CDATA[<p>For a long time, I was manually building my Jekyll blog and pushing the generated <code class="language-plaintext highlighter-rouge">_site</code> directory up to S3. It worked, but it was slow, error-prone, and easy to forget. So I finally decided to automate the whole thing with <strong>GitHub Actions</strong> — and in this post, I’ll show you how I did it, including the little gotchas that tripped me up along the way.</p>

<p>This walkthrough builds on some great work others have shared — in particular, <a href="https://pagertree.com/blog/jekyll-site-to-aws-s3-using-github-actions">this excellent guide from PagerTree</a>, which I used as my starting point. I’ve adapted and expanded on it here to match my own workflow and highlight some issues I ran into along the way.</p>

<hr />

<h2 id="-why-automate-your-jekyll-deployments">🧰 Why Automate Your Jekyll Deployments?</h2>

<p>Every time you push to your main branch, you can have GitHub automatically:</p>

<ul>
  <li>Install Ruby and your Jekyll dependencies</li>
  <li>Build your static site</li>
  <li>Upload the generated files to your S3 bucket</li>
  <li>Invalidate your CloudFront cache so changes go live immediately</li>
</ul>

<p>This turns deployment from a manual multi-step process into a simple <strong><code class="language-plaintext highlighter-rouge">git push</code></strong>.</p>

<hr />

<h2 id="️-setting-up-the-workflow">🛠️ Setting Up the Workflow</h2>

<p>The workflow file lives at <code class="language-plaintext highlighter-rouge">.github/workflows/deploy.yml</code> in your repo. Here’s a trimmed-down version of mine:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">name</span><span class="pi">:</span> <span class="s">CI / CD</span>

<span class="na">on</span><span class="pi">:</span>
  <span class="na">push</span><span class="pi">:</span>
    <span class="na">branches</span><span class="pi">:</span> <span class="pi">[</span> <span class="nv">master</span> <span class="pi">]</span>
  <span class="na">workflow_dispatch</span><span class="pi">:</span>

<span class="na">jobs</span><span class="pi">:</span>
  <span class="na">build</span><span class="pi">:</span>
    <span class="na">runs-on</span><span class="pi">:</span> <span class="s">ubuntu-latest</span>
    <span class="na">defaults</span><span class="pi">:</span>
      <span class="na">run</span><span class="pi">:</span>
        <span class="na">working-directory</span><span class="pi">:</span> <span class="s">jekyll-clouddevopslearningblog</span>

    <span class="na">steps</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="na">uses</span><span class="pi">:</span> <span class="s">actions/checkout@v4</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Set up Ruby</span>
        <span class="na">uses</span><span class="pi">:</span> <span class="s">ruby/setup-ruby@v1</span>
        <span class="na">with</span><span class="pi">:</span>
          <span class="na">ruby-version</span><span class="pi">:</span> <span class="s2">"</span><span class="s">3.2"</span>
          <span class="na">bundler-cache</span><span class="pi">:</span> <span class="kc">true</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Ensure Linux platform support</span>
        <span class="na">run</span><span class="pi">:</span> <span class="s">bundle lock --add-platform x86_64-linux</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Build the site</span>
        <span class="na">env</span><span class="pi">:</span>
          <span class="na">JEKYLL_ENV</span><span class="pi">:</span> <span class="s">production</span>
        <span class="na">run</span><span class="pi">:</span> <span class="s">bundle exec jekyll build --trace</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Deploy to S3</span>
        <span class="na">run</span><span class="pi">:</span> <span class="s">aws s3 sync ./_site/ s3://$ --delete --acl public-read --cache-control max-age=604800</span>

      <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Invalidate CloudFront cache</span>
        <span class="na">run</span><span class="pi">:</span> <span class="s">aws cloudfront create-invalidation --distribution-id $ --paths "/*"</span>
</code></pre></div></div>

<hr />

<h2 id="-configuring-secrets">🔐 Configuring Secrets</h2>

<p>You’ll need to store a few secrets in your GitHub repository settings (<code class="language-plaintext highlighter-rouge">Settings → Secrets and variables → Actions</code>):</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">AWS_ACCESS_KEY_ID</code> – your AWS access key</li>
  <li><code class="language-plaintext highlighter-rouge">AWS_SECRET_ACCESS_KEY</code> – your AWS secret key</li>
  <li><code class="language-plaintext highlighter-rouge">AWS_S3_BUCKET_NAME</code> – the name of your bucket</li>
  <li><code class="language-plaintext highlighter-rouge">AWS_CLOUDFRONT_DISTRIBUTION_ID</code> – the ID of your CloudFront distribution</li>
</ul>

<p>These are injected into the workflow automatically and keep sensitive data out of your repo.</p>

<hr />

<h2 id="-common-gotchas-and-how-i-fixed-them">🧩 Common Gotchas (And How I Fixed Them)</h2>

<p>I ran into a few issues that are worth mentioning:</p>

<h3 id="1-could-not-locate-gemfile-or-bundle-directory">1. <code class="language-plaintext highlighter-rouge">Could not locate Gemfile or .bundle/ directory</code></h3>

<p>This happens when the workflow runs in the wrong directory. If your Jekyll site is inside a subfolder (like <code class="language-plaintext highlighter-rouge">jekyll-clouddevopslearningblog</code>), make sure to set:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">defaults</span><span class="pi">:</span>
  <span class="na">run</span><span class="pi">:</span>
    <span class="na">working-directory</span><span class="pi">:</span> <span class="s">jekyll-clouddevopslearningblog</span>
</code></pre></div></div>

<hr />

<h3 id="2-bundler-command-not-found-jekyll">2. <code class="language-plaintext highlighter-rouge">bundler: command not found: jekyll</code></h3>

<p>This one confused me at first — it means Bundler installed your gems, but <code class="language-plaintext highlighter-rouge">jekyll</code> wasn’t among them. Usually the cause is that you’re not running commands with <code class="language-plaintext highlighter-rouge">bundle exec</code>, or that Jekyll isn’t listed in your <code class="language-plaintext highlighter-rouge">Gemfile</code>.</p>

<p>✅ Fix: Make sure your <code class="language-plaintext highlighter-rouge">Gemfile</code> includes:</p>

<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">gem</span> <span class="s2">"jekyll"</span><span class="p">,</span> <span class="s2">"~&gt; 4.3"</span>
</code></pre></div></div>

<p>And build the site like this:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="pi">-</span> <span class="na">run</span><span class="pi">:</span> <span class="s">bundle exec jekyll build --trace</span>
</code></pre></div></div>

<hr />

<h3 id="3-you-must-add-the-platform-x86_64-linux-to-your-lockfile">3. <code class="language-plaintext highlighter-rouge">You must add the platform x86_64-linux to your lockfile</code></h3>

<p>If you created your <code class="language-plaintext highlighter-rouge">Gemfile.lock</code> on macOS, the Linux runner on GitHub Actions won’t install some gems. You can fix this by adding a step before installation:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Ensure Linux platform support</span>
  <span class="na">run</span><span class="pi">:</span> <span class="s">bundle lock --add-platform x86_64-linux</span>
</code></pre></div></div>

<p>Commit the updated lockfile once and you can remove this step.</p>

<h3 id="4-you-must-copy-the-vendor-postshtml-to-the-repo-if-youve-added-custom-code">4. <code class="language-plaintext highlighter-rouge">You must copy the vendor posts.html to the repo if you've added custom code</code></h3>

<p>I added some custom code in the post.html of the bundle directly on my machine. So when I used github actions to build the site and deploy, it didn’t have these changes. By copying this to the repo version at _includes/post.html, this meant these changes could be added when built remotely</p>

<hr />

<h2 id="-going-further">🚀 Going Further</h2>

<p>Some ideas for future improvements:</p>

<ul>
  <li>Use <a href="https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_providers_create_oidc.html">OIDC</a> and <code class="language-plaintext highlighter-rouge">aws-actions/configure-aws-credentials</code> instead of storing long-lived AWS keys.</li>
  <li>Add a <code class="language-plaintext highlighter-rouge">paths:</code> filter so the workflow only runs if Jekyll files change.</li>
  <li>Automate CloudFront invalidations conditionally to save API calls.</li>
</ul>

<hr />

<h2 id="-final-thoughts">🎉 Final Thoughts</h2>

<p>This setup has completely changed my workflow: now I can just commit and push, and within a minute or two, the live site updates automatically. It’s one of those small bits of DevOps automation that pays off quickly — especially if you’re constantly tweaking content or adding new posts.</p>

<p>If you’re still deploying manually, give this a try. Once you see how easy it is, you’ll never want to go back.</p>

<hr />

<p>Have questions or ran into different errors? Drop them in the comments below — I’d love to hear how you’ve set up your own Jekyll CI/CD pipeline!</p>]]></content><author><name></name></author><category term="devops" /><category term="aws" /><category term="jekyll" /><category term="ci-cd" /><summary type="html"><![CDATA[How I set up a GitHub Actions workflow to build my Jekyll site and deploy it automatically to an S3 bucket and CloudFront.]]></summary></entry><entry><title type="html">How I Fixed an Expired SSL Certificate on My S3 + CloudFront Static Site</title><link href="https://theclouddevopslearningblog.com/aws/cloudfront/s3/devops/troubleshooting/2025/09/29/how-i-fixed-an-expired-ssl-cert-on-my-site.html" rel="alternate" type="text/html" title="How I Fixed an Expired SSL Certificate on My S3 + CloudFront Static Site" /><published>2025-09-29T14:00:00+00:00</published><updated>2025-09-29T14:00:00+00:00</updated><id>https://theclouddevopslearningblog.com/aws/cloudfront/s3/devops/troubleshooting/2025/09/29/how-i-fixed-an-expired-ssl-cert-on-my-site</id><content type="html" xml:base="https://theclouddevopslearningblog.com/aws/cloudfront/s3/devops/troubleshooting/2025/09/29/how-i-fixed-an-expired-ssl-cert-on-my-site.html"><![CDATA[<p>When I first built <a href="https://theclouddevopslearningblog.com">theclouddevopslearningblog.com</a>, I chose one of the most common ways to host a static website on AWS:</p>

<ul>
  <li>A <strong>Jekyll site</strong> stored in an <strong>S3 bucket</strong></li>
  <li>Served through <strong>CloudFront</strong> as a CDN and to support <strong>HTTPS</strong></li>
  <li>A <strong>free SSL certificate</strong> from <strong>AWS Certificate Manager (ACM)</strong></li>
</ul>

<p>It worked perfectly… until one day, my site suddenly started showing security warnings, and browsers said the connection was “<strong>Not Secure</strong>.”</p>

<p>Here’s how I diagnosed the problem and fixed it — and how you can do the same if it happens to you.</p>

<hr />

<h2 id="step-1-spot-the-problem">Step 1: Spot the Problem</h2>

<p>The first sign something was wrong was when I tried to run a simple test:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl <span class="nt">-I</span> https://theclouddevopslearningblog.com
</code></pre></div></div>

<p>This gave me:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl: (60) SSL certificate problem: certificate has expired
</code></pre></div></div>

<p>This error means the certificate that encrypts traffic to my site was no longer valid — so HTTPS wasn’t working.</p>

<hr />

<h2 id="step-2-check-what-certificate-is-being-used">Step 2: Check What Certificate Is Being Used</h2>

<p>To see exactly what certificate CloudFront was serving, I used <code class="language-plaintext highlighter-rouge">openssl</code>:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>openssl s_client <span class="nt">-servername</span> theclouddevopslearningblog.com   <span class="nt">-connect</span> theclouddevopslearningblog.com:443 <span class="nt">-showcerts</span> &lt;/dev/null 2&gt;/dev/null   | openssl x509 <span class="nt">-noout</span> <span class="nt">-issuer</span> <span class="nt">-subject</span> <span class="nt">-dates</span> <span class="nt">-ext</span> subjectAltName
</code></pre></div></div>

<p>The output showed this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>issuer=C=US, O=Amazon, CN=Amazon RSA 2048 M02
subject=CN=theclouddevopslearningblog.com
notBefore=Mar 10 00:00:00 2024 GMT
notAfter=Apr  8 23:59:59 2025 GMT
</code></pre></div></div>

<p>The important part is <code class="language-plaintext highlighter-rouge">notAfter</code> — the certificate expired on <strong>April 8, 2025</strong>. Mystery solved!</p>

<hr />

<h2 id="step-3-why-didnt-it-renew-automatically">Step 3: Why Didn’t It Renew Automatically?</h2>

<p>Certificates from ACM usually renew automatically, but <strong>only if the DNS validation records are still in place</strong>.<br />
When I checked the certificate in the <strong>ACM console (in <code class="language-plaintext highlighter-rouge">us-east-1</code>)</strong>, I saw that:</p>

<ul>
  <li>The certificate had <strong>expired</strong></li>
  <li>Renewal <strong>failed</strong> because the DNS validation records had been deleted</li>
</ul>

<p>This is a very common mistake — if you remove the validation CNAME records after issuing the certificate, AWS can’t confirm you still own the domain. And without that, it won’t renew.</p>

<hr />

<h2 id="step-4-request-a-new-certificate">Step 4: Request a New Certificate</h2>

<p>The fix was simple:</p>

<ol>
  <li>Go to <strong>ACM → Request a public certificate</strong></li>
  <li>Add both:
    <ul>
      <li><code class="language-plaintext highlighter-rouge">theclouddevopslearningblog.com</code></li>
      <li><code class="language-plaintext highlighter-rouge">*.theclouddevopslearningblog.com</code></li>
    </ul>
  </li>
  <li>Choose <strong>DNS validation</strong></li>
  <li>Add the CNAME records ACM gives you into your DNS (for example, in Route 53)</li>
  <li>Wait until the certificate status changes to <strong>“Issued”</strong></li>
</ol>

<p>⚠️ <strong>Tip:</strong> Leave those DNS CNAMEs in place forever. They’re needed for future renewals too.</p>

<hr />

<h2 id="step-5-attach-the-new-certificate-in-cloudfront">Step 5: Attach the New Certificate in CloudFront</h2>

<p>Just creating the new certificate isn’t enough — CloudFront still needs to know to use it.</p>

<p>Here’s how:</p>

<ul>
  <li>Go to <strong>CloudFront → Distributions → [your distribution]</strong></li>
  <li>Under <strong>Alternate domain names (CNAMEs)</strong>, make sure your domain is listed</li>
  <li>Under <strong>Viewer certificate</strong>, choose <strong>“Custom SSL certificate”</strong> and select the new one</li>
  <li>Save your changes</li>
</ul>

<p>This will trigger a new deployment, which usually takes a few minutes.</p>

<hr />

<h2 id="step-6-test-again">Step 6: Test Again</h2>

<p>After the update finished, I tested again:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl <span class="nt">-I</span> https://theclouddevopslearningblog.com
</code></pre></div></div>

<p>Now I got:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>HTTP/2 200
server: CloudFront
</code></pre></div></div>

<p>And <code class="language-plaintext highlighter-rouge">openssl</code> showed the new certificate expiry date was in <strong>2026</strong>.</p>

<hr />

<h2 id="lessons-learned">Lessons Learned</h2>

<p>This was a small problem, but it taught me a few useful lessons that are worth sharing:</p>

<ol>
  <li><strong>Leave DNS validation records in place.</strong> They’re essential for automatic renewal.</li>
  <li><strong>Set a renewal alert.</strong> You can use EventBridge + SNS to send yourself an email before a certificate expires.</li>
  <li><strong>Remember the region.</strong> CloudFront only works with ACM certificates in <code class="language-plaintext highlighter-rouge">us-east-1</code>.</li>
  <li><strong>Add monitoring.</strong> A simple <code class="language-plaintext highlighter-rouge">curl</code> script in a cron job can catch certificate problems before users do.</li>
</ol>

<hr />

<h2 id="final-thoughts">Final Thoughts</h2>

<p>Static sites on S3 + CloudFront are incredibly powerful and cost-effective, but even “serverless” websites need a little maintenance. SSL certificates are one of those things you <em>don’t</em> want to ignore — and now I’ll never forget to keep an eye on mine!</p>

<p>Hopefully, this guide helps you fix an expired certificate quickly and with confidence.</p>

<hr />

<p><em>Have you run into similar CloudFront issues? Let me know — I might write a follow-up post on automating SSL monitoring!</em></p>]]></content><author><name></name></author><category term="aws" /><category term="cloudfront" /><category term="s3" /><category term="devops" /><category term="troubleshooting" /><summary type="html"><![CDATA[When I first built theclouddevopslearningblog.com, I chose one of the most common ways to host a static website on AWS:]]></summary></entry><entry><title type="html">Enabling Blog Comments (Again)</title><link href="https://theclouddevopslearningblog.com/blog/jekyll/2024/12/13/enabling-blog-comments-again.html" rel="alternate" type="text/html" title="Enabling Blog Comments (Again)" /><published>2024-12-13T14:00:00+00:00</published><updated>2024-12-13T14:00:00+00:00</updated><id>https://theclouddevopslearningblog.com/blog/jekyll/2024/12/13/enabling-blog-comments-again</id><content type="html" xml:base="https://theclouddevopslearningblog.com/blog/jekyll/2024/12/13/enabling-blog-comments-again.html"><![CDATA[<h1 id="enabling-blog-comments-again">Enabling Blog Comments (Again)</h1>

<p>It’s been a while between drinks hey?</p>

<p>When I created this blog, I initially used the default support for Disqus as the blog commenting capability. After a few issues with the privacy, etc., I switched to a free trial of an alternative, HyvorTalk. This worked fine, but after the free trial, my comments got disabled, and I hadn’t focused on this, until the last few days.</p>

<p>I moved the comments back to Disqus. However, I haven’t maintained this blog for about 4-5 years, so making changes was problematic.</p>

<p>In fact, I’d hardly used my personal development machine in about that time too, so there were a few things that were broken:</p>

<ul>
  <li>Jekyll wouldn’t run because my Ruby version was out of date</li>
  <li>Homebrew wouldn’t re-install Ruby because it was out of date</li>
</ul>

<p>So after a complete re-install of Homebrew, Ruby and Jekyll, I was able to get Jekyll running. The next issue was that the gem bundler wouldn’t run with various issues. I was able to get it running by deleting the Gemfile.lock and changing the theme / gem back to the default. However, I was still getting errors with undefined methods, etc. when trying to serve the site. My solution was to re-initialise the Jekyll site, update the parameters needed for the theme to work, and copy the posts acrosss. This got it working fine.</p>

<p>At this point, I was able to get the Disqus comments section added again via the default support (I’d played around with the code for this to enable HyvorTalk previously, so had to wipe that code). I was able to build and run the site locally and get the comments section to load. However, after I built the site, I pushed it to my static site S3 bucket. After I loaded the posts again, I wasn’t getting it loading. After a while playing around with the site files, I realised I was being served different files than I uploaded. Having not looked at my site for several years, I’d completely forgotten that I had put a CDN (CloudFront) in front of the site (mainly to get https working). After pushing a ‘/*’ invalidation to the distribution, the site was serving up the right content (dull yay).</p>

<p>So, what’s next? I’m going to be trying to post a bit more of some of the stuff I’m working on and learning, and maybe a refresh of the site altogether at some point.</p>]]></content><author><name></name></author><category term="blog" /><category term="jekyll" /><category term="blog" /><category term="comments" /><category term="jekyll" /><category term="aws" /><category term="cloudfront" /><category term="s3" /><summary type="html"><![CDATA[Enabling Blog Comments (Again)]]></summary></entry><entry><title type="html">Netflix Style Recommendation Engine with Amazon SageMaker #CloudGuruChallenge</title><link href="https://theclouddevopslearningblog.com/aws/sagemaker/2020/11/04/cloud-guru-challenge-2.html" rel="alternate" type="text/html" title="Netflix Style Recommendation Engine with Amazon SageMaker #CloudGuruChallenge" /><published>2020-11-04T14:00:00+00:00</published><updated>2020-11-04T14:00:00+00:00</updated><id>https://theclouddevopslearningblog.com/aws/sagemaker/2020/11/04/cloud-guru-challenge-2</id><content type="html" xml:base="https://theclouddevopslearningblog.com/aws/sagemaker/2020/11/04/cloud-guru-challenge-2.html"><![CDATA[<h1 id="cloud-guru-challenge---october-2020">Cloud Guru Challenge - October 2020</h1>

<h2 id="background">Background</h2>
<ul>
  <li>Goal:	Build a Netflix Style Recommendation Engine with Amazon SageMaker</li>
  <li>Outcome:	Gain real machine learning and AWS skills while getting hands-on with a real-world project to add to your portfolio
https://acloudguru.com/blog/engineering/cloudguruchallenge-machine-learning-on-aws</li>
</ul>

<h1 id="tldr">TL;DR:</h1>
<p>I built:</p>
<ul>
  <li>Movie Recommendation Engine (K-Means clustering using AWS SageMaker)</li>
  <li>Serverless API and Website for users to view recommendations for selected movies (using API Gateway, Lambda, DynamoDB, S3 &amp; CloudFront)</li>
  <li>Visit <a href="https://moviesforme.net/">https://moviesforme.net/</a> to try out the recommendations!</li>
</ul>

<p>Here’s the architecture I implemented:
<img src="/media/v1-of-CloudGuruChallenge.October2020.png" alt="&quot;Architecture(v1)&quot;" /></p>

<h1 id="machine-learning-in-the-cloud">Machine Learning in the Cloud</h1>

<h2 id="steps">Steps</h2>

<h3 id="1-determine-use-case-and-obtain-data">1. Determine use case and obtain data.</h3>

<p>I thought about using GoodReads data to build a book recommendation engine, moreso because it would be a point of differentiation. However, my interests align with TV and film more than literature, so I decided movie datasets will be a better fit for me.</p>

<p>The first decision to make was to do with what data to use. In the brief, Kesha Williams made the example suggestion of movie datasets from IMDB.</p>

<p>The datasets are all available for download here: <a href="https://datasets.imdbws.com/">https://datasets.imdbws.com/</a></p>

<p>The recommended sets to use by Kesha were the title.akas (for grouping alternate titles’ info), title.basics (basic information about the titles) and title.ratings (rating information for the titles). These could all be merged on the “titleid” column. I used the ‘requests’ python library to download these, and then converted to Pandas dataframes for analysis / ML training.</p>

<p>A further dataset that was considered for use was name.basics. This data shows actors (and relevant info) and some titles (csv of titleid values) that the actor is known for. This information would be very useful.</p>

<p>Other information could be easily scraped from imdb itself, such as plot, reviews, etc. 
   While this information is most likely going to improve the quality of the recommendations, the type of Machine Learning required to do this is beyond the scope of this exercise.</p>

<p>In general, there’s a few main main ways of grouping this type of data for recommendations:
   (a) Simple recommendations. Recommending the same items regardless of user, normally based on highest rating, or sales data.
   (b) Content based filtering. The first of these is finding commonalities about data attributes, e.g. movie genre’s, actors, plot, ratings.
   (c) Collaborative filtering. This is more user behaviour driven, grouping data based on interactions (simliar ratings for one title would group allow for recommending another one)</p>

<p>The data that I selected will allow for some fairly simple content-based filtering.</p>

<h3 id="2-create-jupyter-hosted-notebook">2. Create Jupyter hosted notebook</h3>

<p>My experience with Jupyter is pretty minimal. I’d played with it very briefly when doing a AWS DeepRacer lab over a year ago now. I wanted to get a good understanding of how Jupyter works, so I did the following course on A Cloud Guru: <a href="https://learn.acloud.guru/course/introduction-to-jupyter-notebooks">https://learn.acloud.guru/course/introduction-to-jupyter-notebooks</a></p>

<p>I installed Jupyter on my local machine to begin with and to learn about how the notebooks work.</p>

<p>A really cool advantage of Jupyter notebooks is the reproducable nature of the runs, meaning anyone can run the same experiment, even when the underlying data changes.</p>

<p>The course was a very interesting introduction to some of the good data science tools that are available in Python, as well as how to use hosted notebooks in the cloud.</p>

<p><img src="/media/jupyter-course-roc.png" alt="Record of Completion" /></p>

<p>I highly recommend this course if you’re keen on learning how to use Jupyter.</p>

<p>While I’m on th subject, the real advantage of using Jupyter is that you can perform data science experiments and ML training using resources that you normally wouldn’t have access to, and only need to pay for the infrastructure as you use it!</p>

<p>Being able to see visualisations generated inline with the code is really adventageous as well, making the connection between the context, the code and information really straight-forward.</p>

<p>However, a downfall of running data science scripts on AWS hosted infrastructure is the cost. Pandas loads dataframes into memory (much like other statistics software) and more data means bigger instance type. To load the data I chose, I required an instance type of ml.t2.xlarge… not a cheap instance. Couple that with the cost of SageMaker instances, and costs can quickly add up, especially if you’re just doing this as a training exercise (excuse the pun!).</p>

<h3 id="3-inspect-and-visualize-data">3. Inspect and visualize data</h3>

<p>To understand what the data I got meant, I used Pandas and MatPlotLib Python libraries for analysing and visualising the data.</p>

<p>The real value of this is to see the relationship between different variables. A good example of this is to see number of titles in the data vs. year of realease for each movie.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python">   <span class="n">plt</span><span class="p">.</span><span class="nf">bar</span><span class="p">(</span><span class="n">df_titles</span><span class="p">.</span><span class="n">year</span><span class="p">.</span><span class="nf">unique</span><span class="p">(),</span>
         <span class="n">df_titles</span><span class="p">.</span><span class="n">year</span><span class="p">.</span><span class="nf">value_counts</span><span class="p">().</span><span class="nf">sort_index</span><span class="p">())</span>
   </code></pre></figure>

<p><img src="/media/nummoviesperyear.png" alt="Number of Movies Per Year" /></p>

<p>Other good relationships to view is between number of votes per title vs. the average rating.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python">   <span class="n">plt</span><span class="p">.</span><span class="nf">figure</span><span class="p">(</span><span class="n">figsize</span> <span class="o">=</span> <span class="p">(</span><span class="mi">10</span><span class="p">,</span><span class="mi">8</span><span class="p">))</span>
   <span class="n">sns</span><span class="p">.</span><span class="nf">scatterplot</span><span class="p">(</span><span class="n">x</span> <span class="o">=</span> <span class="n">df_titles</span><span class="p">[</span><span class="sh">'</span><span class="s">numvotes</span><span class="sh">'</span><span class="p">],</span> <span class="n">y</span> <span class="o">=</span> <span class="n">df_titles</span><span class="p">[</span><span class="sh">'</span><span class="s">averagerating</span><span class="sh">'</span><span class="p">])</span>
   <span class="n">plt</span><span class="p">.</span><span class="nf">xlabel</span><span class="p">(</span><span class="sh">'</span><span class="s">number of votes</span><span class="sh">'</span><span class="p">)</span>
   <span class="n">plt</span><span class="p">.</span><span class="nf">ylabel</span><span class="p">(</span><span class="sh">'</span><span class="s">average rating of movie</span><span class="sh">'</span><span class="p">)</span>
   </code></pre></figure>

<p><img src="/media/ratingsvsvotes.png" alt="Rating vs. Number of Votes" /></p>

<h3 id="4-prepare-and-transform-data">4. Prepare and transform data</h3>

<p>As I was, previous to this challenge, unfamiliar with AWS SageMaker and K-Means Clustering, I used the following AWS provided example Jupyter Notebook as a guide on performing my own clustering: <a href="https://github.com/aws/amazon-sagemaker-examples/blob/master/introduction_to_applying_machine_learning/US-census_population_segmentation_PCA_Kmeans/sagemaker-countycensusclustering.ipynb">https://github.com/aws/amazon-sagemaker-examples/blob/master/introduction_to_applying_machine_learning/US-census_population_segmentation_PCA_Kmeans/sagemaker-countycensusclustering.ipynb</a></p>

<p>The information in the data is quite useful for classifying movies. Columns such as ‘genres’ allows us to see movies with th same genre, for instance, which is likely going to be a solid basis for grouping movies. However, K-Means clustering algorithms don’t work with descriptive data, so we need to transform the data.</p>

<p>The genres column contains CSV data of the genres for each movie. Each movie can have no, one, or several genres.</p>

<p>There’s many different permutations (1258 unique combinations) of genres that exist for the dataset:</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python">   <span class="c1"># we can see there's lots of unique values, as each genre can be combined with others
</span>   <span class="n">df_titles</span><span class="p">.</span><span class="n">genres</span><span class="p">.</span><span class="nf">unique</span><span class="p">()</span>
   <span class="nf">array</span><span class="p">([</span><span class="sh">'</span><span class="s">Romance</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">Biography,Drama</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="se">\\</span><span class="s">N</span><span class="sh">'</span><span class="p">,</span> <span class="p">...,</span> <span class="sh">'</span><span class="s">Fantasy,History,War</span><span class="sh">'</span><span class="p">,</span>
         <span class="sh">'</span><span class="s">Documentary,Family,Sci-Fi</span><span class="sh">'</span><span class="p">,</span> <span class="sh">'</span><span class="s">Horror,Musical,Thriller</span><span class="sh">'</span><span class="p">],</span>
         <span class="n">dtype</span><span class="o">=</span><span class="nb">object</span><span class="p">)</span>
   <span class="n">df_titles</span><span class="p">.</span><span class="n">genres</span><span class="p">.</span><span class="nf">unique</span><span class="p">().</span><span class="n">shape</span>
   <span class="p">(</span><span class="mi">1258</span><span class="p">,)</span>
   </code></pre></figure>

<p>To convert this to data that an ML-algorithm can use, we need to transform it. Firstly, I converted the CSV data to a list of strings:</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python">   <span class="c1"># let's convert the csv column to a pandas list object in a new column
</span>   <span class="n">df_titles</span><span class="p">[</span><span class="sh">'</span><span class="s">genres_list</span><span class="sh">'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df_titles</span><span class="p">.</span><span class="n">genres</span><span class="p">.</span><span class="nb">str</span><span class="p">.</span><span class="nf">split</span><span class="p">(</span><span class="sh">'</span><span class="s">,</span><span class="sh">'</span><span class="p">).</span><span class="nf">tolist</span><span class="p">()</span>
   </code></pre></figure>

<p>And then used the Pandas ‘get_dummies()’ function to perform ‘One-Hot-Encoding’ on the individual genres</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python">   <span class="c1"># get the one hot encoded values for genre. 
</span>   <span class="c1"># (this table is relatively sparse)
</span>   <span class="n">genres_one_hot_encoded</span> <span class="o">=</span> <span class="n">df_titles</span><span class="p">.</span><span class="n">genres_list</span><span class="p">.</span><span class="nb">str</span><span class="p">.</span><span class="nf">join</span><span class="p">(</span><span class="sh">'</span><span class="s">|</span><span class="sh">'</span><span class="p">).</span><span class="nb">str</span><span class="p">.</span><span class="nf">get_dummies</span><span class="p">().</span><span class="nf">add_prefix</span><span class="p">(</span><span class="sh">'</span><span class="s">genre_</span><span class="sh">'</span><span class="p">)</span>
   <span class="n">genres_one_hot_encoded</span><span class="p">.</span><span class="n">shape</span>
   <span class="p">(</span><span class="mi">254179</span><span class="p">,</span> <span class="mi">29</span><span class="p">)</span>
   <span class="n">genres_one_hot_encoded</span><span class="p">.</span><span class="nf">head</span><span class="p">()</span>
   </code></pre></figure>

<p><img src="/media/genresonehotencoded.png" alt="Genres One Hot Encoded Data" /></p>

<p>From this, we’ve binarised the data for each genre, where each a movie gets a ‘1’ if it has that as a genre, and ‘0’ if it’s absent. We can also see that the number of unique genres is actually only 29 elements long, which is a bit reduction from 1258!
   We can then join this data to the main dataframe, and drop the existing descriptive columns, as well as the one-hot-encoded columns for n/a or null values (in the IMDB dataset, these are represented by ‘\N’). This process can be repeated for any descriptive attribute. In my example data, I also included movie language, although this didn’t need string splitting first.</p>

<p>Now the data for the other numerical attributes (runtime, number of votes and average rating) can be scaled using MinMaxScaler. We need to standardise the scaling of the numerical columns in order to use any distance based analytical methods so that we can compare the relative distances between different feature columns.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python">   <span class="n">scaler</span><span class="o">=</span><span class="nc">MinMaxScaler</span><span class="p">()</span>
   <span class="n">df_titles_scaled</span><span class="o">=</span><span class="n">pd</span><span class="p">.</span><span class="nc">DataFrame</span><span class="p">(</span><span class="n">scaler</span><span class="p">.</span><span class="nf">fit_transform</span><span class="p">(</span><span class="n">df_titles</span><span class="p">))</span>
   <span class="n">df_titles_scaled</span><span class="p">.</span><span class="n">columns</span><span class="o">=</span><span class="n">df_titles</span><span class="p">.</span><span class="n">columns</span>
   <span class="n">df_titles_scaled</span><span class="p">.</span><span class="n">index</span><span class="o">=</span><span class="n">df_titles</span><span class="p">.</span><span class="n">index</span>

   <span class="n">df_titles_scaled</span><span class="p">.</span><span class="nf">describe</span><span class="p">()</span>
   </code></pre></figure>

<p><img src="/media/scaleddata.png" alt="Scaled Data" /></p>

<p>The dimensionality of the data is then really large (95 columns!). I used principal component analysis (PCA) to reduce the dimensionality of the data.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python">   <span class="n">num_components</span><span class="o">=</span><span class="mi">95</span>

   <span class="n">pca_SM</span> <span class="o">=</span> <span class="nc">PCA</span><span class="p">(</span><span class="n">role</span><span class="o">=</span><span class="n">role</span><span class="p">,</span>
      <span class="n">instance_count</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
      <span class="n">instance_type</span><span class="o">=</span><span class="sh">'</span><span class="s">ml.c4.xlarge</span><span class="sh">'</span><span class="p">,</span>
      <span class="n">output_path</span><span class="o">=</span><span class="sh">'</span><span class="s">s3://</span><span class="sh">'</span><span class="o">+</span> <span class="n">bucket_name</span> <span class="o">+</span><span class="sh">'</span><span class="s">/titles/</span><span class="sh">'</span><span class="p">,</span>
      <span class="n">num_components</span><span class="o">=</span><span class="n">num_components</span><span class="p">)</span>
   </code></pre></figure>

<p>I then used the PCA job output to transform the original data. Once transformed, it was ready for training!</p>

<p>When viewing what attributes make up the components found, it’s mostly Genre, with some variation based on release year, language and popularity.</p>

<p><img src="/media/attributesbycentroid.png" alt="Attributes By Centroid" /></p>

<h3 id="5-train">5. Train</h3>

<p>Once the data is transformed, I was able to call through to the Python Sagemaker library to perform segmentation using unsupervised clustering, like this:</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python">   <span class="kn">import</span> <span class="n">sagemaker</span>
   <span class="kn">from</span> <span class="n">sagemaker</span> <span class="kn">import</span> <span class="n">KMeans</span>

   <span class="n">num_clusters</span> <span class="o">=</span> <span class="mi">40</span>
   <span class="n">kmeans</span> <span class="o">=</span> <span class="nc">KMeans</span><span class="p">(</span><span class="n">role</span><span class="o">=</span><span class="n">role</span><span class="p">,</span>
                  <span class="n">instance_count</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
                  <span class="n">instance_type</span><span class="o">=</span><span class="sh">'</span><span class="s">ml.c4.xlarge</span><span class="sh">'</span><span class="p">,</span>
                  <span class="n">output_path</span><span class="o">=</span><span class="sh">'</span><span class="s">s3://</span><span class="sh">'</span><span class="o">+</span> <span class="n">bucket_name</span> <span class="o">+</span><span class="sh">'</span><span class="s">/titles/</span><span class="sh">'</span><span class="p">,</span>              
                  <span class="n">k</span><span class="o">=</span><span class="n">num_clusters</span><span class="p">)</span>
   </code></pre></figure>

<p>After this has been run, the original data can have the cluster label mapped back to it. The distrbution of the clusters looks like so:</p>

<p><img src="/media/distributionofclusters.png" alt="Distribution of Clusters" /></p>

<h3 id="6-recommend">6. Recommend</h3>

<p>I wanted to present the outcome of the recommendation engine to real users. For this, I needed a website, or at a minimum, an API.</p>

<p>The basic flow of this information is:</p>

<ul>
  <li>Sagemaker Notebook writes trained model to CSV file in S3 Bucket</li>
  <li>Scheduled Lambda loads CSV data into DynamoDB table</li>
  <li>API Gateway using Lambda proxy queries the data to find titles and return a sample of titles in the same cluster as the chosen title.</li>
  <li>Static React JS website (hosted in S3, served up via CloudFront) allows users to search for movies and request recommendations based on this. Don’t judge on the styling!</li>
</ul>

<p><img src="/media/moviesforme.net.png" alt="MoviesForMe.net" /></p>

<h3 id="7-source-control">7. Source control</h3>

<p>You can view all the code for the training notebook, app infrastructure and API/website here:</p>

<p><a href="https://github.com/simonmackinnon/cloudguruchallenge-2020-10">https://github.com/simonmackinnon/cloudguruchallenge-2020-10</a></p>

<h3 id="8-clean-up-resources">8. Clean up resources</h3>

<p>When dealing with Machine Learning, instances and SageMaker endpoints can bear large costs very quickly. An important thing to check is the “Endpoints” in the Sagemaker console. I’ve added code at the end of the Notebook to delete the endpoints.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python">   <span class="n">sagemaker</span><span class="p">.</span><span class="nc">Session</span><span class="p">().</span><span class="nf">delete_endpoint</span><span class="p">(</span><span class="n">kmeans_predictor</span><span class="p">.</span><span class="n">endpoint</span><span class="p">)</span>
   </code></pre></figure>

<p>However, if there is an error earlier on, it’s worth manually checking that the endpoints really are deleted!</p>

<p>The Notebook instance sise is also considerable. It’s REALLY worth stopping it when not in use. Or, you can run it on your local machine if it has the required memory resources (my MacBook Pro has 16GB RAM, which is more than enough for this exercise). If you’re planning on doing that, make sure that the configured AWS user that you use on your machine has the ability to assume into the SageMaker execution role (the jobs require you to pass it in as a variable)</p>

<h3 id="9-impovements">9. Impovements</h3>

<p>At the start of this blog I included an architecture diagram for the whole solution. I also proposed a second version of the application, which would allow users to log into the website, select movies they had previously watched (stored in DynamoDB) and filter those movies out of the recommended results. Here’s an example of how this would work:</p>

<p><img src="/media/v2-of-CloudGuruChallenge.October2020.png" alt="&quot;Architecture(v2)&quot;" /></p>

<p>Another improvement I’d make would be to add CodeBuild jobs for automated deployments. I didn’t set up a CI/CD pipeline for anything, so this will definitely be part of V2!</p>

<p>I wanted to include Movie posters in the recommendations and title searches. While this information is obtainable via web-scraping of IMDB, or 3rd-party API calls, tying these calls into the API for title recommendations really slowed the site down. I have some ideas for how this would work, namely storing the images in S3 for all titles, iterating over the records in the database using step functions.</p>

<p>Finally, a major imporovement I’d make would be to the clustered data. I think using some Natural Languange Processing to group movie titles based on plot text would be a fantastic way to approach this. Another way would be to get user rating and viewing data and perform collaborative clustering.</p>

<h2 id="if-youre-on-the-machine-learning-journey-take-the-train">If You’re on the Machine-Learning Journey, Take The Train</h2>

<p>I’m really a Machine Leanring and Data Science beginner. That being said, the documentation and (especially) the out-of-the-box tools that AWS SageMaker provides for performing Machine Learning are REALLY awesome!</p>

<p>I had a fantastic time learning about what’s required to get data ready for training, what the outputs of ML jobs means, and particularly, validating how good my model is.</p>

<p>There’s a lot more involved in getting this all working, so please reach out if there’s anything in the code that you want me to explain, or provide references for!</p>

<h4 id="references">References:</h4>
<ul>
  <li><a href="https://github.com/aws/amazon-sagemaker-examples/tree/master/introduction_to_applying_machine_learning/US-census_population_segmentation_PCA_Kmeans">https://github.com/aws/amazon-sagemaker-examples/tree/master/introduction_to_applying_machine_learning/US-census_population_segmentation_PCA_Kmeans</a></li>
  <li><a href="https://longjp.github.io/statcomp/projects/clusteringimdb.pdf">https://longjp.github.io/statcomp/projects/clusteringimdb.pdf</a></li>
  <li><a href="https://www.imdb.com/interfaces/">https://www.imdb.com/interfaces/</a></li>
  <li><a href="https://learn.acloud.guru/course/introduction-to-jupyter-notebooks">https://learn.acloud.guru/course/introduction-to-jupyter-notebooks</a></li>
  <li><a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html">https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html</a></li>
</ul>]]></content><author><name></name></author><category term="aws" /><category term="sagemaker" /><category term="aws" /><category term="sagemaker" /><category term="acloudguru" /><category term="python" /><category term="jupyter" /><category term="machine-learning" /><category term="recommendation" /><category term="imdb" /><category term="open-data" /><summary type="html"><![CDATA[Cloud Guru Challenge - October 2020]]></summary></entry><entry><title type="html">COVID Data Engineering in AWS and Python #CodeGuruChallenge</title><link href="https://theclouddevopslearningblog.com/aws/serverless/cloudgurchallenge/2020/10/16/cloud-guru-challenge.html" rel="alternate" type="text/html" title="COVID Data Engineering in AWS and Python #CodeGuruChallenge" /><published>2020-10-16T04:15:00+00:00</published><updated>2020-10-16T04:15:00+00:00</updated><id>https://theclouddevopslearningblog.com/aws/serverless/cloudgurchallenge/2020/10/16/cloud-guru-challenge</id><content type="html" xml:base="https://theclouddevopslearningblog.com/aws/serverless/cloudgurchallenge/2020/10/16/cloud-guru-challenge.html"><![CDATA[<h2 id="cloud-guru-challenge---september-2020">Cloud Guru Challenge - September 2020</h2>

<p>I gave the #CloudGuruChallenge by Forrest Brazeal (A Cloud Guru) for September a go! I started 1 day before the deadline, so rushed to finish a bit. The next challenge will be worked on straight away, and the Readme files will definitely have more than just titles. “Working software over comprehensive documentation” right?</p>

<p>Here’s some of the stuff that I did!
Read the challenge details here:
<a href="https://acloudguru.com/blog/engineering/cloudguruchallenge-python-aws-etl">https://acloudguru.com/blog/engineering/cloudguruchallenge-python-aws-etl</a></p>

<p>Here’s the basic architecture of the solution:
<img src="/media/etl-job-architecture.jpg" alt="The Graph" /></p>

<p>See my code here:
<a href="https://github.com/simonmackinnon/cloudguruchallenge/tree/main/2020-09">https://github.com/simonmackinnon/cloudguruchallenge/tree/main/2020-09</a></p>

<h3 id="etl-job-using-python">ETL Job using Python</h3>
<p>The job runs automatically using a CloudWatch scheduled rule, once per day. Setting this up was pretty straightforward. One gotcha for people is that a CloudWatch rule needs permission to invoke Lambda functions.</p>

<p>It then loads data and puts it into DynamoDB table. I had a pretty fun time coding this. I hadn’t used the Pandas Python library before, so it was good to see the power of it. 
Some time (what feels like almost a lifetime) ago, I learnt to use R to do data science. This was a pretty similar experience (although zero-indexing helps!).
Some of the nuances of the challenge were around only loading the most recent day’s data, so some smarts had to be built into it for that to work. The merge functionality helped to spead up the work</p>

<p>It send SNS notifications for different status updates. I set this up as per the brief, although I was pushing to my topic for every row that couldn’t be read correctly, which proved to be a little verbose (at one point I was sending hundreds of error notifications due to a bug in my code). Pretty fun and easy to set up. Then getting CloudFormation passing the output ARN to the environment variables of the Lambda function kept this relatively re-deployable.</p>

<p>For Reporting, I used API Gateway to expose the DynamoDB table data, and then consumed it in JavaScript. Some of the gotchas in this (a) while the API request type for performing Scan operations is GET, the HTTP method for DynamoDB service calls is always POST (b) I found getting the Integration Request Mapping Template and the Integration Repsonse Mapping Template are right for this is a little difficult (c) ensuring the calls to the API had the correct headers to avoid a CORS / Preflight error is always difficult (and something I should spend some time learning about, it always trips me up). I built a simple vanilla JS demo site (due to time constraints) to retrieve (and sort) the data, then display using <a href="https://www.chartjs.org/">Chart.js</a></p>

<p><img src="/media/covidGraph.png" alt="The Graph" /></p>

<p>Anyway, you access the data URL here:</p>

<p><a href="https://2tp0wsvdr2.execute-api.ap-southeast-2.amazonaws.com/live/cumulativedata">https://2tp0wsvdr2.execute-api.ap-southeast-2.amazonaws.com/live/cumulativedata</a></p>

<p>And you can see the graph output here:</p>

<p><a href="http://simonmackinnon.com/cloudguruchallenge-2020-09.html">http://simonmackinnon.com/cloudguruchallenge-2020-09.html</a></p>

<h3 id="infrastructure-as-code">Infrastructure as Code</h3>
<p>Everything is defined in CloudFormation (except uploading function package to S3 and publishing new versions)
Some of this was HAAARD… especially setting up API Gateway to expose the DynamoDB data without using Lambda. This is relatively easy in the console, but I found some of the settings difficult in YAML/CloudFormation.  As mentioned above setting the Mapping Templates for the request/response continuoulsy lead to formatting issues… until it didn’t.</p>

<h3 id="tbd">TBD:</h3>
<ul>
  <li>Lambda layers: the package built was little big, and some of that could be reduced by using layers, especially for the Pandas library</li>
  <li>VPC infrastructure was a little overboard for single lambda</li>
  <li>CodePipeline to test and publish ETL job function package</li>
  <li>CodePipeline to update infrastructure on update</li>
  <li>Build React site to display more interactive/multiple graphs</li>
  <li>API Keys / Security</li>
</ul>]]></content><author><name></name></author><category term="aws" /><category term="serverless" /><category term="cloudgurchallenge" /><category term="cloudguruchallenge" /><category term="cloudformation" /><category term="aws" /><category term="acloudguru" /><category term="python" /><category term="etl" /><category term="lambda" /><category term="dynamodb" /><category term="quicksight" /><category term="codebuild" /><summary type="html"><![CDATA[Cloud Guru Challenge - September 2020]]></summary></entry><entry><title type="html">A Shortcoming of the AWS Lambda CLI - EventSourceMappings</title><link href="https://theclouddevopslearningblog.com/aws/lambda/cli/2020/06/20/a-shortcoming-of-the-aws-lambda-cli.html" rel="alternate" type="text/html" title="A Shortcoming of the AWS Lambda CLI - EventSourceMappings" /><published>2020-06-20T04:15:00+00:00</published><updated>2020-06-20T04:15:00+00:00</updated><id>https://theclouddevopslearningblog.com/aws/lambda/cli/2020/06/20/a-shortcoming-of-the-aws-lambda-cli</id><content type="html" xml:base="https://theclouddevopslearningblog.com/aws/lambda/cli/2020/06/20/a-shortcoming-of-the-aws-lambda-cli.html"><![CDATA[<h2 id="a-shortcoming-of-the-aws-lambda-cli----eventsourcemappings">A Shortcoming of the AWS Lambda CLI -  EventSourceMappings</h2>

<p>This one only slightly annoyed me, but still thought it was worth mentioning.</p>

<p>Some things are really easy to in the console vs. via API calls. AWS Lambda “triggers” is a perfect example of this. The general steps are: create a Lambda function, open the Lambda function configuration in the console, click “Add Trigger”, select source service and configure. Done!</p>

<p><img src="/media/codecommittriggerconsole.png" alt="CodeCommit Trigger" /></p>

<p>Being able to set up lambda triggers for a multitude of triggers from the Lambda console is a really nice (read: simple) way of configuring, and importantly, viewing, what services should be doing so, without having to navigate to each of the respective services’ consoles themselves. When you add Lambda triggers in this way, you can see a visual list of all of the triggers as one of the first things in the Lambda console. Great, really nice UI/UX!</p>

<p><img src="/media/codecommittriggercreated.png" alt="CodeCommit Trigger Created" /></p>

<p>So, now we want to replicate that experience using CloudFormation or API/CLI commands. You would be clever in thinking you can do all of this using the Lambda CLI, given you can do all this in the Lambda Console. And you’d also be wrong. The API call to produce this (listening) trigger is the <a href="https://docs.aws.amazon.com/lambda/latest/dg/API_CreateEventSourceMapping.html"><em>CreateEventSourceMapping</em></a>, and the respective CLI command <a href="https://docs.aws.amazon.com/cli/latest/reference/lambda/create-event-source-mapping.html">create-event-source-mapping</a>. If you look at this documentation, you’ll see that the only services for which you can create such a mapping, like you can in the console, is DynamoDB, Kinesis and SQS. Only those three… This is because Lambda service can essentially “read” events from these services, rather than be asyncronously or synchronously invoked by the triggering service.</p>

<figure class="highlight"><pre><code class="language-bash" data-lang="bash">aws lambda create-event-source-mapping <span class="se">\</span>
    <span class="nt">--function-name</span> CodeCommitLambda-lambdacodecommit-OT2Z33UZKD9O <span class="se">\</span>
    <span class="nt">--batch-size</span> 5 <span class="se">\</span>
    <span class="nt">--starting-position</span> LATEST <span class="se">\</span>
    <span class="nt">--event-source-arn</span> arn:aws:dynamodb:ap-southeast-2:366389342275:table/TestTable/stream/2020-06-20T04:50:40.178</code></pre></figure>

<p><img src="/media/dynamomappingcreated.png" alt="Dynamo Trigger Created" /></p>

<p>And, of course, you can set up triggers for each service respectively from the API calls for those services, but it only creates the one-way mapping. The Lambda function(s), in this case, have no knowledge or ownership of the triggers set up, for example, from CodeCommit.</p>

<figure class="highlight"><pre><code class="language-bash" data-lang="bash">aws codecommit put-repository-triggers <span class="se">\</span>
    <span class="nt">--repository-name</span> my-webpage <span class="se">\</span>
    <span class="nt">--triggers</span> <span class="nv">name</span><span class="o">=</span>MyLambdaTrigger,destinationArn<span class="o">=</span><span class="s2">"arn:aws:lambda:ap-southeast-2:123456789012:function:CodeCommitLambda-lambdacodecommit-OT2Z33UZKD9O"</span>,customData<span class="o">=</span><span class="s2">""</span>,branches<span class="o">=</span>master,events<span class="o">=</span>all</code></pre></figure>

<p><img src="/media/codecommittrigger.png" alt="CodeCommit Trigger" /></p>

<p><img src="/media/lambdanotriggers.png" alt="No Mapping in Lambda" /></p>

<p>Given this, it’s disappointing that the Lambda console repsects the mapping for invoke-type triggers, but there’s no way of even listing these kind “mappings” if you’re doing function creation programatically.</p>]]></content><author><name></name></author><category term="aws" /><category term="lambda" /><category term="cli" /><category term="aws" /><category term="cli" /><category term="lambda" /><summary type="html"><![CDATA[A Shortcoming of the AWS Lambda CLI - EventSourceMappings]]></summary></entry><entry><title type="html">Creating CodeCommit HTTPS Security Credentials With CloudFormation Lambda-based Custom Resource</title><link href="https://theclouddevopslearningblog.com/aws/cloudformation/2020/06/06/creating-codecommit-https-credentials-using-cloudformation-custom-lambda-resource.html" rel="alternate" type="text/html" title="Creating CodeCommit HTTPS Security Credentials With CloudFormation Lambda-based Custom Resource" /><published>2020-06-06T00:00:00+00:00</published><updated>2020-06-06T00:00:00+00:00</updated><id>https://theclouddevopslearningblog.com/aws/cloudformation/2020/06/06/creating-codecommit-https-credentials-using-cloudformation-custom-lambda-resource</id><content type="html" xml:base="https://theclouddevopslearningblog.com/aws/cloudformation/2020/06/06/creating-codecommit-https-credentials-using-cloudformation-custom-lambda-resource.html"><![CDATA[<h2 id="creating-codecommit-https-security-credentials-with-cloudformation-lambda-based-custom-resource">Creating CodeCommit HTTPS Security Credentials With CloudFormation Lambda-based Custom Resource</h2>

<p><img src="/media/httpscreds.png" alt="the outputs of the stack" /></p>

<p>As I have written previously, I’ve just committed myself to achieving the AWS DevOps Professional certification. As part of my study, I’m attempting to work through a <a href="https://www.udemy.com/course/aws-certified-devops-engineer-professional-hands-on/">hands-on online course</a>. I’ve also comitted to doing all demos using only the AWS CLI, SDK or CloudFormation. To force myself to to this, I’ve only granted programatic to my IAM user in my training account. The rationale is this: the console makes deploying things easy, and sets most default values for required fields in API calls appropriately. To get a better understanding of the services being used, provisioning in an automated way ensures these values need to be understood.</p>

<p>As I said, I started this course, primed to only work using scripts and Infrastructure as Code (with the aim of using CloudFormation primarily to ensure easy removal of deployed resources). The very first part of the very first demo: create an IAM User and create HTTPS CodeCommit Security Credentials for it. Easy, right? This is a two second job in the console.</p>

<p>And, while there exists an API for this, <a href="https://docs.aws.amazon.com/IAM/latest/APIReference/API_CreateServiceSpecificCredential.html">CreateServiceSpecificCredential</a>, CloudFormation doesn’t support this IAM feature. Enter, CloudFormation Custom Resources!</p>

<p>The steps needed for this resource could have been really simple, as the API call only requires an existing IAM user’s username, and the endpoint of the AWS service to create the credentials for. I wanted to create a simple automation sequence to allow multiple users to be created with this stack.</p>

<p>I don’t have a lot of experience writing Lambda code for CFN Custom Resources, so I used <a href="https://github.com/aws-cloudformation/custom-resource-helper">crhelper</a> to help build out the function scaffolding. This library does a crazy amount of the undifferentiated heavy lifting. All that was required was to pass through the username to the create and reset credentials API calls (I used the Python SDK for this).</p>

<p><img src="/media/events.png" alt="the outputs of the stack" /></p>

<p>The code was really simple:</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="n">crhelper</span> <span class="kn">import</span> <span class="n">CfnResource</span>
<span class="kn">import</span> <span class="n">boto3</span><span class="p">,</span> <span class="n">json</span>

<span class="n">helper</span> <span class="o">=</span> <span class="nc">CfnResource</span><span class="p">()</span>
<span class="n">iamclient</span> <span class="o">=</span> <span class="n">boto3</span><span class="p">.</span><span class="nf">client</span><span class="p">(</span><span class="sh">'</span><span class="s">iam</span><span class="sh">'</span><span class="p">)</span>

<span class="nd">@helper.create</span>
<span class="k">def</span> <span class="nf">create_https_credentials</span><span class="p">(</span><span class="n">event</span><span class="p">,</span> <span class="n">_</span><span class="p">):</span>
    <span class="n">user</span> <span class="o">=</span> <span class="n">event</span><span class="p">[</span><span class="sh">'</span><span class="s">ResourceProperties</span><span class="sh">'</span><span class="p">][</span><span class="sh">'</span><span class="s">user</span><span class="sh">'</span><span class="p">]</span>

    <span class="n">response</span> <span class="o">=</span> <span class="n">iamclient</span><span class="p">.</span><span class="nf">create_service_specific_credential</span><span class="p">(</span>
        <span class="n">UserName</span><span class="o">=</span><span class="n">user</span><span class="p">,</span>
        <span class="n">ServiceName</span><span class="o">=</span><span class="sh">'</span><span class="s">codecommit.amazonaws.com</span><span class="sh">'</span>
    <span class="p">)</span>

    <span class="n">helper</span><span class="p">.</span><span class="n">Data</span><span class="p">[</span><span class="sh">'</span><span class="s">ServiceUserName</span><span class="sh">'</span><span class="p">]</span> <span class="o">=</span> <span class="n">response</span><span class="p">[</span><span class="sh">'</span><span class="s">ServiceSpecificCredential</span><span class="sh">'</span><span class="p">][</span><span class="sh">'</span><span class="s">ServiceUserName</span><span class="sh">'</span><span class="p">]</span>
    <span class="n">helper</span><span class="p">.</span><span class="n">Data</span><span class="p">[</span><span class="sh">'</span><span class="s">ServicePassword</span><span class="sh">'</span><span class="p">]</span> <span class="o">=</span> <span class="n">response</span><span class="p">[</span><span class="sh">'</span><span class="s">ServiceSpecificCredential</span><span class="sh">'</span><span class="p">][</span><span class="sh">'</span><span class="s">ServicePassword</span><span class="sh">'</span><span class="p">]</span>

<span class="nd">@helper.update</span>
<span class="k">def</span> <span class="nf">reset_https_credentials</span><span class="p">(</span><span class="n">event</span><span class="p">,</span> <span class="n">_</span><span class="p">):</span>
    <span class="n">user</span> <span class="o">=</span> <span class="n">event</span><span class="p">[</span><span class="sh">'</span><span class="s">ResourceProperties</span><span class="sh">'</span><span class="p">][</span><span class="sh">'</span><span class="s">user</span><span class="sh">'</span><span class="p">]</span>
    
    <span class="n">response</span> <span class="o">=</span> <span class="n">iamclient</span><span class="p">.</span><span class="nf">reset_service_specific_credential</span><span class="p">(</span>
        <span class="n">UserName</span><span class="o">=</span><span class="n">user</span><span class="p">,</span>
        <span class="n">ServiceName</span><span class="o">=</span><span class="sh">'</span><span class="s">codecommit.amazonaws.com</span><span class="sh">'</span>
    <span class="p">)</span>

    <span class="n">helper</span><span class="p">.</span><span class="n">Data</span><span class="p">[</span><span class="sh">'</span><span class="s">ServiceUserName</span><span class="sh">'</span><span class="p">]</span> <span class="o">=</span> <span class="n">response</span><span class="p">[</span><span class="sh">'</span><span class="s">ServiceSpecificCredential</span><span class="sh">'</span><span class="p">][</span><span class="sh">'</span><span class="s">ServiceUserName</span><span class="sh">'</span><span class="p">]</span>
    <span class="n">helper</span><span class="p">.</span><span class="n">Data</span><span class="p">[</span><span class="sh">'</span><span class="s">ServicePassword</span><span class="sh">'</span><span class="p">]</span> <span class="o">=</span> <span class="n">response</span><span class="p">[</span><span class="sh">'</span><span class="s">ServiceSpecificCredential</span><span class="sh">'</span><span class="p">][</span><span class="sh">'</span><span class="s">ServicePassword</span><span class="sh">'</span><span class="p">]</span>

<span class="nd">@helper.delete</span>
<span class="k">def</span> <span class="nf">no_op</span><span class="p">(</span><span class="n">_</span><span class="p">,</span> <span class="n">__</span><span class="p">):</span>
    <span class="k">pass</span>

<span class="k">def</span> <span class="nf">handler</span><span class="p">(</span><span class="n">event</span><span class="p">,</span> <span class="n">context</span><span class="p">):</span>
    <span class="nf">print</span><span class="p">(</span><span class="sh">"</span><span class="s">Started execution of HTTPS Credentials Creator Lambda...</span><span class="sh">"</span><span class="p">)</span>
    <span class="nf">print</span><span class="p">(</span><span class="sh">"</span><span class="s">Function ARN %s</span><span class="sh">"</span> <span class="o">%</span> <span class="n">context</span><span class="p">.</span><span class="n">invoked_function_arn</span><span class="p">)</span>
    <span class="nf">print</span><span class="p">(</span><span class="sh">"</span><span class="s">Incoming Event %s </span><span class="sh">"</span> <span class="o">%</span> <span class="n">json</span><span class="p">.</span><span class="nf">dumps</span><span class="p">(</span><span class="n">event</span><span class="p">))</span>
    
    <span class="nf">helper</span><span class="p">(</span><span class="n">event</span><span class="p">,</span> <span class="n">context</span><span class="p">)</span></code></pre></figure>

<p>You can check out (and use) the code for this here: <a href="https://github.com/simonmackinnon/codecommit-httpscreds-cloudformation">https://github.com/simonmackinnon/codecommit-httpscreds-cloudformation</a>. This repo has CloudFormation templates to deploy single-time resources, as well as to create an IAM user and output the corresponding Access Keys and the CodeCommit HTTPS Security Credentials. Feedback super welcome.</p>

<p>Anyway, at this rate, the 20-hour long course will probably take me about a year to complete, ha ha ha!</p>]]></content><author><name></name></author><category term="aws" /><category term="cloudformation" /><category term="aws" /><category term="cloudformation" /><category term="iam" /><category term="lambda" /><summary type="html"><![CDATA[Creating CodeCommit HTTPS Security Credentials With CloudFormation Lambda-based Custom Resource]]></summary></entry><entry><title type="html">Course Review: A Cloud Guru, Advanced AWS CloudFormation- Adrian Cantrill</title><link href="https://theclouddevopslearningblog.com/aws/cloudformation/2020/05/28/acg-advanced-cloudformation-course-review.html" rel="alternate" type="text/html" title="Course Review: A Cloud Guru, Advanced AWS CloudFormation- Adrian Cantrill" /><published>2020-05-28T10:15:00+00:00</published><updated>2020-05-28T10:15:00+00:00</updated><id>https://theclouddevopslearningblog.com/aws/cloudformation/2020/05/28/acg-advanced-cloudformation-course-review</id><content type="html" xml:base="https://theclouddevopslearningblog.com/aws/cloudformation/2020/05/28/acg-advanced-cloudformation-course-review.html"><![CDATA[<h2 id="course-review-a-cloud-guru-advanced-aws-cloudformation--adrian-cantrill">Course Review: A Cloud Guru, Advanced AWS CloudFormation- Adrian Cantrill</h2>
<h3 id="course-url-httpslearnacloudgurucourseaws-advanced-cloudformationdashboard">Course URL: <a href="https://learn.acloud.guru/course/aws-advanced-cloudformation/dashboard">https://learn.acloud.guru/course/aws-advanced-cloudformation/dashboard</a></h3>

<h3 id="tldr">TL;DR</h3>
<p>Do this course! Awesome and fun content using practical templates provided and evolved to match the skills being taught. Perfect course introducing some complex and advances topics for AWS CloudFormation. Thanks Adrian!</p>

<h3 id="long-version">Long Version</h3>

<p>After completing the AWS Associate Certification trifecta late last year, and Azure Fundamentals earlier this year, I took a break from study to figure out what path of learning I wanted to do next. Given I work as an AWS Cloud Engineer, I thought the AWS DevOps Professional certification would be highly relevant as well as an awesome opportunity to learn some new concepts and technology.</p>

<p><img src="/media/record-of-completion.png" alt="Proof!!" /></p>

<p><a href="https://medium.com/@apzuk3/what-it-takes-to-pass-the-aws-certified-devops-engineer-professional-exam-40453cf0e3d4">This blog</a> is a really good starting point (I think) to what needs to be learnt/studied for this certification. I love Infrastructure as Code, and this post recommended doing the A Cloud Guru - Advanced AWS CloudFormation course to brush up on CloudFormation skills.</p>

<p>I loved this course. It posed some business challenge case studies, in two fictitious companies. This made the learning much more realistic.</p>

<p>The course content provides the templates to be deployed. For the first case-study, I re-wrote this, iterating on it as the course progressed. This meant that I got hands-on experience writing the CFN templates, and importantly experienced all of the troubleshooting that comes along with doing so.</p>

<p>For those who are unfamiliar, Infrastructure as Code is a way of declaring in a text file (of some kind), the infrastructure resources, as well their configuration, that you desire to be created. There are many different libraries, frameworks and services to do so. For AWS, <a href="https://aws.amazon.com/cloudformation/">CloudFormation</a> is the native service that they provide to manage this. Some of the advantages of this service over its competitors is the easy integration into your AWS account, ease of learning/setup, as well as a slight security win (looking at you Terraform with your plain-text state-files!).</p>

<p>Some really cool concepts are taught in this, one of my favourites is how <a href="https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-hup.html">cfn-hup</a> is explained, as this is something that always seems confusing to me. Being able to have EC2 resources detect changes in its own meta-data and run some specified commands is really cool. Tying this to re-implement the cfn-init process after a change is detected is a powerful mechanism for triggering reloading of instance setup command when a stack is updated.</p>

<p>The course was, I believe, recorded around 2017/18, so some of the screens in the console are a little out-of-date, although had changed dramatically since then. At one point, we are required to create some Google web authentication credentials to use in an app we create. The steps around this had changes slightly, but the accompanying instructions from ACG helped to navigate these changes.</p>

<p>Another area of learning in this course, that piqued my interest, was CloudFormation custom resources using Lambda. I’ve known about this feature of CFN for some time, and the idea had always interested me. Adrian teaches this content in a very simple manner, especially how the resource lifecycle works using the resource properties/attributes and what the functions’ responses need to contain for it to all work. From these small and simple demos, we automatically allocated CIDR ranges for a multi-environment application within a VPC, a task that normally would require networking knowledge and manual entry. Through this example, Adrian showed the awesome power of extending CloudFormation using Lambda-based custom resources.</p>

<p><img src="/media/custom-resources-lambda-slide.png" alt="Custom Resources Slide" /></p>

<p>Overall, the design/architecture pattern implemented could be used as the foundations for your own projects, etc. even in a work/production setting. Definitely templates that I’ll be hanging onto for some time!!!</p>

<h4 id="amazon-linux-1-ami-usage-and-upgrade-issue">Amazon Linux 1 AMI Usage and Upgrade Issue:</h4>
<ul>
  <li>Only one real issue (other than superficial issues related to POC nature of apps/environments). The EC2 instances used in the templates were based off of the <a href="https://aws.amazon.com/amazon-linux-ami/">Amazon Linux 1 AMI</a>. Given <a href="https://aws.amazon.com/blogs/aws/update-on-amazon-linux-ami-end-of-life/">this image type is flagged for End-Of-Life at the end of 2020</a> this is somewhat problematic. For the first case-study, I updated the template(s) to use Amazon Linux 2, which proved difficult. The cfn-init config packages command has difficulty installing an appropriate version of PHP for WordPress to run when the yum ‘php’ package is used. If the default packages are used, the following error occurs in WordPress: <br />
<br />
<em><strong>“Your server is running PHP version 5.4.16 but WordPress 5.2 requires at least 5.6.20.”</strong></em> <br />
<br />
To overcome this, we need to install PHP &gt; v7.2 using the amazon-linux-extras. Unfortunately, this isn’t available in the cfn-init configuration packages section. To get this to install, I had to the following command to my install_wordpress configuration:  \</li>
</ul>

<figure class="highlight"><pre><code class="language-yaml" data-lang="yaml"><span class="na">commands</span><span class="pi">:</span>
    <span class="na">enable_php</span><span class="pi">:</span>
        <span class="na">cwd</span><span class="pi">:</span> <span class="s2">"</span><span class="s">~"</span>
        <span class="na">command</span><span class="pi">:</span> <span class="s2">"</span><span class="s">amazon-linux-extras</span><span class="nv"> </span><span class="s">install</span><span class="nv"> </span><span class="s">php7.2"</span></code></pre></figure>

<p><br />
In any case, that seemed to be one of the only issues when upgrading the instance to Amazon Linux 2.</p>

<h3 id="overall">Overall</h3>
<p>Pretty stoked to get through this. As with any Infrastructure course, the time taken to get through the content if you do the demos yourself is always a lot longer than the course length, with lots of waiting for stacks to provision/update/delete. Great starting point to move to automation in an AWS native way!</p>]]></content><author><name></name></author><category term="aws" /><category term="cloudformation" /><summary type="html"><![CDATA[Course Review: A Cloud Guru, Advanced AWS CloudFormation- Adrian Cantrill Course URL: https://learn.acloud.guru/course/aws-advanced-cloudformation/dashboard]]></summary></entry></feed>