Cloud Guru Challenge - October 2020

2020-11-05T01:00:00+11:00

Cloud Guru Challenge - October 2020

Background

Goal: Build a Netflix Style Recommendation Engine with Amazon SageMaker
Outcome: Gain real machine learning and AWS skills while getting hands-on with a real-world project to add to your portfolio https://acloudguru.com/blog/engineering/cloudguruchallenge-machine-learning-on-aws

TL;DR:

I built:

Movie Recommendation Engine (K-Means clustering using AWS SageMaker)
Serverless API and Website for users to view recommendations for selected movies (using API Gateway, Lambda, DynamoDB, S3 & CloudFront)
Visit https://moviesforme.net/ to try out the recommendations!

Here’s the architecture I implemented:

Machine Learning in the Cloud

Steps

1. Determine use case and obtain data.

I thought about using GoodReads data to build a book recommendation engine, moreso because it would be a point of differentiation. However, my interests align with TV and film more than literature, so I decided movie datasets will be a better fit for me.

The first decision to make was to do with what data to use. In the brief, Kesha Williams made the example suggestion of movie datasets from IMDB.

The datasets are all available for download here: https://datasets.imdbws.com/

The recommended sets to use by Kesha were the title.akas (for grouping alternate titles’ info), title.basics (basic information about the titles) and title.ratings (rating information for the titles). These could all be merged on the “titleid” column. I used the ‘requests’ python library to download these, and then converted to Pandas dataframes for analysis / ML training.

A further dataset that was considered for use was name.basics. This data shows actors (and relevant info) and some titles (csv of titleid values) that the actor is known for. This information would be very useful.

Other information could be easily scraped from imdb itself, such as plot, reviews, etc. While this information is most likely going to improve the quality of the recommendations, the type of Machine Learning required to do this is beyond the scope of this exercise.

In general, there’s a few main main ways of grouping this type of data for recommendations: (a) Simple recommendations. Recommending the same items regardless of user, normally based on highest rating, or sales data. (b) Content based filtering. The first of these is finding commonalities about data attributes, e.g. movie genre’s, actors, plot, ratings. (c) Collaborative filtering. This is more user behaviour driven, grouping data based on interactions (simliar ratings for one title would group allow for recommending another one)

The data that I selected will allow for some fairly simple content-based filtering.

2. Create Jupyter hosted notebook

My experience with Jupyter is pretty minimal. I’d played with it very briefly when doing a AWS DeepRacer lab over a year ago now. I wanted to get a good understanding of how Jupyter works, so I did the following course on A Cloud Guru: https://learn.acloud.guru/course/introduction-to-jupyter-notebooks

I installed Jupyter on my local machine to begin with and to learn about how the notebooks work.

A really cool advantage of Jupyter notebooks is the reproducable nature of the runs, meaning anyone can run the same experiment, even when the underlying data changes.

The course was a very interesting introduction to some of the good data science tools that are available in Python, as well as how to use hosted notebooks in the cloud.

I highly recommend this course if you’re keen on learning how to use Jupyter.

While I’m on th subject, the real advantage of using Jupyter is that you can perform data science experiments and ML training using resources that you normally wouldn’t have access to, and only need to pay for the infrastructure as you use it!

Being able to see visualisations generated inline with the code is really adventageous as well, making the connection between the context, the code and information really straight-forward.

However, a downfall of running data science scripts on AWS hosted infrastructure is the cost. Pandas loads dataframes into memory (much like other statistics software) and more data means bigger instance type. To load the data I chose, I required an instance type of ml.t2.xlarge… not a cheap instance. Couple that with the cost of SageMaker instances, and costs can quickly add up, especially if you’re just doing this as a training exercise (excuse the pun!).

3. Inspect and visualize data

To understand what the data I got meant, I used Pandas and MatPlotLib Python libraries for analysing and visualising the data.

The real value of this is to see the relationship between different variables. A good example of this is to see number of titles in the data vs. year of realease for each movie.

   plt.bar(df_titles.year.unique(),
         df_titles.year.value_counts().sort_index())

Other good relationships to view is between number of votes per title vs. the average rating.

   plt.figure(figsize = (10,8))
   sns.scatterplot(x = df_titles['numvotes'], y = df_titles['averagerating'])
   plt.xlabel('number of votes')
   plt.ylabel('average rating of movie')

4. Prepare and transform data

As I was, previous to this challenge, unfamiliar with AWS SageMaker and K-Means Clustering, I used the following AWS provided example Jupyter Notebook as a guide on performing my own clustering: https://github.com/aws/amazon-sagemaker-examples/blob/master/introduction_to_applying_machine_learning/US-census_population_segmentation_PCA_Kmeans/sagemaker-countycensusclustering.ipynb

The information in the data is quite useful for classifying movies. Columns such as ‘genres’ allows us to see movies with th same genre, for instance, which is likely going to be a solid basis for grouping movies. However, K-Means clustering algorithms don’t work with descriptive data, so we need to transform the data.

The genres column contains CSV data of the genres for each movie. Each movie can have no, one, or several genres.

There’s many different permutations (1258 unique combinations) of genres that exist for the dataset:

   # we can see there's lots of unique values, as each genre can be combined with others
   df_titles.genres.unique()
   array(['Romance', 'Biography,Drama', '\\N', ..., 'Fantasy,History,War',
         'Documentary,Family,Sci-Fi', 'Horror,Musical,Thriller'],
         dtype=object)
   df_titles.genres.unique().shape
   (1258,)

To convert this to data that an ML-algorithm can use, we need to transform it. Firstly, I converted the CSV data to a list of strings:

   # let's convert the csv column to a pandas list object in a new column
   df_titles['genres_list'] = df_titles.genres.str.split(',').tolist()

And then used the Pandas ‘get_dummies()’ function to perform ‘One-Hot-Encoding’ on the individual genres

   # get the one hot encoded values for genre. 
   # (this table is relatively sparse)
   genres_one_hot_encoded = df_titles.genres_list.str.join('|').str.get_dummies().add_prefix('genre_')
   genres_one_hot_encoded.shape
   (254179, 29)
   genres_one_hot_encoded.head()

From this, we’ve binarised the data for each genre, where each a movie gets a ‘1’ if it has that as a genre, and ‘0’ if it’s absent. We can also see that the number of unique genres is actually only 29 elements long, which is a bit reduction from 1258! We can then join this data to the main dataframe, and drop the existing descriptive columns, as well as the one-hot-encoded columns for n/a or null values (in the IMDB dataset, these are represented by ‘\N’). This process can be repeated for any descriptive attribute. In my example data, I also included movie language, although this didn’t need string splitting first.

Now the data for the other numerical attributes (runtime, number of votes and average rating) can be scaled using MinMaxScaler. We need to standardise the scaling of the numerical columns in order to use any distance based analytical methods so that we can compare the relative distances between different feature columns.

   scaler=MinMaxScaler()
   df_titles_scaled=pd.DataFrame(scaler.fit_transform(df_titles))
   df_titles_scaled.columns=df_titles.columns
   df_titles_scaled.index=df_titles.index

   df_titles_scaled.describe()

The dimensionality of the data is then really large (95 columns!). I used principal component analysis (PCA) to reduce the dimensionality of the data.

   num_components=95

   pca_SM = PCA(role=role,
      instance_count=1,
      instance_type='ml.c4.xlarge',
      output_path='s3://'+ bucket_name +'/titles/',
      num_components=num_components)

I then used the PCA job output to transform the original data. Once transformed, it was ready for training!

When viewing what attributes make up the components found, it’s mostly Genre, with some variation based on release year, language and popularity.

5. Train

Once the data is transformed, I was able to call through to the Python Sagemaker library to perform segmentation using unsupervised clustering, like this:

   import sagemaker
   from sagemaker import KMeans

   num_clusters = 40
   kmeans = KMeans(role=role,
                  instance_count=1,
                  instance_type='ml.c4.xlarge',
                  output_path='s3://'+ bucket_name +'/titles/',              
                  k=num_clusters)

After this has been run, the original data can have the cluster label mapped back to it. The distrbution of the clusters looks like so:

I wanted to present the outcome of the recommendation engine to real users. For this, I needed a website, or at a minimum, an API.

The basic flow of this information is:

Sagemaker Notebook writes trained model to CSV file in S3 Bucket
Scheduled Lambda loads CSV data into DynamoDB table
API Gateway using Lambda proxy queries the data to find titles and return a sample of titles in the same cluster as the chosen title.
Static React JS website (hosted in S3, served up via CloudFront) allows users to search for movies and request recommendations based on this. Don’t judge on the styling!

7. Source control

You can view all the code for the training notebook, app infrastructure and API/website here:

https://github.com/simonmackinnon/cloudguruchallenge-2020-10

8. Clean up resources

When dealing with Machine Learning, instances and SageMaker endpoints can bear large costs very quickly. An important thing to check is the “Endpoints” in the Sagemaker console. I’ve added code at the end of the Notebook to delete the endpoints.

   sagemaker.Session().delete_endpoint(kmeans_predictor.endpoint)

However, if there is an error earlier on, it’s worth manually checking that the endpoints really are deleted!

The Notebook instance sise is also considerable. It’s REALLY worth stopping it when not in use. Or, you can run it on your local machine if it has the required memory resources (my MacBook Pro has 16GB RAM, which is more than enough for this exercise). If you’re planning on doing that, make sure that the configured AWS user that you use on your machine has the ability to assume into the SageMaker execution role (the jobs require you to pass it in as a variable)

9. Impovements

At the start of this blog I included an architecture diagram for the whole solution. I also proposed a second version of the application, which would allow users to log into the website, select movies they had previously watched (stored in DynamoDB) and filter those movies out of the recommended results. Here’s an example of how this would work:

Another improvement I’d make would be to add CodeBuild jobs for automated deployments. I didn’t set up a CI/CD pipeline for anything, so this will definitely be part of V2!

I wanted to include Movie posters in the recommendations and title searches. While this information is obtainable via web-scraping of IMDB, or 3rd-party API calls, tying these calls into the API for title recommendations really slowed the site down. I have some ideas for how this would work, namely storing the images in S3 for all titles, iterating over the records in the database using step functions.

Finally, a major imporovement I’d make would be to the clustered data. I think using some Natural Languange Processing to group movie titles based on plot text would be a fantastic way to approach this. Another way would be to get user rating and viewing data and perform collaborative clustering.

If You’re on the Machine-Learning Journey, Take The Train

I’m really a Machine Leanring and Data Science beginner. That being said, the documentation and (especially) the out-of-the-box tools that AWS SageMaker provides for performing Machine Learning are REALLY awesome!

I had a fantastic time learning about what’s required to get data ready for training, what the outputs of ML jobs means, and particularly, validating how good my model is.

There’s a lot more involved in getting this all working, so please reach out if there’s anything in the code that you want me to explain, or provide references for!

References:

Cloud Guru Challenge - September 2020

2020-10-16T15:15:00+11:00

Cloud Guru Challenge - September 2020

I gave the #CloudGuruChallenge by Forrest Brazeal (A Cloud Guru) for September a go! I started 1 day before the deadline, so rushed to finish a bit. The next challenge will be worked on straight away, and the Readme files will definitely have more than just titles. “Working software over comprehensive documentation” right?

Here’s some of the stuff that I did! Read the challenge details here: https://acloudguru.com/blog/engineering/cloudguruchallenge-python-aws-etl

Here’s the basic architecture of the solution:

See my code here: https://github.com/simonmackinnon/cloudguruchallenge/tree/main/2020-09

ETL Job using Python

The job runs automatically using a CloudWatch scheduled rule, once per day. Setting this up was pretty straightforward. One gotcha for people is that a CloudWatch rule needs permission to invoke Lambda functions.

It then loads data and puts it into DynamoDB table. I had a pretty fun time coding this. I hadn’t used the Pandas Python library before, so it was good to see the power of it. Some time (what feels like almost a lifetime) ago, I learnt to use R to do data science. This was a pretty similar experience (although zero-indexing helps!). Some of the nuances of the challenge were around only loading the most recent day’s data, so some smarts had to be built into it for that to work. The merge functionality helped to spead up the work

It send SNS notifications for different status updates. I set this up as per the brief, although I was pushing to my topic for every row that couldn’t be read correctly, which proved to be a little verbose (at one point I was sending hundreds of error notifications due to a bug in my code). Pretty fun and easy to set up. Then getting CloudFormation passing the output ARN to the environment variables of the Lambda function kept this relatively re-deployable.

For Reporting, I used API Gateway to expose the DynamoDB table data, and then consumed it in JavaScript. Some of the gotchas in this (a) while the API request type for performing Scan operations is GET, the HTTP method for DynamoDB service calls is always POST (b) I found getting the Integration Request Mapping Template and the Integration Repsonse Mapping Template are right for this is a little difficult (c) ensuring the calls to the API had the correct headers to avoid a CORS / Preflight error is always difficult (and something I should spend some time learning about, it always trips me up). I built a simple vanilla JS demo site (due to time constraints) to retrieve (and sort) the data, then display using Chart.js

Anyway, you access the data URL here:

https://2tp0wsvdr2.execute-api.ap-southeast-2.amazonaws.com/live/cumulativedata

And you can see the graph output here:

http://simonmackinnon.com/cloudguruchallenge-2020-09.html

Infrastructure as Code

Everything is defined in CloudFormation (except uploading function package to S3 and publishing new versions) Some of this was HAAARD… especially setting up API Gateway to expose the DynamoDB data without using Lambda. This is relatively easy in the console, but I found some of the settings difficult in YAML/CloudFormation. As mentioned above setting the Mapping Templates for the request/response continuoulsy lead to formatting issues… until it didn’t.

TBD:

Lambda layers: the package built was little big, and some of that could be reduced by using layers, especially for the Pandas library
VPC infrastructure was a little overboard for single lambda
CodePipeline to test and publish ETL job function package
CodePipeline to update infrastructure on update
Build React site to display more interactive/multiple graphs
API Keys / Security

A Shortcoming of the AWS Lambda CLI - EventSourceMappings

2020-06-20T14:15:00+10:00

A Shortcoming of the AWS Lambda CLI - EventSourceMappings

This one only slightly annoyed me, but still thought it was worth mentioning.

Some things are really easy to in the console vs. via API calls. AWS Lambda “triggers” is a perfect example of this. The general steps are: create a Lambda function, open the Lambda function configuration in the console, click “Add Trigger”, select source service and configure. Done!

Being able to set up lambda triggers for a multitude of triggers from the Lambda console is a really nice (read: simple) way of configuring, and importantly, viewing, what services should be doing so, without having to navigate to each of the respective services’ consoles themselves. When you add Lambda triggers in this way, you can see a visual list of all of the triggers as one of the first things in the Lambda console. Great, really nice UI/UX!

So, now we want to replicate that experience using CloudFormation or API/CLI commands. You would be clever in thinking you can do all of this using the Lambda CLI, given you can do all this in the Lambda Console. And you’d also be wrong. The API call to produce this (listening) trigger is the CreateEventSourceMapping, and the respective CLI command create-event-source-mapping. If you look at this documentation, you’ll see that the only services for which you can create such a mapping, like you can in the console, is DynamoDB, Kinesis and SQS. Only those three… This is because Lambda service can essentially “read” events from these services, rather than be asyncronously or synchronously invoked by the triggering service.

aws lambda create-event-source-mapping \
    --function-name CodeCommitLambda-lambdacodecommit-OT2Z33UZKD9O \
    --batch-size 5 \
    --starting-position LATEST \
    --event-source-arn arn:aws:dynamodb:ap-southeast-2:366389342275:table/TestTable/stream/2020-06-20T04:50:40.178

And, of course, you can set up triggers for each service respectively from the API calls for those services, but it only creates the one-way mapping. The Lambda function(s), in this case, have no knowledge or ownership of the triggers set up, for example, from CodeCommit.

aws codecommit put-repository-triggers \
    --repository-name my-webpage \
    --triggers name=MyLambdaTrigger,destinationArn="arn:aws:lambda:ap-southeast-2:123456789012:function:CodeCommitLambda-lambdacodecommit-OT2Z33UZKD9O",customData="",branches=master,events=all

Given this, it’s disappointing that the Lambda console repsects the mapping for invoke-type triggers, but there’s no way of even listing these kind “mappings” if you’re doing function creation programatically.

Creating CodeCommit HTTPS Security Credentials With CloudFormation Lambda-based Custom Resource

2020-06-06T00:00:00+10:00

Creating CodeCommit HTTPS Security Credentials With CloudFormation Lambda-based Custom Resource

As I have written previously, I’ve just committed myself to achieving the AWS DevOps Professional certification. As part of my study, I’m attempting to work through a hands-on online course. I’ve also comitted to doing all demos using only the AWS CLI, SDK or CloudFormation. To force myself to to this, I’ve only granted programatic to my IAM user in my training account. The rationale is this: the console makes deploying things easy, and sets most default values for required fields in API calls appropriately. To get a better understanding of the services being used, provisioning in an automated way ensures these values need to be understood.

As I said, I started this course, primed to only work using scripts and Infrastructure as Code (with the aim of using CloudFormation primarily to ensure easy removal of deployed resources). The very first part of the very first demo: create an IAM User and create HTTPS CodeCommit Security Credentials for it. Easy, right? This is a two second job in the console.

And, while there exists an API for this, CreateServiceSpecificCredential, CloudFormation doesn’t support this IAM feature. Enter, CloudFormation Custom Resources!

The steps needed for this resource could have been really simple, as the API call only requires an existing IAM user’s username, and the endpoint of the AWS service to create the credentials for. I wanted to create a simple automation sequence to allow multiple users to be created with this stack.

I don’t have a lot of experience writing Lambda code for CFN Custom Resources, so I used crhelper to help build out the function scaffolding. This library does a crazy amount of the undifferentiated heavy lifting. All that was required was to pass through the username to the create and reset credentials API calls (I used the Python SDK for this).

The code was really simple:

from crhelper import CfnResource
import boto3, json

helper = CfnResource()
iamclient = boto3.client('iam')

@helper.create
def create_https_credentials(event, _):
    user = event['ResourceProperties']['user']

    response = iamclient.create_service_specific_credential(
        UserName=user,
        ServiceName='codecommit.amazonaws.com'
    )

    helper.Data['ServiceUserName'] = response['ServiceSpecificCredential']['ServiceUserName']
    helper.Data['ServicePassword'] = response['ServiceSpecificCredential']['ServicePassword']

@helper.update
def reset_https_credentials(event, _):
    user = event['ResourceProperties']['user']
    
    response = iamclient.reset_service_specific_credential(
        UserName=user,
        ServiceName='codecommit.amazonaws.com'
    )

    helper.Data['ServiceUserName'] = response['ServiceSpecificCredential']['ServiceUserName']
    helper.Data['ServicePassword'] = response['ServiceSpecificCredential']['ServicePassword']

@helper.delete
def no_op(_, __):
    pass

def handler(event, context):
    print("Started execution of HTTPS Credentials Creator Lambda...")
    print("Function ARN %s" % context.invoked_function_arn)
    print("Incoming Event %s " % json.dumps(event))
    
    helper(event, context)

You can check out (and use) the code for this here: https://github.com/simonmackinnon/codecommit-httpscreds-cloudformation. This repo has CloudFormation templates to deploy single-time resources, as well as to create an IAM user and output the corresponding Access Keys and the CodeCommit HTTPS Security Credentials. Feedback super welcome.

Anyway, at this rate, the 20-hour long course will probably take me about a year to complete, ha ha ha!

Course Review: A Cloud Guru, Advanced AWS CloudFormation- Adrian Cantrill

2020-05-28T20:15:00+10:00

Course Review: A Cloud Guru, Advanced AWS CloudFormation- Adrian Cantrill

Course URL: https://learn.acloud.guru/course/aws-advanced-cloudformation/dashboard

TL;DR

Do this course! Awesome and fun content using practical templates provided and evolved to match the skills being taught. Perfect course introducing some complex and advances topics for AWS CloudFormation. Thanks Adrian!

Long Version

After completing the AWS Associate Certification trifecta late last year, and Azure Fundamentals earlier this year, I took a break from study to figure out what path of learning I wanted to do next. Given I work as an AWS Cloud Engineer, I thought the AWS DevOps Professional certification would be highly relevant as well as an awesome opportunity to learn some new concepts and technology.

This blog is a really good starting point (I think) to what needs to be learnt/studied for this certification. I love Infrastructure as Code, and this post recommended doing the A Cloud Guru - Advanced AWS CloudFormation course to brush up on CloudFormation skills.

I loved this course. It posed some business challenge case studies, in two fictitious companies. This made the learning much more realistic.

The course content provides the templates to be deployed. For the first case-study, I re-wrote this, iterating on it as the course progressed. This meant that I got hands-on experience writing the CFN templates, and importantly experienced all of the troubleshooting that comes along with doing so.

For those who are unfamiliar, Infrastructure as Code is a way of declaring in a text file (of some kind), the infrastructure resources, as well their configuration, that you desire to be created. There are many different libraries, frameworks and services to do so. For AWS, CloudFormation is the native service that they provide to manage this. Some of the advantages of this service over its competitors is the easy integration into your AWS account, ease of learning/setup, as well as a slight security win (looking at you Terraform with your plain-text state-files!).

Some really cool concepts are taught in this, one of my favourites is how cfn-hup is explained, as this is something that always seems confusing to me. Being able to have EC2 resources detect changes in its own meta-data and run some specified commands is really cool. Tying this to re-implement the cfn-init process after a change is detected is a powerful mechanism for triggering reloading of instance setup command when a stack is updated.

The course was, I believe, recorded around 2017/18, so some of the screens in the console are a little out-of-date, although had changed dramatically since then. At one point, we are required to create some Google web authentication credentials to use in an app we create. The steps around this had changes slightly, but the accompanying instructions from ACG helped to navigate these changes.

Another area of learning in this course, that piqued my interest, was CloudFormation custom resources using Lambda. I’ve known about this feature of CFN for some time, and the idea had always interested me. Adrian teaches this content in a very simple manner, especially how the resource lifecycle works using the resource properties/attributes and what the functions’ responses need to contain for it to all work. From these small and simple demos, we automatically allocated CIDR ranges for a multi-environment application within a VPC, a task that normally would require networking knowledge and manual entry. Through this example, Adrian showed the awesome power of extending CloudFormation using Lambda-based custom resources.

Overall, the design/architecture pattern implemented could be used as the foundations for your own projects, etc. even in a work/production setting. Definitely templates that I’ll be hanging onto for some time!!!

Amazon Linux 1 AMI Usage and Upgrade Issue:

Only one real issue (other than superficial issues related to POC nature of apps/environments). The EC2 instances used in the templates were based off of the Amazon Linux 1 AMI. Given this image type is flagged for End-Of-Life at the end of 2020 this is somewhat problematic. For the first case-study, I updated the template(s) to use Amazon Linux 2, which proved difficult. The cfn-init config packages command has difficulty installing an appropriate version of PHP for WordPress to run when the yum ‘php’ package is used. If the default packages are used, the following error occurs in WordPress:

“Your server is running PHP version 5.4.16 but WordPress 5.2 requires at least 5.6.20.”

To overcome this, we need to install PHP > v7.2 using the amazon-linux-extras. Unfortunately, this isn’t available in the cfn-init configuration packages section. To get this to install, I had to the following command to my install_wordpress configuration: \

commands:
    enable_php:
        cwd: "~"
        command: "amazon-linux-extras install php7.2"

In any case, that seemed to be one of the only issues when upgrading the instance to Amazon Linux 2.

Overall

Pretty stoked to get through this. As with any Infrastructure course, the time taken to get through the content if you do the demos yourself is always a lot longer than the course length, with lots of waiting for stacks to provision/update/delete. Great starting point to move to automation in an AWS native way!

Why (the Hell) is Snapshot the CloudFormation RDS Default Deletion Policy??

2020-05-24T14:00:00+10:00

Why (the Hell) is Snapshot the CloudFormation RDS Default Deletion Policy??

This one blew me away. If you create an RDS instance using any other method, the default deletion behaviour is just to delete. If you create the same configured instance using CloudFormation, the default deletion behaviour is to create a manual snapshot.

I was happily working through an A Cloud Guru CloudFormation course, spinning up CFN stack after CFN stack. In between I was deleting the stacks, thinking this was cleaning up my environment. Little did I know, there were RDS snapshots being created each time I killed my “immutable” infrastructure.

So, after plugging away at the course, I found that the default Deletion Policy for resources created using CloudFormation is to delete… Except for RDS… Grrrr!

In order to ensure my test infrastructure was properly cleaned up after being done, I had to ensure the DeletionPolicy field was manually set to “Delete” for the RDS instances being created:

Resources:
    DB:
        Type: \"AWS::RDS::DBInstance\"
        DeletionPolicy: Delete
        Properties:
            ...

The AWS documentation for Deletion Policies says that:

“The default policy is Snapshot for AWS::RDS::DBCluster resources and for AWS::RDS::DBInstance resources that don't specify the DBClusterIdentifier property.”

What’s crazy is that this isn’t evident unless you see the snapshots being created. The CFN events for the stack don’t show the snapshot being created, just the DB being deleted:

I can (kind of) understand why AWS would want a snap to be taken before deleting a stack. Data loss is likely a bigger issue for them than an extra storage cost would be. I just think they’d have done a better job making this default behaviour salient, especially for those just getting started with CloudFormation.

Alright, off to delete some RDS snapshots…

Cloud Guru Challenge - October 2020

Cloud Guru Challenge - October 2020

Cloud Guru Challenge - October 2020

Background

TL;DR:

Machine Learning in the Cloud

Steps

1. Determine use case and obtain data.

2. Create Jupyter hosted notebook

3. Inspect and visualize data

4. Prepare and transform data

5. Train

6. Recommend

7. Source control

8. Clean up resources

9. Impovements

If You’re on the Machine-Learning Journey, Take The Train

References:

Cloud Guru Challenge - September 2020

Cloud Guru Challenge - September 2020

ETL Job using Python

Infrastructure as Code

TBD:

A Shortcoming of the AWS Lambda CLI - EventSourceMappings

A Shortcoming of the AWS Lambda CLI - EventSourceMappings

Creating CodeCommit HTTPS Security Credentials With CloudFormation Lambda-based Custom Resource

Creating CodeCommit HTTPS Security Credentials With CloudFormation Lambda-based Custom Resource

Course Review: A Cloud Guru, Advanced AWS CloudFormation- Adrian Cantrill

Course Review: A Cloud Guru, Advanced AWS CloudFormation- Adrian Cantrill

Course URL: https://learn.acloud.guru/course/aws-advanced-cloudformation/dashboard

TL;DR

Long Version

Amazon Linux 1 AMI Usage and Upgrade Issue:

Overall

Why (the Hell) is Snapshot the CloudFormation RDS Default Deletion Policy??