AWS Solutions Architect Professional Certification

Solutions Architect Professional: Why?

Amazon Web Services is the juggernaut of cloud computing. When I work with large tech-first companies, the question isn’t whether they use AWS; it’s whether they use anything else in addition to AWS. AWS is what Microsoft was in the 90’s.

That means when I’m planning out security solutions with customers, they’ll use AWS terminology, concepts and practices. AWS is the lingua franca. Even on Azure, GCP, Kubernetes, or on-prem, the building blocks are going to be their own twists on AWS services.

You could learn AWS by studying on your own, but the certificate gives you a concrete goal to study for and a little proof of your expertise at the end. So why not do it?

Out of all the AWS certificates, I chose the Solutions Architect Professional certificate because I like a challenge. I’m hoping to get equivalent certs on the other two major clouds soon.

The Challenge

This certificate is extremely challenging. AWS recommends two years of experience as a lead cloud architect before even starting to study for it. It covers all areas of AWS, including networking, security, storage, compute, serverless, and the added-value services that AWS has built on top.

I did have a background in AWS before I started. But no more than the average software engineer. I had never been responsible for running a large AWS deployment or using the vast majority of AWS functionality.

I studied throughout summer 2024, and took the test in September.

The Process

I used three primary resources to study for the exam:

AWS Documentation. AWS spends an enormous amount of time crafting high-quality documentation, so it makes sense ot use it. It is well-written and even calls out potential “gotcha” areas. But it is a moutain of documentation, so you have to be selective.
ACloudGuru’s courses. I watched all the videos for the associate-level and professional-level exams. They are roughly 30 hours of content, combined, and include practice exams and hands-on labs.
Adrian Cantrill’s course. This one is about 70 hours. It also includes a few practice exams.

Both resources were key. ACloudGuru has many hands-on labs, practice quizzes, and full-length practice tests, but the lectures are not very detailed and I feel you’d be better off just reading the documentation. Adrian Cantrill’s course, on the other hand, has fantastic video lectures that are better than nearly anything else you could find; however his practice quizzes and labs are not as good. I’d recommend doing what I did and purchasing both.

Watch at 2x speed

I always find it extremely hard to pay full attention to pre-recorded online lectures because of the inconsistent pacing. I recommend going at 2x speed, and then slowing down when you don’t understand something. This is better than trying to pay attention to things that you already know, and getting bored.

Flash cards

Spaced-repetition flash cards are an effective way to remember details. I made my on flash card app and used it a bit, although it wasn’t my main means of studying. I’ll link to it in the future.

What to study

History and context

It helps to know a bit of history. S3, EC2, and SQS were the original three AWS services, and EBS, IAM and VPCs were bolted on later. Then other services came after that. This helps you explain why S3 has its own permissions mechanism separate from IAM, or why instance stores exist outside EBS.

VPCs were added in 2009, but only made mandatory in 2021! Before VPCs, instances were connected directly to the Internet. EC2 without a VPC is called “EC2 Classic.” It’s likely there are still a few non-VPC instances that are still running today, and various bits of AWS do not actually depend on you having a VPC.

There have been many high-profile breaches caused by insufficient S3 access permissions. Hence the multiple attempts at adding better security to S3.

AWS initially used Xen virtualization under the hood, but switched completely to their own hypervisor called Nitro in around 2016. There are still old instance types and AMIs around that only support Xen.

AWS used to let instances access their own metadata through a very simple service called IMDS. After a high profile breach, they updated to IMDSv2, which is more secure - but IMDS is still available by default.

For most of AWS history, you had to SSH into an instance directly. A whole universe of products called “privileged access management” cropped up to restrict and monitor users SSHing into instances. AWS didn’t develop just one alternative – they now have two, Instance Connect and Systems Manager Session Manager (say that one ten times fast).

Also, for most of AWS history, users had fixed static Access Keys and Secret Access Keys. AWS is really trying to get away from that, with multiple different ways to get temporary Access Keys and Secret Access Keys.

Compute

There is less on the exam about core compute than I originally expected.
Para vs. HVM
AMIs
Placement groups
Behavior around rebooting, stopping, and editing instance details
Launch templates
IMDSv2
Reserved instances (convertible and not), and selling reserved instances
Spot instances
Dedicated instances
Fabric
Amazon Linux 2
Graviton
GPU instances
Memory encryption (SEV)

Queues

SQS
SNS (easy to mix up with SQS!)
Amazon MQ
Kinesis
Firehose. Firehose was originally called Kinesis Firehose, which was very confusing since it had little to do with Kinesis.
Managed Kafka
IoT MQTT

Networking

AWS system structure:
1. Three major partitions (Public, U.S. Federal, and China)
2. Regions
3. Availability zones
4. Local zones
5. Wavelength (super-local zones)
VPCs
1. CIDR blocks
2. Internet gateways (regional)
3. NAT gateways (zonal) (expensive!)
4. Gateway endpoints (used for private S3 and DynamoDB access)
5. VPC Peering
6. IPv6 addressing and egress
7. Client VPN endpoint; link aggregation for VPNs; authentication for VPNs
8. IPAM
9. Routing tables
10. Subnet allocation
11. Splitting subnets
12. What to do when peering two VPCs with overlapping subnets
13. Transit through a VPC is (mostly) not allowed
14. Recommended architecture: VPC per project with a central Shared Services VPC
15. Overall, need to have a strong understanding of which VPC resources are zonal and which are regional
16. Flow logs
17. Mirroring
18. NO multicast support!
DNS
1. Customizing VPC DNS through DHCP Option Sets
2. Split DNS
3. Configuring DNS with instance names
4. Route 53 registration and zone hosting
5. DNS based load balancing based on latency and geography
6. DNS health checks
Client VPN
Site-to-site VPN
Transit gateway
Global accelerator (GAX)
Direct Connect (DX)
1. DX is hard/impossible to experiment with on your own!
2. MACsec
3. Link aggregation with 2 DX, or
4. Differences between dedicated DX and managed service provider DX
ENIs
1. Security groups
  1. Using one security group as an ID in a different security group
  2. Difference between Security Group and NACL
2. Performance basics: accelerated network interfaces, fabric
3. Attaching one EC2 instance to multiple VPCs
Shield / Shield Advanced
Auto Scaling Groups
1. Especially behavior in balancing across zones
Load balancers
1. NLBs
2. ALBs
3. TLS termination
4. Interaction with Auto Scaling Groups
5. In practice, you want most things in AWS to be behind some kind of load balancer, as you want fine-grained control
API Gateway
1. REST mode understands requests and provides more detailed functionality (doesn’t actually HAVE to be REST)
2. Non-REST mode just uses HTTP
VPC Lattice
1. This is not actually on the test yet, but it seems to be AWS’ big new networking thing, so it probably will be soon.
CloudFront

Storage

FSx – understand all modes
Storage gateway – again, understand all modes
Instance store
EBS
1. Basic performance and reliability characteristics
2. Expanding a volume
3. Snapshots
4. RAID arrays of EBS volumes and when to use
  S3 predates the rest of AWS, so of course everything depends on it.
S3
1. S3 is the core storage service for AWS.
2. S3 gateways in the VPC
3. Bucket policies & IAM
4. Origin access control
5. Replication between regions
6. Tiering/service levels
7. Object lifecycle management
8. Mounting with Amazon’s s3mount utility, or synching with the aws command line
9. Pre-signed requests for upload and download
10. Transfer
  1. Basically a serverless FTP/SFTP endpoint that can talk to your buckets.
11. Storage Transfer Service
  1. Sophisticated tool for moving data between S3 and other services on a schedule or driven by events.
12. Lambda objects
13. Triggering events from S3 actions (both within S3, and also CloudWatch Events)
  S3 Express One Zone is a limited, cheaper version of S3. Maybe they should call it S2?
  1. S3 Express One Zone
    This is a very different service that happens to be under the S3 brand name. It has totally different semantics from S3.
EFS
1. Use on Linux and Windows
2. Transit encryption – this actually just uses stunnel at the application layer!
3. Pricing (this one’s expensive)
Snowball Edge
1. This started as a storage device, but now has a lot of compute capabilities too.

Databases

RDS
1. Data migration
2. Schema migration
3. Supported databases
4. MySQL
5. Postgres
6. SQL Server
7. Oracle (it works a bit differently)
8. DB2 (also a bit odd)
9. Gaps in functionality between these, e.g. a good deal of stuff is not available on DB2
10. BabelFish
Replication and promotion
Maintenance windows
Backup RTO and RPO
Aurora
1. Despite the marketing, Aurora is basically RDS with an improved storage layer.
2. Global databases
Authentication for these
DynamoDB
1. Performance quotas
Read the Dynamo paper!
Partition and sort keys
Quotas and performance management options
DAX
1. Always needs 3 nodes!
DynamoDB gateways in the VPC (again)
Managed ElasticSearch
Athena
Redshift
Redshift Spectrum
Caches:
1. Managed Memcached
2. Managed Redis/Valkey
3. You really always want to be using Redis/Valkey. Memcached is usually a red herring.

Security

IAM is a huge topic on its own. You could write a long book just about IAM. This is the most significant single study area. Truly mastering IAM would probably take years of study. Be forewarned!

Accounts, users, and organizations
ARN format
Cognito
1. Identity federation for applications
2. Identity federation into AWS users
IAM
1. Need to be able to write and read IAM policies fluently.
2. Understand resource and user policies
3. Understand roles
4. Understand using roles across accounts
5. IAM Federation using SAML and OIDC
6. Tricky stuff: can never block all services in the us-east-1 region, even if you want to limit users to only one other region!
7. Auditing and checking policies work as expected
8. Permissions boundaries
9. Service control policies
10. 2fa setup
11. AWS-managed policies
12. Permission boundaries
13. How to share resources between accounts
14. PrivateLink endpoint interface permissions
STS
STS is AWS' core identity service. Everything else is built on it.
Simple Directory Service (basically Samba)
Managed Active Directory (basically Windows Active Directory)
Federating with your own AD server across DX or VPN
RAM
1. Share a VPC between accounts
2. Certain resources can be shared with RAM
3. Kind of a kludge outside using proper IAM policies to accomplish the same thing
Control Tower: know basics
Landing Zone: know basics (but really you should just use this)
EC2 Instance Login
1. Login to instances with Instance Connect
2. Login to instances with Systems Manager Session Manager
3. Behavior of SSH keys in the EC2 console
GuardDuty
1. General threat detection service
Macie
1. Special tool for S3 security only (strange name)
Firewall Manager
Web Application Firewall
Config
1. Tracks versions for every AWS object in your account
2. Remediates configuration drift
3. Expensive
CloudTrail
1. Logs every action that affects AWS objects
2. Understand differences between CloudTrail, CloudWatch, and Config
Security Hub
Inspector
Limiting what users can deploy using Service Catalog or CloudFormation Stacks
Label policies
PCA
1. Run a CA for your own stuff
KMS
1. Heart of all encryption functionality in the AWS universe
HSM
1. Really for compliance requirements; very expensive
Roles Anywhere
1. Access APIs with certificates instead of STS tokens
IRSA
1. Access roles from Kubernetes
Shared responsibility model

Billing

AWS Budgets
AWS Cost Control
Having a single budget account within an org
Marketplace
Billing for APIs that you create using API Gateway

Serverless

ACR
ECS
Fargate
1. Can run in ECS or lambda modes
2. Can even run locally now
Lambda
1. Versioning
2. 15-minute limit
3. Attaching to VPC
4. Attaching to EFS
5. Roles
6. Base images
7. Building new base images
Step Functions
1. Mini programming language for building state machines out of multiple lambda functions
2. Similar to state machines
3. Can build an entire system using lambda and step functions
Batch
Using CloudWatch Events to auto-trigger things

IoT

Surprisingly, there were several questions on this, even though it’s a very niche area
Provisioning with certificates
Queueing messages with MQTT
Device updates

CloudFormation

You do not have to know how to write a template
But you do have to know how they deploy, how to use variables, how to nest templates

Migration

Know the services AWS provides for migration
Application discovery on VMWare
Migration Hub
Mainframe migration tools
3 R’s, building a migration strategy

Observability

Other services

Workspaces
EMR basics
Transcode & other video services
Polly
Glue
Marketplace basics
VMWare on AWS - why use it?
EKS - extreme basics only, this is not a Kubernetes test
Elastic BeanStalk - getting old but still relevant
Support tiers
Outposts
Look at all the stuff AWS offers for running locally on your own machine
License Manager

Don’t bother learning

CloudFormation details
Terraform, Ansible, Chef, Puppet (although having a basic knowledge of all would be useful)
Specific command lines
More than basic knowledge of using the console
AWS APIs
Pricing details
Names or characteristics of specific instance types
Very advanced topics in networking and performance
HPC

Vestigial bones

Just like humans have a few vestigial bones from back when we had tails and ate raw plants, AWS has some vestigial bones from its early architecture. It’s evolved over time, but you can see traces of the path it took to get there.

Pricing

The test does not cover pricing. This makes some sense, since most companies have negotiated agreements that come with substantial discounts, and AWS changes pricing all the time.

But you always need to remember that AWS wants you to run up as large a bill as possible. Often, the “recommended” approaches will incur hundreds or thousands of dollars a month in unnecessary expenses, when a simpler and cheaper solution is available.

AWS is a bit like Amazon.com: They already have your credit card number, and make it very easy to spend a lot of money that you don’t really need to. And they don’t necessarily want to train their certified experts in how to reduce costs.

For real-world use, remember:

AWS systematically under-prices compute and storage, and then makes up the difference on network bandwidth pricing. I suppose they figure that if your company is just starting out, it doesn’t really need much bandwidth yet; then when it’s successful you’ll have money to spend on bandwidth (and are already locked in). If you have bandwidth-heavy stuff like video though, use something else!
The networking portions cover PrivateLink (Gateway and Interface Endpoints) and Transit Gateways. These both have a significant added cost though, even though they feel like they are just convenience features that shouldn’t cost extra.
Lambda is unbelievably expensive compared to just hosting the same things yourself in EC2. It’s convenient, and useful for lots of smaller projects, but anything that’s going to blow up probably needs to be moved back to traditional hosting. Fargate is a nice middle ground because you can start with Lambda as the runtime, then switch it to run on ECS which is much cheaper once you need that volume.
RDS and Aurora get really expensive, primarily because they need a ton of memory, which is often the most expensive variable in AWS instances. If at all possible, avoid them or minimize their use.
Advanced services around security are all incredibly expensive. For example, I know companies that have accidentally spent tens of thousands of dollars in a month just by turning on Config. Presumably they have customers who really want those services, and are willing to pay big money.

The cloud can be cheaper than running things in a data center, but it’s not automatically cheaper. Many AWS featuers are designed to lock you in to AWS, so your bill can only grow. Any real architect needs to understand the pricing structure and steer away from the money pits.

Most medium-sized AWS customers work with a managed service provider (MSP), who bundle up many AWS customers and negotiate shared discounts with AWS. They can also share expensive resources like DX connections (which start at $10,000 per month), and provide first-line support. Unfortunately, it does usually mean giving the MSP a great deal of control over your account, which may be unacceptable. Larger customers can negotiate discounts independently.

The most important thing

The test is extremely time limited. The questions and answer choices are long, complicated, and often confusing. You only have about two minutes per question, so you’ll have to read and understand very very quickly. Many people who know all this material won’t be quick enough to answer the questions in the time allotted.

It’s vital to take practice tests in order to get into the rhythm of answering these long, winding, multi-part questions. I found ACloudGuru’s practice tests to be very good, and did all of them (some multiple times).

If you’re not a fluent English speaker, and you’re taking the test in English, it is going to be an extreme challenge.

What can we surmise about AWS internals?

Naturally, AWS doesn’t tell us very much about how their system works under the hood. However, it’s fun to think about how AWS must work internally, in order to have the observed behavior. I’ve made a few educated guesses that help me reason about how AWS must work.

Guesses about AWS internals

All of this is speculation. I have no inside information.

AWS’ networking must have some kind of metadata for each packet. It’s apparent that various features know the originating ENI, security group, VPC ID, and account ID of individual packets flowing across their network, so I would guess there is an encapsulation layer which wraps each individual packet across their network.

While we can’t know for sure, it seems like around 2016, AWS switched from a Xen hypervisor to a KVM-based hypervisor. This is clear from the types of Linux drivers used in newer instance types, which are similar to KVM. They must still have some Xen around though, since you can still launch the old instances.

While AWS whitepapers talk about a single node having “Nitro cards,” it is probably just one off-CPU chip that can do TPM, network encapsulation, storage initiation (SCSI to network), and other functionality. It must be an interesting piece of hardware, since it can handle 40 Gbps line speed. There are some network cards (DPUs) that can do this kind of thing now, but they are much more recent than Nitro is.

It’s pretty clear to me that RAM is the primary limiting factor in the number of EC2 instances that can be deployed per node. Cost of instances varies closely with RAM allocation.

AWS core services seem to be totally dependent on the us-east-1 region. Almost every global service depends on this region. I would guess IAM, STS, S3 metadata, and DynamoDB are based there. If you block access to this region in an IAM policy, a bunch of things break.

Every single thing in AWS needs KMS, including for encrypting storage, RAM, network traffic, and also explicit calls to KMS itself, and all in real time (or everything starts to fail). I would guess this is one of the most widely replicated services in the entire system. Perhaps every rack has its own KMS instance.

Access Key IDs and Secret Access Keys probably started as opaque random tokens. Then, as the system grew more complex, they added session tokens which are cryptographically verifiable assertions. Is there a big global replicated database of all the Access Key IDs and Secret Access Keys, or have they switched that to signed assertions too? It’s hard to say.

S3 likely uses erasure coding to store objects. If there are 3 sub-copies, it can likely survive the loss of any one at a time. Some process rebuilds the third sub-copy if one is damaged.

S3 Express One Zone has the semantics of a regular directory on a standard filesystem. It probably is just a regular directory on a standard fileystem.

EBS has fast snapshots and also seems to be able to incrementally load data as systems use blocks. It is probably similar to bcache on Linux, implementing an LSM based block store. I would guess there are dedicated nodes for serving EBS with lots of NVRAM storage.

All features of Aurora seem to be implemented at the storage layer, not the database layer. I would guess Aurora is really just a fancy edition of EBS.

I suspect Lambda runs on separate physical nodes from EC2. The reason is that Lambda VMs use Firecracker to start up extremely quickly, which isn’t possible on ordinary EC2. I would expect that other major services like S3 and SQS run on separate physical hardware as well.

Both NAT Gateways and PrivateLink are zonal and rather expensive. I suspect they are just implemented as an EC2 running NAT code under the hood. They probably are not built into the VPC architecture itself.

Glacier probably uses a combination of a Massive Array of Idle Disks (MAID) and tape storage for extremely infrequently used data. Its 4-hour SLA is typical of tape retrieval times.

IAM policy evaluation is extremely complicated. I suspect they have whole teams of people just designing and testing policies. It seems very easy for a slip-up to accidentally expose key services to attacks. There are newer approaches to policy design such as Zanzibar and OPA which are equally expressive, but also easier for humans to understand.

Is this a useful certificate?

Yes! Understanding the array of options for building things on AWS is extremely helpful. You could easily save a company millions of dollars by going down the right path instead of the wrong one.

It is also a challenging enough certificate that not too many people have it.

However, just taking a test is very different from hands-on experience. You don’t have to find a problem with a failing network configuration, or know how to actually write a query for any of the many databases. You could come out of this test knowing how to whiteboard a complex system perfectly, but without the experience to actually do it.

I wish the material covered modern immutable infrastructure and declarative configuration in more detail. Unless you’re migrating old stuff, there’s absolutely no reason to have traditional always-on services configured through a console in a cloud environment.

Fun things I learned along the way

Adrian Cantrill is an incredible teacher. I do my own technical trainings sometimes, and I hope I can make mine half as good!
NAT gateways are pretty expensive for personal use (at least $30/month). Can I avoid them by using IPv6 (so no NAT is required)? YES! Although a few key sites like GitHub still don’t work on IPv6.
You can create multiple accounts on AWS using username+account1@gmail.com, username.account2@gmail.com, etc. And each gets $300 in AWS starting credits.
ChatGPT is a great study buddy. It has very good knowledge of AWS, perhaps because so many people write about it.
I posted about passing the test on Twitter, and got what really appears to be a personal congratulations from AWS. I didn’t tag anyone there. Very kind of them!
Apparently if you pass all the AWS tests, they send you an exclusive gold jacket. This would be a fun challenge, especially if you can get someone else to pay for all the courses and exam fees!

Cost of getting the certification

I was paying for all of this out-of-pocket, so cost was a concern. In total, I spent around $550:

Around $80 for Adrian Cantrill’s course (one time fee)
$35/month for ACloudGuru access, adding up to about $150
$300 exam fee
About $50 of AWS fees on top of the $300 in free starting credits they give you (I accidentally left something running for a month)

I think it’s worth it, but I understand not everyone has that much cash available up-front. If you can, see if your employer will pay for these expenses.

If you pass one exam from AWS, they give you a 50% off discount on your next exam. So by taking an easier (and cheaper) exam first, you may be able to save $150.

Conclusion

This was a fantastic journey, and I’m glad I did it. I went from having detailed knowledge about a few specific AWS areas, to deep knowledge about almost every user-facing service they provide. It took me around 200 hours of study over a few months to get to that level. I would compare it to roughly the same effort as an advanced-level college course.

I’m not sure if I’ll ever be responsible for AWS in an operational capacity in the future, but I’m confident I can at least communicate with the people who are, and speak the same language.

Plus it was a fun challenge!

What’s next?

I’m hoping to get high-level certifications on all 3 clouds, and do the CKA, and do a few security certs! So it’s a long road ahead for me.