AWS Solutions Architect Professional Certification
Published on October 24, 2024
Table of Contents
Solutions Architect Professional: Why?
Amazon Web Services is the juggernaut of cloud computing. When I work with large tech-first companies, the question isn’t whether they use AWS; it’s whether they use anything else in addition to AWS. AWS is what Microsoft was in the 90’s.
You could learn AWS by studying on your own, but the certificate gives you a concrete goal to study for and a little proof of your expertise at the end. So why not do it?
Out of all the AWS certificates, I chose the Solutions Architect Professional certificate because I like a challenge. I’m hoping to get equivalent certs on the other two major clouds soon.
The Challenge
I did have a background in AWS before I started. But no more than the average software engineer. I had never been responsible for running a large AWS deployment or using the vast majority of AWS functionality.
I studied throughout summer 2024, and took the test in September.
My AWS resume (before starting)
Just so you know where I was coming from before I started studying for this test:
I heard about AWS for the first time in 2009, and immediately built a static site to host thousands of my own photos. I also messed around with using it as a workstation with VNC (this didn’t work terribly well). In 2011 I was hired on a team to build out a high-performance backup product that supported S3. In 2016 I was the security lead for a product running on AWS. In 2020-2022 I did some security consulting at companies that use AWS heavily.
In all of these roles, my primary job responsibility was writing code, not maintaining cloud infra. From there, it was still a lot of studying to get to the SA Pro level.
If you have less experience than me, it might be a good idea to start with an easier exam.
The Process
I used three primary resources to study for the exam:
- AWS Documentation. AWS spends an enormous amount of time crafting high-quality documentation, so it makes sense ot use it. It is well-written and even calls out potential “gotcha” areas. But it is a moutain of documentation, so you have to be selective.
- ACloudGuru’s courses. I watched all the videos for the associate-level and professional-level exams. They are roughly 30 hours of content, combined, and include practice exams and hands-on labs.
- Adrian Cantrill’s course. This one is about 70 hours. It also includes a few practice exams.
Watch at 2x speed
Flash cards
What to study
History and context
It helps to know a bit of history. S3, EC2, and SQS were the original three AWS services, and EBS, IAM and VPCs were bolted on later. Then other services came after that. This helps you explain why S3 has its own permissions mechanism separate from IAM, or why instance stores exist outside EBS.
VPCs were added in 2009, but only made mandatory in 2021! Before VPCs, instances were connected directly to the Internet. EC2 without a VPC is called “EC2 Classic.” It’s likely there are still a few non-VPC instances that are still running today, and various bits of AWS do not actually depend on you having a VPC.
There have been many high-profile breaches caused by insufficient S3 access permissions. Hence the multiple attempts at adding better security to S3.
Compute
- There is less on the exam about core compute than I originally expected.
- Para vs. HVM
- AMIs
- Placement groups
- Behavior around rebooting, stopping, and editing instance details
- Launch templates
- IMDSv2
- Reserved instances (convertible and not), and selling reserved instances
- Spot instances
- Dedicated instances
- Fabric
- Amazon Linux 2
- Graviton
- GPU instances
- Memory encryption (SEV)
Queues
- SQS
- SNS (easy to mix up with SQS!)
- Amazon MQ
- Kinesis
- Firehose. Firehose was originally called Kinesis Firehose, which was very confusing since it had little to do with Kinesis.
- Managed Kafka
- IoT MQTT
Networking
- AWS system structure:
- Three major partitions (Public, U.S. Federal, and China)
- Regions
- Availability zones
- Local zones
- Wavelength (super-local zones)
- VPCs
- CIDR blocks
- Internet gateways (regional)
- NAT gateways (zonal) (expensive!)
- Gateway endpoints (used for private S3 and DynamoDB access)
- VPC Peering
- IPv6 addressing and egress
- Client VPN endpoint; link aggregation for VPNs; authentication for VPNs
- IPAM
- Routing tables
- Subnet allocation
- Splitting subnets
- What to do when peering two VPCs with overlapping subnets
- Transit through a VPC is (mostly) not allowed
- Recommended architecture: VPC per project with a central Shared Services VPC
- Overall, need to have a strong understanding of which VPC resources are zonal and which are regional
- Flow logs
- Mirroring
- NO multicast support!
- DNS
- Customizing VPC DNS through DHCP Option Sets
- Split DNS
- Configuring DNS with instance names
- Route 53 registration and zone hosting
- DNS based load balancing based on latency and geography
- DNS health checks
- Client VPN
- Site-to-site VPN
- Transit gateway
- Global accelerator (GAX)
- Direct Connect (DX)
- DX is hard/impossible to experiment with on your own!
- MACsec
- Link aggregation with 2 DX, or
- Differences between dedicated DX and managed service provider DX
- ENIs
- Security groups
- Using one security group as an ID in a different security group
- Difference between Security Group and NACL
- Performance basics: accelerated network interfaces, fabric
- Attaching one EC2 instance to multiple VPCs
- Security groups
- Shield / Shield Advanced
- Auto Scaling Groups
- Especially behavior in balancing across zones
- Load balancers
- NLBs
- ALBs
- TLS termination
- Interaction with Auto Scaling Groups
- In practice, you want most things in AWS to be behind some kind of load balancer, as you want fine-grained control
- API Gateway
- REST mode understands requests and provides more detailed functionality (doesn’t actually HAVE to be REST)
- Non-REST mode just uses HTTP
- VPC Lattice
- This is not actually on the test yet, but it seems to be AWS’ big new networking thing, so it probably will be soon.
- CloudFront
Storage
- FSx – understand all modes
- Storage gateway – again, understand all modes
- Instance store
- EBS
- Basic performance and reliability characteristics
- Expanding a volume
- Snapshots
- RAID arrays of EBS volumes and when to use
- S3
- S3 is the core storage service for AWS.
- S3 gateways in the VPC
- Bucket policies & IAM
- Origin access control
- Replication between regions
- Tiering/service levels
- Object lifecycle management
- Mounting with Amazon’s s3mount utility, or synching with the aws command line
- Pre-signed requests for upload and download
- Transfer
- Basically a serverless FTP/SFTP endpoint that can talk to your buckets.
- Storage Transfer Service
- Sophisticated tool for moving data between S3 and other services on a schedule or driven by events.
- Lambda objects
- Triggering events from S3 actions (both within S3, and also CloudWatch Events)
- S3 Express One Zone
- This is a very different service that happens to be under the S3 brand name. It has totally different semantics from S3.
- S3 Express One Zone
- EFS
- Use on Linux and Windows
- Transit encryption – this actually just uses stunnel at the application layer!
- Pricing (this one’s expensive)
- Snowball Edge
- This started as a storage device, but now has a lot of compute capabilities too.
Databases
- RDS
- Data migration
- Schema migration
- Supported databases
- MySQL
- Postgres
- SQL Server
- Oracle (it works a bit differently)
- DB2 (also a bit odd)
- Gaps in functionality between these, e.g. a good deal of stuff is not available on DB2
- BabelFish
- Replication and promotion
- Maintenance windows
- Backup RTO and RPO
- Aurora
- Despite the marketing, Aurora is basically RDS with an improved storage layer.
- Global databases
- Authentication for these
- DynamoDB
- Performance quotas
- Read the Dynamo paper!
- Partition and sort keys
- Quotas and performance management options
- DAX
- Always needs 3 nodes!
- DynamoDB gateways in the VPC (again)
- Managed ElasticSearch
- Athena
- Redshift
- Redshift Spectrum
- Caches:
- Managed Memcached
- Managed Redis/Valkey
- You really always want to be using Redis/Valkey. Memcached is usually a red herring.
Security
- Accounts, users, and organizations
- ARN format
- Cognito
- Identity federation for applications
- Identity federation into AWS users
- IAM
- Need to be able to write and read IAM policies fluently.
- Understand resource and user policies
- Understand roles
- Understand using roles across accounts
- IAM Federation using SAML and OIDC
- Tricky stuff: can never block all services in the us-east-1 region, even if you want to limit users to only one other region!
- Auditing and checking policies work as expected
- Permissions boundaries
- Service control policies
- 2fa setup
- AWS-managed policies
- Permission boundaries
- How to share resources between accounts
- PrivateLink endpoint interface permissions
- STS
- Simple Directory Service (basically Samba)
- Managed Active Directory (basically Windows Active Directory)
- Federating with your own AD server across DX or VPN
- RAM
- Share a VPC between accounts
- Certain resources can be shared with RAM
- Kind of a kludge outside using proper IAM policies to accomplish the same thing
- Control Tower: know basics
- Landing Zone: know basics (but really you should just use this)
- EC2 Instance Login
- Login to instances with Instance Connect
- Login to instances with Systems Manager Session Manager
- Behavior of SSH keys in the EC2 console
- GuardDuty
- General threat detection service
- Macie
- Special tool for S3 security only (strange name)
- Firewall Manager
- Web Application Firewall
- Config
- Tracks versions for every AWS object in your account
- Remediates configuration drift
- Expensive
- CloudTrail
- Logs every action that affects AWS objects
- Understand differences between CloudTrail, CloudWatch, and Config
- Security Hub
- Inspector
- Limiting what users can deploy using Service Catalog or CloudFormation Stacks
- Label policies
- PCA
- Run a CA for your own stuff
- KMS
- Heart of all encryption functionality in the AWS universe
- HSM
- Really for compliance requirements; very expensive
- Roles Anywhere
- Access APIs with certificates instead of STS tokens
- IRSA
- Access roles from Kubernetes
- Shared responsibility model
Billing
- AWS Budgets
- AWS Cost Control
- Having a single budget account within an org
- Marketplace
- Billing for APIs that you create using API Gateway
Serverless
- ACR
- ECS
- Fargate
- Can run in ECS or lambda modes
- Can even run locally now
- Lambda
- Versioning
- 15-minute limit
- Attaching to VPC
- Attaching to EFS
- Roles
- Base images
- Building new base images
- Step Functions
- Mini programming language for building state machines out of multiple lambda functions
- Similar to state machines
- Can build an entire system using lambda and step functions
- Batch
- Using CloudWatch Events to auto-trigger things
IoT
- Surprisingly, there were several questions on this, even though it’s a very niche area
- Provisioning with certificates
- Queueing messages with MQTT
- Device updates
CloudFormation
- You do not have to know how to write a template
- But you do have to know how they deploy, how to use variables, how to nest templates
Migration
- Know the services AWS provides for migration
- Application discovery on VMWare
- Migration Hub
- Mainframe migration tools
- 3 R’s, building a migration strategy
Observability
Other services
- Workspaces
- EMR basics
- Transcode & other video services
- Polly
- Glue
- Marketplace basics
- VMWare on AWS - why use it?
- EKS - extreme basics only, this is not a Kubernetes test
- Elastic BeanStalk - getting old but still relevant
- Support tiers
- Outposts
- Look at all the stuff AWS offers for running locally on your own machine
- License Manager
Don’t bother learning
- CloudFormation details
- Terraform, Ansible, Chef, Puppet (although having a basic knowledge of all would be useful)
- Specific command lines
- More than basic knowledge of using the console
- AWS APIs
- Pricing details
- Names or characteristics of specific instance types
- Very advanced topics in networking and performance
- HPC
Vestigial bones
Just like humans have a few vestigial bones from back when we had tails and ate raw plants, AWS has some vestigial bones from its early architecture. It’s evolved over time, but you can see traces of the path it took to get there.
Traces of AWS evolution
Some parts of AWS include redundant or confusing functionality, and are clearly intended for backward-compatibility and stability. Other clouds which came after AWS have fewer such warts.
- S3 has substantially different permissions semantics from all other AWS services.
- S3 also has ad-hoc links to other services rather than using EventBridge.
- PrivateLink for S3 and DynamoDB works completely differently from PrivateLink for every other service, which is probably because it supported those services first before a design change.
- Security groups can restrict access from other security groups, but not from VPCs (because security groups were invented before VPCs).
- NACLs and security groups have overlapping functionality.
- SNS and SQS are somewhat redundant to each other, and neither really provides the pub/sub message queue that is the standard on other cloud platforms.
- Availability zone IDs were originally completely hidden from the user and randomized per account. Then later, people wanted resources in two different accounts to be in the same AZ. So now they do have to reveal the AZ ID, but still use the old hidden method most of the time.
- AWS first had accounts, then had multiple users within one account, and then added organizations as groupings of accounts. At least 90% of the time, you want your company to have one locked-down organization structure and make it impossible to share things outside that organization. This is pretty difficult to achieve with AWS.
It’s useful to know roughly when AWS added new features, in order to understand why it has surprising idiosyncracies.
Pricing
The test does not cover pricing. This makes some sense, since most companies have negotiated agreements that come with substantial discounts, and AWS changes pricing all the time.
But you always need to remember that AWS wants you to run up as large a bill as possible. Often, the “recommended” approaches will incur hundreds or thousands of dollars a month in unnecessary expenses, when a simpler and cheaper solution is available.
AWS is a bit like Amazon.com: They already have your credit card number, and make it very easy to spend a lot of money that you don’t really need to. And they don’t necessarily want to train their certified experts in how to reduce costs.
For real-world use, remember:
- AWS systematically under-prices compute and storage, and then makes up the difference on network bandwidth pricing. I suppose they figure that if your company is just starting out, it doesn’t really need much bandwidth yet; then when it’s successful you’ll have money to spend on bandwidth (and are already locked in). If you have bandwidth-heavy stuff like video though, use something else!
- The networking portions cover PrivateLink (Gateway and Interface Endpoints) and Transit Gateways. These both have a significant added cost though, even though they feel like they are just convenience features that shouldn’t cost extra.
- Lambda is unbelievably expensive compared to just hosting the same things yourself in EC2. It’s convenient, and useful for lots of smaller projects, but anything that’s going to blow up probably needs to be moved back to traditional hosting. Fargate is a nice middle ground because you can start with Lambda as the runtime, then switch it to run on ECS which is much cheaper once you need that volume.
- RDS and Aurora get really expensive, primarily because they need a ton of memory, which is often the most expensive variable in AWS instances. If at all possible, avoid them or minimize their use.
- Advanced services around security are all incredibly expensive. For example, I know companies that have accidentally spent tens of thousands of dollars in a month just by turning on Config. Presumably they have customers who really want those services, and are willing to pay big money.
The cloud can be cheaper than running things in a data center, but it’s not automatically cheaper. Many AWS featuers are designed to lock you in to AWS, so your bill can only grow. Any real architect needs to understand the pricing structure and steer away from the money pits.
The most important thing
The test is extremely time limited. The questions and answer choices are long, complicated, and often confusing. You only have about two minutes per question, so you’ll have to read and understand very very quickly. Many people who know all this material won’t be quick enough to answer the questions in the time allotted.
It’s vital to take practice tests in order to get into the rhythm of answering these long, winding, multi-part questions. I found ACloudGuru’s practice tests to be very good, and did all of them (some multiple times).
What can we surmise about AWS internals?
Naturally, AWS doesn’t tell us very much about how their system works under the hood. However, it’s fun to think about how AWS must work internally, in order to have the observed behavior. I’ve made a few educated guesses that help me reason about how AWS must work.
Guesses about AWS internals
All of this is speculation. I have no inside information.
AWS’ networking must have some kind of metadata for each packet. It’s apparent that various features know the originating ENI, security group, VPC ID, and account ID of individual packets flowing across their network, so I would guess there is an encapsulation layer which wraps each individual packet across their network.
While we can’t know for sure, it seems like around 2016, AWS switched from a Xen hypervisor to a KVM-based hypervisor. This is clear from the types of Linux drivers used in newer instance types, which are similar to KVM. They must still have some Xen around though, since you can still launch the old instances.
While AWS whitepapers talk about a single node having “Nitro cards,” it is probably just one off-CPU chip that can do TPM, network encapsulation, storage initiation (SCSI to network), and other functionality. It must be an interesting piece of hardware, since it can handle 40 Gbps line speed. There are some network cards (DPUs) that can do this kind of thing now, but they are much more recent than Nitro is.
It’s pretty clear to me that RAM is the primary limiting factor in the number of EC2 instances that can be deployed per node. Cost of instances varies closely with RAM allocation.
AWS core services seem to be totally dependent on the us-east-1 region. Almost every global service depends on this region. I would guess IAM, STS, S3 metadata, and DynamoDB are based there. If you block access to this region in an IAM policy, a bunch of things break.
Every single thing in AWS needs KMS, including for encrypting storage, RAM, network traffic, and also explicit calls to KMS itself, and all in real time (or everything starts to fail). I would guess this is one of the most widely replicated services in the entire system. Perhaps every rack has its own KMS instance.
Access Key IDs and Secret Access Keys probably started as opaque random tokens. Then, as the system grew more complex, they added session tokens which are cryptographically verifiable assertions. Is there a big global replicated database of all the Access Key IDs and Secret Access Keys, or have they switched that to signed assertions too? It’s hard to say.
S3 likely uses erasure coding to store objects. If there are 3 sub-copies, it can likely survive the loss of any one at a time. Some process rebuilds the third sub-copy if one is damaged.
S3 Express One Zone has the semantics of a regular directory on a standard filesystem. It probably is just a regular directory on a standard fileystem.
EBS has fast snapshots and also seems to be able to incrementally load data as systems use blocks. It is probably similar to bcache on Linux, implementing an LSM based block store. I would guess there are dedicated nodes for serving EBS with lots of NVRAM storage.
All features of Aurora seem to be implemented at the storage layer, not the database layer. I would guess Aurora is really just a fancy edition of EBS.
I suspect Lambda runs on separate physical nodes from EC2. The reason is that Lambda VMs use Firecracker to start up extremely quickly, which isn’t possible on ordinary EC2. I would expect that other major services like S3 and SQS run on separate physical hardware as well.
Both NAT Gateways and PrivateLink are zonal and rather expensive. I suspect they are just implemented as an EC2 running NAT code under the hood. They probably are not built into the VPC architecture itself.
Glacier probably uses a combination of a Massive Array of Idle Disks (MAID) and tape storage for extremely infrequently used data. Its 4-hour SLA is typical of tape retrieval times.
IAM policy evaluation is extremely complicated. I suspect they have whole teams of people just designing and testing policies. It seems very easy for a slip-up to accidentally expose key services to attacks. There are newer approaches to policy design such as Zanzibar and OPA which are equally expressive, but also easier for humans to understand.
Is this a useful certificate?
Yes! Understanding the array of options for building things on AWS is extremely helpful. You could easily save a company millions of dollars by going down the right path instead of the wrong one.
However, just taking a test is very different from hands-on experience. You don’t have to find a problem with a failing network configuration, or know how to actually write a query for any of the many databases. You could come out of this test knowing how to whiteboard a complex system perfectly, but without the experience to actually do it.
I wish the material covered modern immutable infrastructure and declarative configuration in more detail. Unless you’re migrating old stuff, there’s absolutely no reason to have traditional always-on services configured through a console in a cloud environment.
Fun things I learned along the way
- Adrian Cantrill is an incredible teacher. I do my own technical trainings sometimes, and I hope I can make mine half as good!
- NAT gateways are pretty expensive for personal use (at least $30/month). Can I avoid them by using IPv6 (so no NAT is required)? YES! Although a few key sites like GitHub still don’t work on IPv6.
- You can create multiple accounts on AWS using
username+account1@gmail.com
,username.account2@gmail.com
, etc. And each gets $300 in AWS starting credits. - ChatGPT is a great study buddy. It has very good knowledge of AWS, perhaps because so many people write about it.
- I posted about passing the test on Twitter, and got what really appears to be a personal congratulations from AWS. I didn’t tag anyone there. Very kind of them!
- Apparently if you pass all the AWS tests, they send you an exclusive gold jacket. This would be a fun challenge, especially if you can get someone else to pay for all the courses and exam fees!
Cost of getting the certification
I was paying for all of this out-of-pocket, so cost was a concern. In total, I spent around $550:
- Around $80 for Adrian Cantrill’s course (one time fee)
- $35/month for ACloudGuru access, adding up to about $150
- $300 exam fee
- About $50 of AWS fees on top of the $300 in free starting credits they give you (I accidentally left something running for a month)
I think it’s worth it, but I understand not everyone has that much cash available up-front. If you can, see if your employer will pay for these expenses.
If you pass one exam from AWS, they give you a 50% off discount on your next exam. So by taking an easier (and cheaper) exam first, you may be able to save $150.
Conclusion
This was a fantastic journey, and I’m glad I did it. I went from having detailed knowledge about a few specific AWS areas, to deep knowledge about almost every user-facing service they provide. It took me around 200 hours of study over a few months to get to that level. I would compare it to roughly the same effort as an advanced-level college course.
I’m not sure if I’ll ever be responsible for AWS in an operational capacity in the future, but I’m confident I can at least communicate with the people who are, and speak the same language.
Plus it was a fun challenge!
What’s next?
I’m hoping to get high-level certifications on all 3 clouds, and do the CKA, and do a few security certs! So it’s a long road ahead for me.