← AWS MLS-C01 — ML Specialty

Exam Cheat Sheet

MLS-C01 Exam Cheat Sheet

Last thing to read before the exam. Dense tables, no prose.


Algorithm Quick Reference — All 17+ Built-In

AlgorithmProblemInputComputeKey HyperparamsUse When
XGBoostClass/Reg/Ranktabular CSV/libsvmCPU (GPU opt)num_round, max_depth, eta, subsample, alpha, lambdaTabular data, default first choice
Linear LearnerClass/Regtabular RecordIO/CSVCPUlearning_rate, l1, wd, mini_batch_sizeLarge tabular, linear relationships, fast baseline
KNNClass/Reg/Anomalytabular RecordIOCPUk, sample_size, feature_dimFast inference, low latency, small-medium data
Factorization MachinesClass/Regsparse CSV/RecordIOCPUnum_factors, lr, epochSparse data, click prediction, recommendations
DeepARForecastingJSON Lines time seriesGPU/CPUcontext_length, prediction_length, num_layersMultiple related time series, cold start
LDATopic modelingInteger token sequencesCPUnum_topics, alpha0Text corpus, discover hidden topics
NTMTopic modelingCount vectorsGPU/CPUnum_topics, encoder_layersSimilar to LDA, neural approach
BlazingTextText class / Word2VecText file (space-delimited)GPU/CPUmode (supervised/cbow/sg), vector_dimFast text classification, word embeddings, millions of docs
Object2VecEmbeddingsPair of tokens/sequencesGPUenc_dim, num_layersEntity relationships, recommendation, similarity
Semantic SegmentationImage segmentationImage+annotation RecordIOGPUbackbone, algorithm (FCN/PSP/DeepLab)Pixel-level classification of images
Image ClassificationMulti-class imagesRecordIO / augmented manifestGPUnum_classes, num_layers, lr, epochsClassify whole images, transfer learning available
Object DetectionBounding boxesRecordIO / augmented manifestGPUnum_classes, base_network (VGG/ResNet)Locate + classify objects in images
Seq2SeqSequence translationTokenized RecordIOGPUnum_layers, num_embed, rnn_num_hiddenMachine translation, summarization
IP InsightsAnomaly (IP)CSV (entity, IP)GPU/CPUnum_entity_vectors, vector_dimDetect suspicious IP usage patterns
Random Cut Forest (RCF)Anomaly detectionRecordIO/CSVCPUnum_trees, num_samples_per_treeReal-time streaming anomaly detection
PCADimensionality reductionRecordIO/CSVCPUnum_components, algorithm (regular/randomized)Reduce features, handle multicollinearity
K-MeansClusteringRecordIO/CSVGPU/CPUk, init_method, max_iterUnsupervised grouping, customer segmentation

Exam tip — “tabular data with no info on structure” → XGBoost first. Images → Image Classification or Object Detection. Streaming anomaly → RCF.


Metric Quick Reference

Classification Metrics

MetricFormulaUse When
Accuracy$\frac{TP+TN}{TP+TN+FP+FN}$Balanced classes, equal cost of errors
Precision$\frac{TP}{TP+FP}$Cost of FP is high (spam filter, fraud alert)
Recall$\frac{TP}{TP+FN}$Cost of FN is high (cancer detection, fraud missed)
F1$\frac{2 \cdot P \cdot R}{P+R}$Imbalanced classes, balance P and R
AUC-ROCArea under ROC curveRank-ordering, imbalanced, threshold-agnostic
Log Loss$-\frac{1}{N}\sum y\log\hat{p}$Probabilistic classifier quality
PR-AUCArea under P-R curveHighly imbalanced data (better than ROC)

Accuracy is USELESS for imbalanced data. Always flag this on exam.

Regression Metrics

MetricFormulaUse When
MSE$\frac{1}{n}\sum(y-\hat{y})^2$Penalize large errors heavily
RMSE$\sqrt{MSE}$Same scale as target, sensitive to outliers
MAE$\frac{1}{n}\sum|y-\hat{y}|$Robust to outliers, interpretable
$1 - \frac{SS_{res}}{SS_{tot}}$Proportion of variance explained (0-1)
MAPE$\frac{100}{n}\sum\frac{|y-\hat{y}|}{y}$Percentage error, scale-independent

Other Metrics

ProblemMetricNotes
ClusteringSilhouette scoreRanges -1 to 1, higher = better clusters
ClusteringInertia (SSE)Within-cluster sum of squares, lower = better
RankingNDCGNormalized Discounted Cumulative Gain
Object DetectionmAPMean Average Precision across classes
SegmentationIoU / mIoUIntersection over Union

Key Formulas

Classification

$$\text{Precision} = \frac{TP}{TP+FP} \qquad \text{Recall} = \frac{TP}{TP+FN}$$

$$F1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2TP}{2TP+FP+FN}$$

$$F_\beta = (1+\beta^2)\frac{P \cdot R}{\beta^2 P + R}$$

Regression

$$MSE = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2 \qquad RMSE = \sqrt{MSE}$$

$$MAE = \frac{1}{n}\sum_{i=1}^n |y_i - \hat{y}_i| \qquad R^2 = 1 - \frac{\sum(y_i-\hat{y}_i)^2}{\sum(y_i-\bar{y})^2}$$

Gradient Descent

$$\theta := \theta - \alpha \nabla_\theta J(\theta)$$

Regularization

$$L1 \text{ (Lasso)}: J + \lambda\sum|\theta_j| \qquad L2 \text{ (Ridge)}: J + \lambda\sum\theta_j^2$$

$$\text{Elastic Net}: J + \lambda_1\sum|\theta_j| + \lambda_2\sum\theta_j^2$$

Decision Trees

$$\text{Gini} = 1 - \sum_{i=1}^c p_i^2 \qquad \text{Entropy} = -\sum_{i=1}^c p_i \log_2 p_i$$

$$\text{Information Gain} = H(\text{parent}) - \sum_k \frac{|S_k|}{|S|} H(S_k)$$

Probability

$$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$$


Service Quick Reference — “When to Use What”

Data Storage

ServiceUse CaseKey Detail
S3Everything — always defaultObjects, scalable, SageMaker native
EFSShared filesystem across instancesNFS, multi-AZ, training clusters
FSx for LustreHPC/ML high-throughputLinks to S3, fastest I/O for training
EBSSageMaker notebook persistenceBlock storage, single instance
RedshiftData warehouse, SQL analyticsColumnar, petabyte scale
DynamoDBLow-latency K-V lookupsMillisecond, managed NoSQL

Streaming & Ingestion

ServiceUse CaseKey Detail
Kinesis Data StreamsCustom consumers, replay, low latencyShard-based, 24h-7d retention
Kinesis Data FirehoseDelivery to S3/Redshift/ESNo code, auto-scale, no replay
Kinesis Data AnalyticsReal-time SQL/Flink on streamsWindowed queries, anomaly detection
MSK (Kafka)Kafka-compatible streamingBring existing Kafka ecosystem
SQSDecoupled queue, async processingAt-least-once, no ordering (std)

ETL & Processing

ServiceUse CaseKey Detail
AWS GlueServerless ETL, PySparkCrawlers, Data Catalog, job bookmarks
EMRFull Hadoop/Spark controlEC2-based, complex transforms, cost
AthenaAd-hoc SQL on S3No server, pay per query, quick
Redshift SpectrumQuery S3 from RedshiftExtend DW without loading
AWS BatchLarge batch computeNon-ML batch jobs
Step FunctionsWorkflow orchestrationML pipelines, state machines

AI/ML Managed Services

ServiceProblemKey Detail
RekognitionImage/video analysisObject detection, face, celebrity, moderation
ComprehendNLP — text analysisSentiment, entities, key phrases, topics
TranscribeSpeech → textASR, medical version available
PollyText → speechTTS, multiple voices/languages
TranslateLanguage translationNeural MT, 75+ languages
TextractOCR + document parsingTables, forms, beyond simple OCR
ForecastTime series forecastingAutoML, no ML expertise needed
PersonalizeRecommendation engineUser-item interactions, real-time
LexChatbot / conversational AIPowers Alexa, ASR + NLU
KendraIntelligent searchEnterprise search, FAQ extraction
Fraud DetectorFraud detectionOnline fraud, account takeover
Lookout for MetricsAnomaly in business metricsNo ML required, time series

SageMaker Deployment Modes

ModeLatencyPayloadWhen to Use
Real-time endpointLow (ms)< 6 MBInteractive apps, low-latency APIs
Serverless inferenceVariable< 6 MBIntermittent traffic, unpredictable spikes
Asynchronous inferenceMinutesUp to 1 GBLarge payloads, video/audio, long inference
Batch transformOfflineUnlimitedBulk offline prediction, no persistent endpoint

SageMaker Architecture

┌─────────────────────────────────────────────────────────────┐
│                    SAGEMAKER ECOSYSTEM                       │
│                                                             │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐               │
│  │   Data   │──▶│ Feature  │──▶│ Training │               │
│  │  Sources │   │  Store   │   │   Job    │               │
│  └──────────┘   └──────────┘   └──────────┘               │
│       │                              │                      │
│       │         ┌────────────────────┘                      │
│       │         ▼                                           │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐               │
│  │  S3 /    │   │  Model   │──▶│  Model   │               │
│  │  Ground  │   │ Registry │   │ Monitor  │               │
│  │  Truth   │   └──────────┘   └──────────┘               │
│  └──────────┘        │                                      │
│                       ▼                                      │
│              ┌─────────────────────────────┐               │
│              │     DEPLOYMENT OPTIONS       │               │
│              │ ┌──────┐ ┌──────┐ ┌──────┐ │               │
│              │ │ RT   │ │Batch │ │Async │ │               │
│              │ │Endpt │ │Trans │ │Endpt │ │               │
│              │ └──────┘ └──────┘ └──────┘ │               │
│              └─────────────────────────────┘               │
│                                                             │
│  ┌──────────────────────────────────────────────────────┐  │
│  │    TOOLS: Studio | Pipelines | Experiments | Clarify │  │
│  └──────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘

Instance Type Selection

WorkloadInstance FamilyNotes
Training deep learningml.p3, ml.p4, ml.g4dn, ml.g5GPU required for CV/NLP
Training tree modelsml.m5, ml.c5XGBoost CPU-optimized
Inference high-throughputml.c5, ml.c6gCPU inference is cheaper
Inference GPU requiredml.g4dnReal-time DL inference
Notebook explorationml.t3, ml.m5Notebooks don’t need GPU
Large model trainingml.p4d.24xlargeMulti-GPU, NVLink

/opt/ml/ Directory (Custom Containers)

/opt/ml/
├── input/
│   ├── config/          ← hyperparameters.json, resourceConfig.json
│   └── data/
│       └── {channel}/   ← training data from S3
├── model/               ← save model artifacts HERE
├── output/
│   └── failure          ← write failure reason here on error
└── code/                ← your training script (if using Script Mode)

Data Format Decision Tree

What is your data?
│
├─ Tabular data
│   ├─ Need fast SageMaker training?  → RecordIO (protobuf) + Pipe mode
│   ├─ XGBoost?                       → CSV or libsvm
│   └─ General purpose?               → CSV (simplest)
│
├─ Image data
│   ├─ SageMaker built-in algo?       → RecordIO + augmented manifest
│   └─ Custom model?                  → Raw files in S3
│
├─ Text data
│   ├─ BlazingText?                   → One sentence per line, space-delimited
│   ├─ Seq2Seq?                       → Tokenized integer sequences
│   └─ General NLP?                   → JSON Lines / CSV
│
└─ Time series
    ├─ DeepAR?                         → JSON Lines (target + timestamp)
    └─ Forecast service?               → CSV or Parquet

RecordIO + Pipe mode = fastest SageMaker training (streams from S3, no copy to disk)


Overfitting vs Underfitting Quick Card

SymptomDiagnosisFix
Train acc high, test acc lowOverfittingAdd dropout, regularization (L1/L2), more data, early stopping, simpler model
Train acc low, test acc lowUnderfittingMore complex model, more features, more epochs, less regularization, larger network
Train acc = test acc, both lowHigh biasFeature engineering, polynomial features, different algorithm
Train acc oscillatesLR too highReduce learning rate, use LR scheduler
Train loss never decreasesLR too low / bad initIncrease LR, check data normalization
Good train/val, bad productionData leakage or distribution shiftFix data split, monitor drift
Model works on seen, fails on unseenPoor generalizationCross-validation, augmentation, regularization

Feature Engineering Quick Card

Numerical Features

  • Scale: StandardScaler (mean=0, std=1) for linear/neural models; MinMaxScaler (0-1) for bounded
  • Log transform: $\log(x+1)$ for skewed/long-tail distributions
  • Binning: Convert continuous → ordinal buckets (age groups)
  • Polynomial: $x^2, x_1 x_2$ to capture non-linear relationships
  • Clipping: Remove/cap extreme outliers before scaling

Categorical Features

  • Label encoding: ordinal categories (low/med/high → 0/1/2)
  • One-hot encoding: nominal categories, low cardinality (< ~20 values)
  • Target encoding: high cardinality, replaces category with mean(target) — risk of leakage
  • Embedding: high cardinality in deep learning (entity embeddings)

Text Features

  • BoW / TF-IDF: sparse bag of words, good for linear models
  • Word2Vec / FastText: dense embeddings, captures semantics
  • BERT / Transformers: contextual embeddings, state-of-the-art
  • n-grams: capture phrases, increase vocabulary size

Time Series Features

  • Lag features: $x_{t-1}, x_{t-7}$ (yesterday, last week)
  • Rolling statistics: rolling mean, rolling std, rolling max
  • Decomposition: trend + seasonality + residual
  • Differencing: $x_t - x_{t-1}$ to remove trend, stationarity

Missing Data Strategies

StrategyWhen
Mean/median imputationMCAR (missing completely at random), numerical
Mode imputationCategorical, low proportion missing
KNN imputationMCAR/MAR, preserve relationships
Model-based imputationComplex dependencies
Indicator variableWhen missingness itself is informative
Drop column> 60-70% missing, no signal
Drop rowVery few missing, MCAR

Security Quick Card

Encryption

LayerMechanismService
At rest (S3)SSE-S3, SSE-KMS, SSE-CKMS, S3
At rest (EBS/EFS)KMS CMKKMS
In transitTLS 1.2+ACM
Inter-containerEnable inter-container encryptionSageMaker training config
Model artifactsKMS key on S3 bucketKMS + S3

Network

ComponentPurpose
VPC + private subnetIsolate training/inference from internet
S3 VPC Gateway EndpointAccess S3 without internet traversal
SageMaker VPC Interface EndpointAccess SageMaker API without internet
Security GroupsControl inter-instance traffic
NAT GatewayOutbound internet from private subnet

Access Control

LayerMechanism
SageMaker execution roleIAM role attached to notebook/training job
Bucket policyS3 access from specific role/VPC only
Resource-based policyCross-account access
KMS key policyWho can use encryption keys
CloudTrailAudit all API calls

Monitoring

ToolPurpose
CloudWatch MetricsResource utilization, invocation counts
CloudWatch LogsTraining/inference logs
CloudTrailAPI call audit trail
SageMaker Model MonitorData drift, model quality, bias, explainability
SageMaker ClarifyBias detection, feature importance (SHAP)

Cost Optimization Quick Card

Training

TechniqueSavingsNotes
Spot instancesUp to 90%Requires checkpointing for recovery
Right-size instances20-50%Profile before choosing large instance
Warm poolsReduce startup timeKeep instances ready between jobs
Managed spot trainingBuilt-inSageMaker handles interruption
Compile with NeoSpeed → costOptimize model for inference

Inference

TechniqueSavingsNotes
Auto-scalingPay only for loadScale to zero with serverless
Serverless inferenceNo idle costCold start tradeoff
Multi-model endpoint1 endpoint N modelsShare resources across models
Elastic InferenceGPU fractionAttach fraction of GPU to CPU instance
Inferentia (Inf1)GPU costCustom ML chip, high throughput
Graviton2 (g4dn)CPU costARM-based, cheaper CPU inference

Storage

TechniqueSavingsNotes
S3 Intelligent-TieringAuto-tierFor unknown access patterns
S3 Lifecycle rulesMove to GlacierArchive old model artifacts
Delete unused endpointsImmediateEndpoints bill while running
Shared feature storeAvoid recomputeFeature Store offline store on S3

The “X vs Y” Quick Reference

PairXYKey Discriminator
Kinesis vs FirehoseCustom processing, replay, sub-secondDelivery to S3/Redshift, no codeNeed replay or sub-second? → Streams
Glue vs EMRServerless, PySpark, catalogFull Hadoop/Spark ecosystem, EC2Operational overhead concern? → Glue
Athena vs RedshiftAd-hoc SQL on S3, no loadingDW with complex joins, frequent queriesOne-off queries on S3? → Athena
Rekognition vs SM Image ClassNo ML expertise, managedCustom training, specific domainNeed customization? → SageMaker
Random Forest vs XGBoostBagging, parallel, robustBoosting, sequential, often betterXGBoost generally better, RF more robust to overfit
L1 vs L2Sparse, feature selection (zero weights)Prevents large weights, smoothWant to zero out features? → L1
PCA vs t-SNELinear, preserve variance, productionNon-linear, visualization onlyProduction pipeline? → PCA
LSTM vs GRUMore params, better for long sequencesFewer params, faster, similar perfLimited compute? → GRU
Batch Norm vs DropoutNormalize activations, training speedRandomly drop neurons, regularizeBoth can be used together
Precision vs RecallMinimize FP (cost of false alarm)Minimize FN (cost of missing)What error is worse?
Bagging vs BoostingParallel, reduces variance, Random ForestSequential, reduces bias, XGBoost/AdaBoostHigh variance? → Bagging. High bias? → Boosting
File Mode vs Pipe ModeData copied to instance diskStreams directly from S3, fasterLarge dataset + RecordIO? → Pipe
Data Parallelism vs Model ParallelismReplicate model, split data across GPUsSplit model layers across GPUsModel too large for 1 GPU? → Model parallel
Real-time vs ServerlessAlways-on, consistent latencyIntermittent, variable latency, no idle costNeed < 100ms guaranteed? → Real-time
Batch vs AsyncOffline, no endpoint, huge scaleLarge payload, async with notificationPayload > 6MB or very long inference? → Async
SageMaker Clarify vs Model MonitorBias detection + SHAP explainabilityData drift + model quality over timeDrift detection? → Monitor. Bias? → Clarify
Step Functions vs Airflow (MWAA)AWS-native, serverless, simple orchestrationComplex DAGs, Python ecosystemComplex Python DAG? → MWAA

Data Pipeline Architecture Patterns

Batch ML Pipeline:
┌──────┐    ┌──────┐    ┌──────┐    ┌──────┐    ┌──────┐
│  S3  │───▶│ Glue │───▶│  S3  │───▶│  SM  │───▶│  S3  │
│ Raw  │    │ ETL  │    │Clean │    │Train │    │Model │
└──────┘    └──────┘    └──────┘    └──────┘    └──────┘

Streaming ML Pipeline:
┌──────┐    ┌───────┐    ┌─────────┐    ┌──────┐
│Kafka/│───▶│Kinesis│───▶│Analytics│───▶│Lambda│
│Source│    │Streams│    │(Flink)  │    │/SM   │
└──────┘    └───────┘    └─────────┘    └──────┘

Real-time Inference:
┌──────┐    ┌──────┐    ┌──────────┐    ┌──────┐
│Client│───▶│ API  │───▶│SageMaker │───▶│Result│
│      │    │  GW  │    │Real-time │    │      │
└──────┘    └──────┘    └──────────┘    └──────┘

SageMaker Pipelines Key Components

ComponentPurpose
PipelineDAG of steps
ProcessingStepData prep/validation using SageMaker Processing
TrainingStepTrain a model
TuningStepHyperparameter tuning (HPO)
EvaluationStepEvaluate metrics
ModelStepCreate model from artifacts
ConditionStepBranch on condition (accuracy > threshold)
RegisterModelPush to Model Registry
TransformStepBatch inference

Hyperparameter Tuning Quick Reference

HPO Strategy

StrategyDescriptionUse When
Grid searchTry all combinationsSmall parameter space
Random searchRandom samplesMedium space, better than grid
Bayesian optimizationModel-guided searchExpensive training, large space
HyperbandEarly stopping of bad trialsLarge space, long training

SageMaker Automatic Model Tuning

  • Objective metric: specify training job metric to maximize/minimize
  • Parameter ranges: continuous, integer, categorical
  • Max jobs: total trials budget
  • Max parallel jobs: concurrent trials (reduces wall time, less efficient)
  • Warm starting: continue previous tuning job, transfer learning for HPO

Common Hyperparameter Effects

HyperparameterEffect when increased
Learning rateFaster convergence, risk instability
Batch sizeMore stable gradients, less noise, memory
Num epochsRisk overfitting, better convergence
Regularization (lambda)More regularization, lower variance
Max depth (trees)More complex model, risk overfitting
Num estimatorsBetter ensemble, diminishing returns
Dropout rateMore regularization
Hidden unitsMore capacity, risk overfitting

Model Monitoring Quick Reference

Monitor TypeWhat it DetectsBaseline
Data QualityFeature drift, schema violationsBaseline statistics from training data
Model QualityPrediction drift, accuracy degradationGround truth labels needed
Bias DriftFairness metric changes over timeClarify baseline
Feature Attribution DriftSHAP value changesClarify baseline

Key: All Model Monitor types need a baseline job first, then schedule monitoring job.


Mnemonics & Memory Aids

CART vs ID3: CART uses Gini, ID3 uses Entropy (C = Continuous splits, I = Information)

Precision vs Recall with “Spam”:

  • Spam filter: Precision matters — FP means missing real email
  • Cancer detection: Recall matters — FN means missed cancer

L1 vs L2 = “Lasso vs Ridge”:

  • L1 Lasso = “L-one = L-ose features” (zeros out weights)
  • L2 Ridge = “Ridge = gentle slope, no zeros”

Bagging vs Boosting:

  • Bagging = Bootstrap samples + Broad parallel trees
  • Boosting = Build on mistakes sequentially

SageMaker endpoint types — “RSBA”:

  • Real-time (low latency, online)
  • Serverless (intermittent, no idle cost)
  • Batch transform (offline, no endpoint)
  • Asynchronous (large payload, long running)

DeepAR vs classical forecasting:

  • DeepAR = Deep learning for All related series at once Recurrently
  • Use when you have many related time series (products, stores)

RCF for anomaly: Random Cut Forest = Real-time streaming anomalies Cut trees Find outliers

BlazingText modes:

  • Supervised = text classification
  • cbow / skipgram = word embeddings (like Word2Vec)

PCA vs t-SNE rule: PCA in production, t-SNE only for visualization (t-SNE is non-parametric, cannot transform new points)

Kinesis shard capacity:

  • Ingestion: 1 MB/s or 1000 records/s per shard
  • Consumption: 2 MB/s per shard

SageMaker Training job data: Always read from os.environ['SM_CHANNEL_TRAINING'] in script, save model to os.environ['SM_MODEL_DIR']

Spot training key requirement: MUST enable checkpointing — job can be interrupted and resumed

Ground Truth labeling: Use consolidation → reduce cost; use active learning → label only uncertain examples

Clarify = Bias + Explainability, Monitor = Drift detection over time


Cross-Validation Patterns

PatternUse When
k-fold CVStandard, general purpose
Stratified k-foldImbalanced classes
Time series splitSequential data — NEVER random split
Leave-one-out (LOO)Very small datasets
Group k-foldPrevent data leakage from grouped data

Time series: ALWAYS split by time, never random. Future data must never appear in training.


SageMaker Ground Truth

Labeling Workforce Options:
┌──────────────────────────────────────────┐
│  Amazon Mechanical Turk  │  Crowd         │
│  (public, cheap, fast)   │  (public)      │
├──────────────────────────────────────────┤
│  Vendor Managed          │  Private       │
│  (professional labels)   │  (your team)   │
└──────────────────────────────────────────┘

Active Learning Loop:
  Human labels sample → Train model → Auto-label easy examples
  → Send uncertain examples back to humans → Repeat

Ensemble Methods Quick Card

MethodAlgorithmTechniqueReduces
BaggingRandom ForestBootstrap + averageVariance
BoostingXGBoost, AdaBoost, GBMSequential correctionBias
StackingMeta-learnerTrain on predictions of base modelsBoth
VotingMultiple modelsMajority / soft voteVariance

Transfer Learning Decision

Do you have labeled data?
│
├─ Lots of data (similar domain)   → Fine-tune all layers
├─ Lots of data (different domain) → Train from scratch
├─ Little data (similar domain)    → Freeze base, train head only
└─ Little data (different domain)  → Risky; use pretrained head if possible

AWS Data Processing Quick Sizing

Data VolumeDaily ETLRecommended Service
< 1 GBInfrequentLambda + S3
1 GB - 1 TBDaily batchGlue (serverless)
1 TB - 10 TBDaily batchGlue or EMR
> 10 TBComplex jobsEMR (full control)
Real-timeStreamingKinesis + Analytics

Quick Recall: Domain Weights

DomainWeightFocus
1: Data Engineering20%S3, Kinesis, Glue, ETL, formats
2: EDA & Feature Engineering24%Stats, imbalance, scaling, imputation
3: Modeling36%Algorithms, training, evaluation (HEAVIEST)
4: ML Implementation & Ops20%SageMaker, security, deployment, monitoring