This project tackles binary sentiment analysis on the IMDB movie reviews dataset using a from‑scratch RNN built with low‑level TensorFlow operations. It emphasizes understanding the recurrent update equations, padding/truncation strategies, vocabulary limits, and practical training challenges for sequence data.
- Load IMDB via
tf.keras.datasets.imdb(50k reviews: 25k train / 25k test, binary labels). - Inspect indexed sequences; optionally reconstruct text via
get_word_index(). - Represent words as one‑hot vectors (sequence length × vocabulary size) and discuss why simple integer inputs are unsuitable.
- Handle variable sequence lengths with padding (and explore pre vs post padding) and truncation (e.g., to 200 tokens).
- Limit vocabulary size (e.g., keep top 20k, map rare words to an UNK token) to reduce memory and improve learning.
- Implement an RNN from scratch (no Keras RNNCell/high‑level RNNs):
- Loop over time steps, update hidden state from previous state + current input (
tf.matmul, nonlinearity). - Use many‑to‑one setup (final step output) or aggregate across steps.
- Train with
tf.GradientTape(BPTT) and Keras optimizers/losses/metrics.
- Loop over time steps, update hidden state from previous state + current input (
- Compare output formulations:
- 2‑unit logits + softmax + sparse categorical cross‑entropy vs 1‑unit sigmoid + binary cross‑entropy.
- Explore training issues: slow starts, vanishing gradients, initialization, learning rate, and using all time steps (averaging states/logits).
- Why is padding to the global max length wasteful? Smarter batching/padding schemes?
- Truncate long sequences vs remove them—trade‑offs?
- Alternatives to one‑hot (e.g., learned embeddings) to avoid huge vectors?
- Pre vs post padding—why does it matter when using the last time‑step output?
- Masking padded steps—how to skip updates for padding within a batch?
- Ways to leverage all time steps (averaging logits/states/probabilities)—pros/cons?
python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
jupyter lab rnn-text-classification.ipynb- Consider sequence length caps (e.g., 200) and vocabulary limits (e.g., 20k) for speed and stability.
- You may use Keras optimizers/losses/metrics, but the RNN recurrence itself is implemented with low‑level ops.