RNN Text Classification (TensorFlow 2, Low‑Level)

This project tackles binary sentiment analysis on the IMDB movie reviews dataset using a from‑scratch RNN built with low‑level TensorFlow operations. It emphasizes understanding the recurrent update equations, padding/truncation strategies, vocabulary limits, and practical training challenges for sequence data.

Purpose

Load IMDB via tf.keras.datasets.imdb (50k reviews: 25k train / 25k test, binary labels).
Inspect indexed sequences; optionally reconstruct text via get_word_index().
Represent words as one‑hot vectors (sequence length × vocabulary size) and discuss why simple integer inputs are unsuitable.
Handle variable sequence lengths with padding (and explore pre vs post padding) and truncation (e.g., to 200 tokens).
Limit vocabulary size (e.g., keep top 20k, map rare words to an UNK token) to reduce memory and improve learning.
Implement an RNN from scratch (no Keras RNNCell/high‑level RNNs):
- Loop over time steps, update hidden state from previous state + current input (tf.matmul, nonlinearity).
- Use many‑to‑one setup (final step output) or aggregate across steps.
- Train with tf.GradientTape (BPTT) and Keras optimizers/losses/metrics.
Compare output formulations:
- 2‑unit logits + softmax + sparse categorical cross‑entropy vs 1‑unit sigmoid + binary cross‑entropy.
Explore training issues: slow starts, vanishing gradients, initialization, learning rate, and using all time steps (averaging states/logits).

Questions to Explore

Why is padding to the global max length wasteful? Smarter batching/padding schemes?
Truncate long sequences vs remove them—trade‑offs?
Alternatives to one‑hot (e.g., learned embeddings) to avoid huge vectors?
Pre vs post padding—why does it matter when using the last time‑step output?
Masking padded steps—how to skip updates for padding within a batch?
Ways to leverage all time steps (averaging logits/states/probabilities)—pros/cons?

Run locally

python -m venv .venv && source .venv/bin/activate     # Windows: .venv\Scripts\activate
pip install -r requirements.txt
jupyter lab rnn-text-classification.ipynb

Notes

Consider sequence length caps (e.g., 200) and vocabulary limits (e.g., 20k) for speed and stability.
You may use Keras optimizers/losses/metrics, but the RNN recurrence itself is implemented with low‑level ops.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md
gitattributes		gitattributes
gitignore		gitignore
requirements.txt		requirements.txt
rnn-text-classification.ipynb		rnn-text-classification.ipynb
setup_repo.sh		setup_repo.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RNN Text Classification (TensorFlow 2, Low‑Level)

Purpose

Questions to Explore

Run locally

Notes

About

Uh oh!

Releases

Packages

Languages

License

Ashly1991/rnn-text-classification-tf2

Folders and files

Latest commit

History

Repository files navigation

RNN Text Classification (TensorFlow 2, Low‑Level)

Purpose

Questions to Explore

Run locally

Notes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages