Skip to content

Conversation

@andrestobelem
Copy link

While I was reading Modern Software Engineering: Doing What Works to Build Better Software Faster by Dave Farley, I ran into a small but annoying issue that felt very familiar.

What happened

I was working with a bunch of .epub files and, like most people, I assumed they would all follow the usual rule: an EPUB is just a ZIP file with a specific structure inside.

Turns out that wasn’t always true.

Some of the files were actually TAR archives (sometimes compressed), just renamed to .epub. As expected, standard EPUB libraries couldn’t read them at all.

There were a couple of extra complications too:

  • Non-standard directory layouts: the book contents were inside nested folders instead of being at the root, which broke relative paths.
  • System files mixed in: hidden OS metadata and resource files showed up in the archive and occasionally caused validation or parsing issues.

What I ended up doing

Rather than special-casing everything later, I added a small pre-processing step to clean things up before reading the files:

  • Check the real file type
    Instead of trusting the file extension, the code looks at the file signature to see what it actually is.

  • Convert in memory
    If the file is a TAR, it gets converted to a ZIP on the fly, using in-memory buffers only.

  • Figure out the actual root
    The required mimetype file is used as an anchor to find the real root of the EPUB, and any extra container directories are stripped out.

  • Keep the rest simple
    After that, the rest of the system always works with a normal ZIP-based EPUB, no matter how the file was originally packaged.

Nothing fancy, but it made the whole flow a bit more resilient and avoided a bunch of edge cases later on.

@andrestobelem andrestobelem marked this pull request as draft January 6, 2026 11:25
@andrestobelem andrestobelem marked this pull request as ready for review January 6, 2026 11:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant