Extract id3 tags by regex #175

kookster · 2025-10-18T19:12:11Z

implements #174

lib/porter-util/index.d.ts

src/lambdas/inspect/audio.js

kookster · 2025-10-18T19:14:28Z

src/lambdas/inspect/ffprobe.js

        "-show_streams",
        "-show_format",
+        "-show_entries",
+        "format_tags",


This is what actually causes ffprobe to detect and return the id3 tags

kookster · 2025-10-18T19:14:59Z

src/lambdas/inspect/index.js

- * @property {string} Type
+ * @property {string} [Type]
 * @property {boolean} [EBUR128]
+ * @property {string} [MatchTags]


Added a Task attribute to pass in a regex that can be used to select and return only certain tags

kookster · 2025-10-18T19:15:20Z

src/lambdas/inspect/audio.js

+
+      // use regex to extract only the matching tags
+      Object.keys(tags).forEach((key) => {
+        if (regex.test(key) || regex.test(tags[key])) {


Checks if there is a matching tag by key or value

A few thoughts about the API for this feature:

I think it would be good to allow the job definition to explicitly decide if it's matching on tag key and/or tag value

I could possibly imagine wanting to provide other comparison methods down the road besides regex, so for future proofing it may be worth being more verbose about that in the API, so we don't end up with a single parameter that paints us into a corner

Right now we're only operating on tags coming back from FFprobe, and we don't run all files that are inspectable through FFmpeg. So for example, if a job inspects a JPEG, which can have tag-like things, this wouldn't do anything, even if they job asks for tags that are present on the file. We should make sure the docs are very clear about what types of metadata this will work with, and, if necessary, for which file formats. Right now this wouldn't even return tags for video files, which we do inspect with FFprobe, so there's a bit of ambiguity in how this feature works at the moment.

I think it'd be best to avoid making this something like MatchId3Tags at the API level, and that's not even really what it's doing already, but if we do want to make the choice that this isn't a a general-purpose "find metadata key/values" feature, I'd want to make sure the API reflects that.

My preference would be to have reasonable support for standard types of KV metadata on audio, video, and images. Audio and video I think we can rely on the FFprobe stuff already in here, and the sharp lib we're already using for images has solid support for Xmp/EXIF, so it's probably not hard to expand this over to that. I think those two tools normalize access to tags/metadata across a lot of different formats to a pretty reasonable degree to call that a good starting point.

I'm imagining something like this for the API (structure, not naming)

Task: { Type: "Inspect", ReturnMetadata: { Keys: { StringLike: "AIS_AD_BREAK_" }, Values: { StringLike: "AIS_AD_BREAK_" } } }

I agree with most of this; I'm not sure about expanding the scope of this ticket to include how we handle metadata on anything that might go through Porter, but I agree current limits could be documented, and issues added for building out the image or other support.

Adobe XMP is also included in the id3 of many of the files I looked at, but ffprobe on its own doesn't do a great job with it - additional processing or a different tool might (e.g. the xmp is returned hex encoded, ffprobe doesn't handle that, or parsing the xml)

kookster · 2025-10-18T19:16:52Z

test/unit/lambdas/InspectLambdaFunction.test.js

+  );
+  expect(result.Task).toEqual("Inspect");
+  expect(result.Inspection.Audio.Tags).toEqual({
+    comment: "AIS_AD_BREAK_1=2000,0;",


returns the comment that matches

kookster · 2025-10-18T19:17:28Z

CONTRIBUTING.md

 The project includes a `Makefile` to centralize various commands used in development and operation. When new commands are required, be sure to add them to the `Makefile`, even if it calls out to something else, like `npm run-script`.

-Besides language runtimes, package managers, and command line programs, tooling should be written to assume that libraries are installed within the package, not globally available on the system. I.e, do not asusme the `eslint` or `prettier` are installed globally.
+Besides language runtimes, package managers, and command line programs, tooling should be written to assume that libraries are installed within the package, not globally available on the system. I.e, do not assume the `eslint` or `prettier` are installed globally.


Just some spell check changes on here

kookster · 2025-10-18T19:17:53Z

package.json

 {
  "name": "porter",
  "version": "1.0.0",
+  "type": "module",


this is what got me past import errors!

farski

Some additional thinking around how best to handle this feature for the majority of files that get inspected.

Also, need to update the README for this new functionality.

If additional tests are going to added, I would do them in a branch off this branch, so this PR doesn't get too overloaded.

lib/porter-util/index.d.ts

src/lambdas/inspect/audio.js

farski · 2025-10-21T16:00:30Z

src/lambdas/inspect/audio.js

+
+      // use regex to extract only the matching tags
+      Object.keys(tags).forEach((key) => {
+        if (regex.test(key) || regex.test(tags[key])) {


A few thoughts about the API for this feature:

I think it would be good to allow the job definition to explicitly decide if it's matching on tag key and/or tag value

I could possibly imagine wanting to provide other comparison methods down the road besides regex, so for future proofing it may be worth being more verbose about that in the API, so we don't end up with a single parameter that paints us into a corner

Right now we're only operating on tags coming back from FFprobe, and we don't run all files that are inspectable through FFmpeg. So for example, if a job inspects a JPEG, which can have tag-like things, this wouldn't do anything, even if they job asks for tags that are present on the file. We should make sure the docs are very clear about what types of metadata this will work with, and, if necessary, for which file formats. Right now this wouldn't even return tags for video files, which we do inspect with FFprobe, so there's a bit of ambiguity in how this feature works at the moment.

I think it'd be best to avoid making this something like MatchId3Tags at the API level, and that's not even really what it's doing already, but if we do want to make the choice that this isn't a a general-purpose "find metadata key/values" feature, I'd want to make sure the API reflects that.

My preference would be to have reasonable support for standard types of KV metadata on audio, video, and images. Audio and video I think we can rely on the FFprobe stuff already in here, and the sharp lib we're already using for images has solid support for Xmp/EXIF, so it's probably not hard to expand this over to that. I think those two tools normalize access to tags/metadata across a lot of different formats to a pretty reasonable degree to call that a good starting point.

I'm imagining something like this for the API (structure, not naming)

Task: { Type: "Inspect", ReturnMetadata: { Keys: { StringLike: "AIS_AD_BREAK_" }, Values: { StringLike: "AIS_AD_BREAK_" } } }

farski

I think this is getting pretty close

src/lambdas/inspect/audio.js

README.md

farski · 2025-10-31T22:56:03Z

src/lambdas/inspect/audio.js

 /** @typedef {import('./index.js').InspectTask} InspectTask */

+/**
+ * @typedef {{ [key: string]: string }} Tags


Do you think it makes sense to elevate metadata to the top level of the Inspection object? I can't think of any cases where individual streams with in a file would have different metadata like what we're dealing with here (e.g., tags). I guess technically everything inspect is returning is metadata in a way, but we don't currently call things like bit depth or video resolution that.

So for example, once we support tag matching on other file types, does it make sense that those tags end up in different places in the results, depending on if the file is video/audio/image?

It could be useful to say that we treat this type of returnable metadata as a file-level aspect, and we normalize things so that no matter where in the file or which part of the inspection process it comes from, things like tags would always end up under Inspection.Metadata.Tags.

I do not have a strong opinion. I think almost everything we return is a kind of metadata, so I would perhaps just call these "tags" as they are arbitrary data added to the file.

So the issue I see with pulling this metadata up out of the Audio is that it messes up how data is returned now from each of the different inspect tasks, that index.js would have to get reworked.

Another way to put it is that a single inspect task can include multiple return values, like when an audio file could include an image in the id3 data, so it makes sense to have the metadata for the audio and image separated.

I'm going to resolve this unless you feel strongly, and have a way to handle combined metadata from each inspect type?

kookster · 2025-11-01T02:04:54Z

src/lambdas/inspect/audio.js

 import { inspect as ebur128 } from "./ebu-r-128.js";
 import { inspect as ffprobe } from "./ffprobe.js";
 import { inspect as mpck } from "./mpck.js";
+import { ffprobeTags } from "./tags.js";


I put the tags logic in a shared util file, so the types and methods could be reused

kookster · 2025-11-01T02:05:31Z

src/lambdas/inspect/audio.js

 * @property {number} [LoudnessTruePeak]
 * @property {number} [LoudnessRange]
 * @property {number} [UnidentifiedBytes]
+ * @property {Tag[]} [Tags]


This used to be a hash, but is now a list of objects, as the same key may be used for more than one tag.

kookster · 2025-11-03T14:48:16Z

A few changes here -

Added support for video to audio tags, extracted common ffmpeg tag extraction
Added support for sharp extracted png comments, which operate like tags
Because comments and possible id3 tags can have repeated labels, use an array of objects instead of a map for them
StringMatches instead of StringIncludes to match regex functionality
@farski I think we're even closer?

README.md

kookster added 9 commits October 18, 2025 12:56

Move jest config

aee40c3

Small edits

320c486

Add jest and aws mock support

a2b54dd

Add a test for inspect, including tags

f2ef781

Allow specifying bin directory

a0fa61a

Extract id3 tags based on task regex

998e36e

Add override for bin executable location

6f82b75

Use binDir consistently

56d9a13

Fix format

e3abce6

kookster commented Oct 18, 2025

View reviewed changes

lib/porter-util/index.d.ts Show resolved Hide resolved

kookster commented Oct 18, 2025

View reviewed changes

src/lambdas/inspect/audio.js Outdated Show resolved Hide resolved

kookster commented Oct 18, 2025

View reviewed changes

package.json

{

"name": "porter",

"version": "1.0.0",

"type": "module",

Copy link

Member Author

kookster Oct 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is what got me past import errors!

Update test desc

2283b20

farski requested changes Oct 21, 2025

View reviewed changes

kookster changed the title ~~Extract id3 tags by regex~~ WIP: Extract id3 tags by regex Oct 21, 2025

kookster added 4 commits October 26, 2025 22:24

Upgrade jest and mocks

8c2cdf3

Support individual bin paths

35a9493

Improve task request structure

8bcfea0

Update test to new task structure

7e55c5e

kookster mentioned this pull request Oct 29, 2025

Import id3 timings PRX/feeder.prx.org#1365

Merged

6 tasks

Add new task attribute docs

dc0acb5

kookster changed the title ~~WIP: Extract id3 tags by regex~~ Extract id3 tags by regex Oct 31, 2025

Fix lint error

990b36c

kookster requested a review from farski October 31, 2025 19:09

Add loudness test to Inspect

2030136

farski requested changes Oct 31, 2025

View reviewed changes

kookster added 5 commits October 31, 2025 20:03

Rename includes to matches, update docs

6cbcab9

Add png comments to tags checked

900d280

Add tag support for video

58ca18a

linting

3593477

Update video and png image support in docs

9491e06

kookster commented Nov 1, 2025

View reviewed changes

kookster requested a review from farski November 3, 2025 14:51

farski approved these changes Nov 3, 2025

View reviewed changes

README.md Show resolved Hide resolved

More docs on the regex

e61638f

kookster merged commit 6313dab into main Nov 4, 2025
4 checks passed

kookster deleted the feat/id3_breaks branch November 4, 2025 17:31

Extract id3 tags by regex #175

Extract id3 tags by regex #175

Uh oh!

Conversation

kookster commented Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kookster Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

farski left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

farski left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kookster commented Nov 3, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kookster commented Oct 18, 2025 •

edited

Loading

kookster Oct 21, 2025 •

edited

Loading