-
Notifications
You must be signed in to change notification settings - Fork 334
hist(group): add validation & empty-data handling; docstring, guide and tests #658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -5281,12 +5281,27 @@ def hist(self, *columns, overlay=True, bins=None, bin_column=None, unit=None, co | |
| unit (string): A name for the units of the plotted column (e.g. | ||
| 'kg'), to be used in the plot. | ||
|
|
||
| group (column name or index): A column of categories. The rows are | ||
| grouped by the values in this column, and a separate histogram is | ||
| generated for each group. The histograms are overlaid or plotted | ||
| separately depending on the overlay argument. If None, no such | ||
| grouping is done. Note: `group` cannot be used together with `bin_column` or when plotting | ||
| multiple columns. An error will be raised in these cases. | ||
| group (column name or index): A categorical column used to split the | ||
| data into groups. A separate histogram is generated for each | ||
| unique value in this column. Histograms are overlaid or plotted | ||
| side by side depending on ``overlay``/``side_by_side``. If ``None``, | ||
| no grouping is applied. | ||
|
|
||
| Constraints and behavior: | ||
| - ``group`` cannot be combined with ``bin_column``. | ||
| - ``group`` requires exactly one histogram value column. If more | ||
| than one value column is passed, a ``ValueError`` is raised. | ||
| - If ``group`` does not reference an existing column (by label or | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not needed |
||
| index), a ``ValueError`` is raised. | ||
|
|
||
| Usage examples: | ||
| >>> t = Table().with_columns( | ||
| ... 'height', make_array(160, 170, 180, 175), | ||
| ... 'gender', make_array('F', 'M', 'M', 'F')) | ||
| >>> t.hist('height', group='gender') # doctest: +SKIP | ||
| <two histograms comparing height distributions by gender> | ||
| >>> t.hist('height', group='gender', side_by_side=True) # doctest: +SKIP | ||
| <two histograms shown side by side for comparison> | ||
|
|
||
| side_by_side (bool): Whether histogram bins should be plotted side by | ||
| side (instead of directly overlaid). Makes sense only when | ||
|
|
@@ -5386,6 +5401,16 @@ def hist(self, *columns, overlay=True, bins=None, bin_column=None, unit=None, co | |
| if counts is not None and bin_column is None: | ||
| warnings.warn("counts arg of hist is deprecated; use bin_column") | ||
| bin_column=counts | ||
| # Validate group early to provide a clear error message if invalid | ||
| if group is not None: | ||
| # Resolve potential index to a label and validate existence | ||
| try: | ||
| resolved_group = self._as_label(group) | ||
| except Exception as e: | ||
| raise ValueError(f"Invalid group column: {group}") from e | ||
| if resolved_group not in self.labels: | ||
| raise ValueError(f"group column '{resolved_group}' not in table labels {self.labels}") | ||
| group = resolved_group | ||
| if columns: | ||
| columns_included = list(columns) | ||
| if bin_column is not None: | ||
|
|
@@ -5429,6 +5454,8 @@ def prepare_hist_with_group(group): | |
| warnings.warn("It looks like you're making a grouped histogram with " | ||
| "a lot of groups ({:d}), which is probably incorrect." | ||
| .format(grouped.num_rows)) | ||
| if grouped.num_rows == 0: | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the argument for this change in behavior? |
||
| return [] | ||
| return [("{}={}".format(group, k), (v[0][1],)) for k, v in grouped.index_by(group).items()] | ||
|
|
||
| # Populate values_dict: An ordered dict from column name to singleton | ||
|
|
@@ -5461,6 +5488,10 @@ def draw_hist(values_dict): | |
| "following code: `np.set_printoptions(legacy='1.13')`", UserWarning) | ||
| # This code is factored as a function for clarity only. | ||
| n = len(values_dict) | ||
| if n == 0: | ||
| # Create an empty figure to maintain a no-error contract on empty groups | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the argument for this change in behavior? Why is it important to avoid exceptions in this case? |
||
| plt.figure(figsize=(width, height)) | ||
| return | ||
| colors = [rgb_color + (self.default_alpha,) for rgb_color in | ||
| itertools.islice(itertools.cycle(self.chart_colors), n)] | ||
| hist_names = list(values_dict.keys()) | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,31 @@ | ||
| # Grouped Histograms with `Table.hist` | ||
|
|
||
| This project supports grouped histograms via the `group` parameter on `Table.hist`. Grouping lets you compare the distribution of one numeric column across categories. | ||
|
|
||
| Minimal example: | ||
|
|
||
| ```python | ||
| from datascience import Table, make_array | ||
|
|
||
| t = Table().with_columns( | ||
| 'height', make_array(160, 170, 180, 175), | ||
| 'gender', make_array('F', 'M', 'M', 'F') | ||
| ) | ||
|
|
||
| # Compare height distributions by gender (overlaid) | ||
| t.hist('height', group='gender') | ||
|
|
||
| # Show the grouped histograms side by side | ||
| t.hist('height', group='gender', side_by_side=True) | ||
| ``` | ||
|
|
||
| Interpretation: | ||
| - When `group='gender'`, the table splits rows by each unique value in `gender` and draws a separate histogram for the `height` values in each group. | ||
| - Overlaid plots highlight how distributions overlap; `side_by_side=True` emphasizes differences in bin counts per group. | ||
|
|
||
| Notes and constraints: | ||
| - `group` cannot be used together with `bin_column`. | ||
| - `group` expects exactly one numeric value column (e.g., `'height'`). Passing multiple value columns raises a `ValueError`. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think the exception raised should be part of the API specification. |
||
| - If `group` does not reference an existing column label or index, a `ValueError` is raised. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think the exception raised should be part of the API specification. |
||
| - If the data are empty for all groups, `hist` creates an empty figure and returns without error. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why? |
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,33 @@ | ||
| import numpy as np | ||
| import pytest | ||
|
|
||
| import datascience as ds | ||
|
|
||
|
|
||
| def test_hist_group_normal_no_error(): | ||
| t = ds.Table().with_columns( | ||
| 'value', ds.make_array(1, 2, 3, 2, 5), | ||
| 'cat', ds.make_array('a', 'a', 'a', 'b', 'b') | ||
| ) | ||
| # Should not raise | ||
| t.hist('value', group='cat') | ||
|
|
||
|
|
||
| def test_hist_group_invalid_label_raises_value_error(): | ||
| t = ds.Table().with_columns( | ||
| 'value', ds.make_array(1, 2, 3), | ||
| 'cat', ds.make_array('x', 'y', 'x') | ||
| ) | ||
| with pytest.raises(ValueError): | ||
| t.hist('value', group='missing_col') | ||
|
|
||
|
|
||
| def test_hist_group_empty_data_no_error(): | ||
| # Empty table after filtering | ||
| t = ds.Table().with_columns( | ||
| 'value', ds.make_array(), | ||
| 'cat', ds.make_array() | ||
| ) | ||
| # Should not raise; creates an empty figure | ||
| t.hist('value', group='cat') | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to specify which exception is raised if a constraint is violated.