popmon.pipeline package

Submodules

popmon.pipeline.amazing_pipeline module

class popmon.pipeline.amazing_pipeline.AmazingPipeline(histogram_path, **kwargs)

Bases: Pipeline

__init__(histogram_path, **kwargs)

Initialization of the pipeline

Parameters

modules (list) – modules of the pipeline.
logger – logger to be used by each module.

popmon.pipeline.amazing_pipeline.run()

Example that run self-reference pipeline and produces monitoring report

Return type: None

popmon.pipeline.dataset_splitter module

popmon.pipeline.dataset_splitter.split_dataset(dataset, split, time_axis)

Split a dataset into a reference and remaining part based on split params.

Parameters

dataset (pd.Dataset|pyspark.sql.Dataset) – dataset as input
split (Any) – split details, meaning depends on the type: if integer, then the reference will be the first split instances if float, then split will be used as ration (e.g. 0.5 returns a 50/50 split) otherwise, the split are interpreted as condition, where the records for which the condition is true are considered the reference, and the other records the remaining dataset.
time_axis (str) – the time axis

Returns

tuple of reference, dataset

popmon.pipeline.metrics module

popmon.pipeline.metrics.df_stability_metrics(df, settings=None, time_width=None, time_offset=0, var_dtype=None, reference=None, **kwargs)

Create a data stability monitoring html datastore for given pandas or spark dataframe.

Parameters

df – input pandas/spark dataframe to be profiled and monitored over time.
settings (popmon.config.Settings) – popmon configuration object

time_width –

bin width of time axis. str or number (ns). note: bin_specs takes precedence. (optional)

Examples: '1w', 3600e9 (number of ns),
          anything understood by pd.Timedelta(time_width).value

time_offset (int) –

bin offset of time axis. str or number (ns). note: bin_specs takes precedence. (optional)

Examples: '1-1-2020', 0 (number of ns since 1-1-1970),
          anything parsed by pd.Timestamp(time_offset).value

var_dtype (dict) – dictionary with specified datatype per feature. auto-guessed when not provided.
reference – reference dataframe or histograms. default is None
kwargs – residual keyword arguments, passed on to stability_report()

Returns

dict with results of metrics pipeline

popmon.pipeline.metrics.stability_metrics(hists, settings, reference=None)

Create a data stability monitoring datastore for given dict of input histograms.

Parameters

hists (dict) – input histograms to be profiled and monitored over time.
settings (popmon.config.Settings) – popmon configuration object
reference – histograms used as reference. default is None

Returns

dict with results of metrics pipeline

popmon.pipeline.metrics_pipelines module

class popmon.pipeline.metrics_pipelines.ExpandingReferenceMetricsPipeline(settings, hists_key='test_hists')

Bases: Pipeline

__init__(settings, hists_key='test_hists')

Example metrics pipeline for comparing test data with itself (expanding test set)

Parameters: hists_key (str) – key to test histograms in datastore. default is ‘test_hists’
Returns: assembled expanding reference pipeline

class popmon.pipeline.metrics_pipelines.ExternalReferenceMetricsPipeline(settings, hists_key='test_hists', ref_hists_key='ref_hists')

Bases: Pipeline

__init__(settings, hists_key='test_hists', ref_hists_key='ref_hists')

Example metrics pipeline for comparing test data with other (full) external reference set

Parameters

hists_key (str) – key to test histograms in datastore. default is ‘test_hists’
ref_hists_key (str) – key to reference histograms in datastore. default is ‘ref_hists’

Returns

assembled external reference pipeline

class popmon.pipeline.metrics_pipelines.RollingReferenceMetricsPipeline(settings, hists_key='test_hists')

Bases: Pipeline

__init__(settings, hists_key='test_hists')

Example metrics pipeline for comparing test data with itself (rolling test set)

Parameters: hists_key (str) – key to test histograms in datastore. default is ‘test_hists’
Returns: assembled rolling reference pipeline

class popmon.pipeline.metrics_pipelines.SelfReferenceMetricsPipeline(settings, hists_key)

Bases: Pipeline

__init__(settings, hists_key)

Example metrics pipeline for comparing test data with itself (full test set)

Parameters: hists_key (str) – key to test histograms in datastore. default is ‘test_hists’
Returns: assembled self reference pipeline

popmon.pipeline.metrics_pipelines.get_dynamic_bound_modules(pull_rules)

Generate dynamic traffic light boundaries, based on traffic lights for normalized residuals, used for plotting in popmon_profiles report.

Return type: list[Module | Pipeline]

popmon.pipeline.metrics_pipelines.get_splitting_modules(hists_key, features, time_axis)

Splitting of test histograms. For each histogram with datetime i, comparison of histogram i with histogram i-1, results in chi2 comparison of histograms

Return type: list[Module | Pipeline]

popmon.pipeline.metrics_pipelines.get_static_bound_modules(pull_rules)

generate dynamic traffic light boundaries, based on traffic lights for normalized residuals, used for plotting in popmon_profiles report.

Return type: list[Module | Pipeline]

popmon.pipeline.metrics_pipelines.get_traffic_light_modules(monitoring_rules)

Expand all (wildcard) static traffic light bounds and apply them. Applied to both profiles and comparisons datasets

Return type: list[Module | Pipeline]

popmon.pipeline.metrics_pipelines.get_trend_modules(window)

Looking for significant rolling linear trends in selected features/metrics

Return type: list[Module | Pipeline]

popmon.pipeline.report module

class popmon.pipeline.report.StabilityReport(datastore, read_key='html_report')

Bases: object

Representation layer of the report.

Stability report module wraps the representation functionality of the report after running the pipeline and generating the report. Report can be represented as a HTML string, HTML file or Jupyter notebook’s cell output.

__init__(datastore, read_key='html_report')

Initialize an instance of StabilityReport.

Parameters: read_key (str) – key of HTML report data to read from data store. default is html_report.

regenerate(store_key='html_report', sections_key='report_sections', settings=None): Regenerate HTML report with different plot settings :param str sections_key: key to store sections data in the datastore. default is ‘report_sections’. :param str store_key: key to store the HTML report data in the datastore. default is ‘html_report’ :param Settings settings: configuration to regenerate the report :return HTML: HTML report in an iframe

to_file(filename)

Store HTML report in the local file system.

Parameters: filename (str) – filename for the HTML report
Return type: None

to_html(escape=False)

HTML code representation of the report (represented as a string).

Parameters: escape (bool) – escape characters which could conflict with other HTML code. default: False
Return str: HTML code of the report

to_notebook_iframe(width='100%', height='100%')

HTML representation of the class (report) embedded in an iframe.

Parameters

width (str) – width of the frame to be shown
height (str) – height of the frame to be shown

Return HTML

HTML report in an iframe

popmon.pipeline.report.df_stability_report(df, settings=None, time_width=None, time_offset=0, var_dtype=None, reference=None, split=None, **kwargs)

Create a data stability monitoring html report for given pandas or spark dataframe.

Parameters

df – input pandas/spark dataframe to be profiled and monitored over time.
settings (popmon.config.Settings) – popmon configuration object

time_width –

bin width of time axis. str or number (ns). note: bin_specs takes precedence. (optional)

Examples: '1w', 3600e9 (number of ns),
          anything understood by pd.Timedelta(time_width).value

time_offset (int) –

bin offset of time axis. str or number (ns). note: bin_specs takes precedence. (optional)

Examples: '1-1-2020', 0 (number of ns since 1-1-1970),
          anything parsed by pd.Timestamp(time_offset).value

var_dtype (dict) – dictionary with specified datatype per feature. auto-guessed when not provided.
reference – reference dataframe or histograms. default is None

Returns

dict with results of reporting pipeline

popmon.pipeline.report.stability_report(hists, settings=None, reference=None, **kwargs)

Create a data stability monitoring html report for given dict of input histograms.

Parameters

hists (dict) – input histograms to be profiled and monitored over time.
settings (popmon.config.Settings) – popmon configuration object
reference – histograms used as reference. default is None
kwargs – when settings=None, parameters such as features and time_axis can be passed

Returns

dict with results of reporting pipeline

popmon.pipeline.report_pipelines module

class popmon.pipeline.report_pipelines.ExpandingReference(settings, hists_key='test_hists')

Bases: Pipeline

__init__(settings, hists_key='test_hists')

Example pipeline for comparing test data with itself (expanding test set)

Parameters: hists_key (str) – key to test histograms in datastore. default is ‘test_hists’
Returns: assembled expanding reference pipeline

class popmon.pipeline.report_pipelines.ExternalReference(settings, hists_key='test_hists', ref_hists_key='ref_hists')

Bases: Pipeline

__init__(settings, hists_key='test_hists', ref_hists_key='ref_hists')

Example pipeline for comparing test data with other (full) external reference set

Parameters

hists_key (str) – key to test histograms in datastore. default is ‘test_hists’
ref_hists_key (str) – key to reference histograms in datastore. default is ‘ref_hists’

Returns

assembled external reference pipeline

class popmon.pipeline.report_pipelines.ReportPipe(settings, sections_key='report_sections', store_key='html_report')

Bases: Pipeline

Pipeline of modules for generating sections and a final report.

__init__(settings, sections_key='report_sections', store_key='html_report')

Initialize an instance of Report.

Parameters

settings (Settings) – the configuration object
sections_key (str) – key to store sections data in the datastore
store_key (str) – key to store the HTML report data in the datastore

transform(datastore)

Central function of the pipeline.

Calls transform() of each module in the pipeline. Typically, transform() of a module takes something from the datastore, does something to it, and puts the results back into the datastore again, to be passed on to the next module in the pipeline.

Parameters: datastore (dict) – input datastore
Returns: updated output datastore
Return type: dict

class popmon.pipeline.report_pipelines.RollingReference(settings, hists_key='test_hists')

Bases: Pipeline

__init__(settings, hists_key='test_hists')

Example pipeline for comparing test data with itself (rolling test set)

Parameters: hists_key (str) – key to test histograms in datastore. default is ‘test_hists’
Returns: assembled rolling reference pipeline

class popmon.pipeline.report_pipelines.SelfReference(settings, hists_key='test_hists')

Bases: Pipeline

__init__(settings, hists_key='test_hists')

Example pipeline for comparing test data with itself (full test set)

Parameters: hists_key (str) – key to test histograms in datastore. default is ‘test_hists’
Returns: assembled self reference pipeline