popmon.pipeline package

Submodules

popmon.pipeline.amazing_pipeline module

class popmon.pipeline.amazing_pipeline.AmazingPipeline(histogram_path, **kwargs)

Bases: Pipeline

__init__(histogram_path, **kwargs)

Initialization of the pipeline

Parameters
  • modules (list) – modules of the pipeline.

  • logger – logger to be used by each module.

popmon.pipeline.amazing_pipeline.run()

Example that run self-reference pipeline and produces monitoring report

popmon.pipeline.dataset_splitter module

popmon.pipeline.dataset_splitter.split_dataset(dataset, split, time_axis)

Split a dataset into a reference and remaining part based on split params.

Parameters
  • dataset (pd.Dataset|pyspark.sql.Dataset) – dataset as input

  • split (Any) – split details, meaning depends on the type: if integer, then the reference will be the first split instances if float, then split will be used as ration (e.g. 0.5 returns a 50/50 split) otherwise, the split are interpreted as condition, where the records for which the condition is true are considered the reference, and the other records the remaining dataset.

  • time_axis (str) – the time axis

Returns

tuple of reference, dataset

popmon.pipeline.metrics module

popmon.pipeline.metrics.df_stability_metrics(df, settings=None, time_width=None, time_offset=0, var_dtype=None, reference=None, **kwargs)

Create a data stability monitoring html datastore for given pandas or spark dataframe.

Parameters
  • df – input pandas/spark dataframe to be profiled and monitored over time.

  • settings (popmon.config.Settings) – popmon configuration object

  • time_width

    bin width of time axis. str or number (ns). note: bin_specs takes precedence. (optional)

    Examples: '1w', 3600e9 (number of ns),
              anything understood by pd.Timedelta(time_width).value
    

  • time_offset

    bin offset of time axis. str or number (ns). note: bin_specs takes precedence. (optional)

    Examples: '1-1-2020', 0 (number of ns since 1-1-1970),
              anything parsed by pd.Timestamp(time_offset).value
    

  • var_dtype (dict) – dictionary with specified datatype per feature. auto-guessed when not provided.

  • reference – reference dataframe or histograms. default is None

  • kwargs – residual keyword arguments, passed on to stability_report()

Returns

dict with results of metrics pipeline

popmon.pipeline.metrics.stability_metrics(hists, settings, reference=None)

Create a data stability monitoring datastore for given dict of input histograms.

Parameters
  • hists (dict) – input histograms to be profiled and monitored over time.

  • settings (popmon.config.Settings) – popmon configuration object

  • reference – histograms used as reference. default is None

Returns

dict with results of metrics pipeline

popmon.pipeline.metrics_pipelines module

class popmon.pipeline.metrics_pipelines.ExpandingReferenceMetricsPipeline(settings, hists_key='test_hists')

Bases: Pipeline

__init__(settings, hists_key='test_hists')

Example metrics pipeline for comparing test data with itself (expanding test set)

Parameters

hists_key (str) – key to test histograms in datastore. default is ‘test_hists’

Returns

assembled expanding reference pipeline

class popmon.pipeline.metrics_pipelines.ExternalReferenceMetricsPipeline(settings, hists_key='test_hists', ref_hists_key='ref_hists')

Bases: Pipeline

__init__(settings, hists_key='test_hists', ref_hists_key='ref_hists')

Example metrics pipeline for comparing test data with other (full) external reference set

Parameters
  • hists_key (str) – key to test histograms in datastore. default is ‘test_hists’

  • ref_hists_key (str) – key to reference histograms in datastore. default is ‘ref_hists’

Returns

assembled external reference pipeline

class popmon.pipeline.metrics_pipelines.RollingReferenceMetricsPipeline(settings, hists_key='test_hists')

Bases: Pipeline

__init__(settings, hists_key='test_hists')

Example metrics pipeline for comparing test data with itself (rolling test set)

Parameters

hists_key (str) – key to test histograms in datastore. default is ‘test_hists’

Returns

assembled rolling reference pipeline

class popmon.pipeline.metrics_pipelines.SelfReferenceMetricsPipeline(settings, hists_key)

Bases: Pipeline

__init__(settings, hists_key)

Example metrics pipeline for comparing test data with itself (full test set)

Parameters

hists_key (str) – key to test histograms in datastore. default is ‘test_hists’

Returns

assembled self reference pipeline

popmon.pipeline.metrics_pipelines.get_dynamic_bound_modules(pull_rules)

Generate dynamic traffic light boundaries, based on traffic lights for normalized residuals, used for plotting in popmon_profiles report.

Return type

List[Union[Module, Pipeline]]

popmon.pipeline.metrics_pipelines.get_splitting_modules(hists_key, features, time_axis)

Splitting of test histograms. For each histogram with datetime i, comparison of histogram i with histogram i-1, results in chi2 comparison of histograms

Return type

List[Union[Module, Pipeline]]

popmon.pipeline.metrics_pipelines.get_static_bound_modules(pull_rules)

generate dynamic traffic light boundaries, based on traffic lights for normalized residuals, used for plotting in popmon_profiles report.

Return type

List[Union[Module, Pipeline]]

popmon.pipeline.metrics_pipelines.get_traffic_light_modules(monitoring_rules)

Expand all (wildcard) static traffic light bounds and apply them. Applied to both profiles and comparisons datasets

Return type

List[Union[Module, Pipeline]]

popmon.pipeline.metrics_pipelines.get_trend_modules(window)

Looking for significant rolling linear trends in selected features/metrics

Return type

List[Union[Module, Pipeline]]

popmon.pipeline.report module

class popmon.pipeline.report.StabilityReport(datastore, read_key='html_report')

Bases: object

Representation layer of the report.

Stability report module wraps the representation functionality of the report after running the pipeline and generating the report. Report can be represented as a HTML string, HTML file or Jupyter notebook’s cell output.

__init__(datastore, read_key='html_report')

Initialize an instance of StabilityReport.

Parameters

read_key (str) – key of HTML report data to read from data store. default is html_report.

regenerate(store_key='html_report', sections_key='report_sections', settings=None)

Regenerate HTML report with different plot settings :param str sections_key: key to store sections data in the datastore. default is ‘report_sections’. :param str store_key: key to store the HTML report data in the datastore. default is ‘html_report’ :param Settings settings: configuration to regenerate the report :return HTML: HTML report in an iframe

to_file(filename)

Store HTML report in the local file system.

Parameters

filename (str) – filename for the HTML report

to_html(escape=False)

HTML code representation of the report (represented as a string).

Parameters

escape (bool) – escape characters which could conflict with other HTML code. default: False

Return str

HTML code of the report

to_notebook_iframe(width='100%', height='100%')

HTML representation of the class (report) embedded in an iframe.

Parameters
  • width (str) – width of the frame to be shown

  • height (str) – height of the frame to be shown

Return HTML

HTML report in an iframe

popmon.pipeline.report.df_stability_report(df, settings=None, time_width=None, time_offset=0, var_dtype=None, reference=None, split=None, **kwargs)

Create a data stability monitoring html report for given pandas or spark dataframe.

Parameters
  • df – input pandas/spark dataframe to be profiled and monitored over time.

  • settings (popmon.config.Settings) – popmon configuration object

  • time_width

    bin width of time axis. str or number (ns). note: bin_specs takes precedence. (optional)

    Examples: '1w', 3600e9 (number of ns),
              anything understood by pd.Timedelta(time_width).value
    

  • time_offset

    bin offset of time axis. str or number (ns). note: bin_specs takes precedence. (optional)

    Examples: '1-1-2020', 0 (number of ns since 1-1-1970),
              anything parsed by pd.Timestamp(time_offset).value
    

  • var_dtype (dict) – dictionary with specified datatype per feature. auto-guessed when not provided.

  • reference – reference dataframe or histograms. default is None

Returns

dict with results of reporting pipeline

popmon.pipeline.report.stability_report(hists, settings=None, reference=None, **kwargs)

Create a data stability monitoring html report for given dict of input histograms.

Parameters
  • hists (dict) – input histograms to be profiled and monitored over time.

  • settings (popmon.config.Settings) – popmon configuration object

  • reference – histograms used as reference. default is None

  • kwargs – when settings=None, parameters such as features and time_axis can be passed

Returns

dict with results of reporting pipeline

popmon.pipeline.report_pipelines module

class popmon.pipeline.report_pipelines.ExpandingReference(settings, hists_key='test_hists')

Bases: Pipeline

__init__(settings, hists_key='test_hists')

Example pipeline for comparing test data with itself (expanding test set)

Parameters

hists_key (str) – key to test histograms in datastore. default is ‘test_hists’

Returns

assembled expanding reference pipeline

class popmon.pipeline.report_pipelines.ExternalReference(settings, hists_key='test_hists', ref_hists_key='ref_hists')

Bases: Pipeline

__init__(settings, hists_key='test_hists', ref_hists_key='ref_hists')

Example pipeline for comparing test data with other (full) external reference set

Parameters
  • hists_key (str) – key to test histograms in datastore. default is ‘test_hists’

  • ref_hists_key (str) – key to reference histograms in datastore. default is ‘ref_hists’

Returns

assembled external reference pipeline

class popmon.pipeline.report_pipelines.ReportPipe(settings, sections_key='report_sections', store_key='html_report')

Bases: Pipeline

Pipeline of modules for generating sections and a final report.

__init__(settings, sections_key='report_sections', store_key='html_report')

Initialize an instance of Report.

Parameters
  • settings (Settings) – the configuration object

  • sections_key (str) – key to store sections data in the datastore

  • store_key (str) – key to store the HTML report data in the datastore

transform(datastore)

Central function of the pipeline.

Calls transform() of each module in the pipeline. Typically transform() of a module takes something from the datastore, does something to it, and puts the results back into the datastore again, to be passed on to the next module in the pipeline.

Parameters

datastore (dict) – input datastore

Returns

updated output datastore

Return type

dict

class popmon.pipeline.report_pipelines.RollingReference(settings, hists_key='test_hists')

Bases: Pipeline

__init__(settings, hists_key='test_hists')

Example pipeline for comparing test data with itself (rolling test set)

Parameters

hists_key (str) – key to test histograms in datastore. default is ‘test_hists’

Returns

assembled rolling reference pipeline

class popmon.pipeline.report_pipelines.SelfReference(settings, hists_key='test_hists')

Bases: Pipeline

__init__(settings, hists_key='test_hists')

Example pipeline for comparing test data with itself (full test set)

Parameters

hists_key (str) – key to test histograms in datastore. default is ‘test_hists’

Returns

assembled self reference pipeline