+
Skip to content

v0.14.0 #880

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 507 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
507 commits
Select commit Hold shift + click to select a range
cf040b0
Deliverables sprint 07
FelipeTrost Dec 3, 2024
90940ca
Merge remote-tracking branch 'origin/develop' into feature/17-dimenti…
FelipeTrost Dec 4, 2024
31e2b17
test(pipeline): dimensionality reduction
FelipeTrost Dec 4, 2024
ad9820e
dimensionality reduction: improved doc
FelipeTrost Dec 4, 2024
8f290d6
#84: Removed print
dh1542 Dec 4, 2024
17fae52
#84: Added differences form core setup.py
dh1542 Dec 4, 2024
555d0e3
#84: Added differences form core
dh1542 Dec 4, 2024
3de7862
#84: Added differences form core
dh1542 Dec 4, 2024
3197876
#84: linted
dh1542 Dec 4, 2024
42a12a1
#84: Added differences form core
dh1542 Dec 4, 2024
f07acbe
#84: Fixed wrong passing of FastAPI into httpx async client
dh1542 Dec 4, 2024
9730324
#84: Fixed wrong passing of FastAPI into httpx async client
dh1542 Dec 4, 2024
36f377c
#84: Changed back files that were taken from core repo
dh1542 Dec 4, 2024
f1e99ac
Merge pull request #93 from amosproj/#84-Fix-broken-API-test
mollle Dec 4, 2024
99199f4
Merge remote-tracking branch 'origin/develop' into feature/17-dimenti…
FelipeTrost Dec 7, 2024
8aae6df
Merge branch 'develop' into feature/040_finishing_ARIMA
Timm638 Dec 9, 2024
87befbb
refactoring unit tests
mollle Dec 10, 2024
957e3b2
fixed path to test data file
mollle Dec 10, 2024
9fd3825
fixed broken log collection tests that broke because of changes to Id…
mollle Dec 10, 2024
2cd1d9b
Merge pull request #95 from amosproj/refactor/068_067_069
dh1542 Dec 10, 2024
756fa7f
feat: schema validation for anomaly detection
FelipeTrost Dec 10, 2024
5f98b56
refactors arima.py for preparation of exogenous variables
Timm638 Dec 10, 2024
f0ed1b8
refactors AutoArima to its own Component
Timm638 Dec 10, 2024
31615ae
Merge remote-tracking branch 'origin/develop' into refactor/63-anomal…
FelipeTrost Dec 11, 2024
b47380f
Merge remote-tracking branch 'origin/develop' into feature/17-dimenti…
FelipeTrost Dec 11, 2024
ff9f55b
removed unused imports
FelipeTrost Dec 11, 2024
a7c0ef1
Revert "Deleted deliverables folder"
FelipeTrost Dec 11, 2024
d877f29
adds auto-recognition of column names to ARIMA
Timm638 Dec 11, 2024
a93b9f8
Adds Sprint 8 Delieverables
Timm638 Dec 11, 2024
b616330
#62: Removed log statement
dh1542 Dec 14, 2024
523f46c
#62: Added large dataset test
dh1542 Dec 14, 2024
7ddfcd7
#62: Added invalid datatype test
dh1542 Dec 14, 2024
45f32d1
#62: linted
dh1542 Dec 14, 2024
4420e23
Merge pull request #100 from amosproj/feature/#62-Interval-filtering-…
chris-1187 Dec 15, 2024
3da1ef3
cleans up RegEx
Timm638 Dec 15, 2024
e4510af
tries to implement exog
Timm638 Dec 16, 2024
2601544
splits up Arima into two files
Timm638 Dec 16, 2024
899e73e
Merge pull request #91 from amosproj/feature/17-dimentionality-reduction
mollle Dec 16, 2024
30eb9d3
Merge pull request #98 from amosproj/97-restore-deliverables
mollle Dec 16, 2024
14aaa3a
Removed AMOS specific folder
dh1542 Dec 16, 2024
fbba3ff
prepare test data in test_arima.py
Timm638 Dec 16, 2024
d2bad25
Merge remote-tracking branch 'origin/develop' into refactor/63-anomal…
FelipeTrost Dec 16, 2024
8668431
test: ksigma anomaly detection, wrong values and large data set
FelipeTrost Dec 16, 2024
7689d08
refactorings on #61 and #92
chris-1187 Dec 16, 2024
738b1cf
actions adjustment
chris-1187 Dec 16, 2024
b26682a
#65 ARIMA refactorings
chris-1187 Dec 16, 2024
cbca20f
cleans up test
Timm638 Dec 16, 2024
8448a76
Merge pull request #102 from amosproj/refactor/#92-MVI_#61-DD
dh1542 Dec 16, 2024
c3e185b
#66 add datatype validation and prediction on large dataset
Dec 16, 2024
2242257
validate datatype
Dec 16, 2024
ba5873d
refactor invalid datatype
Dec 17, 2024
cd0640d
Merge pull request #103 from amosproj/refactor/066_linear_regression_…
dh1542 Dec 17, 2024
a0f2575
documents code & finishes up column-conversion
Timm638 Dec 17, 2024
5132629
Merge branch 'develop' into feature/040_finishing_ARIMA
Timm638 Dec 17, 2024
b960d1d
adapted test cases to new ARIMA structure
Timm638 Dec 17, 2024
1019bfa
applies Black linter
Timm638 Dec 17, 2024
ba31917
applies workaround to test dataframes
Timm638 Dec 17, 2024
099fa56
fixes import
Timm638 Dec 17, 2024
c7658a0
Merge pull request #94 from amosproj/feature/040_finishing_ARIMA
dh1542 Dec 17, 2024
cb036bc
Merge pull request #101 from amosproj/refactor/63-anomaly-detection-t…
dh1542 Dec 17, 2024
94570ec
Updated gitignore after merge
dh1542 Dec 17, 2024
392d221
Merge remote-tracking branch 'origin/develop' into develop
dh1542 Dec 17, 2024
0277a5c
Added deliverables for sprint-09
dh1542 Dec 17, 2024
27b293f
refactor(normalization): use input validator
FelipeTrost Jan 6, 2025
3321dc6
refactor(test/normalization): add tolerance for data frame comparison
FelipeTrost Jan 6, 2025
94d335c
test(normalization): test idempotence with large data set
FelipeTrost Jan 6, 2025
869db2f
test(normalization): test wrong type
FelipeTrost Jan 6, 2025
f0d1aa7
Merge branch 'develop' into feature/096_rtdip_demo_pipeline
Timm638 Jan 7, 2025
97ac7c7
fixes imports by re-adding them to each data quality component
Timm638 Jan 7, 2025
22ed2fb
fixes imports by re-adding them to each data quality component
Timm638 Jan 7, 2025
26be410
fixes imports by re-adding them to each data quality component
Timm638 Jan 7, 2025
9867d5f
Merge remote-tracking branch 'origin/feature/096_rtdip_demo_pipeline'…
Timm638 Jan 7, 2025
92b93ce
renamed imports
Timm638 Jan 7, 2025
3c0f404
fixes imports by re-adding them to each data quality component
Timm638 Jan 7, 2025
91c7827
Merge remote-tracking branch 'origin/feature/115_fixing_missing_impor…
Timm638 Jan 7, 2025
f0e6d14
Applies black to files
Timm638 Jan 7, 2025
0f3e866
Applies black to files
Timm638 Jan 7, 2025
6499181
fixes imports & writes down pipeline concept
Timm638 Jan 7, 2025
e7eac02
Merge pull request #114 from amosproj/refactor/64-de-normalization-tests
dh1542 Jan 8, 2025
6193b3d
added sprint 10 deliverables folder
mollle Jan 8, 2025
4069e0e
added sprint 10 deliverables
mollle Jan 8, 2025
9ad3fa4
remove unnecessary file
mollle Jan 8, 2025
b4c369f
delete out of value ranges done #109
mollle Jan 8, 2025
f085d14
Merge pull request #116 from amosproj/feature/115_fixing_missing_import
mollle Jan 8, 2025
fddd03c
imporved documentation
mollle Jan 8, 2025
21b726f
flaltline filter done
mollle Jan 9, 2025
41aab5c
added documentation for my components
mollle Jan 9, 2025
cba1a8a
Merge pull request #116 from amosproj/feature/115_fixing_missing_import
mollle Jan 8, 2025
1fedcce
Merge branch 'develop' into feature/109_delete_out_of_range_values
mollle Jan 9, 2025
c5b11bc
Merge pull request #119 from amosproj/feature/109_delete_out_of_range…
mollle Jan 9, 2025
c9c9b7a
Merge pull request #120 from amosproj/feature/113_remove_flatlining_rows
mollle Jan 9, 2025
db1ca1d
fixed documentation
mollle Jan 9, 2025
62a49f5
fix value range test
mollle Jan 9, 2025
d966148
Upgrade Jinja2 package (#855)
GBBBAS Jan 13, 2025
4bfcd29
Update fastapi (#856)
GBBBAS Jan 13, 2025
57ae7a7
docs: added docs entries for amos components
FelipeTrost Jan 13, 2025
3648db9
fix: return type annotations in linear_regression
FelipeTrost Jan 14, 2025
4941426
updates showcase_notebook to current version of rtdip
Timm638 Jan 14, 2025
6763442
starts working on pipeline_showcase.ipynb
Timm638 Jan 14, 2025
f7ae600
fix: type annotations for older python versions in linear_regression
FelipeTrost Jan 14, 2025
6bf8c94
Added deliverables for sprint-11
chris-1187 Jan 14, 2025
c42247c
Merge pull request #121 from amosproj/110-docs
mollle Jan 15, 2025
be0bceb
renamed DeleteOutOfRangeValues to OutOfRangeValueFilter
mollle Jan 15, 2025
c1cf5e4
Resample Query Join and Aggregation Updates (#857)
GBBBAS Jan 17, 2025
60b7a5a
Update Resample Query (#858)
GBBBAS Jan 17, 2025
eea867a
fixed tests after renaming OutOfRangeValueFilter
mollle Jan 17, 2025
a488563
107 moving average done
mollle Jan 17, 2025
1524b06
fixing typo in docs file
mollle Jan 17, 2025
26569b1
#106: Added files
dh1542 Jan 17, 2025
76271df
#106: Implemented gaussian smoothing wip
dh1542 Jan 17, 2025
da5f3e2
Merge remote-tracking branch 'upstream/develop' into core-develop-clone
dh1542 Jan 18, 2025
5827a80
Merge remote-tracking branch 'origin/develop' into merge/to-core
dh1542 Jan 18, 2025
f272438
Merged amos develop to merge branch
dh1542 Jan 18, 2025
56826db
fixed build docu
mollle Jan 19, 2025
0bdeec2
docs: add nav entries for amos components
FelipeTrost Jan 19, 2025
ecaef29
Merge remote-tracking branch 'origin/develop' into 110-docs
FelipeTrost Jan 19, 2025
19e0cb9
docs: correct reference to value filter
FelipeTrost Jan 19, 2025
f2ea173
docstring: fixes for arima
FelipeTrost Jan 19, 2025
2ed654d
docs: add arima
FelipeTrost Jan 19, 2025
cf0c3cf
saves in progress pipeline_showcase changes
Timm638 Jan 20, 2025
2e7b670
Merge pull request #127 from amosproj/110-docs
mollle Jan 20, 2025
4aa3347
test if htm files are readable in github
mollle Jan 20, 2025
b8eb445
adds Sensor Data Stage
Timm638 Jan 20, 2025
f0b52f3
fixes copy bug in ARIMA
Timm638 Jan 20, 2025
332c181
push latest pipeline changes
Timm638 Jan 20, 2025
e3aafa3
completes draft of notebook
Timm638 Jan 21, 2025
cc012f5
Merge branch 'develop' into feature/107_MovingAverage
mollle Jan 21, 2025
3018a3d
Merge pull request #125 from amosproj/feature/107_MovingAverage
dh1542 Jan 21, 2025
2df81c0
Merge branch 'refs/heads/develop' into feature/#106-Gaussian-Smoothing
dh1542 Jan 21, 2025
9fea923
#106: Implementation and refactoring gaussian smoothing
dh1542 Jan 21, 2025
0419455
#106: Added correct documentation for gaussian smoothing
dh1542 Jan 21, 2025
61ebc77
#106: Linted
dh1542 Jan 21, 2025
fbe7155
changed dataset presented to be a more periodic one
Timm638 Jan 21, 2025
ed0a8bf
cleaned up draft more, each cell except for Linear Regression functions
Timm638 Jan 21, 2025
3170bd3
Add deliverables for sprint-12
Jan 21, 2025
d114ba3
Merge branch 'develop' of https://github.com/amosproj/amos2024ws01-rt…
Jan 21, 2025
542dd96
adjusts plot to be more descriptive
Timm638 Jan 22, 2025
ec394c9
adds Example Data to git
Timm638 Jan 22, 2025
697b02e
finishes up texts
Timm638 Jan 22, 2025
4ceb318
improving/fixing documentation
mollle Jan 22, 2025
4fcf3a7
fixed/improved docu, renamed "machine_learning" into "forecasting"
mollle Jan 22, 2025
9cd320d
fixing tests that broke because of refactoring
mollle Jan 22, 2025
1b9e5ab
fixed arima test
mollle Jan 22, 2025
9d062e7
#105 refactor for only euclidean distance
Jan 22, 2025
d80ec71
#105 add parameters for time series prediction
Jan 23, 2025
254f90f
#105 add unit test for KNN
Jan 23, 2025
9b25452
refactor
Jan 23, 2025
11c6db9
linted
Jan 23, 2025
eb35dae
debug import error
Jan 23, 2025
083fab3
#105 Rename machine_learning to forecasting folder
Jan 23, 2025
9aa0a6a
Merge branch 'develop' into feature/105_KNN
kristen149 Jan 23, 2025
3eef96c
Refactor Time Series Builder Queries for Raw, Resample and Interpolat…
GBBBAS Jan 24, 2025
3c11b35
Update tests (#860)
GBBBAS Jan 24, 2025
e500c5a
Delete tests/sdk/python/rtdip_sdk/pipelines/forecasting/__init__.py
kristen149 Jan 26, 2025
d3a32ce
Delete tests/sdk/python/rtdip_sdk/pipelines/forecasting/spark/__init_…
kristen149 Jan 26, 2025
a8f521e
fix imports for builds
FelipeTrost Jan 26, 2025
cef5823
Linted formatting
kristen149 Jan 26, 2025
ae07063
updates nbformat
Timm638 Jan 27, 2025
7bc67ec
Merge branch 'develop' into feature/096_rtdip_demo_pipeline
Timm638 Jan 27, 2025
f17b8b9
fixed imports
Timm638 Jan 27, 2025
603df22
removes redundant .gitignore
Timm638 Jan 27, 2025
44a1995
applies black
Timm638 Jan 27, 2025
919d603
fixes missing preparation from pandas DF to spark DF
Timm638 Jan 27, 2025
d8e330f
fix more imports for build
FelipeTrost Jan 27, 2025
87a5aca
initial notebook for shell demo
FelipeTrost Jan 27, 2025
2e6b496
reapplies black
Timm638 Jan 28, 2025
5c9c8a4
Merge pull request #136 from amosproj/feature/105_KNN
mollle Jan 28, 2025
8eb8168
Merge branch 'develop' into merge/to-core
dh1542 Jan 28, 2025
66ee048
#126: Merged new changes from develop
dh1542 Jan 28, 2025
f147a8c
Merge branch 'rtdip:develop' into merge/core-develop-clone
dh1542 Jan 28, 2025
a1ea222
Merge remote-tracking branch 'upstream/develop' into merge/to-core
dh1542 Jan 28, 2025
6492153
Merge remote-tracking branch 'origin/develop' into feature/#106-Gauss…
dh1542 Jan 28, 2025
28cb5e0
#106: Removed converting dfs in tests
dh1542 Jan 28, 2025
e43cc4b
#106: Linted
dh1542 Jan 28, 2025
25781ea
Merge remote-tracking branch 'upstream/develop' into develop
dh1542 Jan 28, 2025
08faaf1
#106: Changed implementation without converting to panda df
dh1542 Jan 28, 2025
e37fd07
#106: Fixed static method
dh1542 Jan 28, 2025
21a6c6a
#106: Fixed import
dh1542 Jan 28, 2025
06b9977
feat: demo day slide
luccalb Jan 28, 2025
ad015b4
Merge pull request #134 from amosproj/feature/#106-Gaussian-Smoothing
kristen149 Jan 28, 2025
f273297
Adds demo video
Timm638 Jan 28, 2025
fc5f26c
updates imports to new ones
Timm638 Jan 29, 2025
c2feea5
feat: update demo day slide
luccalb Jan 29, 2025
a87b358
Merge remote-tracking branch 'origin/develop' into demo-notebook
FelipeTrost Jan 29, 2025
0857cbc
sprint 13 deliverables
FelipeTrost Jan 29, 2025
4064a08
Merge pull request #138 from amosproj/sprint-13-amos-deliverables
FelipeTrost Jan 29, 2025
cdffe9f
Merge branch 'develop' into feature/096_rtdip_demo_pipeline
Timm638 Feb 2, 2025
1842e5d
adds variation of notebook with outputs
Timm638 Feb 2, 2025
9d39d8e
Merge remote-tracking branch 'origin/develop' into demo-notebook
FelipeTrost Feb 2, 2025
538597f
fix: print correct column name in error
FelipeTrost Feb 2, 2025
d46c9f1
use relative imports to fix build
FelipeTrost Feb 2, 2025
063810c
final demo notebook
FelipeTrost Feb 2, 2025
450d8d4
fix typo
FelipeTrost Feb 2, 2025
e8c85a4
docu except guassian smoothing complete
mollle Feb 3, 2025
a8a9aed
format
FelipeTrost Feb 3, 2025
0e2657d
user documentation
FelipeTrost Feb 3, 2025
1ae8020
Merge remote-tracking branch 'origin/demo-notebook' into feature/096_…
Timm638 Feb 3, 2025
6750990
merges changes from Felipe's Notebook to this branch
Timm638 Feb 3, 2025
6eab4df
changes something to trick GitHub into showing the pull request as su…
Timm638 Feb 3, 2025
c3e91f9
Merge pull request #139 from amosproj/demo-notebook
Timm638 Feb 3, 2025
480d497
Pulls changes from notebook branch without the notebooks
Timm638 Feb 3, 2025
61d2505
Merge pull request #142 from amosproj/feature/141-import-bugfixes-cha…
mollle Feb 3, 2025
f3e738c
Added missing documentation for gaussian smoothing
dh1542 Feb 3, 2025
a51d5bc
fixed docu build warning
mollle Feb 3, 2025
928461f
RTDIP Blog Post
chris-1187 Jan 20, 2025
1c05132
Agile illustration
chris-1187 Jan 20, 2025
dda27e7
basic md blog post files
chris-1187 Jan 28, 2025
2674853
image adjustments
chris-1187 Feb 3, 2025
c72388f
Added code and charts
chris-1187 Feb 3, 2025
490f95d
Merge remote-tracking branch 'origin/develop' into 124/documentation
FelipeTrost Feb 3, 2025
df8b290
deliverables docs: gaussian smoothing
FelipeTrost Feb 3, 2025
70391b6
Merge pull request #140 from amosproj/124/documentation
mollle Feb 4, 2025
f6c88dc
fixing spelling issues, closes #145
mollle Feb 4, 2025
0db35c3
Merge branch 'develop' into merge/to-core
dh1542 Feb 4, 2025
e8b43a1
Fixed spelling mistakes
dh1542 Feb 4, 2025
d583a74
fixing spelling issues, closes #145
mollle Feb 4, 2025
299570f
Merge pull request #143 from amosproj/documentation/123_Article_RTDIP…
dh1542 Feb 4, 2025
8a3d236
Upload sprint 14 deliverables
Timm638 Feb 4, 2025
5180a96
Merge remote-tracking branch 'origin/develop' into develop
Timm638 Feb 4, 2025
f769af9
edit example documentation for KNN
Feb 4, 2025
dbe1f48
edit example documentation for KNN
Feb 4, 2025
440f68f
Merge branch 'develop' into merge/to-core
dh1542 Feb 5, 2025
8ca1038
Merge changes from develop
dh1542 Feb 5, 2025
f0ff17a
Merge pull request #126 from amosproj/merge/to-core
dh1542 Feb 5, 2025
34543a1
Package Updates for Pandas and Langchain (#867)
GBBBAS Mar 7, 2025
53bee23
Required changes to data quality
Amber-Rigg Mar 14, 2025
75c0a50
Update with forecasting changes
Amber-Rigg Mar 14, 2025
a925d0f
Further code review
Amber-Rigg Mar 14, 2025
6a3932e
Final Code Review- Amber L Rigg
Amber-Rigg Mar 17, 2025
a0a2ad8
Merge branch 'develop' into merge/core-develop-clone
Amber-Rigg Mar 17, 2025
69e9bca
reverted the sonar github action path
TugceOzberkYener Mar 18, 2025
b32205e
added the init for the forecasting test folder
TugceOzberkYener Mar 18, 2025
9ec8e4c
import updates
TugceOzberkYener Mar 18, 2025
a3df93d
Update to transformers to include pre and post validations
TugceOzberkYener Mar 18, 2025
d847436
PR comments fixed
TugceOzberkYener Mar 18, 2025
064e331
import fix
TugceOzberkYener Mar 18, 2025
dddfcaa
fixed setup.py
TugceOzberkYener Mar 18, 2025
8515ecd
fixed documentation for linear_regression
TugceOzberkYener Mar 18, 2025
276b259
Unit test fixes
TugceOzberkYener Mar 18, 2025
b1a3034
sonarqube reliability issues fixed
TugceOzberkYener Mar 18, 2025
1f8addd
sonar high severity maintainability checks fixes
TugceOzberkYener Mar 18, 2025
f2081ef
mkdocs fixes for df removal
TugceOzberkYener Mar 18, 2025
bf597ca
sonarqube high severity issue fixes
TugceOzberkYener Mar 18, 2025
7f635e4
test fixes
TugceOzberkYener Mar 18, 2025
3180a31
fixed for unit tests
TugceOzberkYener Mar 19, 2025
2447a6f
added missing import
TugceOzberkYener Mar 19, 2025
03d7fca
sonarqube quality gate fixes
TugceOzberkYener Mar 19, 2025
7508c38
Merge pull request #862 from amosproj/merge/core-develop-clone
GBBBAS Apr 10, 2025
32d02e9
Langchain import fix (#879)
GBBBAS Jun 27, 2025
c5b64f5
Merge branch 'main' into hotfix/merges_v0140
GBBBAS Jun 27, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ jobs:
matrix:
os: [ubuntu-latest]
python-version: ["3.9", "3.10", "3.11", "3.12"]
pyspark: ["3.3.0", "3.3.1", "3.3.2", "3.4.0", "3.4.1", "3.5.0", "3.5.1"]
pyspark: ["3.3.0", "3.3.1", "3.3.2", "3.4.0", "3.4.1", "3.5.0", "3.5.1"] # 3.5.2 does not work with conda
exclude:
- pyspark: "3.5.1"
python-version: "3.9"
Expand Down
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -136,4 +136,7 @@ spark-warehouse/
spark-checkpoints/

# Delta Sharing
config.share
config.share

# JetBrains
.idea/
6 changes: 5 additions & 1 deletion docs/blog/.authors.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,4 +24,8 @@ authors:
GBARAS:
name: Amber Rigg
description: Contributor
avatar: https://github.com/Amber-Rigg.png
avatar: https://github.com/Amber-Rigg.png
TUBCM:
name: Christian Munz
description: Contributor
avatar: https://github.com/chris-1187.png
1,827 changes: 1,827 additions & 0 deletions docs/blog/images/agile.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/blog/images/amos_mvi.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/blog/images/amos_mvi_raw.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
94 changes: 94 additions & 0 deletions docs/blog/posts/enhancing_data_quality_amos.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
---
date: 2025-02-05
authors:
- TUBCM
---

# Enhancing Data Quality in Real-Time: Our Experience with RTDIP and the AMOS Project

<center>

![blog](../images/agile.svg){width=60%}
<small>1</small>
</center>

Real-time data integration and preparation are crucial in today's data-driven world, especially when dealing with time series data from often distributed heterogeneous data sources. As data scientists often spend no less than 80%<sup>2</sup> of their time finding, integrating, and cleaning datasets, the importance of automated ingestion pipelines rises inevitably. Building such ingestion and integration frameworks can be challenging and can entail all sorts of technical debt like glue code, pipeline jungles, or dead code paths, which calls for precise conception and development of such systems. Modern software development approaches try to mitigate technical debts and enhance quality results by introducing and utilizing agile and more iterative methodologies, which are designed to foster rapid feedback and continuous progress.

<!-- more -->

As part of the Agile Methods and Open Source (AMOS) project, we had the unique opportunity to work in a SCRUM team consisting of students from TU Berlin and FAU Erlangen-Nürnberg, to build data quality measures for the RTDIP Ingestion Pipeline framework. With the goal of enhancing data quality, we got to work and built modular pipeline components that aim to help data scientists and engineers with data integration, data cleaning, and data preparation.

But what does it mean to work in an agile framework? The Agile Manifesto is above all a set of guiding values, principles, ideals, and goals. The overarching goal is to gain performance and be most effective while adding business value. By prioritizing the right fundamentals like individuals and interactions, working software, customer collaboration, and responding to change, cross-functional teams can ship viable products easier and faster.

How that worked out for us in building data quality measures? True to the motto "User stories drive everything," we got together with contributors from the RTDIP Team to hear about concepts, the end users' stake in the project, and the current state to get a grasp on the expectations we can set on ourselves. With that, we got to work and planned our first sprint, and soon, we got the idea of how agile implementation is here to point out deficiencies in our processes. Through regular team meetings, we fostered a culture of continuous feedback and testing, leveraging reviews and retrospectives to identify roadblocks and drive necessary changes that enhance the overall development process.

## Enhancing Data Quality in RTDIP's Pipeline Framework

Coming up with modular steps that enhance data quality was the initial and arguably most critical step to start off a successful development process. So the question was: what exactly do the terms data integration, data cleaning, and data preparation entail? To expand on the key parts of that, this is what we did to pour these aspects into RTDIP components.

### Data Validation and Schema Alignment

Data validation and schema alignment are critical for ensuring the reliability and usability of data, serving as a foundational step before implementing other quality measures. For the time series data at hand, we developed an InputValidator component to verify that incoming data adheres to predefined quality standards, including compliance with an expected schema, correct PySpark data types, and proper handling of null values, raising exceptions when inconsistencies are detected. Additionally, the component enforces schema integration, harmonizing data from multiple sources into a unified, predefined structure. To maintain a consistent and efficient workflow, we required all data quality components to inherit the validation functionality of the InputValidator.

### Data Cleansing

Data cleansing is a vital process in enhancing the quality of data within a data integration pipeline, ensuring consistency, reliability, and usability. We implemented functionalities such as duplicate detection, which identifies and removes redundant records to prevent skewed analysis, and flatline filters, which eliminate constant, non-informative data points. Interval and range filters are employed to validate the time series data against predefined temporal or value ranges, ensuring conformity with expected patterns. Additionally, a K-sigma anomaly detection component identifies outliers based on statistical deviations, enabling the isolation of erroneous or anomalous values. Together, these methods ensure the pipeline delivers high-quality, actionable data for downstream processes.

### Missing Value Imputation

With a dataset refined to exclude unwanted data points and accounting for potential sensor failures, the next step toward ensuring high-quality data is to address any missing values through imputation. The component we developed first identifies and flags missing values by leveraging PySpark’s capabilities in windowing and UDF operations. With these techniques, we are able to dynamically determine the expected interval for each sensor by analyzing historical data patterns within defined partitions. Spline interpolation allows us to estimate missing values in time series data, seamlessly filling gaps with plausible and mathematically derived substitutes. By doing so, data scientists can not only improve the consistency of integrated datasets but also prevent errors or biases in analytics and machine learning models.
To actually show how this is realized with this new RTDIP component, let me show you a short example on how a few lines of code can enhance an exemplary time series load profile:
```python
from rtdip_sdk.pipelines.data_quality import MissingValueImputation
from pyspark.sql import SparkSession
import pandas as pd

spark_session = SparkSession.builder.master("local[2]").appName("test").getOrCreate()

source_df = pd.read_csv('./solar_energy_production_germany_April02.csv')
incomplete_spark_df = spark_session.createDataFrame(vi_april_df, ['Value', 'EventTime', 'TagName', 'Status'])

#Before Missing Value Imputation
spark_df.show()

#Execute RTDIP Pipeline component
clean_df = MissingValueImputation(spark_session, df=incomplete_spark_df).filter_data()

#After Missing Value Imputation
clean_df.show()
```
To illustrate this visually, plotting the before-and-after DataFrames reveals that all gaps have been successfully filled with meaningful data.

<center>

![blog](../images/amos_mvi_raw.png){width=70%}

![blog](../images/amos_mvi.png){width=70%}

</center>


### Normalization

Normalization is a critical step in ensuring data quality within data integration pipelines with various sources. Techniques like mean normalization, min-max scaling, and z-score standardization help transform raw time series data into a consistent scale, eliminating biases caused by differing units or magnitudes across features. It enables fair comparisons between variables, accelerates algorithm convergence, and ensures that data from diverse sources aligns seamlessly, supporting possible downstream processes such as entity resolution, data augmentation, and machine learning. To offer a variety of use cases within the RTDIP pipeline, we implemented normalization techniques like mean normalization, min-max scaling, and z-score standardization as well as their respective denormalization methods.

### Data Monitoring

Data monitoring is another aspect of enhancing data quality within the RTDIP pipeline, ensuring the reliability and consistency of incoming data streams. Techniques such as flatline detection identify periods of unchanging values, which may indicate sensor malfunctions or stale data. Missing data identification leverages predefined intervals or historical patterns to detect and flag gaps, enabling proactive resolution. By continuously monitoring for these anomalies, the pipeline maintains high data integrity, supporting accurate analysis for inconsistencies.

### Data Prediction

Forecasting based on historical data patterns is essential for making informed decisions on a business level. Linear Regression is a simple yet powerful approach for predicting continuous outcomes by establishing a relationship between input features and the target variable. However, for time series data, the ARIMA (Autoregressive Integrated Moving Average) model is often preferred due to its ability to model temporal dependencies and trends in the data. The ARIMA model combines autoregressive (AR) and moving average (MA) components, along with differencing to stabilize the variance and trends in the time series. ARIMA with autonomous parameter selection takes this a step further by automatically optimizing the model’s parameters (p, d, q) using techniques like grid search or other statistical criteria, ensuring that the model is well-suited to the data’s underlying structure for more accurate predictions. To address this, we incorporated both an ARIMA component and an AUTO-ARIMA component, enabling the prediction of future time series data points for each sensor.

<br>

Working on the RTDIP Project within AMOS has been a fantastic journey, highlighting the importance of people and teamwork in agile development. By focusing on enhancing data quality, we’ve significantly boosted the reliability, consistency, and usability of the data going through the RTDIP pipeline.

To look back, our regular team meetings were the key to our success. Through open communication and collaboration, we tackled challenges and kept improving our processes. This showed us the power of working together in an agile framework and growing as a dedicated SCRUM team.

We’re excited about the future and how these advancements will help data scientists and engineers make better decisions.

<br>

<small>1 Designed by Freepik</small><br>
<small>2 Michael Stonebraker, Ihab F. Ilyas: Data Integration: The Current Status and the Way Forward. IEEE Data Eng. Bull. 41(2) (2018)</small>
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: src.sdk.python.rtdip_sdk.pipelines.data_quality.data_manipulation.spark.dimensionality_reduction
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: src.sdk.python.rtdip_sdk.pipelines.data_quality.data_manipulation.spark.duplicate_detection
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: src.sdk.python.rtdip_sdk.pipelines.data_quality.data_manipulation.spark.flatline_filter
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: src.sdk.python.rtdip_sdk.pipelines.data_quality.data_manipulation.spark.gaussian_smoothing
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: src.sdk.python.rtdip_sdk.pipelines.data_quality.data_manipulation.spark.interval_filtering
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: src.sdk.python.rtdip_sdk.pipelines.data_quality.data_manipulation.spark.k_sigma_anomaly_detection
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
::: src.sdk.python.rtdip_sdk.pipelines.data_quality.data_manipulation.spark.missing_value_imputation

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: src.sdk.python.rtdip_sdk.pipelines.data_quality.data_manipulation.spark.normalization.denormalization
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: src.sdk.python.rtdip_sdk.pipelines.data_quality.data_manipulation.spark.normalization.normalization
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: src.sdk.python.rtdip_sdk.pipelines.data_quality.data_manipulation.spark.normalization.normalization_mean
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: src.sdk.python.rtdip_sdk.pipelines.data_quality.data_manipulation.spark.normalization.normalization_minmax
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: src.sdk.python.rtdip_sdk.pipelines.data_quality.data_manipulation.spark.normalization.normalization_zscore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: src.sdk.python.rtdip_sdk.pipelines.data_quality.data_manipulation.spark.out_of_range_value_filter
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: src.sdk.python.rtdip_sdk.pipelines.data_quality.monitoring.spark.check_value_ranges
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: src.sdk.python.rtdip_sdk.pipelines.data_quality.monitoring.spark.flatline_detection
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@

Great Expectations is a Python-based open-source library for validating, documenting, and profiling your data. It helps you to maintain data quality and improve communication about data between teams.

::: src.sdk.python.rtdip_sdk.pipelines.monitoring.spark.data_quality.great_expectations_data_quality
::: src.sdk.python.rtdip_sdk.pipelines.data_quality.monitoring.spark.great_expectations_data_quality
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: src.sdk.python.rtdip_sdk.pipelines.data_quality.monitoring.spark.identify_missing_data_interval
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: src.sdk.python.rtdip_sdk.pipelines.data_quality.monitoring.spark.identify_missing_data_pattern
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: src.sdk.python.rtdip_sdk.pipelines.data_quality.monitoring.spark.moving_average
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: src.sdk.python.rtdip_sdk.pipelines.forecasting.spark.arima
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: src.sdk.python.rtdip_sdk.pipelines.forecasting.spark.auto_arima
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: src.sdk.python.rtdip_sdk.pipelines.forecasting.spark.data_binning
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: src.sdk.python.rtdip_sdk.pipelines.forecasting.spark.k_nearest_neighbors
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
::: src.sdk.python.rtdip_sdk.pipelines.forecasting.spark.linear_regression
2 changes: 2 additions & 0 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,8 @@ dependencies:
- black>=24.1.0
- joblib==1.3.2,<2.0.0
- great-expectations>=0.18.8,<1.0.0
- statsmodels>=0.14.1,<0.15.0
- pmdarima>=2.0.4
- pip:
- databricks-sdk>=0.20.0,<1.0.0
- dependency-injector>=4.41.0,<5.0.0
Expand Down
38 changes: 33 additions & 5 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -235,10 +235,38 @@ nav:
- Azure Key Vault: sdk/code-reference/pipelines/secrets/azure_key_vault.md
- Deploy:
- Databricks: sdk/code-reference/pipelines/deploy/databricks.md
- Monitoring:
- Data Quality:
- Great Expectations:
- Data Quality Monitoring: sdk/code-reference/pipelines/monitoring/spark/data_quality/great_expectations.md
- Data Quality:
- Monitoring:
- Check Value Ranges: sdk/code-reference/pipelines/data_quality/monitoring/spark/check_value_ranges.md
- Great Expectations:
- Data Quality Monitoring: sdk/code-reference/pipelines/data_quality/monitoring/spark/great_expectations.md
- Flatline Detection: sdk/code-reference/pipelines/data_quality/monitoring/spark/flatline_detection.md
- Identify Missing Data:
- Interval Based: sdk/code-reference/pipelines/data_quality/monitoring/spark/identify_missing_data_interval.md
- Pattern Based: sdk/code-reference/pipelines/data_quality/monitoring/spark/identify_missing_data_pattern.md
- Moving Average: sdk/code-reference/pipelines/data_quality/monitoring/spark/moving_average.md
- Data Manipulation:
- Duplicate Detetection: sdk/code-reference/pipelines/data_quality/data_manipulation/spark/duplicate_detection.md
- Out of Range Value Filter: sdk/code-reference/pipelines/data_quality/data_manipulation/spark/out_of_range_value_filter.md
- Flatline Filter: sdk/code-reference/pipelines/data_quality/data_manipulation/spark/flatline_filter.md
- Gaussian Smoothing: sdk/code-reference/pipelines/data_quality/data_manipulation/spark/gaussian_smoothing.md
- Dimensionality Reduction: sdk/code-reference/pipelines/data_quality/data_manipulation/spark/dimensionality_reduction.md
- Interval Filtering: sdk/code-reference/pipelines/data_quality/data_manipulation/spark/interval_filtering.md
- K-Sigma Anomaly Detection: sdk/code-reference/pipelines/data_quality/data_manipulation/spark/k_sigma_anomaly_detection.md
- Missing Value Imputation: sdk/code-reference/pipelines/data_quality/data_manipulation/spark/missing_value_imputation.md
- Normalization:
- Normalization: sdk/code-reference/pipelines/data_quality/data_manipulation/spark/normalization/normalization.md
- Normalization Mean: sdk/code-reference/pipelines/data_quality/data_manipulation/spark/normalization/normalization_mean.md
- Normalization MinMax: sdk/code-reference/pipelines/data_quality/data_manipulation/spark/normalization/normalization_minmax.md
- Normalization ZScore: sdk/code-reference/pipelines/data_quality/data_manipulation/spark/normalization/normalization_zscore.md
- Denormalization: sdk/code-reference/pipelines/data_quality/data_manipulation/spark/normalization/denormalization.md
- Forecasting:
- Data Binning: sdk/code-reference/pipelines/forecasting/spark/data_binning.md
- Linear Regression: sdk/code-reference/pipelines/forecasting/spark/linear_regression.md
- Arima: sdk/code-reference/pipelines/forecasting/spark/arima.md
- Auto Arima: sdk/code-reference/pipelines/forecasting/spark/auto_arima.md
- K Nearest Neighbors: sdk/code-reference/pipelines/forecasting/spark/k_nearest_neighbors.md

- Jobs: sdk/pipelines/jobs.md
- Deploy:
- Databricks Workflows: sdk/pipelines/deploy/databricks.md
Expand Down Expand Up @@ -330,4 +358,4 @@ nav:
- blog/index.md
- University:
- University: university/overview.md


5 changes: 4 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright 2022 RTDIP
# Copyright 2025 RTDIP
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -46,6 +46,8 @@
"langchain>=0.2.0,<0.3.0",
"langchain-community>=0.2.0,<0.3.0",
"openai>=1.13.3,<2.0.0",
"statsmodels>=0.14.1,<0.15.0",
"pmdarima>=2.0.4",
]

PYSPARK_PACKAGES = [
Expand All @@ -71,6 +73,7 @@
"joblib>=1.3.2,<2.0.0",
"sqlparams>=5.1.0,<6.0.0",
"entsoe-py>=0.5.10,<1.0.0",
"numpy>=1.23.4,<2.0.0",
]

EXTRAS_DEPENDENCIES: dict[str, list[str]] = {
Expand Down
4 changes: 2 additions & 2 deletions src/api/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Do not include azure-functions-worker as it may conflict with the Azure Functions platform
azure-functions==1.18.0
fastapi==0.110.0
fastapi==0.115.6
pydantic==2.6.0
# turbodbc==4.11.0
pyodbc==4.0.39
Expand All @@ -10,7 +10,7 @@ azure-identity==1.17.0
oauthlib>=3.2.2
pandas>=2.0.1,<3.0.0
numpy==1.26.4
jinja2==3.1.4
jinja2==3.1.5
pytz==2024.1
semver==3.0.2
xlrd==2.0.1
Expand Down
Loading
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载