+
Skip to content

Conversation

Hooverdan96
Copy link
Member

Addresses #2650.

Addresses #2650.

High-level intent

Migrate Rockstor service management from supervisord to systemd, add dedicated systemd unit files for several components, genericize the task scheduler name, remove supervisor dependency from packaging, and update code to call systemd instead of supervisorctl. Also adjust logging filenames and prepare the codebase for moving away from WSGI toward ASGI/other SGI notations.

Key changes (what was added/edited)

  • Added systemd unit files in conf:

    • replication.service (new)
    • rockstor-data-collector.service (new)
    • rockstor-scheduling.service (new)
    • Tightened rockstor.service to run gunicorn directly under systemd (replacing supervisord usage), added stdout/stderr files and restart rules.
  • Logging changes:

    • gunicorn.conf.py: access log renamed from ./var/log/gunicorn.log to ./var/log/sgi.log.
    • data_collector.py: internal log map updated to use sgi.log, sgi_stdout.log, sgi_stderr.log; removed references to supervisord log.
    • rockstor.service systemd unit now writes to sgi_* log filenames.
  • Supervisor → systemd migration in code:

    • Replaced usages of supervisorctl / superctl with a systemctl approach:
      • debugmode.py: calls systemctl instead of supervisorctl.
      • Many views switched import/use from system.services.superctl to system.services.systemctl (e.g., dc_service.py, replication_service.py, sm_service.py, ts_service.py, ztaskd_service.py → renamed).
    • services.py:
      • Removed supervisor-related constants and helper functions (SUPERCTL_BIN, SUPERVISORD_CONF, set_autostart, superctl).
      • Adjusted service_status to query systemd for replication/data-collector/scheduling.
        Note: this is the only instance where this expression is used out, err, rc = systemctl(service_name, "status") vs. other instances using the run_command construct. Should this be aligned to one or the other?
      • Added new service names to internal supported list.
  • Renames / genericization of scheduler:

    • ZTaskd renamed to rockstor-scheduling across fixtures, tests and views:
      • Files renamed: ztask_helpers.pyscheduling_helpers.py, ztaskd_service.pyscheduling_service.py
      • Fixtures/JSON and test data: replaced ztask-daemon/ZTaskd with rockstor-scheduling/Rockstor Scheduling.
      • URLs and UI text updated (e.g., services.py, src/rockstor/storageadmin/static/.../services.js).
      • Task imports updated to use scheduling_helpers.restart_rockstor.
  • init / registration updates:

    • initrock.py: ROCKSTOR_SYSTEMD_SERVICES now includes rockstor-data-collector.service, replication.service, and rockstor-scheduling.service.
    • prep_db.py: service display/config mapping adjusted; ZTaskd mapped to rockstor-scheduling.
  • Packaging / dependency changes:

    • pyproject.toml: supervisor dependency commented out (effectively removed).
  • Tests updated:

    • test_services.py and other fixtures/tests updated to include/expect the new services and names.
  • Misc:

    • rockon_id.py switched to systemctl and now references rockstor-scheduling.
    • Various code changes to replace references to ztask/ztask-daemon and to rely on systemd-managed services.

Notable renames / symbols

  • ztask-daemon / ZTaskdrockstor-scheduling / Rockstor Scheduling
  • ztask_helpersscheduling_helpers
  • superctl / supervisorctlsystemctl usage in code
  • gunicorn.logsgi.log and stdout/stderr files renamed to sgi_stdout.log / sgi_stderr.log

Cosmetic Changes

adjust End-of-Line from CRLF (Windows) to LF (Unix) - found some files that were not changed as part of the content of this PR, but want to align to consistent EOL setup.

Testing

  • Installation on VM that contained a previous Rockstor installation (remove original Rockstor user to be able to reuse it)
  • Successful Rockstor start - initial login, user creation
  • Imported previous btrfs disks
  • Activated Rock-on service (log shows successful start of docker service)
  • ability to configure and turn on/off replication service (systemctl status shows corresponding changes, as well as the WebUI)
  • data-collection seems to be running fine (no error messages)
  • dashboard is populated
  • renamed logs are accessible

run unit tests:

----------------------------------------------------------------------
Ran 292 tests in 80.386s

OK
Destroying test database for alias 'default' ('test_storageadmin')...
Destroying test database for alias 'smart_manager' ('test_smartdb')...

update services list
Move to systemd & cleanup supervisor files
Defer socket-based approach to later
Switch to .env file for new services
Add service files for gunicorn, replication, data-collector and huey.
update data_collector.py log paths for gunicorn
add individual services formerly managed by supervisord
change dependency to rockstor-gunicorn
Update log file references to more generic in prep to moving to other asgi from wsgi
Align initrock.py with changed service name
update services test
change return parameters for former supervisord services
add services to systemd support list
additional systemd alignment in prep_db and services
replace superctl with systemctl method
remove supervisor from pyproject.toml and update lock file using poetry 2.2.0
move new services to using env file
genericize task scheduling
update replication service
cleanup - adjust End-of-Line from CRLF (Windows) to LF (Unix)
Comment on lines 3 to 4
After=rockstor-pre.service
Requires=rockstor-pre.service
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Hooverdan96 I think the following dependencies are missing in all of your suggested services.

After=rockstor-pre.service
Requires=rockstor-pre.service

I.e. we depend upon a cascade of systemd dependencies starting at the top with rockstor-bootstrap.
Here in rockstor.service, and likely some of your new partially independent services, there is a hard dependency on rockstor-pre.service as it, in turn depends on rockstor-build- which builds our .venv that all our services (except rockstor-build) depends upon as they all use the .venv python env.

Copy link
Member Author

@Hooverdan96 Hooverdan96 Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if I add these things in, the dependency diagram would look something like this (there are obviously others like docker, etc. but for the purpose of this argument):

image

Should the rockstor-bootstrap also complete before the data collection and replication starts (since they were part of the original rockstor.service/supervisor service unit)? Or the other way around

The bootstrap performs a btrfs device scan, secret refresh and qgroup cleanup. If the scan or the secret refresh fail, then not much will work, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the rockstor-bootstrap also complete before the data collection and replication starts (since they were part of the original rockstor.service/supervisor service unit)?

I think it is more

Or the other way around

The bootstrap performs a btrfs device scan, secret refresh and qgroup cleanup. If the scan or the secret refresh fail, then not much will work, right?

There is an existing conundrum here, i.e. on a fresh boot, if bootstrap fails we can end up with unmounted Pools. Note that it retries several times give things time to be ready. Likely we would be better re-imagining how things currently work which is the core of this issue in essence. But we can have spin-offs remember (i.e. the JSON EOL corrections/normalisations may have been a candidate for that given we have yet to address all that is related to dropping our supervisord dropping.

We do have an existing doc entry for services: https://rockstor.com/docs/contribute/contribute.html#code-build which reads as follows:

rockstor-build # runs build.sh if .venv dir is not found
rockstor-pre # starts .venv/bin/initrock after rockstor-build
rockstor # starts .venv/bin/supervisord -c etc/supervisord.conf after rockstor-pre
rockstor-bootstrap # starts .venv/bin/bootstrap after rockstor

And there is a simple chain between them. I.e. each one depends on the one above it. I.e. we have both After and requires. Currently these are a simple sequence. But as you say, there is a conundrum re bootstrap in that it prepares things that others depend upon (disk and mounts) yet it may use API calls that are stood-up by what is left of the rockstor.service, now nicely anounced as "Rockstor SGI Service".

That is why, in bootstrap, we wait on availability - previous to systemd (Rockstor pre-date systemd) we had to wait on stuff we needed. But you work here on splitting out each component bring light to these interdependencies. But the focus of this issue is to remove supervisord and establish someting close to what worked previously; even if it has yet to take full advantage of these seperations of concerns. And keep in mind that docs/forum and rpm all consider bootstrap to be the last element to be started - which in turn invokes via a chain dependency every other service.

I think we are nearly there on this PR - but I would like to maintain bootstrap as the overall last service - as it has been for some time now. Otherwise we have to change our execution order - when we are attempting here to remove supervisord. Not re-architect our services (but yes there is a bit of that) but one step at a time.

There is more than one cross-dependency within our architecture currently - this issue is (should be) only one step towards availing ourselves of more modern managers and only one of them. Once we have a working setup that can boot clean and upgrade in-place (our current function) and all with systemd then we are on the way to better defining our inter-dependencies.

Any API call needs our SGI, it in turn needs nginx and postgres. But our current rockstor-build depends on postgres - presumably due to "django-admin collectstatic", so any service that depends on build also has an implied postgres dependency. Incidentally once you have all these in a working state - from a clean install - we can always refine there-after. Now we are down to a single manager of systemd. But the first step is to replace supervisord with systemd - you have gone further than I had intended against this issue which is good. But it has in-turn brought up more fine grained issues of interdependency.

Lets get this booting with no .venv and no DB and no .initrock first. The look at exactly what systemd is doing with the definitions at that time. Prior work in this area by me indicates the use of systemd graphing to analyse boot-ups in detail.

In short we have two cross dependencies I know of - Disk/Pool models and how to refresh them. And likely also rockstor and rockstor-bootstrap as was. You have teased out very nicely the scheduler (needs SGI/rockstor) and data-collector (our dashboard) which will likely be thrown out later for a Prometheus backend used by a Graphana and openTelemetry etc. But for now it uses our DB and API so it is also dependant on SGI. All of what you have done thus-far here is great - we just need to get a working arrangment in-practice. We have in-code waits / retries from the olden-days of starting from rc.local so we have some wiggle room.

Agreed we have at least one existing conundrum but the work against this issue is to collect our existing co-dependencies under a single process manager - not to simultaneously fix all of our potential existing race conditions.

Try to keep things simple initially at least. Where one service depends on the last most needed service and only that service. If an earlier service fails all subsequent ones will also. That is OK. We can do parallel startup later once we have everything out of supervisord which I think you have almost done.

Re the config save/restore question we may just be able to establish a conditional on restore (if we do use service name) where if there is no instance of the new name, look for the old name. If everything else is the same that should help with backward compatibility which would be much preferred.

@phillxnet
Copy link
Member

@Hooverdan96

This is all looking quite promising, I've made a note re systemd service ultimately depending (via a cascade) on the lowests level one of rocktstor-build. All other services must ensure, either directly or indirectly, that rockstor-build has run successfully.

The tell here is to stop all your services on the test system, then rm -rf /opt/rockstor/.venv, then try starting bootstrap. This is what each and every rpm update does. So we must recover from that and end-up with all our intended services starting from that worst-case scenario. We do this to effect a from-scratch rebuild of the .venv before we embark on using it. This setup helps us to transition our .venv during rpm upgrade from one instance to another: where there is also likely a Python update as well - and on occasion a Poetry update also.

Also there are a number of js files here that don't look involved in these changes - you may have an MS EOL or something like that going on in there some-how. Sorting that will also help with narrowing all changes to those you have intended which will help with reviewing this.

All quite exiting this. Can't hardly wait to be down to just a single process manager - and a native Linux one at that.

@FroggyFlox
Copy link
Member

FroggyFlox commented Oct 6, 2025

Thank you so much for all this thorough work!

Reading on the changes in services names makes me wonder about our config restore mechanics. If I remember correctly, we do attempt to restore the status of services as well, in which case I wonder if the restore of a config backup created before the names changes will error out on a system post names changes (as matches by names would now fail)... Is that indeed the case?

@Hooverdan96
Copy link
Member Author

Hooverdan96 commented Oct 6, 2025

I think the following dependencies are missing in all of your suggested services.

I probably got lucky that this didn't cause issues. I will take a look at that, but it makes sense that these dependencies should be in there.

Also there are a number of js files here that don't look involved in these changes - you may have an MS EOL or something like that going on in there some-how. Sorting that will also help with narrowing all changes to those you have intended which will help with reviewing this.

Yes, I pointed that out in the initial description that while checking my own work for possible EOL issues, I found the other files -which I had never touched before - having EOL issues (i.e. they had the windows LFCR instead of the Unix LF only, which may or may not cause issues, I have experienced). So, I went ahead and fixed those here.

@FroggyFlox

I wonder if the restore of a config backup created before the names changes will error out on a system post names changes

Good point, I have not tested that, but you are probably right. This would mean, that in the coding this "change" needs to be carried forward to ensure that pre-systemd services are recognized and started appropriately, correct?

@Hooverdan96
Copy link
Member Author

Hooverdan96 commented Oct 6, 2025

@FroggyFlox, what is your recommendation w.r.t to the config backup impact?

The most direct way would be to insert logic right into the config restore area:

def restore_services(ml: list):

But would it make more sense to leave that portion alone and rather map old to new during ingestion of the json file? That way the actual logic for restoration would not be cluttered. I am thinking if more services end up being renamed or other config elements unrelated to services get an overhaul for some reason.

I could imagine another call to another (new) function after the consumption of the config file, before the actual restoration, where a type of legacy mapping would be maintained.

def restore_config(cbid):
cbo = ConfigBackup.objects.get(id=cbid)
fp = os.path.join(settings.MEDIA_ROOT, "config-backups", cbo.filename)
gfo = gzip.open(fp)
lines = gfo.readlines()
sa_ml: list = json.loads(lines[0])
sm_ml: list = json.loads(lines[1])
gfo.close()
restore_users_groups(sa_ml)
restore_samba_exports(sa_ml)
restore_nfs_exports(sa_ml)
restore_services(sm_ml)
# restore_dashboard(ml)
# restore_appliances(ml)
# restore_network(sa_ml)
restore_scheduled_tasks(sm_ml, sa_ml)
# N.B. the following is also a Huey task in its own right.
restore_rockons(sa_ml)

I can break this out into a separate issue (or @phillxnet can make this a sub-issue, because I apparently cannot), and in the current documentation of this PR topic, the interim "solution" is to perform a search and replace in the json file (for ztask-daemon, replacing it with rockstor-scheduling and ZTaskd with Scheduling), before zipping it back up again and uploading it.

{"model": "smart_manager.service", "pk": 13, "fields": {"name": "ztask-daemon", "display_name": "ZTaskd", "config": null}}

to

{"model": "smart_manager.service", "pk": 13, "fields": {"name": "rockstor-scheduling", "display_name": "Scheduling", "config": null}}

Expectation would then be that this is addressed before the next release as well, so that manual manipulation becomes obsolete.

@Hooverdan96
Copy link
Member Author

Hooverdan96 commented Oct 7, 2025

The tell here is to stop all your services on the test system, then rm -rf /opt/rockstor/.venv, then try starting bootstrap.

Just to clarify, stop all of these running services

rockstor
rockstor-data-collector
rockstor-scheduling

and

rm -rf /opt/rockstor/.venv

Reboot. The initial rockstor-build.service is skipped because of the missing .venv environment. rockstor-pre fails for the same reason (and cannot execute initrock). the other services also fail because of the pre-req of rockstor-pre which is not there. So based on the above changes
this test outcome should drive all units based on running rockstor-bootstrap. while rockstor-pre and rockstor will have started (as required), the other three are not as there is no required dependency maintained for rockstor-scheduling, replication, data-collector (and, incidentally, docker).

Sorry, just want to ensure, I get it.

@phillxnet
Copy link
Member

phillxnet commented Oct 7, 2025

@Hooverdan96 Re:

Reboot. The initial rockstor-build.service is skipped because of the missing .venv environment.

I think this is a typo as rockstor-build.service runs only if there is no .venv. It builds the .venv after all. So if it finds none it sets about building it.

@phillxnet
Copy link
Member

phillxnet commented Oct 7, 2025

@Hooverdan96 your last comment here is encouraging. We do need to establish a working set of services here. They may not be entirely without race-conditions - but what we did do did work. And establishing a strict sequence, even if not strictly requried - is closer to what we had which has worked. We can then tease out optimisations and parallelisation etc.

@phillxnet
Copy link
Member

@Hooverdan96 & @FroggyFlox Re:

I could imagine another call to another (new) function after the consumption of the config file, before the actual restoration, where a type of legacy mapping would be maintained.

I like this idea myself. As it abstracts the legacy to current, making it cleaner - hopefully.

Do keep in mind that we break this out into a spin-off issue. GitHub sub-issue stuff is unimportant - as long as we have inter-dependencies established via links. It may be we release a first testing rpm with (announched) broken restore for this service alone. That is OK. If in years time we have to parse 12 reasons for this PR we may regret it. But also we are a dooacracy, if a simple if no new-name, get old-name can be done then great. But we are not limited to the number of issues we can raise - each with a clear definition of what is being attmpted.

The complexity of this issue is why I had assigned myself - but you are doing an excellent job here; so no worries at all really.

Re scheduler and data-collector just have them wait and depend on rockstor. Leaving bootstrap as last (with some questions already addressed). That may get us to a working state again.

@Hooverdan96
Copy link
Member Author

I think this is a typo as rockstor-build.service runs only if there is no .venv
actually, it was not. I thought that was strange ... and late in the evening, so I turned everything off and started over this morning 😄

So, here's the testing

Before removing .venv - critical path

~# systemd-analyze critical-chain

multi-user.target @1min 17.999s
└─docker.service @1min 13.773s +4.224s
  └─rockstor-bootstrap.service @55.959s +17.803s
    └─rockstor.service @55.929s
      └─rockstor-pre.service @16.388s +39.525s
        └─chrony-wait.service @9.148s +7.160s
          └─chronyd.service @8.500s +637ms
            └─network.target @8.480s
              └─NetworkManager.service @7.959s +519ms
                └─network-pre.target @7.956s
                  └─soft-reboot-cleanup.service @7.458s +497ms
                    └─basic.target @7.410s
                      └─dbus-broker.service @7.292s +113ms
                        └─dbus.socket @7.270s
                          └─sysinit.target @7.261s
                            └─auditd.service @7.000s +261ms
                              └─systemd-tmpfiles-setup.service @6.521s +162ms
                                └─systemd-journal-flush.service @5.904s +600ms
                                  └─var.mount @5.800s +93ms
                                    └─dev-sda4.device

test start:

systemctl stop rockstor replication rockstor-data-collector rockstor-scheduling postgresql
rm -rf /opt/rockstor/.venv
reboot

after removing .venv and reboot - critical path

~# systemd-analyze critical-chain

multi-user.target @1min 33.900s
└─docker.service @1min 30.474s +3.421s
  └─rockstor-bootstrap.service @1min 13.732s +16.718s
    └─rockstor.service @1min 13.723s
      └─rockstor-pre.service @43.474s +30.242s
-->     └─rockstor-build.service @7.311s +36.151s
          └─NetworkManager-wait-online.service @4.926s +2.357s
            └─NetworkManager.service @4.535s +383ms
              └─network-pre.target @4.533s
                └─soft-reboot-cleanup.service @4.188s +330ms
                  └─basic.target @4.146s
                    └─dbus-broker.service @4.061s +81ms
                      └─dbus.socket @4.041s
                        └─sysinit.target @4.029s
                          └─auditd.service @3.838s +191ms
                            └─systemd-tmpfiles-setup.service @3.519s +118ms
                              └─systemd-journal-flush.service @2.879s +625ms
                                └─var.mount @2.648s +163ms
                                  └─dev-sda4.device

The rockstor-build is running, doing its thing and the .venv is restored and Rockstor is up and running again 🎉

image

Here's a visual of the systemd load ... a bit doctored up for people without eagle vision...
image

@phillxnet
Copy link
Member

@Hooverdan96 Just a small note on the naming of the new replication.service; all our existing core systemd services, and the others you have added here, begin with "rockstor". Might we want to continue that consistency? It does make for easier identification, plus one can do such things as:

systemctl stop rockstor*

At least when not in a directory with filename matches.

But not so if a core systemd service of ours does not begin with "rockstor".

I'm thinking mainly here at the systemd level of things - names within rockstor code and the DB are of course another thing given they have more context. Also, if we maintain the "rockstor*" naming we can also more easily wipe all existing services by filename for example; using the same wildcard "rockstor*".

@phillxnet
Copy link
Member

phillxnet commented Oct 8, 2025

@Hooverdan96 Further context re:

all our existing core systemd services ...

We are obliged by systemd norms/function with non-core override servcies to start with a priority number - see our nginx override config in the install lines within our https://github.com/rockstor/rockstor-rpmbuild repo.

https://github.com/rockstor/rockstor-rpmbuild/blob/8d6a3d997b843db1bab121fbad21f0f13a64b3d6/rockstor.spec#L249-L260

Also note that we will require a matching issue/pr set wihin that repo on the testing branch to add the install/management of these new services via rpm: once we have the final names in place here. That way we can properly assess the changes in this repo as our backend can build with simultaneous PRs from both this repo and the rpm repo - a requirement when we get to finally assess the changes here.

@Hooverdan96
Copy link
Member Author

Hooverdan96 commented Oct 8, 2025

Might we want to continue that consistency?

yes, I tried it initially, but did run into issues on how the unit name and the actual service are not the same. This resulted in missing configuration icon/option on the services screen and I couldn't start/stop the service from there either. I will investigate further again, but it seems more convoluted than it was for the rockstor-scheduling (unless of course I missed some major thing there that's not apparent in my testing).

We are obliged by systemd norms/function with non-core override servcies to start with a priority number

Sorry, you have to explain that a little bit more.
Do you mean that services that are started/restarted with an override config (like nginx) should be completed prior to the core rockstor service? In the context of nginx I would think nginx does not need to start at all until gunicorn is up? We currently don't (and previously didn't either) have a sequence in the service units (as a before/after/required), correct? However it's implicit in the sequence through the rockstor-pre contents itself.

Also note that we will require a matching issue/pr set [rpm repo]

Understood that it will be required there as well. That will be another learning experience for me, if that's the expectation.

@phillxnet
Copy link
Member

@Hooverdan96 Re:

Sorry, you have to explain that a little bit more.

My fault, all I meant was that it is expected to preceed an override file such as we use for nginx with a number to inform systemd the order in which they are to be run within the associated directory.

... should be completed prior to the core rockstor service?

It is just our specific override file required to use our nginx config rather than the default one. Other than the config there is not effect - but we do end up restarting nginx I think in initrock.

Re:

We currently don't (and previously didn't either) have a sequence in the service units (as a before/after/required), correct?

I think we enforce this via initrock, as you intimate. We have no hard nginx dependency. Doing it via initrock enables us to pop-in our override - but we do this also via the rpm install. However initrock does the restart or stop/start to pick-up the required config we pop in via the override directory and .conf file.

Re:

... if that's the expectation.

Either way, if you are game then great. But once we have the changes here finalised then I can pop in the required changes in our rpm sped file if you would prefer.

Re:

yes, I tried it initially, but did run into issues on how the unit name and the actual service are not the same.

It is likely something silly, let us know if it proves to be elusive. We can keep it's service name internally as replication but just change its systemd file name only and all associated calls of course. No worries with 'replication' within the code - but it would be nice if this translates to a rockstor-replication.service file externally.

poetry.lock Outdated
@@ -1,4 +1,4 @@
# This file is automatically @generated by Poetry 2.1.4 and should not be changed by hand.
# This file is automatically @generated by Poetry 2.2.0 and should not be changed by hand.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Hooverdan96 How come your Poetry is 2.2.0 when testing is currently using 2.1.4. We have an open issue assigned to me to move to 2.2.0 but that is not part of this issue/PR and could lead to merge conflicts. I'm about to move us to 2.2.0 but that hasn't happened yet, and such a change will be under, and should only be under:

Just trying to avoid merge conflicts as we polish off what you are doing here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes that's not what should be there.
I thought I had fixed that after my initial draft, where I assumed your poetry upgrade would already be done by the time I finish this (which might still be the case), and had used a vm that had the higher version of poetry installed. I missed that when I moved the branch.
I'll fix that here, as during current testing it installs and runs poetry 2.1.4

return is_sftp_running(return_boolean=False)
elif service_name in ("replication", "data-collector", "ztask-daemon"):
return superctl(service_name, "status", throw=False)
elif service_name in ("replication", "data-collector", "scheduling"):
Copy link
Member

@phillxnet phillxnet Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Hooverdan96 should these not tally with the new, and preferred (rockstor-replication|data-collector|scheduling) naming? Given the resulting service_name is then passed to systemctl() to gain status info.

[EDIT] See my later comment re adding an internal to systemd naming shim. That may well be what we are missing here.

@phillxnet
Copy link
Member

phillxnet commented Oct 9, 2025

@Hooverdan96 I'm a little reluctant to ask :). But how about we go with:

rockstor-collector rather than rockstor-data-collector given we are starting out a-fresh on this one.

We can add a dictionary shim, or the like, to transition between internal and systemd filenames if need be.
I.e. systemctl() could transition "replication", "data-collector", "scheduling" to:

  • rockstor-replication
  • rockstor-collector
  • rockstor-scheduling

That way we can maintain the shorter internal names that make more sense in display and existing DB entries or what-have-you.

On the dropping of rockstor-data-collector - it seems a bit sinister :).

[EDIT] Where we would need a shim for internal to systemd naming of services:

def systemctl(service_name, switch):
arg_list = [SYSTEMCTL, switch, service_name]
if switch == "status":
arg_list.append("--lines=0")
return run_command(arg_list, log=True)

Which we use for start stop and status presumably.

@Hooverdan96
Copy link
Member Author

On the dropping of rockstor-data-collector - it seems a bit sinister :).

Just in time for Halloween :).
After starting to look into the replication change, it seems we will need a shim/explicit call for all systemd related interactions on this one (lest I'm having to do more work in the js portion of all this (which, considering the possible revamp of the UI might be less value added). So, yes, I will revisit the data-collector renaming as well.

@phillxnet
Copy link
Member

@Hooverdan96 I've popped in two fixes that narrow us down to just the return code accommodation re the replication service. Ergo the tick now in #3034 (comment)

Hooverdan96 and others added 2 commits October 16, 2025 08:06
Co-authored-by: Philip Guyton <phillxnet@users.noreply.github.com>
Co-authored-by: Philip Guyton <phillxnet@users.noreply.github.com>
@phillxnet
Copy link
Member

Regression now in testing !!

======================================================================
ERROR: test_init_service_op_valid (rockstor.system.tests.test_services.ServicesTests.test_init_service_op_valid)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/buildbot/rockstor-buildbot/worker/Poetry-Build-on-Leap15-6/rpmbuild/rockstor-core-5.5.0-3034/src/rockstor/system/tests/test_services.py", line 90, in test_init_service_op_valid
    init_service_op(service_name=s, command=c)
  File "/home/buildbot/rockstor-buildbot/worker/Poetry-Build-on-Leap15-6/rpmbuild/rockstor-core-5.5.0-3034/src/rockstor/system/services.py", line 77, in init_service_op
    raise Exception("unknown service: {}".format(service_name))
Exception: unknown service: replication

@phillxnet
Copy link
Member

As per the finding in: #3034 (comment)

We have from: https://www.freedesktop.org/software/systemd/man/latest/systemctl.html#Exit%20status

The expected confirmation that rc=3 from our status in:

/usr/bin/systemctl status rockstor-replication

○ rockstor-replication.service - Rockstor Replication Service
     Loaded: loaded (/usr/lib/systemd/system/rockstor-replication.service; disabled; preset: disabled)
     Active: inactive (dead)

echo $?
3

Is as follows:

Value Description in LSB Use in systemd
0 "program is running or service is OK" unit is active
1 "program is dead and /var/run pid file exists" unit not failed (used by is-failed)
2 "program is dead and /var/lock lock file exists" unused
3 "program is not running" unit is not active
4 "program or service status is unknown" no such unit

Co-authored-by: Philip Guyton <phillxnet@users.noreply.github.com>
- use more comprehensive init_service_op function
- rockstor-collector does not need to go through this process

Co-authored-by: Philip Guyton <phillxnet@users.noreply.github.com>
@phillxnet
Copy link
Member

@Hooverdan96 With the last two suggestions (first one now committed by the looks of it) we should be in a better state.

See what you think of your branches function there-after. From a quick look we have at least a conundrum re offering to turn off collection without actually supporting this. But we can separately issue and approach what we are to do with that. It is after all currently a dependency - similarly we offer to turn off rockstor!!! I'd like also to reconsider the use of the bare-bones type=simple. Upping things to at least Exec (so that we wait for program execution to finish before moving on. But such changes would have to be tested of course. Also possibly, as per bootstrap we could consider using OneShot for stuff that exits but has never-the-less done it's job.

Anyway - kind of back to you I think on this one. Just wanted to help-out re getting rid of the outstanding errors found thus far from what was there. Once you have committed that last suggestion we can rebuild the rpm to see where we are at end-to-end wise again. I can report anything obvious I notice there-after.

Incidentally - we can now turn the scheduler service (Huey) off and on again from within the Web-UI. I wasn't sure that would work given the systemd interdependencies.

Co-authored-by: Philip Guyton <phillxnet@users.noreply.github.com>
@Hooverdan96
Copy link
Member Author

@phillxnet thanks for your persistence in resolving these nagging errors, very helpful I think. I will do more testing. And you are right, I was wondering why the option to start/stop the collector existed, but was never integrated into the WebUI, however the Rockstor on/off has always been there, though it wouldn't make much sense.

I will check out the service type thing again, especially w.r.t. the simple option.

Finally, not for this PR of course, I think there has to be a better way to define/separate the service definition, corresponding filenames and their connection to the WebUI. It feels like overkill for a change like the scheduling to have to touch quite a few places where it's not easy to discern (well, at least for me) which change is absolutely required to run a service vs. how the functionality is interacting with the WebUI. Especially, since a database is used for storing all kinds of properties, having lists of hard-coded values seems superfluous.

@phillxnet
Copy link
Member

@Hooverdan96 Re:

I will check out the service type thing again, especially w.r.t. the simple option.

I'm thinking, given what we have here now basically works, that we get this existing work squashed down and re-presented in a clean PR for prosterity, we can then spin off a few issues that have arisen from this ambitions but productive interpretation on my initial proposal re use of additional ExecStarts. We are now further along, which is good, but we have created a few more potential ripples.

I think there has to be a better way to define/separate the service definition

There is always a better way - and in fact this and your proceeding draft PR have contributed to improving what we had. Do keep in mind that systemd is younger than Rockstor. We at one point used nothing more than supervisord and rc.local. So as-is this branch is likely the best candidate thus-far given it uses a single process manager all-in: systemd. The core aim of the original issue. Only we got further than I had planned.

It feels like overkill for a change like the scheduling to have to touch quite a few places where it's not easy to discern (well, at least for me) which change is absolutely required to run a service vs. how the functionality is interacting with the WebUI. Especially, since a database is used for storing all kinds of properties, having lists of hard-coded values seems superfluous.

I mostly agree - and may have taken a different approach myself regarding these additional services and their respective systemd service names. I.e. we still have inconsistencies: only some for historical reasons. But that's find as we constant refactoring is all part of what we are required to do here. But most importantly I think it's best we understand why we do a thing and what we think we have done. Larger issues that experience scope creep (which is inevitable some times) are the most challenging in this regard. I'm going to create some spin-off issues referencing this one with some of the problems encountered.

Our project initially used systemd to DB servcie name consistency - but over time in situations such as ZTaskD to Huey it was impractical to take on such deep modifications - with consequences to such things as config save and restore - a deal breaker for our stable release channel. As such things like your:

are critical to address before our next stable release - but with your enthusiastic input into rockstor-core of late, such things can be more easily considered. As you have done here re the name changes. As for DB entries re services - we are a Django project, as such anything we manage must have a DB counterpart. Everything else revolves around that. The front end and all middle ware must interact with the DB and often through our API. This brings great flexibility and capability but also brings with it an inevitable complexity.

having lists of hard-coded values seems superfluous.

Wow that clause is doing a lot :). Lets just say again we can always improve and I think we have done so within this and the preceding draft PR. Lets move to creating more spin-offs to address the potential fall-out and do it in clear steps that help us understand progressively better as we go.

Personally I would not want a JS line endings issue rolled into this one as it serves as a smoke screen when someone (us in the future possibly) views a file changed for no apparent reason related to the issue: completing our move away from supervisord to systemd. We just open a new issue defining we have inconsistency in js file line endings and address that. It will really help in the future - in my experience anyway.

Consider copying over all changed files that are not JS line-ending related and creating a new PR with the result. We can then look back to that as removing supervisord. I'll create an issue to address the line ending also.

@Hooverdan96
Copy link
Member Author

Hooverdan96 commented Oct 17, 2025

given what we have here now basically works

sounds good. I'll leave that one alone.

Do keep in mind that systemd is younger than Rockstor

Yes, sorry, I didn't mean to knock the history and past decisions. Obviously, with a project that has been around for such a long time, decisions were driven by experience, available options, etc.

Wow that clause is doing a lot :)

yes, somewhat intentional 😄 it was in the same context around your remark on django and database entries. I hope, we can get to relying on the db entries only, as opposed to having an additional dictionary/list in the code to restrict or map a thing like a service (which in turn would make "plugging" in a future service easier as well). But, to quote your favorite line "bit by bit".

I will do some squashing and clean up and present a new PR (or should this be in this PR?), excluding the js line ending issues (if I don't mess it up).

The only other thing I was planning is to update the copyright year for any files that are modified as part of this PR. Or are you planning on doing that wholesale for all files in a separate issue/PR?

And, I just noticed the spin-off issues you have created, thank you!

@phillxnet
Copy link
Member

@Hooverdan96 Thanks - that sound great - I've covered the issue re avoiding DB as canonical for servcies in associated spin-off, as we have a chicken and egg on that front: plus the DB access is heavy-weight for one of our uses. But that is for the specific issue :) .

The only other thing I was planning is to update the copyright year

I wouldn't both with that, now that we have normalised the formatting we can likely do this on-mass as per the last change of that type. Plus you have done quite enough juggling here what with my sporadic and potentially circular contributions.

I will do some squashing and clean up and present a new PR (or should this be in this PR?), excluding the js line ending issues (if I don't mess it up).

Thanks, and I think it would be easier to create a new one, it can in turn reference this and it's draft predecessor for historical context.

It may be easier to just setup a new git checkout of latest testing, and overwrite the relevant files with the non-line-ending changed files from this git repo - plus any js changes you did end up popping in. Don't worry about preserving attribution of changes in this pr - we were basically thinking out-loud during this PR's discussion and changes.

I'm working on wording one-more spin-off and I think we should then ready for final review of what you are about to present post squash etc.

Incidentally - did you think the SYSTEM-Services page worked way faster after your changes? I'm pretty sure it feels more nippy now. That would be nice.

@Hooverdan96
Copy link
Member Author

Incidentally - did you think the SYSTEM-Services page worked way faster after your changes? I'm pretty sure it feels more nippy now.

I have to check that out, I did not compare it, now that you wrote it, I want to say yes, but I did not really pay attention. Would be great of course.

@Hooverdan96
Copy link
Member Author

Hooverdan96 commented Oct 17, 2025

I am somehow now not able to get the 5 windows vs. unix line-feed files back to their original state. Not sure what has changed in my dev setup (if anything), but when I do the comparison back to the test branch, these files still show up as changed. Not sure what I should do next on that... any recommendations? I've tried by pulling the testing branch completely independently via git, as well as changing branches, copying all files out, switching back to the new branch and copying the 5 files back in. Neither approach seems to make a difference ... only other option is to do the reverse, take a clean testing clone and copying the 30 odd relevant changed files back into the branch, which is probably what I will end up doing, unless you have a tip on how I can avoid this.

@FroggyFlox
Copy link
Member

Incidentally - did you think the SYSTEM-Services page worked way faster after your changes? I'm pretty sure it feels more nippy now. That would be nice.

We have a toolbar to check these things now 😉.

Hooverdan96 added a commit to Hooverdan96/rockstor-core that referenced this pull request Oct 17, 2025
- remove supervisor references and config
- genericize scheduling service (currently huey, formerly ztaskd)
- add new corresponding services, update references in various files
- reflect change to systemd in relevant test cases
- harmonize rockstor-specific service unit nomenclature
- genericize log files
- update poetry project toml and lock file (commenting supervisor for future removal
- changes based on prior PRs rockstor#3031 & rockstor#3034
Hooverdan96 added a commit to Hooverdan96/rockstor-core that referenced this pull request Oct 17, 2025
- remove supervisor references and config
- genericize scheduling service (currently huey, formerly ztaskd)
- add new corresponding services, update references in various files
- reflect change to systemd in relevant test cases
- harmonize rockstor-specific service unit nomenclature
- genericize log files
- update poetry project toml and lock file (commenting supervisor for future removal
- changes based on prior PRs rockstor#3031 & rockstor#3034
Hooverdan96 added a commit to Hooverdan96/rockstor-core that referenced this pull request Oct 17, 2025
- remove supervisor references and config
- genericize scheduling service (currently huey, formerly ztaskd)
- add new corresponding services, update references in various files
- reflect change to systemd in relevant test cases
- harmonize rockstor-specific service unit nomenclature
- genericize log files
- update poetry project toml and lock file (commenting supervisor for future removal
- changes based on prior PRs rockstor#3031 & rockstor#3034
@Hooverdan96
Copy link
Member Author

Hooverdan96 commented Oct 17, 2025

Very true, I tried it too, but I didn’t have a benchmark from the previous install. However with the latest build I just did it appears pretty darn fast though the config popups are still slow.

I will convert this PR to draft and the close it, since I opened a fresh one just now

@Hooverdan96 Hooverdan96 marked this pull request as draft October 17, 2025 23:50
@Hooverdan96
Copy link
Member Author

Closing this as I cleaned up the development into a new PR 3040# as mentioned above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载