Operations task management software?

Hi all, curious if anyone has recommendations on software that helps manage routine duties assigned to operations staff?

For example, let’s say we have a P&P that says someone from the netops group must check that Rancid is successfully backing up all router configs bi-weekly. Ideally, it would send an email reminder to this pre-defined group of people saying hey, it’s Monday, someone needs to check this and come acknowledge the task as having been completed. If that doesn’t occur, pre-defined manager X is notified on Tuesday. If manager X doesn’t get someone to complete the task, director Y is notified, so on and so forth. Then, perhaps periodically it emails manager X anyway and says hey, it’s been three months, you need to audit netops to ensure they’re actually doing the Rancid audit and not just checking that it was done. This could be applied to the staff who check on backup failures, backup internet circuit status, out of band interfaces, etc.

A data center I looked at recently had QR code stickers on all of their infrastructure stuff and there were staff assigned to check and log certain displayed values each day. The software would at least ensure they actually visited the equipment by requiring they scan the relevant QR code when in front of it. So I figure something that does what I’m looking for properly already exists.

Thanks,

David

Been meaning to dig into this one https://www.upguard.com/blog/guardrail-tasks-a-lightweight-tracking-system-for-ops

--srs

Hi all, curious if anyone has recommendations on software that helps manage
routine duties assigned to operations staff?

Have computers do the routine scut work - not people.

For example, let’s say we have a P&P that says someone from the netops group
must check that Rancid is successfully backing up all router configs
bi-weekly.

You've got the source code for rancid, so change rancid-run to do something like
  LOGFILE=$LOGDIR/$GROUP.`date +%Y%m%d.%H%M%S`; export LOGFILE
change the
  ) >$LOGDIR/$GROUP.`date +%Y%m%d.%H%M%S` 2>&1
to
  ) >$LOGFILE 2>&1

and then in control_rancid do something like
  grep "clogin error:" $LOGFILE | sort | uniq -c >$TMP.fail
  if [ -s $TMP.fail ]; then
     # got some output, mail the report
     ...

Do the same type thing for checking on

backup failures, backup internet circuit status, out of band interfaces, etc.

Automate the checks, put the scripts in crontab & mail out an
"OhNoes!" or "all clear" msg at the end. At which point you're left
with the problem of making sure the managers are looking at the emails
& making sure whatever problems are found actually get fixed :slight_smile:

Regards,
Lee

Full automation is planned but does not eliminate the need for the software. Zero human auditing of fully automated processes and data collection are not acceptable to various certifying entities, the relevant auditors, the inevitably involved lawyers, and won’t pick up on bad data, like a bad thermometer or snmp counter that says a CRAC is 65 degrees when it’s really 90. So I’m still going to need a management solution to the issue whether it’s to tell someone to do the work or to tell someone to check the automated work.

David

    > Hi all, curious if anyone has recommendations on software that helps manage
    > routine duties assigned to operations staff?
    
    Have computers do the routine scut work - not people.
    
    > For example, let’s say we have a P&P that says someone from the netops group
    > must check that Rancid is successfully backing up all router configs
    > bi-weekly.
    
    You've got the source code for rancid, so change rancid-run to do something like
      LOGFILE=$LOGDIR/$GROUP.`date +%Y%m%d.%H%M%S`; export LOGFILE
    change the
      ) >$LOGDIR/$GROUP.`date +%Y%m%d.%H%M%S` 2>&1
    to
      ) >$LOGFILE 2>&1
    
    and then in control_rancid do something like
      grep "clogin error:" $LOGFILE | sort | uniq -c >$TMP.fail
      if [ -s $TMP.fail ]; then
         # got some output, mail the report
         ...
    
    Do the same type thing for checking on
    > backup failures, backup internet circuit status, out of band interfaces, etc.
    
    Automate the checks, put the scripts in crontab & mail out an
    "OhNoes!" or "all clear" msg at the end. At which point you're left
    with the problem of making sure the managers are looking at the emails
    & making sure whatever problems are found actually get fixed :slight_smile:
    
    Regards,
    Lee

Full automation is planned but does not eliminate the need for the software.
Zero human auditing of fully automated processes and data collection are
not acceptable to various certifying entities, the relevant auditors, the
inevitably involved lawyers, and won’t pick up on bad data, like a bad
thermometer or snmp counter that says a CRAC is 65 degrees when it’s really
90. So I’m still going to need a management solution to the issue whether
it’s to tell someone to do the work or to tell someone to check the
automated work.

You have a ticketing system - right? Create a cron job that creates a
ticket to check whatever.

Regards,
Lee

Jira works well as a task tracking system for ops. Customizable work flows,
decent integration with ldap, etc. Also good for tracking software
projects. Having both software and ops tasks in one place has many benefits.

We use redmine, combined with scripts that call it’s API to create automated tickets/tasks that NOC or engineers need to attend to.
Has email notifications, wiki, documents, files, code repo, calendar, customisable fields all built in.

Hey,

Hi all, curious if anyone has recommendations on software that helps manage routine duties assigned to operations staff?

I'd solicit opinions as well. There are few features I'd like to see:

1) ability to create parent+child, if all childs are closed, parent
closes if parent is closed, childs close

2) ability to create dependencies, perhaps I have some design change I
want to make, but it can't be done until large bunch of operational
work is done, I could create tickets for ops, and then create ticket
for myself, and make it depend on the the ops ticket being solved. It
wouldn't be seen in my work queue, until all solve-dependencies are
solved.

3) user (non-admin) access to API, if the UI is bad, like it probably
is for my very small subnet of things I need, I could create own CLI
UI addressing solely the use cases that are relevant to me, in an
streamlined, low-time-cost UI to me. In dream scenario shipping webUI
is dog-fooding documented API, so anything I can do there, I can do

There are probably others, but those are the main things I think I need.