DevOps workflow for networking

Kasper_Adel · August 10, 2017, 12:52am

We are pretty new to those new-age network orchestrators and automation,

I am curious to ask what everyone is the community is doing? sorry for such
a long and broad question.

What is your workflow? What tools are your teams using? What is working
what is not? What do you really like and what do you need to improve? How
mature do you think your process is? etc etc

Wanted to ask and see what approaches the many different teams here are
taking!

We are going to start working from a GitLab based workflow.

Projects are created, issues entered and developed with a gitflow branching
strategy.

GitLab CI pipelines run package loadings and run tests inside a lab.

Tests are usually python unit tests that are run to do both functional and
service creation, modification and removal tests.

For unit testing we typically use python libraries to open transactions to
do the service modifications (along with functional tests) against physical
lab devices.

For our prod deployment we leverage 'push on green' and gating to push
package changes to prod devices.

Thanks

Joe_Hamelin · August 10, 2017, 2:16am

We've been using this tool since we're a LEAN company, but it actually is a
good way to assign tasks/projects and delegate tasks so everyone can see
what is going on. Managers can move cards to your active lane or ask why a
task/project has stalled.

I'm not sure what exactly you are looking for but as a team management
tool, this has mostly worked for us for the last 3-4 years. YMMV.

https://kanbanize.com/

Andrew_Latham · August 10, 2017, 3:44pm

Kasper

I know that many are embarrassed to share their overly manual processes and
or others are keeping their solutions private. It sounds like you have a
solution for your needs. I would add some transparency to the process in
the form of a dashboard or summary of status for support staff, security,
QA, etc to understand that a particular release was tested, approved and
deployed "days" prior to the customer having an issue vs "minutes".
Inter-organization communication and socialization for the win!

Maybe a poll on workflow would be fun:
1. Break Fix workflow - aka ASAP
2. Whale customer requests only
3. Budget constrained projects only
4. Everything is awesome we get to test all the things AND have single
button rollback
5. All of the above depending on the team and department.

Ray_Burkholder · August 10, 2017, 4:41pm

some observations below:

We are pretty new to those new-age network orchestrators and automation,

There are definitely many options out there. With a considerable amount of sophisticated. But fortunately, it is possible to start simple and add layers of abstraction as knowledge and experience is gained.

I am curious to ask what everyone is the community is doing? sorry for such
a long and broad question.

the brief version: we are working towards integrating a SaltStack with Napalm management and orchestration solution.

What is your workflow? What tools are your teams using? What is working
what is not? What do you really like and what do you need to improve? How
mature do you think your process is? etc etc

Things are getting started. I am able to automate the build of servers simply by knowing the mac address and then pxebooting the device. The operating system is installed, auto - reboote. It then automatically gets its total configuration applied , again automatically, from a Salt server.

Our operating environment uses Debian. And by incorporating the auto installation of Quagga/FRR, Openvswitch, KVM/Qemu, and LXC into the appropriate devices, it is possible to build a homogenous server/router/switch/virtualization solution with certain devices picking up varying weights of those roles.

The people on this list who are running high bandwidth networks, may not see this a much of a benefit, but for smaller operators, I thinks there is value.

But then again, when something like Napalm is incorporated into the mix, then automation of the ‘big iron’ becomes part of the overall solution. I came across a CloudFlare slide deck which shows their perspective for management, implementation, and orchestration. https://ripe72.ripe.net/presentations/58-RIPE72-Network-Automation-with-Salt-and-NAPALM-Mircea-Ulinic-CloudFlare.pdf

And SaltStack has a proxy minion, which enables it to talk to cli based devices.

Wanted to ask and see what approaches the many different teams here are
taking!

We are going to start working from a GitLab based workflow.

Salt uses generic ‘state’ files which are completed with device specific settings from ‘pillar’ files. Both of which can be version controlled in git.

Projects are created, issues entered and developed with a gitflow branching
strategy.

GitLab CI pipelines run package loadings and run tests inside a lab.

I not affiliated with SaltStack, just a happy user. Having said that, various dev/test/prod scenarios can be implemented. With orchestrated work flows and provisioning processes based upon the level of sophistication required.

Tests are usually python unit tests that are run to do both functional and
service creation, modification and removal tests.

Rather than re-inventing the wheel, take a look at SaltStack or Ansible and/or Napalm. All are python based and could probably get you to your target faster than when using Python natively. When it is necessary to go native python on a hairy integration problem, then it is no problem to incorporate Python as needed.

For unit testing we typically use python libraries to open transactions to
do the service modifications (along with functional tests) against physical
lab devices.

Napalm may get you that next level of sophistication where configs can be diff’d before roll-out.

For our prod deployment we leverage 'push on green' and gating to push
package changes to prod devices.

Which can be orchestrated.

Thanks

Raymond Burkholder
https://blog.raymond.burkholder.net

Jippen · August 11, 2017, 7:18am

To be honest, most companies I've worked at have moved to amazon, where the
networking stack has APIs. I've also seen folks who use CI/CD pipelines to
generate configuration files for devices that don't directly support
automation.

Hugo_Slabbert1 · August 11, 2017, 3:51pm

Possibly a minor nit, but if the devices "don't directly support automation", how is the "D" part of "CI/CD" accomplished there? `integration -ne deployment`. Do you mean something like "there is no API or e.g. netconf interface, but they can generate config off-box, scp it, and `copy start run` to load"?

Leo_Bicknell1 · August 11, 2017, 4:34pm

In a message written on Fri, Aug 11, 2017 at 08:51:25AM -0700, Hugo Slabbert wrote:

Possibly a minor nit, but if the devices "don't directly support
automation", how is the "D" part of "CI/CD" accomplished there?
`integration -ne deployment`. Do you mean something like "there is no API
or e.g. netconf interface, but they can generate config off-box, scp it,
and `copy start run` to load"?

More or less. I've worked at places that do this sort of thing.

1) Download config from box.
2) Run script to determine changes necesary to config.
3) Load changes.
4) Download config again.
5) Re-run the script to determine changes necessary, verify there are none.

For a lot of the devices with a Cisco-IOS like interface it's not even
hard. Generate a code snippet:

config terminal
interface e0
description bar
end
write mem

Then tftp the config to a server, have the script see e0 has description
bar.

Saku_Ytti1 · August 11, 2017, 8:45pm

Hey,

For a lot of the devices with a Cisco-IOS like interface it's not even
hard. Generate a code snippet:

config terminal
interface e0
description bar
end
write mem

Then tftp the config to a server, have the script see e0 has description
bar.

To me there are two fundamentally different ways to do this
1) consider world dynamic, incrementally change it
2) consider world static, generate it from scratch

The first one, is like managing servers with puppet/chef/ansible, you
ask it to run some set of commands when you decide you want to turn up
new service.
The second one, is like using docker, if you want to change it, you
build new full container, and swap it to the network.

The benefit of second one is, that there is absolute guarantee of the
state of the device immediately after the change has been made. The
first one assumes there is known state in the system, when incremental
change is pushed.

I am great proponent of the second way of doing things. Mainly because:
   a) I find it trivial to generate full config from database, where
as figuring how to go from A to B I find complicated (i.e. error
prone) to do
   b) 2nd mandates that only system is managing the device, because if
someone does login and does do something out-of-system, it will go
away on next change - I think this is large advantage
   c) I do not need to try to prove system state is currently correct
by implementing more and more tests towards figuring out state,
instead I prove system state by setting all of it

Downside of the 2nd method is, that it requires device which supports
replacing whole config, classic IOS(-XE) and SR-OS today do not.
JunOS, IOS-XR, EOS (Both compass and arista) and VRP do. SR-OS is
making strides towards solving this. IOS-XE I'm hoping but not holding
breath.

Tom_Beecher · August 11, 2017, 8:53pm

The same way we've done it for years ; really hacky expect scripts.

Pete_Lumbis · August 18, 2017, 6:08pm

Awesome!

I gave a presentation on CI/CD for networking last year at the Interop
conference; my demo was based on Gitlab

I use Behave for testing, but it is just a front end for python code under
the hood to actually validate that everything is doing what it's supposed
to be doing.

I did a little bit of work to try and get Ansible to do checking and
validation in a playbook, but since Ansible isn't really a programming
language it felt like putting a square peg in a round hole. I would
recommend an actual programming language or testing frame work.

Likely the biggest challenge you'll encounter is a lack of features in
vendor VMs and the fact you can't change interface names. Generally, in
production, we don't have "eth1, eth2, eth3" as the cabled up interfaces,
so you end up needing to maintain two sets of configs (prod and test) or
something to modify production configs on the fly, both of which are crummy
options.

From a workflow perspective, you can treat configuration like code and run

full test suites when pull requests are issued and then use the test
results as the basis for a change review meeting. Don't let humans talk
about changes that we already know won't work.

Glad to hear about other people seriously considering CI/CD in the network
space, good luck!

-Pete

James_Bensley1 · August 22, 2017, 8:18am

We are pretty new to those new-age network orchestrators and automation,

I am curious to ask what everyone is the community is doing? sorry for such
a long and broad question.

What is your workflow? What tools are your teams using? What is working
what is not? What do you really like and what do you need to improve? How
mature do you think your process is? etc etc

The wheels here move extremely slowly so it's slowly, slowly catchy
monkey for us. So far we have been using Ansible and GitLab CI and the
current plan is to slowly engulf the existing network device by device
into the process/toolset.

Wanted to ask and see what approaches the many different teams here are
taking!

We are going to start working from a GitLab based workflow.

Projects are created, issues entered and developed with a gitflow branching
strategy.

GitLab CI pipelines run package loadings and run tests inside a lab.

Yes that is the "joy" of GitLab, see below for a more detailed
breakdown but we use docker images to run CI processes, we can branch
and make merge requests which trigger the CI and CD processes. It's
not very complicated and it just works. I didn't compare with stuff
like BitBucket, I must admit I just looked at GitLab and saw that it
worked, tried it, stuck with it, no problems so far.

Tests are usually python unit tests that are run to do both functional and
service creation, modification and removal tests.

For unit testing we typically use python libraries to open transactions to
do the service modifications (along with functional tests) against physical
lab devices.

Again see below, physical and virtual devices, and also some custom
python scripts for unit tests like checking IPv4/6 addresses are valid
(not 999.1.2.3 or AA:BB:HH::1), AS numbers are valid integeters of the
right size etc.

For our prod deployment we leverage 'push on green' and gating to push
package changes to prod devices.

Thanks

Yeah that is pretty much my approach too. Device configs are in YAML
files (actually multiple files). So one git repo stores the
constituent YAML files, when you update a file and push to the repo
the CI process starts which runs syntax checks and semantic checks
against the YAML files (some custom python scripts basically).

As Saku mentioned, we also follow the “replace entire device config”
approach to guarantee the configuration state (or at least “try” when
it comes to crazy old IOS). So this means we have Jinja2 templates
that render YAML files into device specific CLI config files. They
live in a separate repo and again many constituent Jinaj2 files make
one entire device template. So any push to this Jinja2 repo triggers a
separate CI workflow which performs syntax checking and semantic
checking of the Jinja2 templates (again, custom Python scripts).

When one pushes to the YAML repo to update a device config, the syntax
and semantic checks are made against the YAML files; they are then
“glued” together to make the entire device configs in a single file,
the Jinja2 repo is checked out, the entire YAML file is used to feed
the Jinja templates and configs are built and now the vendor specific
config needs to be syntax checked.

This CD part of the process (to a testing area) is a WIP still, for
Junos we can push to a device and use “commit check” for IOS and
others we can’t. So right now I’m working on a mixture of pushing the
config to virtual IOS devices and to physical kit in the lab but this
also causes problems in that interface / line card slot numbers/names
will change so we need to run a few regex statements against the
config to jimmy it into a lab device (so pretty ugly and temporary I
hope).

When the CD to “testing” passes then CD to “production” can be
manually triggered. Another repo stores the running config of all
devices (from the previous push). So we can push the candidate config
to a live device (using Ansible with NAPALM [1]) and get a diff
against the running config, make the “config replace” action, then
download the running config and put that back into the repo. So we
have a local stored copy of device configs so we can see off-line the
diff’s between pushes. It also provides a record that the process of
going form YAML > Jinaj2 > to device produces the config we expected
(although prior to this one will have had to make a branch and then a
merge request, which is peer reviewed, to get the CD part to run and
push to device, so there shouldn’t be any surprises this late in the
process!).

Is it fool proof, no. It is a young system still being design and
developed. Is it better than before, hell yes.

Cheers,
James.

[1] Ansible and NAPALM here might seem like overkill but we use
Ansible for other stuff like x86 box management so this means
configuring a server or a router is abstracted through one single tool
to the operator (i.e. playbooks are use irrelevant of device type,
rather than say playbooks for servers but python scripts for
firewalls). Also we use YAML files as config files for x86 boxes also
living in GitLab with a CI/CD process so again, one set of tools for
all.

Andrew_Latham · August 24, 2017, 3:14pm

Related I am working on https://github.com/lathama/Adynaton and hope to get
parts into the Python Standard Library with help from some peers. Anyone
who wants to help out ping me off list.