DPDK and energy efficiency

Etienne-Victor_Depas · February 22, 2021, 7:27am

Hello folks,

I’ve just followed a thread regarding use of CGNAT and noted a suggestion (regarding DANOS) that includes use of DPDK.

As I’m interested in the breadth of adoption of DPDK, and as I’m a researcher into energy and power efficiency, I’d love to hear your feedback on your use of power consumption control by DPDK.

I’ve drawn up a bare-bones, 2-question survey at this link:

https://www.surveymonkey.com/r/J886DPY.

Responses have been set to anonymous.

Cheers,

Etienne

Douglas_Fischer · February 22, 2021, 11:33am

I’m very happy to see interest in DPDK and power consumption.

But IMHO, the questions do not cover the actual reality of DPDK.
That característic of “100% CPU” depends on several aspects, like:

How old are the hardware on DPDK.
What type of DPDK Instructions are made(Very Dynamic as Statefull CGNAT, ou Static ACLs?)
Using or not the measurements of DPDK Input/Drop/Fowarding.
CPU Affinity done according to the demand of traffic
SR-IOV (sharing resources) on DPDK.

The way I saw, the questions induce the public to conclude that DPDK ALWAYS has 100% CPU usage, which is not true.

Etienne-Victor_Depas · February 22, 2021, 11:45am

The way I saw, the questions induce the public to conclude that DPDK ALWAYS has 100% CPU usage, which is not true.

I don’t concur.

Every research paper I’ve read indicates that, regardless of whether it has packets to process or not, DPDK PMDs (poll-mode drivers) prevent the CPU from falling into an LPI (low-power idle).

When it has no packets to process, the PMD runs the processor in a polling loop that keeps utilization of the running core at 100%.

Cheers,

Etienne

Pawel_Malachowski · February 22, 2021, 11:55am

It consumes 100% only if you busy poll (which is the default approach).
One can switch between polling and interrupts (or monitor, if supported),
or introduce halt instructions, in case of low/medium traffic volume.

Etienne-Victor_Depas · February 22, 2021, 12:00pm

Here are a few references.
Strictly speaking, DPDK and SR-IOV are orthogonal. DPDK is intended to facilitate cloud-native operation through hardware independence. SR-IOV presumes SR-IOV-compliant hardware.

[1] Z. Xu, F. Liu, T. Wang, and H. Xu, “Demystifying the energy efficiency of Network Function Virtualization,”
in 2016 IEEE/ACM 24th International Symposium on Quality of Service (IWQoS), Jun. 2016, pp. 1–10.
DOI: 10.1109/IWQoS.2016.7590429.

[2] S. Fu, J. Liu, and W. Zhu, “Multimedia Content Delivery with Network Function Virtualization: The Energy Perspective,”
IEEE MultiMedia, vol. 24, no. 3, pp. 38–47, 2017, ISSN: 1941-0166.
DOI: 10.1109/MMUL.2017.3051514.

[3] X. Li, W. Cheng, T. Zhang, F. Ren, and B. Yang, “Towards Power Efficient High Performance Packet I/O,”
IEEE Transactions on Parallel and Distributed Systems, vol. 31, no. 4, pp. 981–996, April 2020,
ISSN:1558-2183. DOI: 10.1109/TPDS.2019.2957746.

[4] G. Li, D. Zhang, Y. Li, and K. Li, “Toward energy efficiency optimization of pktgen-DPDK for green network testbeds,”
China Communications, vol. 15, no. 11, pp. 199–207, November 2018,
ISSN: 1673-5447. DOI: 10.1109/CC.2018.8543100.

Etienne-Victor_Depas · February 22, 2021, 12:01pm

It consumes 100% only if you busy poll (which is the default approach).

Precisely.

It is, after all, Intel’s response to the problem of general-purpose scheduling of its processors - which prevents the processor from being viable under high networking loads.

Cheers,

Etienne

Pawel_Malachowski · February 22, 2021, 1:11pm

It totally makes sense to busy poll under high networking load.
By high networking load I mean roughly > 7 Mpps RX+TX per one x86 CPU core.

I partially agree it may be hard to mix DPDK and non-DPDK workload
on a single CPU, not only because of advanced power management logic
requirement for the dataplane application, but also due to LLC trashing.
It heavily depends on usecase and dataset sizes, for example
optimised FIB may fit nicely into cache and use only tiny, hot part
of the dataset, but CGNAT Mflow mapping likely won't fit. For such
a usecase I would recommand dedicated CPU or cache partitioning (CAT),
if available.

In case of low volume traffic like 20-40G of IMIX one can dedicate
e.g. 2 cores and interleave busy polling with halt instructions to
lower the usage significantly (~60-80% core underutilisation).

Etienne-Victor_Depas · February 22, 2021, 4:49pm

I forgot to point out that on Friday 26th, I’ll share the results collected through a link or a series of screenshots.

Cheers,

Etienne

Jared_Geiger1 · February 22, 2021, 7:46pm

DANOS lets you specify how many dataplane cores you use versus control plane cores. So if you put a 16 core host in to handle 2GB of traffic, you can adjust the dataplane worker cores as needed. Control plane cores don’t stay at 100% utilization.

I use that technique plus DANOS runs on VMware (not oversubscribed) which allows me to use the hardware for other VMs. NICS are attached to the VM via PCI Passthrough which helps eliminate the overhead to the VMware hypervisor itself.

I have an 8 core VM with 4 cores set to dataplane and 4 to control plane. The 4 control plane cores are typically idle only processing BGP route updates, SNMP, logs, etc.

~Jared

Robert_Bays · February 22, 2021, 9:16pm

Beyond RX/TX CPU affinity, in DANOS you can further tune power consumption by changing the adaptive polling rate. It doesn’t, per the survey, "keep utilization at 100% regardless of packet activity.” Adaptive polling changes in DPDK optimize for tradeoffs between power consumption, latency/jitter and drops during throughput ramp up periods. Ideally your DPDK implementation has an algorithm that tries to automatically optimize based on current traffic patterns.

In DANOS refer to the “system default dataplane power-profile” config command tree for adaptive polling settings. Interface RX/TX affinity is configured on a per interface basis under the “interfaces dataplane” config command tree.

-robert

Etienne-Victor_Depas · February 22, 2021, 10:02pm

Thanks Jared; that’s very interesting.

Earlier today, I had a private exchange of emails regarding the progressive development of architectures specific to the domain of high-speed networking functions. Your note reinforces the notion that this “hard” partitioning of cores is a key part of the DSA (domain-specific architecture) here.

Jared_Geiger1 · February 22, 2021, 11:28pm

“set system default dataplane cpu-affinity 3-7” is what I have set for my use case. Technically its 5 cores out of 8 total, but 4 are polling cores and 1 manages those 4. Then the control plane is 3 plus the leftover cycles of the 1 manager core.

Etienne-Victor_Depas · February 23, 2021, 7:22am

Beyond RX/TX CPU affinity, in DANOS you can further tune power consumption by changing the adaptive polling rate. It doesn’t, per the survey, "keep utilization at 100% regardless of packet activity.”

Robert, you seem to be conflating DPDK
with DANOS’ power control algorithms that modulate DPDK’s default behaviour.

Let me know what you think; otherwise, I’m pretty confident that DPDK does:

"keep utilization at 100% regardless of packet activity.”

Keep in mind that this is a bare-bones survey intended for busy, knowledgeable people (the ones you’d find on NANOG) -
not a detailed breakdown of modes of operation of DPDK or DANOS.
DPDK has been designed for fast I/O that’s unencumbered by the trappings of general-purpose OSes,
and that’s the impression that needs to be forefront.
Power control, as well as any other dimensions of modulation,
are detailed modes of operation that are well beyond the scope of a bare-bones 2-question survey
intended to get an impression of how widespread DPDK’s core operating inefficiency is.

Cheers,

Etienne

Etienne-Victor_Depas · February 23, 2021, 7:27am

Sorry, last line should have been:
“intended to get an impression of how widespread knowledge of DPDK’s core operating inefficiency is”,
not:
“intended to get an impression of how widespread DPDK’s core operating inefficiency is”

Pawel_Malachowski · February 23, 2021, 11:56am

No, it is not PMD that runs the processor in a polling loop.
It is the application itself, thay may or may not busy loop,
depending on application programmers choice.

Etienne-Victor_Depas · February 23, 2021, 4:03pm

No, it is not PMD that runs the processor in a polling loop.
It is the application itself, thay may or may not busy loop,
depending on application programmers choice.

From one of my earlier references [2]:

“we found that a poll mode driver (PMD)
thread accounted for approximately 99.7 percent
CPU occupancy (a full core utilization).”

And further on:

“we found that the thread kept spinning on the following code block:
for ( ; ; ) {
for ( i = 0; i < poll_cnt; i ++) {
dp_netdev_process_rxq_port (pmd, list[i].port, poll_list[i].rx) ;
}
}
This indicates that the thread was continuously
monitoring and executing the receiving data path.”

[2] S. Fu, J. Liu, and W. Zhu, “Multimedia Content Delivery with Network Function Virtualization: The Energy Perspective,”

IEEE MultiMedia, vol. 24, no. 3, pp. 38–47, 2017, ISSN: 1941-0166.
DOI: 10.1109/MMUL.2017.3051514.

Nick_Hilliard3 · February 23, 2021, 4:47pm

"we found that a poll mode driver (PMD)
thread accounted for approximately 99.7 percent
CPU occupancy (a full core utilization)."

interrupt-driven network drivers generally can't compete with polled mode drivers at higher throughputs on generic CPU / PCI card systems. On this style of config, you optimise your driver parameters based on what works best under the specific conditions.

Polled mode drivers have been around for a while, e.g.

[base] Revision 87902

Nick

Shane_Ronan · February 23, 2021, 4:59pm

For use cases where DPDK matters, are you really concerned with power consumption?

Nick_Hilliard3 · February 23, 2021, 5:07pm

Probably yeah. Have you assessed the lifetime cost of running a multicore CPU at 100% vs at 10%, particularly as you're likely to have multiples of these devices in operation?

Nick

Etienne-Victor_Depas · February 23, 2021, 6:04pm

Probably yeah. Have you assessed the lifetime cost of running a
multicore CPU at 100% vs at 10%, particularly as you’re likely to have
multiples of these devices in operation?

Spot on.