Request for a pointer - Linux modifying DSCP on replies?

Darren_Bolding · August 17, 2009, 9:19pm

I believe this is operational content, but may well be better asked
somewhere else. I would love to have a pointer to another list/website.
I am looking to do some policy routing based on DSCP marking, and I have
this all working inside the networking equipment. I DSCP mark some packets
at ingress and I policy-route others based on ACL's matching those DSCP
markings. This should allow me to solve some problems in a rather elegant
manner, if I do say so myself.

And this works fine for some things- I have verified that Ping's to a host
work as expected- the Ping shows up at the destination host DSCP marked, and
the ICMP reply leaves with the same DSCP marking.

However, when I do this with apache and mysql connections (TCP 80/3306), the
incoming packets are marked, but the replies are not.

My research into the subject doesn't seem to suggest there is a standard for
whether replies to a TCP connection are required to have the same DSCP
marking, but it seems to make a lot of sense that they would.

I've disabled iptables on the server host to no avail. I've looked around
for an apache or Linux kernel setting and found nothing.

At this point I'm looking for pointers- to a way to solve this issue, or to
a better place to ask.

I've started investigating writing iptables rules to match incoming
connections that have DSCP marking and explicitly mark response traffic, but
that seems, I don't know... wrong.

Linux kernel we are using is 2.6.9-67.ELsmp.

Any help or pointers would be appreciated!

--D

Steve_Miller1 · August 17, 2009, 9:44pm

Would not the end station be considered to be outside of the DS
domain? It does not necessarily make sense (to me) for reply packets
to be marked unless they are appropriate classified and marked on the
return path at the point they re-enter the DS domain.

I would imagine that iptables and the DSCP target would do what you
wanted, yes. I'd consider classifying and marking traffic at whatever
switch you would consider to be at the edge of the DS domain
(connected to this server.)

-Steve

Dan_White · August 17, 2009, 10:12pm

See the linux-net mailing list:

http://www.kernel.org/pub/linux/docs/lkml/

Darren_Bolding · August 17, 2009, 11:08pm

Steve,
Perhaps it is outside the DS domain, and that is the issue. It seems odd
that the behavior with ICMP/Ping is different than that with TCP however.
Not sure which is technically correct, but I am going to follow up on some
of the pointers I've gotten to try and learn that. It just seems natural to
me that connection oriented traffic would have the same markings on both
sides of the conversation unless explicitly told otherwise.

I would love to be able to mark the traffic at the edge of the DS domain- I
do this at ingress from one location. The challenge I am trying to solve is
that the DS edge switch will not reliably know how to policy-route traffic
unless it has been previously marked.

To clarify, as in many other environments, we have stateful devices such as
firewalls and load balancers. I want to be able to route traffic
that ingressed through one of these devices to egress through it as well.
This is entirely solvable by splitting equipment functionally (a cluster of
servers and associated network equipment, real or virtual associated with a
service) or by employing SNAT solutions. However, for various reasons these
solutions are not preferred in our environment, and I dare say I am not
alone in that viewpoint.

What I am trying to deploy now is a system where the stateful equipment (in
this case a load balancer) has its traffic to the rest of the network tagged
on ingress. Since I am using Cisco 6500's with sup720's, I can classify and
mark the traffic with a DSCP setting via PFC/DFC hardware. I then inspect
traffic at the layer-3 edge for the various pools of servers. Depending on
the DSCP marking of the packet, I change the next-hop. Since this is
implemented through an extended ACL for a route-map it is handled in
hardware (a good thing). Research shows that I can implement similar
functionality in hardware on L3 switching gear from Juniper, Foundry, etc.
so I am not boxing myself into a vendor.

I don't believe Cisco supports using reflexive-acl's to apply policy
routing, and even if they did, that would likely swamp our sup's CPU's, so I
don't believe maintaining a stateful filter on the switch is viable.

This all works as expected for Ping's and the ICMP replies. It breaks down
for TCP http/mysql connections.

It sounds like the correct (per-spec) solution may be to have the Linux
servers track the incoming connections DSCP setting and mark the outgoing
packets related to those connections. I am not at all certain this will not
hit the servers CPU's more than desired or require additional
connection-tracking resources than the ones we currently implement via
iptables.

Is there some other design option I am not considering?

Thanks to those of you who have replied so far, it is at least a start down
some additional paths of research for me! It's been since the days of BSDI
that I have been involved in system networking internals, so I have been at
a loss who to even ask!

--D

James_Hess · August 18, 2009, 12:33am

the ICMP reply leaves with the same DSCP marking.

ICMPs may have special treatment. This is the kernel replying, not a
user application.

However, when I do this with apache and mysql connections (TCP 80/3306), the
incoming packets are marked, but the replies are not.

I haven't known Linux to automatically apply DSCP markings.
Believe this operation may be by design. Not everyone is likely to
want response traffic to have the same markings for all TCP protocols.

HTTP requests are often small request, big response. People might
sometimes want low delay for the request but higher throughput for
HTTP responses (though higher delay compared to other applications
sharing that bandwidth).

If an application developer wants a Linux computer to apply DSCP or TOS bits,
either, the application needs to elect to set ToS bits using
setsockopt(), SO_PRIORITY, and SO_TOS on the socket descriptor
itself... the app must be running as superuser to do this

Or you may also be able to set the bits using iptables and the mangle table.

e.g.
# iptables -t mangle -I OUTPUT -p tcp --sport 80 -j DSCP --set-dscp 0x1a

You may also be able to use a CONNMARK iptables target to mark a connection ,
and then use the mangle table to set the DSCP field of OUTPUT packets
that match the connection mark.

Darren_Bolding · September 4, 2009, 1:27am

I wanted to go ahead and reply back with what I figured out.
The easiest way to solve this problem turned out to be to utilize netfilters
CONNMARK module, which sadly is not available in some older but still used
linux kernels.

Syntax for this is as follows:
# Set outgoing DSCP on connections to same as incoming DSCP
-A PREROUTING -m dscp --dscp 1 -j CONNMARK --set-mark 1
-A POSTROUTING -m connmark --mark 1 -j DSCP --set-dscp 1
-A PREROUTING -m dscp --dscp 2 -j CONNMARK --set-mark 2
-A POSTROUTING -m connmark --mark 2 -j DSCP --set-dscp 2

And this goes on for all 63 possible non-zero markings.

This seems to have had negligible performance or memory impact on some very
busy hosts, so it seems like a viable solution.

--D