RE: Juniper hardware recommendation

If accurate interface stats are important to you, MX’s don’t support accurate SNMP Interface Utilization, ie they don’t comply with RFC2665/3635, which seems like a fairly basic thing to do but they decided not to, and has been impactful to me in the past. So, any SNMP monitoring of an interface will always show less utilization than what is actually occurring, possibly leading to a false sense of security, or delay in augmentation. Would also affect usage based billing, if you do that.

https://www.juniper.net/documentation/us/en/software/junos/network-mgmt/topics/topic-map/snmp-mibs-and-traps-supported-by-junos-os.html

For M Series, T Series, and MX Series, the SNMP counters do not count the Ethernet header and frame check sequence (FCS). Therefore, the Ethernet header bytes and the FCS bytes are not included in the following four tables:

ifInOctets

ifOutOctets

ifHCInOctets

ifHCOutOctets

Thanks,

Michael Fiumano

At least it isn’t Arista, where SVI egress counters are disabled by default, and once enabled count everything UNLESS the packet egresses via a LAG! Talk about being “impactful”, we’re having to buy new routers to insert behind them, just to count packets so we can bill accurately, and for that matter, have traffic graphs that work at all. :frowning:

Adam Thompson
Consultant, Infrastructure Services
[MERLIN LOGO]
100 - 135 Innovation Drive
Winnipeg, MB, R3T 6A8
(204) 977-6824 or 1-800-430-6404 (MB only)
athompson@merlin.mb.ca
www.merlin.mb.ca

Hey Michael,

If accurate interface stats are important to you, MX’s don’t support accurate SNMP Interface Utilization, ie they don’t comply with RFC2665/3635, which seems like a fairly basic thing to do but they decided not to, and has been impactful to me in the past. So, any SNMP monitoring of an interface will always show less utilization than what is actually occurring, possibly leading to a false sense of security, or delay in augmentation. Would also affect usage based billing, if you do that.

Juniper has worked like this since day1 and shockingly the world
doesn't care, people really don't care for accuracy. CLI and SNMP are
both L3. If you want to report L2 'set chassis fpc N pic N
account-layer2-overhead'.

However, who decided that L2 is right? To me only L1 is right, I don't
care about L2 at all. So any system I'd use, I'd normalise the data to
L1.

Ethernet on minimum size packets
L1 - 100%
L2 - 76%
L3 - 24%

Not sure why 76 is better than 24. Both are wrong and will cause
operational confusion because people think the link is not congested.
This is extremely poorly understood even by professionals, so poorly
that people regularly think you can't get 100% utilisation, because
you can't unless you normalise stats to L1 rate.

Because end users will demand compensation and lawyer time for only getting 195Mbps on their 200Mbps service. 195Mbps is not 200Mbps.

I've seen operators over-provision services simply to quiet-down the noise, i.e., they'll provision 210Mbps for a 200Mbps service. We don't do this, but I encourage all of my competitors to do so.

The example I always give is that if there were no seats on an aircraft, it'd carry significantly more people than otherwise advertised.

We try hard to educate customers about how the higher layers eat away at the lower ones re: capacity, and that's just how the system works. There probably isn't a single man-made technology that offers 100% efficiency. So I'm not about to go out of business giving you the optical illusion that my corner of earth will make it so. In the end, it's easier to just let those customers go than spend human hours and money placating them.

Mark.

Customers and operators both have very little idea what they are
doing. Most people have no idea what the policer are accounting for.
And everything still works, without anyone understanding what they are
doing. So mostly it's not a problem if you're doing L1, L2 or L3.

Of course your 100M physical interface is limited to L1 rate of 100M.
If you provision that as VLAN of 100M service, should you sell now L1,
L2 or L3 of 100M? What are. you doing? (No you you, passive you, you
are not representative, nanog is not representative, the passive you
doesn't know which they are selling, and which they are selling
changes with hardware upgrades, and they don't know it).

To echo Alain's comments earlier, the Juniper QFX 5100 series is stable, once you figure out all the shortcomings of the chipset. We aren't doing anything fancy, but have certainly bumped into our share of issues that have no workaround because it's a limitation of the physical hardware. Since we're talking about counters, see if you can spot the error with IPv6 accounting in the output from our 5100 below (about 50% of our traffic is v6):

    Transit statistics:
     Input bytes : 284315487788005 412457312 bps
     Output bytes : 39937401090441 29417528 bps
     Input packets: 231391925059 39552 pps
     Output packets: 88278182551 10809 pps
     IPv6 transit statistics:
      Input bytes : 0
      Output bytes : 0
      Input packets: 0
      Output packets: 0

:wink:

I believe the 5100 just announced EOL (QFX Series Hardware Dates & Milestones - Juniper Networks); I haven't had time to look at the replacement models to see if they behave any better.

Jason

Looks like its replacement is the 5120 series. The question is does the 5120 have the same limitations and similar chipset?

All sounds like a bit of Broadcom to me :-).

Mark.

Severly limited TCAM makes use of ACLs challenging.

Hi!

That's the way one of my employers did it, and I can't think of a better way.

bytes += PPS*overhead

Overhead is likely 20bytes (preamble, SFD, ifg). But it could also be
24B (FCS/CRC might be missing in what otherwise is claimed to be L2).
You may need a lab to confirm what exactly is being counted.

This adjustment could be in DB or it could be render-time, both have
pro and con.

Good monitoring softwares allow to do "preprocessing" before storing the monitored data in database.

Saku's formula should work well in this case.

I use Zabbix for monitoring big infrastructure. It has many advantages like:

- Push or pull metrics (dmz friendly)
- Can use many proxies (scale well)
- preprocessing of data (fix vendors mess)
- alert based on business logic through templates ( proactive instead of reactive)
- open source and have enterprise support (always nice to be able to call 1800 zabbix in case of emergency)
- agent, agentless, discovery, snmp, java/jmx, telnet, ipmi, web scenarios, etc (never face a coirner-case that can't be monitored so far)

Really awesome at infrastructure level.

Jean