UCSF Network Admin??

Hello,

Is there anyone with clue from UCSF on-list? Or if someone knows how to
put me in contact with them, that would be great.

We are not able to query their DNS servers from our network. We've got
users not able to access anything UCSF due to this.

Thus far, their response has been to manually put DNS entries into our
users hosts file, not actually fix the real issue.

Thanks,
-Robert

I am querying them OK. I am in US AZ. I am also able to reach
manana.garlic.com.

[hyperion]/usr/local# dig www.ucsf.edu @ucsfns2.ucsf.edu
www.ucsf.edu. 3600 IN A 64.54.132.50
ucsf.edu. 3600 IN NS ucsfns1.ucsf.edu.
ucsf.edu. 3600 IN NS adns2.Berkeley.edu.
ucsf.edu. 3600 IN NS adns1.Berkeley.edu.
ucsf.edu. 3600 IN NS ucsfns2.ucsf.edu.
adns1.Berkeley.edu. 172800 IN A 128.32.136.3
adns1.Berkeley.edu. 3600 IN AAAA 2607:f140:ffff:fffe::3
adns2.Berkeley.edu. 172800 IN A 128.32.136.14
adns2.Berkeley.edu. 3600 IN AAAA 2607:f140:ffff:fffe::e
ucsfns1.ucsf.edu. 3600 IN A 128.218.254.10
ucsfns2.ucsf.edu. 3600 IN A 128.218.254.40
;; Query time: 41 msec
;; SERVER: 128.218.254.40#53(128.218.254.40)
;; WHEN: Wed Aug 1 11:02:46 2012
;; MSG SIZE rcvd: 270

Ditto on that from TWTC in Milwaukee, WI.

# dig www.ucsf.edu @ucsfns2.ucsf.edu

; <<>> DiG 9.8.1-P1 <<>> www.ucsf.edu @ucsfns2.ucsf.edu
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 49793
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 4, ADDITIONAL: 6
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;www.ucsf.edu. IN A

;; ANSWER SECTION:
www.ucsf.edu. 3600 IN A 64.54.132.50

;; AUTHORITY SECTION:
ucsf.edu. 3600 IN NS adns1.Berkeley.edu.
ucsf.edu. 3600 IN NS ucsfns2.ucsf.edu.
ucsf.edu. 3600 IN NS adns2.Berkeley.edu.
ucsf.edu. 3600 IN NS ucsfns1.ucsf.edu.

;; ADDITIONAL SECTION:
adns1.Berkeley.edu. 172800 IN A 128.32.136.3
adns1.Berkeley.edu. 3600 IN AAAA 2607:f140:ffff:fffe::3
adns2.Berkeley.edu. 172800 IN A 128.32.136.14
adns2.Berkeley.edu. 3600 IN AAAA 2607:f140:ffff:fffe::e
ucsfns1.ucsf.edu. 3600 IN A 128.218.254.10
ucsfns2.ucsf.edu. 3600 IN A 128.218.254.40

;; Query time: 63 msec
;; SERVER: 128.218.254.40#53(128.218.254.40)
;; WHEN: Wed Aug 1 13:48:51 2012
;; MSG SIZE rcvd: 259

Hello,

Is there anyone with clue from UCSF on-list? Or if someone knows how to
put me in contact with them, that would be great.

We are not able to query their DNS servers from our network. We've got
users not able to access anything UCSF due to this.

Thus far, their response has been to manually put DNS entries into our
users hosts file, not actually fix the real issue.

Please, please don't misuse "DNS entries". Host files DO NOT and
NEVER HAVE taken "DNS entries". The contain hostname/address
mappings but they are not and never bave been DNS entries.

I should have been a little more forthcoming with information.

We are having issues with getting responses from these servers:

NSMEDCTR1.UCSFMEDICALCENTER.ORG
NSMEDCTR2.UCSFMEDICALCENTER.ORG

Which are authoritative for "ucsfmedctr.org" and "ucsfmedicalcenter.org".

We ARE able to resolve ucsf.edu and things associated with that entity,
just NOT the medical center.

Thanks,
-Robert

Hi all,

I am looking for literature on the (monetary) costs of misconfigurations in an operational ISP network. Are there any such studies I can benefit from?

In a larger context, are there any thorough studies exploring the cost of building and running a large ISP network?

Best,

-Murat

Hi Murat,

I never saw any literature about this topic. But I think it is not too
difficult to calculate (or estimate).

A misconfiguration will, at least, impact on two points: network
outage and re-work. For the network outage, you have to use the SLAs
to calculate the cost (how much you lost from the customers' revenue)
due to that outage. On the other hand, there is the time efforts spent
to fix the misconfiguration. Under the fix, it could be removing the
misconfig and applying a new one correct. Or just fixing the misconfig
targeting the correct config. This re-work will translate in time, and
time can be translated in money spent.

Regards

Isn't the largest cost omitted (or at least glossed over) here?
Namely, lost customers due to the outage. That's why people have SLAs
and rework the network at all -- to avoid that cost.

I am looking for literature on the (monetary) costs of
misconfigurations in an operational ISP network. Are there any such
studies I can benefit from?

jgs, who should know, says 42 quatloos

randy

Hi Darius,

You are right. The lost of a customer due to those things. However, I
would classify this as an unknown situation (in terms of risk
analisys) because the others I mentioned are possible to calculate and
estimate (they are known). But it is very hard to estimate if a
customer will cancel the contract because 1 or n network outages. In
theory, if the customer SLA is not being met consecutively, there is a
potential probability he will cancel the contract.

Regards

Those servers respond to my queries from here in AZ:

# dig www.ucsfmedicalcenter.org @nsmedctr2.ucsfmedicalcenter.org
www.ucsfmedicalcenter.org. 86400 IN CNAME webmcb06.ucsfmedicalcenter.org.
webmcb06.ucsfmedicalcenter.org. 86400 IN A 64.54.46.99
;; Query time: 41 msec
;; SERVER: 64.54.50.50#53(64.54.50.50)
;; WHEN: Wed Aug 1 17:36:36 2012
;; MSG SIZE rcvd: 93

# dig www.ucsfmedicalcenter.org @nsmedctr1.ucsfmedicalcenter.org
www.ucsfmedicalcenter.org. 86400 IN CNAME webmcb06.ucsfmedicalcenter.org.
webmcb06.ucsfmedicalcenter.org. 86400 IN A 64.54.46.99
;; Query time: 54 msec
;; SERVER: 64.54.42.50#53(64.54.42.50)
;; WHEN: Wed Aug 1 17:37:41 2012
;; MSG SIZE rcvd: 93

also responds here in Ohio on TW

I think it's more complicated than that, the cost of misconfiguration
is almost inseparable
in some cases from the cost of configuration in general.; not all
misconfigs are equal, so you might want to concentrate on a specific
kind of misconfiguration, or a specific misconfig impact "E.g. an
erroneous filter is applied, causing routes to be accepted from an EGP
peer without restriction". Esp. with misconfigurations that might not
have an immediately discovered impact, business impact beyond cost
to discover and resolve may not be apparent, which depend on details
of the misconfig, such as how trivial or 'obvious' the error
should be, how consistent the problems it causes.

At least if you concetrate on a certain specific type of misconfig and
specific impact, you can have a basis for comparison and
approximation, for just that type though.

The "fix" to some types of misconfigs might sometimes be to update the
design documentation, so the "misconfig" is no longer a
misconfiguration; so then you can start asking about how you
define "misconfig" in the first place, and the costs of having
erroneous or missing documentation.

Which is hard, because the "costs" of updating documentation and
finding errors, less than best/optimal practices, or improvements
possible in configurations, are effected by long term "costs" or
loss of efficiencies resulting from failing to correct
documentation, and failing to review and improve arguably
suboptimal configurations.

  Some misconfigs or suboptimal configs are discovered by review or
other measures before there is any operational impact. Some
misconfigs are "safe" or "harmless" by coincidence, but can cause
issues later when the network is expanded farther according to design
that does not anticipate the misconfig, so the cost there is
increased risk.

Not all possible misconfigurations of a network cause an outage, some
misconfigurations are actually design errors, not operator errors;
not all network issues are outages, some configuration errors are
just things like

"Some entries in an access-list that are dead-weight, e.g. can never
be reached, or is not necessary"; and the impact of this error is
wasted memory resources, or increased complexity / more unnecessary
stuff for humans to look at.

(The entry might not have been dead-weight when originally added.)
Correcting the deadweight ACL entry situation then is an improvement
in efficiency.

Not all misconfigurations are detected, either, possibly, sometimes
even misconfigs that caused issues.

An example of a misconfiguration that would occur frequently in some
kinds of environments and might not break an uptime SLA, would be
suboptimal performance, less cost-effectiveness (E.g. early
upgrade required due to an unrecognized misconfiguration).

Or configuration deadweight utilizing so much memory, that hardware
upgrades become needed. On some networks, there might not be a
formal SLA, and the end user might not notice or take issue with it.

Loss of fault resilience (E.g. failover path won't work); no SLA is
violated if the
fault tolerance wasn't required by the SLA, and the configuration
error might go undetected
for years if there was not regular failover testing performed.

It might be corrected before there is an issue... then the cost of
"Increased risk" during the period, in which the misconfig wasn't
service-effecting could be quite nebulous.

I never saw any literature about this topic. But I think it is not too
difficult to calculate (or estimate).

[snip]

A misconfiguration will, at least, impact on two points: network
outage and re-work. For the network outage, you have to use the SLAs
to calculate the cost (how much you lost from the customers' revenue)
due to that outage. On the other hand, there is the time efforts spent
to fix the misconfiguration. Under the fix, it could be removing the

[snip]

On the end customer side, I've done a bunch of reliability / risk cost
assessments for various customers over the years. It's never easy.

For an ISP... customers are fairly locked in, but for big networks and
customers, especially multihoming customers, business goes where they
want it.

SLA costs are easy. Predicting the final financial impact is hard.

Quantifying the business costs would be very complex.

Here are some reports and research papers that may be a starting point:
[1] Juniper Networks, Inc., “What's Behind Network Downtime?,” pp.
1–12, May 2008.
[2] R. Mahajan, D. Wetherall, and T. Anderson, “Understanding BGP
misconfiguration,” Proceedings of the 2002 conference on Applications,
2002.
[3] A. Medem, R. Teixeira, N. Feamster, and M. Meulle, “Joint analysis
of network incidents and intradomain routing changes,” Network and
Service Management (CNSM), 2010 International Conference on, pp.
198–205, 2010.
[4] D. Turner, K. Levchenko, A. C. Snoeren, and S. Savage, “California
fault lines: understanding the causes and impact of network failures,”
presented at the SIGCOMM '10: Proceedings of the ACM SIGCOMM 2010
conference on SIGCOMM, 2010.
[5] Z. Yin, X. Ma, J. Zheng, Y. Zhou, L. N. Bairavasundaram, and S.
Pasupathy, “An empirical study on configuration errors in commercial
and open source systems,” presented at the SOSP '11: Proceedings of
the Twenty-Third ACM Symposium on Operating Systems Principles, 2011.
[6] Z. Kerravala, “As the Value of Enterprise Networks Escalates, So
Does the Need for Configuration Management
,” cs.princeton.edu, 01-Jan.-2004. [Online]. Available:
https://www.cs.princeton.edu/courses/archive/fall10/cos561/papers/Yankee04.pdf.
[Accessed: 09-May-2012].
[7] W. Enck, P. McDaniel, S. Sen, and P. Sebos, “Configuration
management at massive scale: System design and experience,” USENIX
'07, Jun. 2007.
[8] R. D. Doverspike, K. K. Ramakrishnan, and C. Chase, “Structural
overview of ISP networks,” Guide to Reliable Internet Services and
Applications, pp. 19–93, 2010.

I do not think occasional outages cause significant loss of customers. Customers get angry easily, but once an issue is fixed, they get happy quickly. Customers have very short memories and the cost and hassle of changing services is often significant. Outages are never good, but it is better to concentrate on fixing the issue than panic about customers canceling their service.

Many times the cause of an outage is totally out of your control. For example, most of our outages are caused by Verizon's aging and neglected copper cable plant. I often wish some company had the balls to file a class action lawsuit over Verizon's neglect of their copper plant, but NOBODY wants to piss off their ILEC, including us.

The misconfiguration cost is usually not calculable in itself. But I
think the more important issue is, "How do we prevent it?" I would
spend more time on prevention than assessing the cost.

I can think of several minor provisioning issues that cost us more in
customer relations than everything else put together and a couple
significant ones that seemed like nothing happened. And I am not sure I
could have predicted the outcome the day before the event if someone had
handed me the scenario to assess it. Reason, when it happens the
CURRENT situation is as much a driver of the impact as is the actual
event. It even goes back to the emotional state of the customer and
maybe if his toast was burned this morning, if he/she had a fight with
the spouse, who flipped him the bird during his drive in and a lot of
other things that dictate mental state.

I would be very lax to use a vendor who is taking an approach that all
they are concerned about is what an error costs them. I want them to be
more concerned about what that costs their customer (me) and what they
can do to prevent it.

Proper Prior Preparation Prevents Piss Poor Performance.

Training, sound processes, good management practices, good maintenance,
good personnel selection go a long way.

To quote Chief Gassaway (fire chief with good stuff on the web for any
business) "Luck validates bad practices." The REB translation, "We did
it this way for years and nothing bad happened."

In Chief Gassaway's business, bad practices cause Line of Duty Deaths.
In ours it causes outages, lost revenue and possibly bankruptcy.
Remember, if your company goes belly up, you are out of a job...

http://www.samatters.com/2012/07/31/positive-reinforcement-of-undesirabl
e-behavior/
     
Ralph Brandt

Lots of people have developed best practices on these topics. The
problem is pushing against the business side and keeping these in
place, and not letting the bar be low at your upstream and peers.

There is a secondary issue that is yet still unaddressed. Some vendors
still send all routes they receive out to all external peers in the
absence of a policy. This is something I want to see corrected as it
will require a bit more intelligence when it comes to BGP policy to
provide the expected behavior.

- Jared