CVSS – Vulnerability Scoring Gone Wrong

By Patrick Toomey

If you have been in the security space for any stretch of time you have undoubtedly run across the Common Vulnerability Scoring System (CVSS).  CVSS attempts to provide an “objective” way to calculate a measure of risk associated with a given vulnerability based on a number of criteria the security community has deemed worthwhile.  While I admire the goals of such a scoring system, in practice I think it falls short, and over-complicates the issue of assigning risk to vulnerabilities.  Before we get into my specific issues with CVSS, let’s briefly review how a CVSS score is calculated.  Put simply, the calculation tries to take into account criteria such as:

  • Exploitability Metrics (i.e. probability)
  • Impact Metrics (i.e. severity)
  • Temporal Metrics (extra fudge factors for probability)
  • Environmental Metrics (extra fudge factors for severity)

Each of the above categories is composed of a number of questions/criteria that are used as input into a calculation that results in a value between 0.0 and 10.0.  This score is often reported with publically disclosed vulnerabilities as a means of conveying the relative importance of fixing/patching the affected software.     The largest source of public CVSS scores comes from the National Vulnerability Database (NVD), as they have XML documents that contain a CVSS score for every CVE from 2002 to 2012.  In addition to the  NVD, I’ve also seen CVSS used by various security tools as well as used internally by numerous organizations, as it doesn’t require reinventing the wheel when ranking vulnerabilities.   So, what’s wrong with CVSS?

There are so many things I dislike about CVSS, though I will freely admit I am not steeped in CVSS lore, and would be open to hearing/discussing the reasoning behind the scoring system.  That said, here are my issues with CVSS in no particular order.

We don’t measure football fields in inches for a reason

Nobody cares that the distance between goal lines on an American football field is 3600 inches.  Why?  Because it is a useless unit of measurement when we are talking about football.  Nobody cares if someone has made 2 inches of progress on the field, as yards are the only thing that matters.  Similarly, what is an organization supposed to take away from a CVSS score that can take on 100 potential values?  Is a 7.2 any better than a 7.3 when it comes down to whether someone is deciding to fix something or not?  A reasonable argument against CVSS being too fine grained is that you can always bubble the result into a more coarse unit of measure.  But, that leads to my second complaint.

The “fix” is broken

So, sure, 100 distinct values is overkill for ranking vulnerabilities, and CVSS acknowledges this to some degree by mapping the overall score to a “severity score” of High, Medium and Low.  On the surface this seems reasonable, as it abstracts the ugly sausage making details of the detailed CVSS score into a very actionable severity score.  But, I feel like they managed to mess this up as well.  They started with a pretty fine granularity and bubbled up to something that is too coarse, as it tends to blur together various high severity vulnerabilities.  I’ve always been a fan of a four point  score that breaks down as follows:

  • Critical – The vulnerability needs to have been fixed yesterday.  The entire team responsible will not sleep until the vulnerability has been fixed.
  • High – This vulnerability is serious and we are going to fix it in the near term, but we also don’t need to make everyone lose sleep over it.
  • Medium – This vulnerability is worth fixing, and we will set a relatively fixed date in the near future for when it will be fixed.
  • Low – This vulnerability is on our radar and if it fits in our next release schedule we will fix it.
As it happens, a fairly large project manages to get by pretty well using a system roughly analogous to the one described above.  Google’s Chrome project has used a similar rating system and I haven’t heard anyone complain.     I was curious how this mapping would work against CVSS scores so I plotted all of the CVSS scores for every CVE within the NVD from 2002 until 2012.  The result are as follows:

As can be seen, there are some pretty obvious groupings of scores within this data.  Without staring at the data too hard you can see that there are clearly four groupings of scores that would map very cleanly to the four point system I mentioned earlier.

The main thing to make note of here is that there is a vast chasm between each grouping and its nearest neighbor(s).  There is very little chance of mistaking a low vulnerability for a medium vulnerability.  In contrast, with the current CVSS scoring system the grouping looks more like this:

There is some seemingly arbitrary dividing lines between High, Medium, and Low scores.  Particularly troubling is the dividing line between Medium and High.  Anything scored less than a 7 is a Medium risk and anything greater is a High.  Unfortunately, there is a fair bit of data clustered at exactly that juncture.  This leads to my final complaint against CVSS.

Objectivity is in the eye of the beholder

As mentioned in the beginning of the blog entry, a CVSS score is based on some base metric, but can be adjusted using a number of “Temporal” and “Environmental” metrics.  In other words, given a base score, you can just tweak it how ever you want using a number of fuzzy criteria.  This, compounded with the coarse High, Medium, Low severity scores, leads to a troubling amount of score fiddling.  I am not going to go all conspiracy theory on you and claim people are fudging numbers for publically disclosed CVEs.  But, I have seen internal groups within companies leveraging these additional metrics to make the data fit their desired outcome.  I can’t blame them, as it is almost a requirement.  When presented a vulnerability there is generally an internal consensus about how serious this vulnerability is to the organization and whether it is a Critical, High, Medium, or Low (as I defined them above).  However, once they enter all of the base metrics into the CVSS calculator there is a reasonable chance that it is going to give you a score that doesn’t mesh with their gut.  So, adjustments are made to the temporal metrics and environmental metrics until it gives them the appropriate score.  Again, I blame nobody for “fudging” the data, as often times the base score just doesn’t work.  One could argue that the temporal and environmental scores could be adjusted in a reliable/repeatable way for a given application/environment.  Then, anytime a vulnerability is identified in that specific application then the same temporal/environmental adjustments could be used to create reliable/repeatable scores.  In reality, this doesn’t happen.  An organization should be praised for using any kind of scoring system at all.  To try to enforce an extra level of unnecessary/burdensome process is not worthwhile or realistic.

Conclusion

Even with all the above being said, as soon as you pitch the idea of using a four point scoring system you run into the problem of objectivity.  How do we decide what criteria delineates a Critical from a High vulnerability?  I am sure that is how CVSS started, as it provided an approach for scoring things objectively.  But, as we already discussed, it is only superficialy objective, as there are numerous ways to adjust the score using subjective metrics.  So, why bother?  I think following a model similar to the Chrome severity guidelines makes more sense.  The Chrome team has developed some specific criteria they use to group vulnerabilities.  Given that they are only trying to place a vulnerability into one of four buckets it isn’t that difficult.  Most organizations could come up with a similar set of organization specific criteria for assigning a vulnerability score.  In the end, while I am a fan of standardization in general, I am not a fan of the current standard for vulnerability scoring.  Not to be to cliche, but an Albert Einstein quote sums up my thoughts pretty well: “Everything should be made as simple as possible, but no simpler”.  I think CVSS could using a little simplifying.

12 thoughts on “CVSS – Vulnerability Scoring Gone Wrong

  1. Hi Patrick,

    Nice and extensive article, thanks.
    As opposed to you I actually like the CVSS rating:
    *) It is very difficult to have one vulnerability rating method that applies to so many different objects: client software, server software as well as operating systems and propietary firmware. In that regard, I think CVSS does a really great job.
    *) CVSS is one of the few standards that is actually widely used (although a lot of vendors have their own rating system, sometimes besides using the CVSS). That makes it easier to compare vulnerabilities.
    *) As far as I know, all vendors supply at least the CVSS base score. That’s the objective metric. ‘Fudgeable’ scores like the temporal and environmental are optional – You can choose not to use them (I don’t).
    *) It’s relatively easy to apply for vulnerabilities compared to other rating systems, for instance the OWASP Risk Rating methodology.
    *) The fine-grained scale from 0.0 to 10.0 actually helps to rank the vulnerabilities.

    Having said that, it could be improved upon:

    *) I find the breakdown in three categories (medium, high and low) a bit too broad as well. Adobe and Microsoft use a four scale system as well, and I too think that it sometimes better explains the severity. You can however name all vulnerabilities having a score of 9.0 and higher critical – the score stays objective.
    *) On one hand I like the simplicity that you can use it for so many objects, but on the other hand: can you really compare a remotely automated exploitable server vulnerabilty to a browser vulnerability *) Microsoft’s severity rating system for example does a better job in that aspect.

    Regards,

    Peter Mosmans

  2. Patrick,

    Thanks very much for writing this. I wanted to let you know that FIRST (the organization that manages CVSS) takes CVSS feedback very seriously. In fact, right now there are two ongoing calls (for Participants and Subjects) to improve CVSS version 2 via the release of CVSS version 3.

    I’d encourage you to review the options that are currently available for participating in CVSS’ improvement:

    * You may apply to participate in the development of CVSS v3 (Call for Participants) now through May 4, 2012 — that’s Friday.
    * You may submit ideas to the Special Interest Group (Call for Subjects) now through June 16, 2012.

    For more details on how to do that, please see the FIRST.org press release:

    I don’t think that a full reply here in the comments on your blog would quite do justice. Briefly, while a “four bucket” system may work for Google’s needs, as many vendor systems work well for their own software, CVSS has different goals — e.g. letting organizations compare vulnerabilities in client-side apps to server software, operating systems, and even virtualization and network infrastructure, from any vendor. Even still, we’re convinced CVSS could use improvement on doing what it is trying to do, which is why we’re looking for exactly this kind of input going into the development of v3.

    That said, I’ve left my email here if you would like to dive a bit further into some discussion should you have particular questions about why one approach was chosen over another. For anything you’d like to see changed, while I’m happy to take emails on those, answering the Call for Subjects is the best way to go.

    Thanks again for your interest in CVSS,
    Seth Hanford
    Chair, CVSS Special Interest Group

    • Looks like the URL was stripped for the Call for Participants and Call for Subjects news release.

      http://www.first.org/newsroom/releases/20120322

      First dot org, newsroom, releases, 20120322 (you can see it from first dot org’s frontpage, in the right-hand news column).

    • Thanks for some insight into the CVSS process. Related to your comment, CVSS obviously has different goals than Google. As a result, no vulnerability risk ranking scheme can have criteria as specific as Google’s if the intent is to calculate risk across a broad array of software/operating systems/etc. But, the part that central difficulty I have is figuring out how 100 point scale is useful. Obviously a 1000 point scale would be useless, as it conveys no extra information. I guess I have a gut feeling that 100 points is similarly overreaching, as I don’t know what I am supposed to do with a 7.2 vs. 7.4. Conceptually, from a macro perspective, I could see some utility in measuring the score with such precision, as it might give you some keen insight into vulnerability metrics/trending/etc. But, as shown in the post, there are so few scores actually being used I think we aren’t actually capturing any real extra information. Again, thanks for the comment.

  3. CVSS doesn’t have low, moderate, and high categories. That’s a proprietary convention introduced by the National Vulnerability Database (nvd.nist.gov). You can use whatever categories you want with CVSS scores. If you like your four categories, use them. That’s in no way contradicting the CVSS specification.

    There’s a CVSS History document posted at http://www.first.org/cvss/history that explains how the scores themselves were developed. The most important thing to point out from that document is that any score differences less than 0.5 are intended to be statistically insignificant. So a 7.3 can’t be considered more severe than a 7.2, but would be considered more severe than a 6.2.

    I’d be happy to answer any other questions or concerns you have about CVSS. I’m one of the co-authors of the CVSS v2 documentation, and I’ve done some rather extensive research into CVSS scoring.

    Karen Scarfone
    karen dot scarfone at cox dot net

  4. I have researched CVSS since the publication of CVSSv2, including the CVSS-SIG minutes published at http://www.first.org/cvss and have published a number of related links to http://www.delicious.com/cmlh/

    In relation to the post (not comment) above:

    1. The NVD is independent (within reason) of any vendor hence their is scoring is objective. It would be of value to explore if their severity score is different to that of each vendor and by what margin?

    2. The CVSS score is dependent on the environmental metrics and hence the end user is the most influential in determining the priority of implementing patches and/or workarounds based on the scoring of each vuln when considered in context of the attack surface i.e. number of hosts x affected software, cost to repair, if the residual risk is accepted, etc i.e. the end user can create new environmental metrics specific to their context.

    3. The selection of values was sampled based on those most common with *all* software and attack vector(s). Hence it is possible to compare Cisco to Microsoft to Oracle as an example.

    4. Sampling based on severity is irrelevant if I have a single host with a high severity compared to several with medium severity since the attack surface is larger (i.e. number of hosts) or the value of the temporal metrics increase or decrease, etc (you referenced the second point in your post above).

    The major issue with CVSS from the perspective of an end user is there is no “real time” feed of the temporal metrics from FIRST members, vendors, etc.

    • That’s a good point. Temporal metrics do kind of imply it varies over time :-) Without some “real time” update that information is probably not very useful. And as it relates to my experience with internal teams doing vulnerability scoring, I have mostly seen temporal used as fudge factors more than time based factors into the calculation.

  5. As far as I can tell, the rankings of Low, Medium and High (for scores 0.0-3.9, 4.0-6.9, and 7.0-10.0, respectively) is not part of the CVSS. Instead, it is part of the NVD Vulnerability Severity Ratings (see http://nvd.nist.gov/cvss.cfm).
    The CVSS generates the score from between 0.0 and 10.0, and you are free to apply whatever rating system you want, including Google’s Chrome project rating system.
    You do provide a good arguement for a four rating system. Going by your cluster graphs, a natural grouping would be Low for CVSS scores of 0.0-2.9, Medium for 3.0-5.9, High for 6.0-7.9 and Critical for 8.0-10.0.

    • Excellent writeup. I definitely wouldn’t consider myself a security metrics expert, but I think some of my observations, along with your detailed analysis, hint that CVSS doesn’t feel quite right. I often feel that as the complexity of the scoring system increases the more likely I am to feel like it doesn’t match my gut. I am sure there is valid research behind how CVSS calculations are performed, but it always seems like metrics that involve multiplication of magic numbers have edge cases that just don’t make sense. Again, good writeup. I look forward to your future thoughts on the topic.

  6. This point was mentioned above but not so explicitly. All of the metrics used in CVSS have origin in another tool or by the user. CVSS is a way of relating them. What CVSS brings to the table is that it allows one to take a rating of (what I call) fragility, a vulnerability severity, and relate it to the criticality of your technology and the mission it performs. It is actually a very broad stroke. Try reading MORDA or other risk scoring methods.

    The environmental group is a key part of CVSS. It is impossible to relate a CVSS score without reference to the the individual mission or service. One could give a score for the components of a CVSS score i.e. the temporal scores or exploitability but not an actual CVSS score. That would be the final computation of all groups.

    Regarding temporal scores, they are temporal because they are likely to change with time. Existence of exploit code will change with time. Knowledge of the existence of exploit code may also be dependent upon your own intelligence assets.

    In full disclosure, I championed use of CVSS in the Navy, I work at CMU, and know one of the authors.

    Thierry Zoller makes some good observations in his article but I’ll address them over there.

  7. It is given in CVSS that Environmental_Score should be less than the Temporal_Score. But why in most of the vulnerability scoring, Environmental_Score is higher than the Temporal_Score.
    e.g. CVE-2003-0818, CVE-2002-0392.

    Can anyone answer my question????

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s