LinkedIn

See our LinkedIn profile, click on this button:

APDEX

Report Groups and Quality Ratings in Apdex-G

I’m writing a series of posts about Generalizing Apdex. This is #13. To minimize confusion, section numbers in the current spec are accompanied by the section symbol, like this: §1. The corresponding section numbers in the generalized spec, Apdex-G, are enclosed in square brackets, like this: [1].

In my previous post in this series, I reviewed the first part of section §5 Reporting in the Apdex-G spec, introducing Interval Notation and proposing a new comma-separated format for the Uniform Output File. For the full story, see Configurable Reporting in Apdex-G. In this post, I cover the remainder of section §5.

First I will explore the question of what data should be made available to an independent reporting tool that queries an Apdex-based data analysis service. The current spec requires tools to report Apdex scores in Uniform Output format, but does not require any identifiers or context to be supplied with those scores. This strikes me as an omission, especially since section §3.2 Report Groups contains mandatory rules for defining report groups, which are “the foundation for an Apdex calculation”.

The Importance of Context

In The Big Book of Key Performance Indicators, Eric Peterson writes (on page 8) that “Key performance indicators are always rates, ratios, averages or percentages; they are never raw numbers. Raw numbers are valuable to web analytics reporting to be sure, but because they don’t provide context, are less powerful than key performance indicators,” and (on page 11), “Raw numbers are not key performance indicators. I know that many smart people disagree with me on this point but, well, they’re wrong” PETE06.

Among those who do disagree is Brian Clifton, author of the book Advanced Web Metrics with Google Analytics. In his blog, Measuring Success, he writes that “a KPI is not always an average, ratio or percentage – sometimes raw numbers are better” CLIF08.

Having surveyed the field of (Key) Performance Indicators, I am not surprised to find people disagreeing about what is and is not a KPI. But why not have both derived metrics and raw numbers? A derived metric like Apdex is ideal for tracking progress against targets, while the underlying data, like sample counts, supply important context.

Today Apdex recognizes this (see section §5.2 below) by demanding an asterisk when the sample count is below 100, but this does not always substitute for knowing the actual sample count. When comparing a pair of Apdex scores, 0.95 is “excellent” while 0.90 is merely “good” (see section §5.4 below) — the first is clearly better, right? But what if those two scores were based on the measurements of customer satisfaction shown in Table 1 below?

Application Description Total Samples Satisfied Customers Tolerating Customers Frustrated Customers Apdex Score
A Search 20,000 18,500 1,000 500 0.95
B Purchase 200 160 40 0 0.90
Table 1. Comparison of Two Apdex Scores

A management dashboard probably needs to draw attention to the 500 frustrated customers of application A, despite its higher Apdex score. Size does matter, for two reasons. Whenever we measure something that varies, we first need enough samples to establish that any apparent effect we observe in a sample is truly a property of the real world, and not simply a chance occurrence. Second, we need to determine the magnitude of that effect. In the field of statistics, these two aspects are called ‘statistical significance’ and ‘practical (or scientific) significance’, and the two are frequently confused by scientists and non-scientists alike ZILI08, ZILI09.

Supporting Management Dashboards
Neil Gunther's Apdex List

Neil Gunther’s paper, The Apdex Index Revealed GUNT09 lists a possible criticism of Apdex for not specifying how to present results for “hundreds of applications”:

In a typical commercial enterprise, one may have to manage hundreds or perhaps thousands of servers running multiple applications per server. Even if viewing all of them simultaneously is unlikely, it may be necessary to look at a large subset of them. How do you digest hundreds of Apdex Indexes? How do you digest changes across hundreds of Apdex Indexes? Other than simple tabulation, the Apdex Alliance does not seem to have addressed this issue.
— Neil Gunther, The Apdex Index Revealed, 2009. Section 3.6.

Actually, the Apdex spec deliberately choses to address this (in sections §5 and §5.1) by leaving the form of any “report” of Apdex scores to be determined by the tool doing the reporting. I think this is the right place to draw the line. Apdex is primarily focused on the data analysis aspect of reporting, not on data visualization. But the spec should support visualization tools by supplying the data they need.

Visualization tools need to implement natural ways to highlight the most important results, and suppress visual clutter. Sorting the lowest Apdex scores to the top could be one way. Another might be to sort by the number of samples underlying an Apdex score. That would distinguish scores based on 20,000 samples from those based on 200 samples, so that problems indicated in the larger group can be given a higher priority. Other possibilities include sorting displays by frustrated sample count, by application, by user, by a timestamp, or by report group name. I am proposing that the Apdex-G spec extend the Uniform Output format to supply these kinds of identifiers.

Neil suggests a display format (shown on the right) in which many Apdex result are lined up side-by-side for easy comparison. I agree with this direction–it is typical of many dashboard displays–but it can be improved upon. It devotes too much ink to the colored bars, and not enough to the Apdex scores (the black lines). This aspect of data visualization is a key design feature for dashboards. Edward Tufte TUFT03 and Stephen Few FEW10 both emphasize the crucial importance of maximizing the data-ink ratio, namely:

.. the proportion of ink (or pixels, when displaying information on a screen) that’s used to present actual data, without redundancy, compared to the total amount of ink (or pixels) used in the entire display, such as in a table or graph. The goal is to design a display that has the highest possible data-ink ratio (that is, as close to the total of 1.0 or 100% as possible), without eliminating something that is necessary for effective communication.”
— Stephen Few, Elegance Through Simplicity, 2004 FEW04

[5.1.2] Report Group Section

Section [5.1.2] is a new section that specifies an extension of the Uniform Output spec defined previously in section [5.1.1]. The additional elements, labeled 2-15, summarize the report group and distribution of samples underlying the Apdex metric. At this stage I am undecided on whether some or all of these fields should be made mandatory within the Uniform Output. Once we decide which elements are to be optional or mandatory, I believe we can design the fields of the Uniform Output Header Record to describe the contents of a file. This will allow any tool processing a Uniform Output file to identify the presence or absence of optional fields.

Because this proposal extends the current spec, there is no corresponding text in section §5. So the two columns below show the current section §3.2 Report Groups on the left, and the draft proposal for section [5.1.2] on the right.

Current spec:
3.2 Defining a Report Group

The report group is a specified set of individual measurement samples (of Task Time or Task Chain Time) that will form the foundation for an Apdex calculation. A report group’s set of measurement samples may be unique or may have overlapping measurement samples with other report groups. Report groups can be defined in many ways, but the following are required:

Type
Task or Task Chain; measurements of Task Time and Task Chain Time may not be combined in one report group.
Application
An application as selected by the technician. At a minimum, this is the application that the tool can interpret to the Task level (see above).
User Group
The technician must be able to define various user groups of an application (e.g., geography, organization).
Time Period
The technician must be able to define time of day periods for which the index will be calculated.

The report group is one of the fundamental controls available to a technician. The report group may be defined as broadly as all of the samples for an application, or as narrowly as a single sample. Single samples are useful for diagnostic purposes.

First draft:
[5.1.2] Report Group Section

The Report Group Section, defined in Table 5 below, is an [optional?] extension of the uniform output data record described in section [5.1.1]. Note: Elements 1 and 16-22 repeat the contents of section [5.1.1] for convenience in this draft.

Element Number Definition Type Content
1 Apdex metric identifier Literal Apdex
Report Group Identifiers
2 Report Group Name Name Defined by tool or user
3 Report Group Description Name Defined by tool or user
4 Metric Type Name Addendum type (e.g. R for Apdex-R)
5 Metric Subtype Name As defined within an addendum
6 Application Name Defined by tool or user
7 User Group Name Defined by tool or user
8 Time Period Start Timestamp ISO 8601: [YYYY][MM][DD]T[hh][mm][ss]Z
9 Time Period End Timestamp ISO 8601: [YYYY][MM][DD]T[hh][mm][ss]Z
Input summary
10 Sample Count Number Integer
11 Satisfied Zone Count Number Integer
12 Tolerating Zone Count Number Integer
13 Frustrated Zone Count Number Integer
14 Earliest Sample Timestamp Timestamp ISO 8601: [YYYY][MM][DD]T[hh][mm][ss]Z
15 Latest Sample Timestamp Timestamp ISO 8601: [YYYY][MM][DD]T[hh][mm][ss]Z
Apdex Index
16 Apdex Index Number Decimal in range [0.00, 1.00]
17 Satisfied Zone Identifier Literal S
18 Satisfied Thresholds(s) Group Interval Group, see Table 3
19 Tolerating Zone Identifier Literal T
20 Tolerating Thresholds(s) Group Interval Group, see Table 3
21 Frustrated Zone Identifier Literal F
22 Frustrated Thresholds(s) Group Interval Group, see Table 3
Table 5. Layout of an Extended Uniform Output Data Record

Report Group Section Header Record
[To be defined]

[5.3] Describing General Cases

I have reversed the order in which I review sections §5.3 and §5.2. in this post, to eliminate a forward reference. Eventually I will make the same change in the Apdex-G spec, but for now it’s probably less confusing to retain the current spec’s numbering.

Because Apdex-G has generalized the definition and format of thresholds, I propose that Apdex-G should support only generic references to thresholds, rather than the specific forms defined in the current spec. I propose to drop the example of subscripted format because it is not consistent with the defined Apdex-G terminology. It may be included in Apdex-R.

Current spec:
§5.3 When T is a General Case

General Apdex value discussions that reflect any target of T are written with “T” as shown in the following examples.

Uniform Output: “Everyone should understand that 0.90 [T] is a better value than 0.80 [T].”
Subscripted Output: “Everyone should understand that 0.90T is a better value than 0.80T.”

First draft:
[5.3] Describing General Cases

In general discussions of Apdex values, references to unspecified performance thresholds are written using the notation [T], as shown in the following examples.

“Everyone should understand that 0.90 [T] is a better value than 0.80 [T].”
“Apdex scores in the range 0.85 to 0.93 [T] are rated Good

For more examples, see sections [5.2] and [5.4] of this document, which also use this notation.

[5.2] Sample Size

To adapt the current section §5.2 for the Apdex-G spec, I have made the langauge consistent with the notation for general thresholds defined in [5.3], and allowed for an addendum to override the definition of a ‘small group’ of samples.

Current spec:
§5.2 Indicating Sample Size

Apdex values are calculated based upon a set of measurements (samples) in the report group. If there are a small number of samples, the tool must still present a result. However, a result for such a small report group must be clearly marked.

A small report group is defined as any number of samples between 0 and 99. Apdex tools will clearly indicate that the result is based upon one of the following scenarios:

No Samples
The Apdex calculation could not be performed because there were no samples (NS) within the report group. Where the calculated Apdex value would normally appear, the tool will show an output of NS. Examples: NS [4.0], NS4
Small Group
When an Apdex value is the output of a small group (1 to 99) calculation, an asterisk (*) must be appended to that value. Examples: 0.80 [4.0]*, 0.804*.

First draft:
[5.2] Indicating Sample Size

Apdex values are calculated based upon a set of measurements (samples) in the report group. If there are a small number of samples, the tool must still present a result. However, a result for such a small report group must be clearly marked.

A small report group is defined as one having fewer than 100 samples. An addendum may modify this definition to be appropriate for a particular measurement domain. Apdex tools will clearly indicate that the result is based upon one of the following scenarios:

No Samples
The Apdex calculation could not be performed because there were no samples (NS) within the report group. Where the calculated Apdex value would normally appear, the tool will show an output of NS [T], where [T] is the normal threshold display (see section [5.3]).
Small Group
When an Apdex value is the output of a small group calculation, an asterisk (*) must be appended to that value, for example: 0.84* [T], where [T] is the normal threshold display (see section [5.3]).

[5.4] Quality Ratings

To adapt the current section §5.4 for the Apdex-G spec, I have:

  • introduced the term “Quality Ratings” to describe this feature
  • clarified the layout and wording of the introductory paragraphs
  • made the notation in the table consistent with the notation defined for general thresholds in [5.3]
  • allowed for an addendum to override the color specifications if appropriate

The actual rating band values are those defined in Apdex today, which I have carried forward into Apdex-G. I have left open for further discussion the question of whether an addendum may specify domain-specific values to override these.

Current spec:
§5.4 Additional Reporting Rules

Some tool vendors may wish to add graphical aids to report the Apdex value. This is an optional feature, but, if implemented, it must follow these guidelines.

Two forms of alternative representations are permitted: the rating (a word), and a color indication. The following table shows the fixed set of alternative modes of representing the Apdex. The table shows examples where the target threshold (T) is 4 seconds.

The color indication can be determined by the vendor in line with their existing product set, however a legend must clearly indicate which color represents each Apdex rating.

Table 3 – Apdex Qualitative Reporting Rules
(Examples where T=4)

Apdex Value Range Rating Color Indication
0.944 to 1.004 Excellent4 Determined by vendor (with a 4 plus a color indication)
0.854 to 0.934 Good4 Determined by vendor (with a 4 plus a color indication)
0.704 to 0.844 Fair4 Determined by vendor (with a 4 plus a color indication)
0.504 to 0.694 Poor4 Determined by vendor (with a 4 plus a color indication)
0.004 to 0.494 Unacceptable4 or UNAX4 Determined by vendor (with a 4 plus a color indication)
Low Sample Cases
0.NS4 NoSample4 Determined by vendor (with a 4 plus an NS inside color indication)
0.854* Good4* Determined by vendor (with a 4 plus an * inside color indication)
Table 3. Apdex Qualitative Reporting Rules

Note. In the current specification, the ‘Apdex Value Range’ column of Table 3 contains a typo in the ‘NoSample’ row, which (as shown above) reads ‘0.NS4‘. That cell should read ‘NS4‘.

First draft:
[5.4] Apdex Quality Ratings

Some tool creators may wish to assign quality ratings to Apdex value ranges, and to present those ratings graphically. This is an optional feature, but, if implemented, it must follow these guidelines.

Two alternative representations are permitted for quality ratings: a rating word or a color indication. Table 6 below lists the value ranges to be used when assigning a rating to an Apdex value. The table shows examples for the target threshold [T], where [T] is the normal threshold display as described in section [5.3].

Colors may be selected by the vendor for consistency with other products, or based on user-supplied preferences. However a legend must clearly indicate which color represents each Apdex rating.

Apdex Value Range Rating Word Color Indication
0.94 to 1.00 [T] Excellent [T] Determined by vendor (with [T] plus a color indication)
0.85 to 0.93 [T] Good [T] Determined by vendor (with [T] plus a color indication)
0.70 to 0.84 [T] Fair [T] Determined by vendor (with [T] plus a color indication)
0.50 to 0.69 [T] Poor [T] Determined by vendor (with [T] plus a color indication)
0.00 to 0.49 [T] Unacceptable [T] or UNAX [T] Determined by vendor (with [T] plus a color indication)
Low Sample Cases
NS [T] NoSample [T] Determined by vendor (with [T] plus an NS inside color indication)
0.85* [T] Good* [T] Determined by vendor (with [T] plus an * inside color indication)
Table 6. Rules for Apdex Quality Ratings

Note: An addendum may specify alternative colors [TBD: and/or value ranges?] to be applied when reporting Apdex scores for a particular measurement domain.

As usual, all these proposals are open for public discussion. Please use the comment form below to contribute any comments, suggestions, or questions.

References

CLIF08
A KPI is not always an average, ratio or percentage – sometimes raw numbers are better, Brian Clifton, 2008. [Measuring Success, June 2008]
FEW04
Elegance Through Simplicity Stephen Few, 2004. [Perceptual Edge, October 16, 2004]
FEW10
Perceptual Edge Stephen Few, 2010. [Perceptual Edge Website]
GUNT09
The Apdex Index Revealed, Neil J. Gunther, February 2009. [CMG MeasureIT]
PETE06
The Big Book of Key Performance Indicators, Eric T. Peterson, 2006. [1.1Mb pdf]
TUFT03
Executive Dashboards, Edward Tufte and others, 2003-2009. [Discussion thread on “Ask E.T.”]
ZILI08
The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives Stephen T. Ziliak and Deirdre N. McCloskey, University of Michigan Press, 2008. [Amazon Books]
ZILI09
The Cult of Statistical Significance Stephen T. Ziliak and Deirdre N. McCloskey, Joint Statistical Meetings, Washington, DC, August 3, 2009. [179Kb pdf]

Leave a Reply

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

  

  

  


*