(Top) Experimental assemblages refited A) and B) Chert; C) and D) Basalt; E) and F) Quartzite (Scale: 2cm)

(2nd from Top) Attribute of stone tools encoded in the blind analysis. 

(3rd from Top) Results of the Kappa Test for Inter-analyst variation; A) Assemblage Level; B) Raw material level. Dotted line: significant k-value (.041)

(Bottom) Results of the Kappa Test for Closeness of Fit A) Chert; B) Basalt; C) Quartzite; D) Assemblage level. Dotted line: significant k-value (.041)

The effect of raw material on inter-analyst variation and analyst accuracy for lithic analysis: a case study from Olduvai Gorge

Tomos Proffitt, and Ignacio de la Torre



This study aims to understand what effect, in terms of inter-analysis variation and analyst accuracy, different raw material types have on modern technological analyses of lithic assemblages. This is done through a series of blind analysis tests undertaken on experimentally derived assemblages of cores and flakes. Novelties of our approach include the introduction of refit studies as a method to assess analyst’s accuracy, and the use of statistical tests specifically designed to address inter-analyst variability, common in other disciplines but rarely used in Archaeology. The experimental assemblages were produced from raw materials collected at Olduvai Gorge, an archaeological sequence that has been a source for studies of early human technology for several decades, and where re-analyses of the same assemblages have usually offered different interpretations. The results of the blind analyses are compared to the true technological values obtained through full refit analysis of the experimental material, and suggest that there is a significant difference in terms of inter-analyst variability as well as accuracy related to different raw materials. Our paper highlights the interpretative problems posed by difficult-to-analyze raw materials such as quartzite, and stresses subjectivity present in stone-tool technological studies, which may contribute to explain differences in the interpretation of Early Stone Age lithic assemblages


The implementation of lithic technological studies requires the correct identification of numerous common technological characteristics and markers located on lithic material.

Nonetheless, to date surprisingly few studies have attempted to identify the level of inter-analyst accuracy associated with the identification of technological characteristics of lithic assemblages.

This paper reports on an investigation designed to identify how different raw materials available at Olduvai Gorge may affect an analyst’s ability to correctly undertake a technological lithic analysis.

Material and Methods:

Three of the primary raw materials documented in the Olduvai Beds I and II assemblages are quartzite, basalt and chert.

In total, six separate experimental assemblages were produced, two of each raw material (blocks of quartzite, nodules of chert and basalt cobbles). The target of each knapping sequence was to produce core and flake assemblages which were comparable to classic Oldowan assemblages.

Four lithic analysts (BAs) assessed the entire experimental assemblages; these analysts possessed varying levels of experience in lithic analysis, ranging from over 10 years’ experience (BA1 and BA3) to between 3 and 10 years’ experience (BA2 and BA4).

true values were retrieved through the complete refitting of the original nodules/ blocks/ cobbles, which were then deconstructed again in order to encode the relevant attributes that became the Golden Standard to which compare each analyst’s classification.

The Kappa test returns a value between 0 and 1. A value of 0 indicates no agreement, and 1 indicates total agreement between analysts, above what one would expect form chance.

The term ‘closeness of fit’ was used to describe the level of accuracy of each analyst to the true values derived from the refitting, i.e. the closeness of the analyst’s results to the refit data.


Quartzite consistently produced very low levels of inter-analyst agreement, with all attributes gaining a k-value less than the significant value of 0.41

When all raw materials were grouped together, an average overall k-value of 0.420 was produced. The attribute which produced the highest level of inter-analyst variation was the identification and quantification of dorsal surface extractions (k-value: 0.344; pvalue:

<0.001) (Figure 3A). The attributes which elicited the lowest level of inter-analyst variation were dorsal surface cortex (k-value: 0.486; p-value <0.001), striking platform cortex (k-value: 0.486; p-value: <0.001), and subsequently the Toth’s flake classification (k-value: 0.477; p-value <0.001) (Table 1).

Considering quartzite, basalt and chert together, the average k-values ranged from 0.345 to 0.538 (Table 3) (Figure 4D). The highest level of accuracy was achieved in the identification of dorsal surface cortex (k-value: 0.538, p-value: <0.001), with the lowest level of accuracy in the assessment of knapping accidents (k-value: 0.345, p-value: <0.001).

The quantification of dorsal surface extractions produced the lowest level of closeness to the refit values (average k-value: 0.320) (Table 5), whilst dorsal surface cortex gained the highest level of agreement to the refit data (average k-value: 0.525).


In terms of the individual attributes, only four of them (striking platform cortex, striking platform facets, dorsal surface cortex, Toth’s flake category) gained a significant level of agreement. Such a low level of agreement between all analysts suggests that raw material may be affecting the classification.

(In regards to chert) However, attention should be drawn to the fact that the level of agreement between all analysts still only falls within the moderate and the lower end of the substantial agreement levels. This suggests that even for good-quality raw materials, a wide range of inter-analyst variability may still exist.

The cortex on basalt cobbles from Olduvai is smooth and rounded due to fluvial action, and therefore relatively easy to recognise. One is led to postulate, however, whether an analysis of a more weathered basalt assemblage (as found in archaeological contexts), where the distinction between the clearly rounded cortex and the flaked surfaces is far less obvious, would affect the level of inter-analyst variability.

The highest level of inter-analyst variation occurred in the quartzite assemblages, with no attributes gaining a significant level of inter-analyst agreement (see figure 3).

That many lithic analyses are conducted by analysts of varying experience levels and on different raw materials; in this case, the high level of inter-analyst variation seen in this study may well be mirrored in archaeological studies.

In this regard, the results gained from the assessment of analyst accuracy can be seen as somewhat troubling. Comparison of the blind test results with the true values of the refit data shows that raw material is a substantial contributing factor to analyst accuracy.

In striking contrast to chert and basalt, no analyst gained a significant level of accuracy for quartzite, with all attributes falling within the poor and fair agreement categories.

Recent work suggests that hominins abilities to overcome substantial knapping accidents are a sign of a competent understanding of flaking mechanics and of manual dexterity (Delagnes and Roche 2005). Results from this study indicate that raw material type will adversely affect the ability to identify these attributes. Not only is the identification of knapping accidents one of the attributes which produced some of the highest interanalyst variation, but it was also consistently less accurately identified at an assemblage-wide level, particularly in the case of quartzite.

Furthermore, the assessment of directionality of dorsal surface scars is troubling, as chert (the finest grained raw material) produced the lowest average accuracy. This is important in regards to the wider understanding of lithic analysis, as a better realization is needed that even when dealing with fine grained, ‘easy to analyse’ raw materials, analytical misidentifications may affect subsequent interpretation.


The results show substantial issues associated with correctly analyzing Early Stone Age assemblages from Olduvai Gorge. The significantly low level of accuracy associated with the interpretation of quartzite experimental assemblages would almost certainly have a detrimental effect on the analysis of an archaeological assemblage of which quartzite is a component. Subsequently, archaeological and behavioural interpretations based on such lithic assemblages will be affected. Additionally, chert and basalt, supposedly easier raw materials to study, exhibited results which could by no means be described as ideal, with substantial errors of analysis occurring frequently in both raw materials.

Each raw material tested in this study affected to different degrees both inter-analyst variability and accuracy to the true values; chert and basalt produced significantly better results in both domains. However, not one raw material produced results which could be categorised as sufficiently close to the true values, both for inter-analyst variation and the two measures of accuracy. Although producing better results than basalt and quartzite, chert did not come close to achieving complete inter-analyst agreement, nor did it produce entirely true results when dealing with accuracy and closeness of fit.

The results presented here highlight that data derived from lithic analysis, be it from fine-grained chert or coarse grained quartzite, is subject to a considerable degree of error / uncertainty, which should be acknowledged and taken into account when putting forward interpretations and conclusions of human behaviour in Early Stone Age assemblages.

***This is one of those papers that will (should) just stick in your head when you read about interpretations derived from lithic analysis.