The modern Olympic Games have taken place for over 100 years, and since 1912, data relating to the men’s 100-metre freestyle swimming event has indicated a trend of decreasing times. By studying the progression of Olympic times in swimming events over time, the potentiality of the human body can be assessed, which also reflects how society has evolved over time and will continue to evolve in the future millennium.
The results presented in this report are a product of the investigation into the change in time for the men’s Olympic 100-meter freestyle swimming over the last decade, which will inevitably lead to a prediction being made in relation to what the winning time will be in 2112.
To achieve this, the data provided was graphed, followed by the development of a regression equation and the presentation of a regression line using a trendline function. To evaluate the linearity of the relationship, residuals were calculated and plotted to allow for analysis. The sample of data is provided below.
Year Time (in seconds):
The sample of 16 provided, was deemed as an appropriate sample size to investigate the relationship between the winning Olympic time for the men’s 100-meter freestyle event and the year the time was achieved. This allowed for a reasonable prediction to be made in terms of the winning time for the year 2112. The suitability of the sample size was confirmed using the general rule for effective sample size: population.
While it was concluded that approximately 5 (i.e. 25) Olympic times would be adequate in the analysis of this study, the decision to include 16 pieces of data was made to allow for more variety within the dataset. This is considering that the men’s 100-meter freestyle event has been in every modern summer Olympics since 1896 (The International Olympic Committee, 2019).
The variables allocated for this investigation are as follows:
Prior to the analysis of data, the following observations and assumptions were considered:
Relevant observations are as follows:
Changes to male competitor’s swimwear (the transition from full body suits that generally caused passive drag) changes to pool designs including; pool and lane width modifications to eliminate currents. Energy-absorbing lane barriers used to stop waves from adjacent lanes the emergence of the standard 50m pool in the 1924 Paris Games, and the introduction of diving block in the 1936 Games (West, 2019). It is also necessary to acknowledge the evolution of timing and recording technology as a contributing factor.
The EXCEL 2016 spreadsheet program was used during the investigation process to formulate graphs, confirm the regression equation and coefficient of determination, and to calculate residuals and plot them accordingly. A Casio fx-82AU PLUS II calculator was also used to calculate statistical measures, which were required in the development of the least-squares regression equation and in order to make a prediction.
The data from Table 1 is represented in Graph 1 below, using a scatterplot to identify a possible association between the year a winning Olympic time is achieved and the winning time (in seconds).
Graph 1: Scatterplot of winning Olympic times against calendar years (1912-2012)
From Graph 1, it is apparent that there is a strong association between the two variables. A strong negative linear relationship was identified as a plausible classification. Based on this conclusion, a linear regression equation was developed using the least-squares method of regression.
The general form of the least-squares regression line is given by:
Where b=r ?s_y/s_x and a= y ?- b ? given r is Pearson’s correlation coefficient, s_x and s_y are the sample standard deviations and x and y are the sample means (average)
To determine the correlation coefficient necessary to find the value of b, the general formula (for n values) was used:
r=1/(n-1) ((x_i-x ?)/s_x )((y_i-y )/s_y )
Where; n=16 , x_i=x data value , x =1966.5, y_i=y data value, y =53.3, s_x=32.59038713 and s_y=5.202350751 (as determined using Casio calculator functions.
Refer to Appendix 1 for statistical measures used and Appendix 3 for complete calculations and solutions.
r=1/(n-1) ??((x_i-x )/s_x )((y_i-y )/s_y )
r-0.98 (correct to two decimal places)
Check using the correlation coefficient (r) function on the Casio calculator: -0.98 (correct to two decimal places).
? The gradient of the regression is -0.1566407231, implying a negative slope.
a= y ?-bx ?
= 53.3+ 0.1566407231 ?1966.5
? The linear regression equation developed for the data is:
y?361.33-0.16x (correct to two decimal places).
From the gradient (b) value given in the equation above, it was deduced that each year, the winning time for the men’s 100-meter freestyle, is expected to decrease by approximately 0.16 seconds (correct to two decimal places). In context with the study, the winning time would be expected to decrease by 0.64 seconds (0.16 ? 4) at each Olympic games, held every four years. The y-intercept (0, 361.33) implies that at the year 0 it the winning time would have been 361.33 seconds, which of course, is highly unreasonable.
Using the trendline function available on the spreadsheet program; EXCEL, the regression line was added to the initial scatterplot (see Graph 2). The equation of the line, as determined by the EXCEL function, confirmed the regression equation constants calculated using algebraic formulas.
The distribution of data points followed the trend closely. The calculated correlation coefficient value of -0.98 (correct to two decimal places) and the coefficient of determination (R^2 ) value of 0.9629, were also confirmed using relevant calculator and EXCEL functions.
Pearson’s correlation coefficient (r) is a statistical measure of the strength and direction of the association between the two quantitative variables (University of the West of England, 2019). When calculating a correlation coefficient there is the underlying assumption that the relationship between the two variables is linear. Furthermore, a correlation coefficient of one or negative one describes a perfect linear association, while a value of zero implies there is no association. The calculated r value of -0.98 (correct to two decimal places), reveals that data distribution initially assumed was correct; in that there is a strong association.
The coefficient of determination, denoted as R^2, is a used to describe how well a linear regression model is in predicting an outcome. More specifically, R^2 indicates the proportion of the variance in the response variable (y) that is predicted by linear regression and the explanatory variable (x) (Enders, 2019).The closer R^2 is to one, the stronger the relationship is between the two variables. In this case, the winning Olympic swimming times for the men’s 100-metre freestyle data possesses an R^2 value of 0.9629. This indicates that 96.29% of the variation in time (seconds) is explained by the variation in the year – a very strong correlation.
Using the general equation of the least-squares regression line, a prediction of the winning Olympic swimming times for the men’s 100-meter freestyle in 100 years, was made.
Sub the year 2112 in for x to determine the winning time (y), in seconds.
It is predicted that in 2112 the winning time for the men’s 100-meter freestyle will be 30.50 seconds. However, this prediction is somewhat unrealistic, as after a given period the data would expect to plateau given the physiological limitations of the human body.
In performing a regression analysis, a strong, negative linear association between the year a winning Olympic time was achieved and the time (in seconds), was confirmed. With all but some data points falling close to the regression line, a plausible assumption could be made that there is a strong relationship between the two variables. However, to be able to confidently confirm this theory, a residual plot is suggested, as it is considered a more reliable method in confirming the linear relationship between two variables.
The residual value is the vertical distance between an actual value of the response variable, y_i, and the value predicted by the least-squares regression equation. These residuals, calculated from the available data, are treated as estimates of the model error (Anderson & Sweeney, 2019). Residuals can take on either positive or negative values, revealing if the actual value is greater than or less than the predicted value. In this instance, the residual value represents the number of seconds the winning Olympic time is greater than or less than the predicted winning times, determined by the regression equation.
The general formula for determining residual values is given by:
1.56 (correct to two decimal places).
The plot of the residuals against the explanatory variable evaluates the appropriateness of fitting a linear model to data, in a regression analysis. If the distribution of points is randomly dispersed across the horizontal axis (x-axis), the relationship is likely linear (Jones, Evans, Lipson, & Staggard, 2019). From Graph 3, it can be observed that the points are scattered in a cyclical fashion, which would suggest the association between the two variables is non-linear. While there is some random behavior, there is a clearly identifiable curve shown in the scattering of data. This is contradictory to the large r and R^2 values calculated, that imply a strong linear relationship
In trying to explain this cyclical pattern observed in the residual plot (see graph 3), the initial assumptions and observations were revisited to accommodate the unexpected findings. The assumption that; all athletes were of relatively similar age, height and weight, in particular, may have had the greatest impact.
It may also be reasonable to question whether the calendar year directly affects winning time (a cofounding variable), and whether or not there are other variables that may contribute more to the change in the winning times overtime.
A strength of the model is the relatively large sample size of the winning Olympic times used. Another strength of the study, is the ability to calculate the least-squares regression line using the spreadsheet program; EXCEL 2016, as it provides an accurate value.
The study, however, was limited to the calendar years provided. Other variables excluded from this investigation such as anthropometric, biological, and physiological variables, could have been considered. The linear model is also limited in terms of extrapolation. The accuracy of the prediction is questionable as the model will reach a limit of extrapolation where it will become impossible to achieve the predicted time (i.e. negative times). When the line intersects the x-axis at zero, the following predictions of winning Olympic times will become unreasonable.
Investigated in this report was the relationship between a winning Olympic time (in seconds), and the calendar year it was achieved. The strength of this relationship is clearly indicated through the negative linear association, as demonstrated by the linear regression and high correlation coefficient (r) value of – 0.98 (indicating a strong association). However, given the cyclical pattern of the residuals and limitations concerning the extrapolation of data, it is not possible to draw a reasonable conclusion about whether the winning Olympic time is entirely dependent on the year it was achieved. Therefore, the prediction of the winning time for the men’s 100-meter freestyle in 100 years’ time of 30.50 seconds may be considered unreasonable.