Title: | Statistical Pattern Recognition and daTing using Archaeological Artefacts assemblageS |
---|---|
Description: | Statistical pattern recognition and dating using archaeological artefacts assemblages. Package of statistical tools for archaeology. hclustcompro(perioclust): Bellanger Lise, Coulon Arthur, Husi Philibrary(SPARTlippe (2021, ISBN:978-3-030-60103-4). mapclust: Bellanger Lise, Coulon Arthur, Husi Philippe (2021) <doi:10.1016/j.jas.2021.105431>. seriograph: Desachy Bruno (2004) <doi:10.3406/pica.2004.2396>. cerardat: Bellanger Lise, Husi Philippe (2012) <doi:10.1016/j.jas.2011.06.031>. |
Authors: | Arthur Coulon [aut, cre], Lise Bellanger [aut], Philippe Husi [aut], Bruno Desachy [ctb], Benjamin Martineau [ctb] |
Maintainer: | Arthur Coulon <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.2.4 |
Built: | 2025-03-05 05:49:23 UTC |
Source: | https://github.com/arliph/spartaas |
Statistical Pattern Recognition and daTing using Archeological Artefacts assemblageS.
Statistical pattern recognition and dating using archaeological artefacts assemblages. Package of statistical tools for archaeology. hclustcompro(perioclust): Bellanger Lise, Coulon Arthur, Husi Philibrary(SPARTlippe (2021, ISBN:978-3-030-60103-4). mapclust: Bellanger Lise, Coulon Arthur, Husi Philippe (2021) <doi:10.1016/j.jas.2021.105431>. seriograph: Desachy Bruno (2004) <doi:10.3406/pica.2004.2396>. cerardat: Bellanger Lise, Husi Philippe (2012) <doi:10.1016/j.jas.2011.06.031>.
Arthur Coulon [aut, cre], Lise Bellanger [aut], Philippe Husi [aut], Bruno Desachy [ctb], Benjamin Martineau [ctb]
Maintainer: Arthur Coulon <[email protected]>
Bellanger L., Coulon A., Husi P., 2020 – Perioclust: a new Hierarchical agglomerative clustering method including temporal or spatial ordering constraints. Springer Series, Studies in Classification, Data Analysis, and Knowledge Organization. <doi: 10.1007/978-3-030-60104-1>
Bellanger L., Husi P., Laghzali Y. (2015). Spatial statistic analysis of dating using pottery: an aid to the characterization of cultural areas in West Central France. In : Traviglia A. ed., Across Space and Time, Proceedings of the 41th International Conference on Computer Applications and Quantitative Methods in Archaeology (CAA-2013), Perth (Australie), Amsterdam University Press : 276-282.
From the data of a network, we build a contiguity matrix. Based on this matrix, we generate a dissimilarity matrix. The matrix contains only 0 or 1, 1 if there is no relationship and 0 if there is a relationship. The network object is a data frame with two columns. The first column contains the elements of the network and the second column contains a list of all other elements related to it. The list is a string consisting of the names of the elements separated by commas (see example).
adjacency(network)
adjacency(network)
network |
Data frame with 2 columns. The first contains all elements (nodes) and the second a string with all related elements (links). |
D |
Dissimilarity matrix based on adjacency. |
A. COULON
L. BELLANGER
P. HUSI
##---- Should be DIRECTLY executable !! ---- ##-- ==> Define data, use random, ##-- or do help(data=index) for the standard data sets. library(SPARTAAS) data(datangkor) ## network stratigraphic data (Network) network <- datangkor$stratigraphy dissimilarity <- adjacency(network) dissimilarity
##---- Should be DIRECTLY executable !! ---- ##-- ==> Define data, use random, ##-- or do help(data=index) for the standard data sets. library(SPARTAAS) data(datangkor) ## network stratigraphic data (Network) network <- datangkor$stratigraphy dissimilarity <- adjacency(network) dissimilarity
Always returns the upper value if the next digit is 5.
arrondi(x, acc)
arrondi(x, acc)
x |
The number to round. |
acc |
Accuracy (number of digits). A negative number of digits means rounding to the power of ten, e.g. arrondi(x, digits = -2) will round to the nearest hundred. |
res |
Value or vector of values rounded. |
A. COULON
L. BELLANGER
P. HUSI
library(SPARTAAS) x1 <- c(15,25,35,45,55) round(x1,-1) arrondi(x1,-1) x2 <- c(-15,-25,-35,-45,-55) round(x2,-1) arrondi(x2, -1) x3 <- 1.125 round(x3,2) arrondi(x3, 2) x4 <- seq(-0.55,0.55,0.1) data.frame( val = x4, round = round(x4,1), arrondi = arrondi(x4, 1), equal = (arrondi(x4, 1) == round(x4,1)) )
library(SPARTAAS) x1 <- c(15,25,35,45,55) round(x1,-1) arrondi(x1,-1) x2 <- c(-15,-25,-35,-45,-55) round(x2,-1) arrondi(x2, -1) x3 <- 1.125 round(x3,2) arrondi(x3, 2) x4 <- seq(-0.55,0.55,0.1) data.frame( val = x4, round = round(x4,1), arrondi = arrondi(x4, 1), equal = (arrondi(x4, 1) == round(x4,1)) )
Perform a correspondence analysis on a contingency table and then return the distance matrix of the coordinates (you can choose the number of axes to use to build the distance matrix with the nCP parameter).
CAdist(df, nPC = NULL, graph = TRUE)
CAdist(df, nPC = NULL, graph = TRUE)
df |
Data.frame, matrix or table with the data for the correspondence analysis. |
nPC |
Number of principal components to be retained for the construction of the distance matrix. Must be between 1 and the minimum of ncol - 1 and nrow - 1. Could also be "max". |
graph |
Logical parameter for plotting the Correspondence Analysis (Axis1, Axis2). |
D |
The distance matrix |
A. COULON
L. BELLANGER
P. HUSI
##---- Should be DIRECTLY executable !! ---- ##-- ==> Define data, use random, ##-- or do help(data=index) for the standard data sets. library(SPARTAAS) data(datangkor) ## contingency table cont <- datangkor$contingency distance <- CAdist(cont, nPC = "max") distance ## run without printing the plot distance <- CAdist(cont, nPC = "max", graph=FALSE)
##---- Should be DIRECTLY executable !! ---- ##-- ==> Define data, use random, ##-- or do help(data=index) for the standard data sets. library(SPARTAAS) data(datangkor) ## contingency table cont <- datangkor$contingency distance <- CAdist(cont, nPC = "max") distance ## run without printing the plot distance <- CAdist(cont, nPC = "max", graph=FALSE)
The methodology is based on a statistical and visual approach using two estimated density curves to date each archaeological context. The statistical procedure required two steps, each leading to the construction of a density curve. The first allowed us to estimate a date corresponding to the terminus post quem of the context, a cursor reflecting an event dated in calendar time. This statistical tool allows the archaeologist to easily visualise and analyse chronological patterns.
cerardat(df, row.sup, date, nf = NULL, confidence = 0.95, graph = T)
cerardat(df, row.sup, date, nf = NULL, confidence = 0.95, graph = T)
df |
The data (data.frame) is a contingency table with the observations in the rows and the technical groups in the columns. |
row.sup |
Index of supplementary rows in df (vector). |
date |
The dates of each observation or NA (vector). |
nf |
an integer representing the number of axes retained in the correspondence analysis. If NULL, it is automatically chosen to keep a number corresponding to at least 60% of the inertia. |
confidence |
The desired confidence interval (0.95 for 95%). |
graph |
logical to display the plots or not. |
The corpus data is a contingency table with the observations in the rows and the technical groups in the columns. There are two types of observations: the reference corpus observations and the supplementary observations. The supplementary rows (observations) are identified by the argument 'row.sup'.
step 1: modelling events dated in calendar time (dateEv)
This step involves estimating the date of an event recorded in the ground (an archaeological context for the archaeologist) from the pottery assemblage of which it is composed, by fitting a regression model that relates a known date in calendar time, such as the date of issue of a coin, to its pottery profile. The reference corpus used to fit the regression model. We then used the previously fitted model to calculate a predicted value for contexts not included in the reference corpus, sometimes stratigraphically separated or poorly documented, with a 95% confidence interval for the predicted date.
A correspondence analysis (CA) was carried out to summarize the information in the reference corpus data. We then kept only the first factorial axes. In this way, our contingency table becomes a reduced size table, an incomplete reconstruction of the data. This principle is used in many factor analysis techniques to reduce the number of explanatory variables in the linear regression model.
After estimating the beta parameters of the model using the classical results of multiple regression analysis and checking that the model fits the data correctly, we can deduce the estimated date of an observation and also predict the date of another observation that has no coins and is therefore not dated.
step 2: from event time (dateEv) to accumulation time (dateAc)
We used the results of the first step and the properties of the CA to obtain an estimate of the date of each fabric. We could then define the archaeological time represented as dateAc, in other words the accumulation time of a context, as the weighted sum of the fabric dates; the weights being the proportions of MINVC of each fabric in the context. Assuming that the random variables dateEvj are independent, the distribution of the accumulation time of each context can be approximated by the Gaussian mixture. In this way, for each context, we obtained a plurimodal density curve representing the estimated law of accumulation time based on the mixture of normal densities (dates of each tissue). By definition, the area under the density curve has a value of 1 (i.e. 100%).
date
In order to estimate a date for the context, it is essential to refer to objects that have been dated by another source, such as coins. These contexts were selected on a very strict basis for their chronostratigraphic reliability, level of domestic occupation or enclosures with long urban stratigraphic sequences, thereby minimising any bias associated with the disparity between the date of the coin and that of the context.
prediction |
Estimated date for archaeological context (event: dateEV and accumulation: dateAc) with confidence interval. The first two columns show the total count and the number of categories per line (only for columns used in CA). Then a date column shows the known dates. Each dateEv and dateAC model has three columns (value, lower bound, upper bound). |
date_gt |
Estimated date for technical groups with confidence interval (use for dateAc). The first column show the total count per category in the reference data (only for rows used in CA). |
lm |
Linear model on the components of the correspondance analysis. |
predict_obj_row |
date prediction of archaeological contexts (rows) using predict.lm. |
predict_obj_col |
date prediction of technical groups (columns) using predict.lm. |
cont_gt |
Contingency table of the reference corpus. |
statistical.summary |
Statistical summary of the model: |
obs_ca_eval |
Quality of row representation in the correspondence analysis. |
check_ref |
Plot of estimated dates (with confidence interval) and real dates of reference data. Only when the real date is known. |
check_sup |
Plot of estimated dates (with confidence interval) and real dates of supplementary data. Only when the real date is known. |
Shapiro_Wilks |
Summary of the Shapiro-Wilks test. see shapiro.test. |
Durbin_Watson |
Summary of the Durbin-Watson test. see dwtest. |
Breusch_Pagan |
Summary of the Breusch-Pagan test. see bptest. |
A. COULON
L. BELLANGER
P. HUSI
Bellanger L. and Husi P. (2012) Statistical tool for dating and interpreting archaeological contexts using pottery. Journal of Archaeological Science, Elsevier, 39 (4), pp.777-790. doi:10.1016/j.jas.2011.06.031.
data("datacerardat") resultat = cerardat(df = datacerardat$df, row.sup = datacerardat$row.sup, date = datacerardat$date, nf = NULL, confidence = 0.95, graph = TRUE ) resultat #The Shapiro-Wilks test is used to verify the normality of the residuals. #The Durbin-Watson test checks for first order autocorrelation. #The Breusch-Pagan test checks for heteroscedasticity. #See the first plot plot(resultat, which = 1, col1=rgb(0.93,0.23,0.23,0.5), col2="black", xlim=NULL, ylim=c(0,0.03) ) #See the first ten plots #plot(resultat, # which = 1:10, # col1=rgb(0.93,0.23,0.23,0.5), # col2="black", # xlim=NULL, # ylim=c(0,0.03) # ) #See all plots #plot(resultat, # which = NULL, # col1=rgb(0.93,0.23,0.23,0.5), # col2="black", # xlim=NULL, # ylim=c(0,0.03) # ) #You can extract the plots and find them in the directory : paste0(getwd(),"/figures") #With the 'extract_results' function #extract_results(resultat,width=480, height=480, path="figures")
data("datacerardat") resultat = cerardat(df = datacerardat$df, row.sup = datacerardat$row.sup, date = datacerardat$date, nf = NULL, confidence = 0.95, graph = TRUE ) resultat #The Shapiro-Wilks test is used to verify the normality of the residuals. #The Durbin-Watson test checks for first order autocorrelation. #The Breusch-Pagan test checks for heteroscedasticity. #See the first plot plot(resultat, which = 1, col1=rgb(0.93,0.23,0.23,0.5), col2="black", xlim=NULL, ylim=c(0,0.03) ) #See the first ten plots #plot(resultat, # which = 1:10, # col1=rgb(0.93,0.23,0.23,0.5), # col2="black", # xlim=NULL, # ylim=c(0,0.03) # ) #See all plots #plot(resultat, # which = NULL, # col1=rgb(0.93,0.23,0.23,0.5), # col2="black", # xlim=NULL, # ylim=c(0,0.03) # ) #You can extract the plots and find them in the directory : paste0(getwd(),"/figures") #With the 'extract_results' function #extract_results(resultat,width=480, height=480, path="figures")
see cerardat.
cerardat_app()
cerardat_app()
No return value
A. COULON
L. BELLANGER
P. HUSI
##---- Should be DIRECTLY executable !! ---- ##-- ==> Define data, use random, ##-- or do help(data=index) for the standard data sets. library(SPARTAAS) if(interactive()){ cerardat_app() }
##---- Should be DIRECTLY executable !! ---- ##-- ==> Define data, use random, ##-- or do help(data=index) for the standard data sets. library(SPARTAAS) if(interactive()){ cerardat_app() }
Estimate the correct number of axes to keep in the regression model.
cerardat_estim_nf(df, row.sup, date)
cerardat_estim_nf(df, row.sup, date)
df |
The data (data.frame) is a contingency table with the observations in rows and the technical groups in columns. |
row.sup |
Index of supplementary rows in df (vector). |
date |
The dates of each observation or NA (vector). |
nf |
Number of axes to keep (minimal PRESS value) |
MSE |
plot of the Mean Squared Error. |
PRESS |
plot of the PRediction Error Sum Of Squares. |
Pvalue |
plot of p-values from statistical tests on the hypotheses. |
adj.R_sq |
plot of the Coefficient of determination R². |
data |
data frame of MSE, PRESS and R_sq values. |
data_stat |
data frame of the p-values. |
A. COULON
L. BELLANGER
P. HUSI
Bellanger L. and Husi P. (2012) Statistical tool for dating and interpreting archaeological contexts using pottery. Journal of Archaeological Science, Elsevier, 39 (4), pp.777-790. doi:10.1016/j.jas.2011.06.031.
library(SPARTAAS) data(datacerardat) res = cerardat_estim_nf(datacerardat$df, datacerardat$row.sup, datacerardat$date) #Number of axes to keep (minimal PRESS value) res$nf #the plots res$MSE res$PRESS res$adj.R_sq
library(SPARTAAS) data(datacerardat) res = cerardat_estim_nf(datacerardat$df, datacerardat$row.sup, datacerardat$date) #Number of axes to keep (minimal PRESS value) res$nf #the plots res$MSE res$PRESS res$adj.R_sq
Longitude, latitude, number of thyroid cancers. The data concern two French departments (Loire-Atlantique and Vendee) between 1998 and 2012. For confidentiality reasons, the data are simulated.
data("datacancer")
data("datacancer")
List of two objects with 359 observations on the following 3 variables.
$coord (data.frame):
longitude
a numeric vector: geographical coordinate
latitude
a numeric vector: geographical coordinate
$var (vector):
var
a numeric vector: number of thyroid cancers (simulated)
M. Karakachoff (IR CHU - l'institut du Thorax INSERM UMR 1087 - CNRS UMR 6291) Nantes, France
F. Molinie (resp. Loire-Atlantique-Vendee cancer registry -
registre-des-cancers
) France
library(SPARTAAS) data(datacancer) str(datacancer) head(datacancer$coord) str(datacancer$var)
library(SPARTAAS) data(datacancer) str(datacancer) head(datacancer$coord) str(datacancer$var)
This important dataset comes from the Collective Research Project (CRP) on Medieval and Modern pottery in the Middle Loire Basin. This is a long-term project, since it began in 1996 and has already been the subject of two books on the subject (Husi dir. 2003 and 2013), as well as an online logical publication (Husi dir. 2022).
data("dataceram")
data("dataceram")
List of three objects with 226 observations.
(data.frame) Contingency table of the quantities of 183 types of pottery sherds in the 226 sets.
(data.frame) The first column corresponds to the identifier (sets), the second and the third to the lower and upper limits of the temporal range of the observations.
Vector of the geographical area of each observation.
Husi P. (dir.) – La céramique médiévale et moderne du bassin de la Loire moyenne, chrono-typologie et transformation des aires culturelles dans la longue durée (6e —19e s.), XXe Supplément à la Revue Archéologique du Centre de la France, FERACF, Tours, 2022.
library(SPARTAAS) data(dataceram) str(dataceram) str(dataceram$contingency) head(dataceram$timerange) head(dataceram$geographic_area)
library(SPARTAAS) data(dataceram) str(dataceram) str(dataceram$contingency) head(dataceram$timerange) head(dataceram$geographic_area)
Data sets on Tours pottery. Contains three objects: - Pottery sherd contingency table (data.frame 280 sets and 391 technical groups) - The date (if known) of the sets (vector) - Column index for supplementary sets (vector)
data("datacerardat")
data("datacerardat")
A list with 3 objects.
df
Ceramic contingency table. a integer data.frame
date
The date (if known) of the sets. a numeric or NA vector
col.sup
Column index for supplementary sets. a numeric vector
data(datacerardat) datacerardat ## maybe str(datacerardat);
data(datacerardat) datacerardat ## maybe str(datacerardat);
The archaeological data come from excavations carried out at Angkor Thom (Cambodia), the capital of the Khmer Empire between the 9th and 15th centuries (Gaucher, 2004). The dataset consists of the pottery assemblages (quantities of different types of pottery sherds contained in the sets - ..$contingency) and the stratigrpahy of the sets from 3 disconnected archaeological sites (..$stratigraphy).
data("datangkor")
data("datangkor")
List of two objects with 17 observations.
(data.frame) Contingency table of the quantities of 12 types of pottery sherds in the 17 sets.
(data.frame) Saves the stratigraphic network. The first column corresponds to the nodes (sets) and the second to the edges by listing the nodes connected to it.
Gaucher, J. (2004). Angkor Thom, une utopie réalisée ? Structuration de l’espace et modèle indien d’urbanisme dans le Cambodge ancien. Arts Asiatiques, Volume 59, pp. 58-86.
library(SPARTAAS) data(datangkor) str(datangkor) str(datangkor$contingency) str(datangkor$stratigraphy)
library(SPARTAAS) data(datangkor) str(datangkor) str(datangkor$contingency) str(datangkor$stratigraphy)
Latitude, longitude, absolute difference between two dates and the name of the archaeological site. The data concern the dating of archaeological contexts in West-central France based on a large collection of medieval pottery finds. Two original statistical models are developed to estimate context dates using pottery. The absolute difference is calculated for each context. The data are based on a collective research on medieval pottery directed by P. Husi ("La céramique médiévale dans la vallée de la Loire moyenne"
).
data("datarcheo")
data("datarcheo")
List of three objects with 240 observations on the following 4 variables.
$coord (data.frame):
longitude
a numeric vector: geographical coordinate
latitude
a numeric vector: geographical coordinate
$var (vector):
regionalized_var
a numeric vector: difference between two dating model
$label (vector):
noms
a character vector(factor): name of archeological site
P. Husi IR CNRS, UMR CITERES-LAT, CNRS/Tours University, France :
library(SPARTAAS) data(datarcheo) str(datarcheo) head(datarcheo$coord) str(datarcheo$var) levels(datarcheo$label)
library(SPARTAAS) data(datarcheo) str(datarcheo) head(datarcheo$coord) str(datarcheo$var) levels(datarcheo$label)
Generates all plots (in jpeg format) in a subfolder (relative path from the working directory).
extract_results(cerardat, width=480, height=480, path="figures", col1 = rgb(0.93,0.23,0.23,0.5), col2 = "black", xlim=NULL, ylim=NULL)
extract_results(cerardat, width=480, height=480, path="figures", col1 = rgb(0.93,0.23,0.23,0.5), col2 = "black", xlim=NULL, ylim=NULL)
cerardat |
a cerardat output. |
width |
the width of the graphics region in pixels. |
height |
the width of the graphics region in pixels. |
path |
directory where plots are exported (relative path from the working directory). Automatically generates two sub-folders 'ref' for the reference plot and 'sup' for the supplementary plot. |
col1 |
Color of the the Event curve (color name, hex code, rgb function {grDevices} for transparency). |
col2 |
Color of the the Accumulation curve (color name, hex code, rgb function {grDevices} for transparency). |
xlim |
Two numeric values, specifying the left limit and the right limit of the scale. |
ylim |
Two numeric values, specifying the lower limit and the upper limit of the scale. |
A. COULON
L. BELLANGER
P. HUSI
data("datacerardat") resultat = cerardat(df = datacerardat$df, row.sup = datacerardat$row.sup, date = datacerardat$date, nf = NULL, confidence = 0.95, graph = TRUE ) #Generates all plots (in jpeg format) in a folder 'figures' #You will find the jpeg in 'ref' and 'sup' subfolder #extract_results(resultat,width=480, height=480, path="figures", # col1=grDevices::rgb(0.93,0.23,0.23,0.5), # col2="black", # xlim=NULL, # ylim=c(0,0.03) # ) #You can extract the plots and find them in the directory : paste0(getwd(),"/figures")
data("datacerardat") resultat = cerardat(df = datacerardat$df, row.sup = datacerardat$row.sup, date = datacerardat$date, nf = NULL, confidence = 0.95, graph = TRUE ) #Generates all plots (in jpeg format) in a folder 'figures' #You will find the jpeg in 'ref' and 'sup' subfolder #extract_results(resultat,width=480, height=480, path="figures", # col1=grDevices::rgb(0.93,0.23,0.23,0.5), # col2="black", # xlim=NULL, # ylim=c(0,0.03) # ) #You can extract the plots and find them in the directory : paste0(getwd(),"/figures")
Overload of hclust for dealing with two dissimilarities matrices. Hierarchical cluster analysis on a set of dissimilarities and methods for analyzing it.
hclust(d, method = "complete", members = NULL, d2 = NULL, alpha = NULL)
hclust(d, method = "complete", members = NULL, d2 = NULL, alpha = NULL)
d |
a dissimilarity structure as produced by dist. |
method |
the agglomeration method to be used. This should be (an unambiguous abbreviation of) one of "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC). |
members |
NULL or a vector with length size of d. See the ‘Details’ section. |
d2 |
a second dissimilarity structure as produced by dist. |
alpha |
The mixing parameter in order to generate the D_alpha matrix on which the classical hclust method is applied. Formula: D_alpha = alpha * d + (1-alpha) * d2. |
Data fusion (parameter alpha: optimal value see hclustcompro_select_alpha). It is necessary to define the appropriate proportion for each data source. This is the first sensitive point of the method that the user has to consider. A tool is provided to help him in his decision.
The hclust function is based on Fortran code contributed to STATLIB by F. Murtagh.
A. COULON
L. BELLANGER
P. HUSI
##---- Should be DIRECTLY executable !! ---- ##-- ==> Define data, use random, ##-- or do help(data=index) for the standard data sets. ## The function is currently defined as function (d, method = "complete", members = NULL, d2 = NULL, alpha = NULL) { if (!is.null(d2)) { if (!length(d) == length(d2)) { stop("d and d2 have not the same size.") } if (is.null(alpha)) { sa <- hclustcompro_select_alpha(d, d2, method = method, resampling = FALSE) alpha <- sa$alpha[1] } alpha <- as.numeric(alpha) if (!(alpha > 0 & alpha < 1)) { warning("Alpha must be between 0 and 1.") sa <- hclustcompro_select_alpha(d, d2, method = method, resampling = FALSE) alpha <- sa$alpha[1] } d <- dist(alpha * d + (1 - alpha) * d2) } stats::hclust(d, method, members) }
##---- Should be DIRECTLY executable !! ---- ##-- ==> Define data, use random, ##-- or do help(data=index) for the standard data sets. ## The function is currently defined as function (d, method = "complete", members = NULL, d2 = NULL, alpha = NULL) { if (!is.null(d2)) { if (!length(d) == length(d2)) { stop("d and d2 have not the same size.") } if (is.null(alpha)) { sa <- hclustcompro_select_alpha(d, d2, method = method, resampling = FALSE) alpha <- sa$alpha[1] } alpha <- as.numeric(alpha) if (!(alpha > 0 & alpha < 1)) { warning("Alpha must be between 0 and 1.") sa <- hclustcompro_select_alpha(d, d2, method = method, resampling = FALSE) alpha <- sa$alpha[1] } d <- dist(alpha * d + (1 - alpha) * d2) } stats::hclust(d, method, members) }
Compromised Hierarchical bottom-up clustering method.
The method uses two sources of information. The merging of the two data sources is done by a parameter () that allows to weight each source.
hclustcompro( D1, D2, alpha="EstimateAlphaForMe", k=NULL, title="notitle", method="ward.D2", suppl_plot=TRUE )
hclustcompro( D1, D2, alpha="EstimateAlphaForMe", k=NULL, title="notitle", method="ward.D2", suppl_plot=TRUE )
D1 |
First dissimilarity matrix (square matrix) or distance matrix. Could be a contingency table (see CAdist). A factorial correspondence analysis is performed using the distances (chi-square metric). |
D2 |
Second dissimilarity matrix (square matrix), same size as D1, or distance matrix. |
alpha |
The mixing parameter in order to generate the D_alpha matrix (in [0;1]). Formula: D_alpha = alpha * D1 + (1-alpha) * D2 |
k |
The number of clusters you want. |
title |
The title to be displayed on the dendogram plot. |
method |
The agglomeration method to be used. This should be (an unambiguous abbreviation of) one of "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC). |
suppl_plot |
Logical defines whether additional plots are to be displayed (WSS and average sil plot). |
CAH
Data fusion (parameter optimal value see hclustcompro_select_alpha). It is necessary to define the appropriate proportion for each data source. This is the first sensitive point of the method that the user has to consider. A tool is provided to help him in his decision.
Cut dendrogram
The division into classes and subclasses is the second crucial point. It has to be done based on the knowledge of the study area and some decision support tools such as the cluster silhouette or the calculation of the intra-cluster variability (WSS: Within Sum of Square). You can use hclustcompro_subdivide to subdivide a cluster into sub-clusters.
The function returns a list (class: hclustcompro_cl).
D1 |
First dissimilarity matrix (square matrix) |
D2 |
Second dissimilarity matrix (square matrix) |
D_alpha |
The matrix use in the CAH resulting from the mixing of the two matrices (D1 and D2) |
alpha |
Alpha |
tree |
An object of class hclust, describing the tree generated by the clustering process (see hclust) |
cluster |
The cluster number vector of the selected partition |
cutree |
Plot of the cut dendrogram |
call |
How you call the function |
cont |
Original contingency data (if D1 is a contingency table) |
The hclust function is based on Fortran code contributed to STATLIB by F. Murtagh.
A. COULON
L. BELLANGER
P. HUSI
library(SPARTAAS) data(datangkor) #network stratigraphic data (Network) network <- datangkor$stratigraphy #contingency table cont <- datangkor$contingency #obtain the dissimilarities matrices distance <- CAdist(cont, nPC = 11) constraint <- adjacency(network) #You can also run hclustcompro with the dist matrix directly hclustcompro(D1 = distance, D2 = constraint, alpha = 0.7, k = 4)
library(SPARTAAS) data(datangkor) #network stratigraphic data (Network) network <- datangkor$stratigraphy #contingency table cont <- datangkor$contingency #obtain the dissimilarities matrices distance <- CAdist(cont, nPC = 11) constraint <- adjacency(network) #You can also run hclustcompro with the dist matrix directly hclustcompro(D1 = distance, D2 = constraint, alpha = 0.7, k = 4)
see hclustcompro, hclustcompro_select_alpha, seriograph. You can also check the wiki on the application.
hclustcompro_app()
hclustcompro_app()
No return value
A. COULON
L. BELLANGER
P. HUSI
##---- Should be DIRECTLY executable !! ---- ##-- ==> Define data, use random, ##-- or do help(data=index) for the standard data sets. library(SPARTAAS) if(interactive()){ hclustcompro_app() }
##---- Should be DIRECTLY executable !! ---- ##-- ==> Define data, use random, ##-- or do help(data=index) for the standard data sets. library(SPARTAAS) if(interactive()){ hclustcompro_app() }
Based on a resampling process, we generate clone and we check the CorCrit_alpha. The function shows each set of clone curves.
hclustcompro_detail_resampling(D1, D2 = NULL, acc = 2, method = "ward.D2", iter = 5)
hclustcompro_detail_resampling(D1, D2 = NULL, acc = 2, method = "ward.D2", iter = 5)
D1 |
First dissimilarity matrix or contingency table (square matrix). |
D2 |
Second dissimilarity matrix or network data (square matrix) same size than D1. |
acc |
Number of digits after the comma for the alpha value. |
method |
The agglomeration method to be used. This should be (an unambiguous abbreviation of) one of "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC). |
iter |
The number of clones cheked for each observation. |
Definition of the criterion:
A criterion for choosing alpha IN [0;1] must be determined by balancing the weights between the two information sources in the final classification. To obtain alpha, we define the following criterion:
The CorCrit_alpha criterium in (1) represents the difference in absolute value between two cophenetic correlation (Cophenetic correlation is defined as the correlation between two distances matrices. It is calculated by considering the half distances matrices as vectors. It measures of how faithfully a dendrogram preserves the pairwise distances between the original unmodeled data points). The first correlation is associated with the comparison between D1 and ultrametric distances from the HAC with alpha fixed; while the second compares D2 and ultrametric distances from the HAC with alpha fixed. Then, in order to compromise between the information provided by D1 and D2, we decided to estimate alpha with hat(alpha) such that:
Resampling strategy:
To do this, a set of "clones" is created for each observation i. A clone c of observation i is a copy of observation i for which the adjacency relationships to others have been modified. The clone has none conection exept with j. A set is generated by varying j for all observations except i. A HAC is then carried out using the combination defined in (1) with D1(c) a (n+1)X(n+1) matrix where the observations i and c are identical and D2(c) a (n+1)X(n+1) matrix where the clone c of i has different neighbourhood relationships from those of i. We can create a set of n-1 clones for each element i in n, so n(n-1) clones in total.
Intuitively, by varying alpha between 0 and 1, we will be able to identify when the clone and the initial observation will be separated on the dendrogram. This moment will correspond to the value of alpha above which the weight given to information on the connection between observations contained in D2 has too much impact on the results compared to that of D1.
Let CorCrit_alpha(c) defines the same criterion as in (1) in which D1 ans D2 are replaced respectively by D1(c) and D2(c).
The estimated alpha is the average of estimated values for each clone.
For each clone (c):
hat(alpha)^* is the average of the hat(alpha)(c). In the same spirit as confidence intervals based on bootstrap percentiles (Efron & Tibshirani, 1993), a percentile confidence interval based on replication is also be obtained using the empirical percentiles of the distribution of hat(alpha)(c).
plot |
The interactive plot: CorCrit_alpha criterium for each resampling dataset |
A. COULON
L. BELLANGER
P. HUSI
################################### # For view the equation # ################################### plot( c(.6,.6,.6,.6), c(.9,.5,-.3,-.7), xlim = c(.6,1.4), ylim = c(-1.1,1), axes = FALSE, main = "Equations:", xlab = "", ylab = "", pch = 1 ) text(.65, .9, "( 1 )") text(.65, .5, "( 2 )") text(.65,-.3, "( 3 )") text(.65,-.7, "( 4 )") text(1, .9, expression(CorCrit[alpha] == abs(Cor(dist[cophenetic],dist[ceramic]) - Cor(dist[cophenetic], dist[stratigraphic]) ))) text(1, .5, expression(hat(alpha) == min(CorCrit[alpha], alpha))) text(1,-.3, expression(hat(alpha)^(c) == min(CorCrit[alpha]^(c), alpha))) text(1,-.7, expression(hat(alpha)^"*" == frac(1,n(n-1)) * sum(hat(alpha)^(c),c==1,n(n-1)))) ################################# ##---- Should be DIRECTLY executable !! ---- ##-- ==> Define data, use random, ##-- or do help(data=index) for the standard data sets. #library(SPARTAAS) #data(datangkor) #network stratigraphic data (Network) #network <- datangkor$stratigraphy #contingency table #cont <- datangkor$contingency #hclustcompro_detail_resampling(D1 = CAdist(cont, nPC="max"), D2 = adjacency(network))
################################### # For view the equation # ################################### plot( c(.6,.6,.6,.6), c(.9,.5,-.3,-.7), xlim = c(.6,1.4), ylim = c(-1.1,1), axes = FALSE, main = "Equations:", xlab = "", ylab = "", pch = 1 ) text(.65, .9, "( 1 )") text(.65, .5, "( 2 )") text(.65,-.3, "( 3 )") text(.65,-.7, "( 4 )") text(1, .9, expression(CorCrit[alpha] == abs(Cor(dist[cophenetic],dist[ceramic]) - Cor(dist[cophenetic], dist[stratigraphic]) ))) text(1, .5, expression(hat(alpha) == min(CorCrit[alpha], alpha))) text(1,-.3, expression(hat(alpha)^(c) == min(CorCrit[alpha]^(c), alpha))) text(1,-.7, expression(hat(alpha)^"*" == frac(1,n(n-1)) * sum(hat(alpha)^(c),c==1,n(n-1)))) ################################# ##---- Should be DIRECTLY executable !! ---- ##-- ==> Define data, use random, ##-- or do help(data=index) for the standard data sets. #library(SPARTAAS) #data(datangkor) #network stratigraphic data (Network) #network <- datangkor$stratigraphy #contingency table #cont <- datangkor$contingency #hclustcompro_detail_resampling(D1 = CAdist(cont, nPC="max"), D2 = adjacency(network))
parameter.The following criterion "balances" the weight of and
in the final clustering. The
value is only a point estimate but the confidence interval gives a range of possible values.
Based on a resampling process, we generate clones and recalculate the criteria according to (see below).
hclustcompro_select_alpha( D1, D2, acc=2, resampling=TRUE, method="ward.D2", iter=5, suppl_plot=TRUE )
hclustcompro_select_alpha( D1, D2, acc=2, resampling=TRUE, method="ward.D2", iter=5, suppl_plot=TRUE )
D1 |
First dissimilarity matrix or contingency table (square matrix). |
D2 |
Second dissimilarity matrix or network data (square matrix) of the same size as D1. |
acc |
Number of digits after the decimal point for the alpha value. |
resampling |
Logical for estimating the confidence interval with a resampling strategy. If you have a lot of data, you can save computation time by setting this option to FALSE. |
method |
The agglomeration method to be used. This should be (an unambiguous abbreviation of) one of "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC). |
iter |
The number of clones checked per observation. (200 observations iter=1: ~30 sec, 1000 observations iter=1: ~40 min). |
suppl_plot |
Logical defines whether additional plots should be displayed. |
Definition of the criterion:
A criterion for choosing must be determined by balancing the weights between the two sources of information in the final classification. To obtain
, we define the following criterion
The criterion in (1) represents the difference in absolute value between two cophenetic correlations (cophenetic correlation is defined as the correlation between two distance matrices. It is calculated by considering the half distance matrices as vectors. It measures how faithfully a dendrogram preserves the pairwise distances between the original unmodeled data points). The first correlation is related to the comparison between
and the ultrametric distances of the clustering with
fixed, while the second compares
and the ultrametric distances of the clustering with
fixed. Then, in order to compromise between the information provided by
and
, we decided to estimate
with
such that:
Resampling strategy:
This is done by creating a set of "clones" for each observation . A clone
of observation
is a copy of observation
for which the distances from the second source have been modified. The modification is made by copying the distances for the second source from another observation
. A clustering is then performed using the combination defined in (1) with
an
matrix where observations
and
are identical and
an
matrix where the clone
of
has different distances from those of
. A set of clones is generated by varying
for all observations except
. We can generate a set of
clones for each element
in
, so
clones in total.
Intuitively, by varying between 0 and 1, we will be able to identify when the clone and the original observation are separated on the dendrogram. This moment will correspond to the value of alpha above which the weight given to the information about the connection between observations contained in
has too much influence on the results compared to that of
.
Let define the same criterion as in (1), where
and
are replaced by
and
respectively.
The estimated
is the mean of the estimated values for each clone.
For each clone :
is the mean of
. In the same spirit as confidence intervals based on bootstrap percentiles (Efron & Tibshirani, 1993), a percentile confidence interval based on replication is also be obtained using the empirical percentiles of the distribution of
.
Warnings:
It is possible to observe an value outside the confidence interval. In some cases, this problem can be solved by increasing the number of iterations or by changing the number of axes used to construct the matrix D1 after the correspondence analysis. If the
value remains outside the interval, it means that the data are noisy and the resampling procedure is affected.
The function returns a list (class: selectAlpha_obj).
alpha |
The estimated value of the alpha parameter (min CorCrit_alpha) |
alpha.plot |
The CorCrit curve for alpha between 0 and 1 |
If resampling = TRUE
sd |
The standard deviation |
conf |
The confidence interval of alpha |
boxplot |
The boxplot of alpha estimation with resampling |
values |
All potential alpha values obtained from each clone |
A. COULON
L. BELLANGER
P. HUSI
################################# ##---- Should be DIRECTLY executable !! ---- ##-- ==> Define data, use random, ##-- or do help(data=index) for the standard data sets. library(SPARTAAS) data(datangkor) #network stratigraphic data (Network) network <- datangkor$stratigraphy #contingency table cont <- datangkor$contingency dissimilarity <- CAdist(cont,nPC="max",graph=FALSE) constraint <- adjacency(network) hclustcompro_select_alpha(D1 = dissimilarity, D2 = constraint) hclustcompro_select_alpha(D1 = dissimilarity, D2 = constraint, acc = 3, resampling = TRUE)
################################# ##---- Should be DIRECTLY executable !! ---- ##-- ==> Define data, use random, ##-- or do help(data=index) for the standard data sets. library(SPARTAAS) data(datangkor) #network stratigraphic data (Network) network <- datangkor$stratigraphy #contingency table cont <- datangkor$contingency dissimilarity <- CAdist(cont,nPC="max",graph=FALSE) constraint <- adjacency(network) hclustcompro_select_alpha(D1 = dissimilarity, D2 = constraint) hclustcompro_select_alpha(D1 = dissimilarity, D2 = constraint, acc = 3, resampling = TRUE)
Allow the user to split one cluster into sub-clusters. This function only works with 'hclustcompro_cl' object returned by the hclustcompro function.
hclustcompro_subdivide(hclustcompro_cl,cluster,nb_class)
hclustcompro_subdivide(hclustcompro_cl,cluster,nb_class)
hclustcompro_cl |
A hclustcompro_cl object |
cluster |
The number of the cluster. Numbered from left to right on the dendrogram (1, 2, ...) |
nb_class |
The number of sub-clusters you want |
hclustcompro_cl |
A new hclustcompro_cl object updated see hclustcompro |
A. COULON
L. BELLANGER
P. HUSI
##---- Should be DIRECTLY executable !! ---- ##-- ==> Define data, use random, ##-- or do help(data=index) for the standard data sets. library(SPARTAAS) data(datangkor) #network stratigraphic data (Network) network <- datangkor$stratigraphy #contingency table cont <- datangkor$contingency #obtain the dissimilarities matrices distance <- CAdist(cont, nPC = 11) constraint <- adjacency(network) #You can also run hclustcompro with the dist matrix directly clustering <- hclustcompro(D1 = distance, D2 = constraint, alpha = 0.7, k = 7) #number of cluster 7 clustering <- hclustcompro_subdivide(clustering,cluster = 5, nb_class = 2) #subdivide more than one cluster clustering2 <- hclustcompro(D1 = distance, D2 = constraint,0.7,k=7) #number of cluster 7 clustering2 <- hclustcompro_subdivide(clustering2,cluster = c(5,7), nb_class = c(2,2))
##---- Should be DIRECTLY executable !! ---- ##-- ==> Define data, use random, ##-- or do help(data=index) for the standard data sets. library(SPARTAAS) data(datangkor) #network stratigraphic data (Network) network <- datangkor$stratigraphy #contingency table cont <- datangkor$contingency #obtain the dissimilarities matrices distance <- CAdist(cont, nPC = 11) constraint <- adjacency(network) #You can also run hclustcompro with the dist matrix directly clustering <- hclustcompro(D1 = distance, D2 = constraint, alpha = 0.7, k = 7) #number of cluster 7 clustering <- hclustcompro_subdivide(clustering,cluster = 5, nb_class = 2) #subdivide more than one cluster clustering2 <- hclustcompro(D1 = distance, D2 = constraint,0.7,k=7) #number of cluster 7 clustering2 <- hclustcompro_subdivide(clustering2,cluster = c(5,7), nb_class = c(2,2))
This function performs a divisive hierarchical clustering on a regionalised variable using the spatial patches algorithm (Woillez et al. 2007; Woillez, Rivoirard and Petitgas 2009). It is a top-down hierarchical clustering with a geographical constraint. It is possible to cut the tree by clicking on the dendrogram at the desired level. The results include a description of the clusters and graphics. When slicing the dendrogram, you can look at the two plots (WSSPlot and AveSilPlot) that show the relatively good quality of the partitions. The first shows the Within Sum of Square (WSS) for each partition and you can use the elbow approach to select a partition. The second graph shows the average silhouette width. This index is between -1 and 1. The closer it is to 1, the better the partition. See silhouette.
If you want to cut the dendogram to a different dlim or number of clusters, you can do so without re-running mapclust() with mapclust_cut_tree
.
mapclust( coord, var, label = "Nolabel", iter = 20, Plabel = TRUE, lonlat = TRUE, positive_var = FALSE, n = NULL )
mapclust( coord, var, label = "Nolabel", iter = 20, Plabel = TRUE, lonlat = TRUE, positive_var = FALSE, n = NULL )
coord |
The x (longitude) and y (latitude) coordinates of the data.frame or matrix dimension 2. |
var |
The regionalised variable(s) of interest |
label |
(optional) The names of the samples or an identifier. Must be a factor. |
iter |
The number of iterations. The number of different dlim you want to test (must be greater than 10). |
Plabel |
Logical parameter to activate or not the printing of labels on the dendrogram. |
lonlat |
Logical parameter to activate or not the cartography in lonlat system with leaflet (base map). |
positive_var |
logical parameter that defines whether your variable of interest is positive or not. |
n |
Number of groups. If NULL, you can select the number of groups by clicking on the dendrogram. |
Dlim is the selected minimum distance from the sample to the patch centre: to identify patches (units are those of coordinates). The dlim is automatically initialised and does not need to be set by the user. The minimum data is a data frame or matrix with at least 3 columns.
the function returns a list.
Plot:
dendrogram |
The global dendrogram (hclust object) |
dendrogram_ggplot |
The global dendrogram (ggplot2 object) |
cuttree |
The cut dendrogram |
map |
The map of the selected partition |
AveSilPlot |
The average silhouette width plot (for each partition) |
WSSPlot |
The Within Sum of Square plot (for each partition) |
silhouette |
The silhouette plot of the selected partition |
Value:
X |
The x-coordinate data you have used |
Y |
The y-coordinate data you have used |
var |
The regionalized variable(s) data you have used |
label |
The label vector you have used |
density |
The estimated density based on var. Equal to var if you used a unidimensional density variable. |
cluster |
The cluster vector of the selected partition |
Plabel |
Logical parameter to activate or not the printing of labels on the dendrogram |
fullhist |
The compositional cluster for each observation |
hist |
The compositional cluster without duplicates (corresponds to the split on the dendogram) |
dlim |
The vector of the different limit distances |
cutdlim |
The select dlim for the cut of the current partition |
DiMatrix |
The weighted euclidean distance matrix |
silhouetteData |
The silhouette data of the selected partition |
AveSilData |
The average silhouette value for each partition |
Moran |
The Moran index for each group for each partition |
lonlat |
Logical parameter, whether your coordinates are in latitude-longitude format or not |
A. COULON L. BELLANGER P. HUSI
Bellanger L., Coulon A. and Husi P. (2021) Determination of cultural areas based on medieval pottery using an original divisive hierarchical clustering method with geographical constraint (MapClust), Journal of Archaeological Science, Volume 132 doi:10.1016/j.jas.2021.105431.
Bellanger L., Husi P., Laghzali Y. (2015). Spatial statistic analysis of dating using pottery: an aid to the characterization of cultural areas in West Central France. In : Traviglia A. ed., Across Space and Time, Proceedings of the 41th International Conference on Computer Applications and Quantitative Methods in Archaeology (CAA-2013), Perth (Australie), Amsterdam University Press : 276-282.
Woillez M., Poulard J.C., Rivoirard J., Petitgas P., Bez N. (2007). Indices for capturing spatial patterns and their evolution in time, with application to European hake (Merluccius merluccius) in the Bay of Biscay. ICES J. Mar. Sci. 64, 537-550.
Woillez M., Rivoirard J. and Petitgas P. (2009) Notes on survey-based spatial indicators for monitoring fish populations, Aquatic Living Resources, 22 :155-164.
################################### ## loading data library(SPARTAAS) data(datarcheo) data(datacancer) ################################### ### Example: 1 ## Function "mapclust" # object <- mapclust( coord = ..., var = ..., label = ...) classification <- mapclust(datarcheo$coord, datarcheo$var, datarcheo$label, n=4) #Global dendrogram classification$dendrogram #Cut dendrogram classification$cuttree #silhouette of selected partition classification$silhouette #You can cut the dendrogram for another dlim NewCut <- mapclust_cut_tree(classification, dlim=0.30) #See evaluation using Silhouette width by running: NewCut$silhouette #If the plot is empty try to increase the height of the window (full screen) #See summary of the data by running: summary(NewCut$silhouetteData) ################################### ## kmeans comparison # pepare data (only geographical data) datakmeans <- datarcheo$coord #kmeans number_cluster <- 4 cl <- kmeans(datakmeans, number_cluster) plot(datakmeans, col = cl$cluster)
################################### ## loading data library(SPARTAAS) data(datarcheo) data(datacancer) ################################### ### Example: 1 ## Function "mapclust" # object <- mapclust( coord = ..., var = ..., label = ...) classification <- mapclust(datarcheo$coord, datarcheo$var, datarcheo$label, n=4) #Global dendrogram classification$dendrogram #Cut dendrogram classification$cuttree #silhouette of selected partition classification$silhouette #You can cut the dendrogram for another dlim NewCut <- mapclust_cut_tree(classification, dlim=0.30) #See evaluation using Silhouette width by running: NewCut$silhouette #If the plot is empty try to increase the height of the window (full screen) #See summary of the data by running: summary(NewCut$silhouetteData) ################################### ## kmeans comparison # pepare data (only geographical data) datakmeans <- datarcheo$coord #kmeans number_cluster <- 4 cl <- kmeans(datakmeans, number_cluster) plot(datakmeans, col = cl$cluster)
see mapclust You can also check the wiki on the application.
mapclust_app()
mapclust_app()
No return value
A. COULON
L. BELLANGER
P. HUSI
#open application #library(SPARTAAS) #if(interactive()){ #mapclust_app() #}
#open application #library(SPARTAAS) #if(interactive()){ #mapclust_app() #}
The function returns the new map, one dendrogram with the cut line, the silhouette width and the new vector cluster. You must call mapclust beforehand to get a mapclust_cl
object that can be used by mapclust_cut_tree.
mapclust_cut_tree(classification, nb_grp = NA, dlim = NA)
mapclust_cut_tree(classification, nb_grp = NA, dlim = NA)
classification |
The object return by mapclust |
nb_grp |
The number of groups you want to have on the partition. Must be an integer. (don't use dlim in this case) |
dlim |
The value of dlim at which you wish to cut the dendrogram. You can enter any value (numerical) and the function will select the nearest lower dlim with the same partition. (do not use nb_grp in this case). |
the function returns a list.
Plot:
dendrogram |
The global dendrogram (hclust object) |
dendrogram_ggplot |
The global dendrogram (ggplot2 object) |
cuttree |
The cut dendrogram |
map |
The map of the selected partition |
AveSilPlot |
The average silhouette width plot (for each partition) |
WSSPlot |
The Within Sum of Square plot (for each partition) |
silhouette |
The silhouette plot of the selected partition |
Value:
X |
The x-coordinate data you have used |
Y |
The y-coordinate data you have used |
var |
The regionalised variable data you have used |
label |
The label vector you have used |
density |
The estimated density based on var. Equal to var if you used a unidimensional density variable |
cluster |
The cluster vector of the selected partition |
Plabel |
Logical parameter to activate or not the printing of labels on the dendrogram |
fullhist |
The compositional cluster for each observation |
hist |
The compositional cluster without duplicates (corresponds to the split on the dendogram) |
dlim |
The vector of the different limit distances |
cutdlim |
The select dlim for the cut of the current partition |
DiMatrix |
The weighted euclidean distance matrix |
silhouetteData |
The silhouette data of the selected partition |
AveSilData |
The average silhouette value for each partition |
Moran |
The Moran index for each group for each partition |
lonlat |
Logical parameter, whether your coordinates are in latitude-longitude format or not |
A. COULON L. BELLANGER P. HUSI
## loading data library(SPARTAAS) data(datarcheo) ##First you need to run the mapclust function. #This function allow you to obtain one partition # object <- mapclust( coord = ..., var = ..., label = ...) OldClassif <- mapclust(datarcheo$coord, datarcheo$var, datarcheo$label, n = 4) ##In order to cut the dendrogram for another dlim NewCut <- mapclust_cut_tree(classification = OldClassif, dlim = 0.37) ##In order to cut the dendrogram for another number of cluster NewCut2 <- mapclust_cut_tree(classification = OldClassif, nb_grp = 4) #See evaluation using Silhouette width by running: NewCut$silhouette #If the plot is empty try to increase the height of the window (full screen). #See summary of the data by running: summary(NewCut$silhouetteData)
## loading data library(SPARTAAS) data(datarcheo) ##First you need to run the mapclust function. #This function allow you to obtain one partition # object <- mapclust( coord = ..., var = ..., label = ...) OldClassif <- mapclust(datarcheo$coord, datarcheo$var, datarcheo$label, n = 4) ##In order to cut the dendrogram for another dlim NewCut <- mapclust_cut_tree(classification = OldClassif, dlim = 0.37) ##In order to cut the dendrogram for another number of cluster NewCut2 <- mapclust_cut_tree(classification = OldClassif, nb_grp = 4) #See evaluation using Silhouette width by running: NewCut$silhouette #If the plot is empty try to increase the height of the window (full screen). #See summary of the data by running: summary(NewCut$silhouetteData)
The overlap index is the ratio between internal overlap and total overlap over time. We define the limit of total overlap as: the minimum of the lower limits of the pair of individuals and the maximum of the upper limits. We define the internal overlap limit as the maximum of the lower limits and the minimum of the upper limits.
overlap(temporal)
overlap(temporal)
temporal |
A data frame with tree columns: the name of the element, the lower limit and the upper limit. |
The lower and upper limits must be numeric.
The dissimilarity between time periods is calculated as the ratio of the overlap of the time periods (distance in the case of disjoint time periods) to the cumulative extent of the two time periods.
As the ratio is bounded between -1 and 1, we add 1 to make it positive and normalise it so that it is between 0 and 1.
This overlap index then needs to be transformed into a dissimilarity index between sets. To do this we use the 1 - ratio. It is equal to 0 if the two time periods are identical and 1 if they are infinitely different.
D |
The dissimilarity matrix base on the overlap index. |
A. COULON
L. BELLANGER
P. HUSI
##---- Should be DIRECTLY executable !! ---- ##-- ==> Define data, use random, ##-- or do help(data=index) for the standard data sets. library(SPARTAAS) data(dataceram) dist <- overlap(dataceram$timerange)
##---- Should be DIRECTLY executable !! ---- ##-- ==> Define data, use random, ##-- or do help(data=index) for the standard data sets. library(SPARTAAS) data(dataceram) dist <- overlap(dataceram$timerange)
plot cerardat model
## S3 method for class 'cerardat_obj' plot(x, which = NULL, col1 = rgb(0.93,0.23,0.23,0.5), col2 = "black", xlim=NULL, ylim=NULL,...)
## S3 method for class 'cerardat_obj' plot(x, which = NULL, col1 = rgb(0.93,0.23,0.23,0.5), col2 = "black", xlim=NULL, ylim=NULL,...)
x |
a cerardat output. |
which |
Vector containing the plots you want (row number on the contingence table: numeric). |
col1 |
Color of the Event curve (color name, hex code, rgb function {grDevices} for transparency). |
col2 |
Color of the Accumulation curve (color name, hex code, rgb function {grDevices} for transparency). |
xlim |
Two numeric values, specifying the left limit and the right limit of the scale. |
ylim |
Two numeric values, specifying the lower limit and the upper limit of the scale. |
... |
other parameters to be passed through to plotting functions. |
A. COULON
L. BELLANGER
P. HUSI
##---- Should be DIRECTLY executable !! ---- ##-- ==> Define data, use random, ##-- or do help(data=index) for the standard data sets. ## The function is currently defined as
##---- Should be DIRECTLY executable !! ---- ##-- ==> Define data, use random, ##-- or do help(data=index) for the standard data sets. ## The function is currently defined as
see seriograph. You can also check the wiki on the application.
serio_app()
serio_app()
No return value
A. COULON
L. BELLANGER
P. HUSI
##---- Should be DIRECTLY executable !! ---- ##-- ==> Define data, use random, ##-- or do help(data=index) for the standard data sets. library(SPARTAAS) if(interactive()){ serio_app() }
##---- Should be DIRECTLY executable !! ---- ##-- ==> Define data, use random, ##-- or do help(data=index) for the standard data sets. library(SPARTAAS) if(interactive()){ serio_app() }
Visualization of contingency data over time. Rows must be individuals (archaeological site,...) and columns must be categories (type,...).
seriograph(cont, order, insert, show, permute)
seriograph(cont, order, insert, show, permute)
cont |
Contingency table or hclustcompro object. Note: Your contingency table must have the rows sorted in chronological order (the order parameter allows you to change the order of the rows if necessary). |
order |
Vector to change the order of the rows (use row names or cluster names if cont is a hclustcompro object, as a character vector). The oldest (at the bottom) must be at the end of the vector. Missing names are not plotted. You can remove a row by simply removing the name in the vector. |
show |
The element to be plotted. This should be (a unique abbreviation of) one of 'both', 'EPPM' or 'frequency'. |
permute |
Logical for permute columns in order to show seriation. |
insert |
Vector with the position after which you want to insert one or more hiatuses. Could be a list with two vectors: position and label to be printed instead of hiatus (see last examples). |
Seriograph
We have chosen the serigraph (DESACHY 2004). This tool makes it possible to highlight the evolution of ceramics over time as well as to understand the commercial relations thanks to the imported ceramics. The percentages of each category of ceramics per set are displayed. The percentages are calculated independently for each set (row). The display of the percentages allows comparison of the different sets but does not provide information on the differences in numbers. To fill this gap, the proportion of the numbers in each class is displayed on the seriograph (weight column).
We can generalized this representation for other contingency data or with hclustcompro object.
The visualization of positive deviations from the average percentage allows us to observe a series that results from changes in techniques and materials dedicated to ceramic manufacturing over time.
In order to facilitate the exploitation of the data tables, we propose here a computerised graphic processing tool (EPPM serigraph - for Ecart Positif aux Pourcentages Moyens - positive deviation from the average percentage), which does not require specialised statistical skills and is adapted to the case of stratified sites, where the study of the evolution of artefacts can be based on the relative chronology provided by the excavation.
The treatment consists firstly of transforming this table of counts into a table of percentages, the total number in each set (each row) being reduced to 100; these are the proportions, or frequencies, of the types in the sets are thus compared.
The display of positive deviations from the mean percentages (EPPM) shows in black on a lighter background the percentage shares that are higher than the mean percentage of the variable, so as to highlight the most significant part of the values in the table.This display is simply adapted to the seriograph: when a percentage is greater than the average percentage of the type, the excess share (called here EPPM: positive deviation from the average percentage) is shown in black, centred around the axis of the type, on the grey background of the percentage bar.
The table is then transformed into a graphic matrix where these percentages are expressed, for each type, by horizontal bars centred on the column axis. When the rows are ordered chronologically, the silhouette formed by the superposition of these frequency bars bars makes it possible to visualise the evolution over time of the type concerned.
The display of the percentages allows comparison of the different sets but does not provide information on the differences in numbers. To fill this gap, the proportion of the numbers in each class is displayed on the seriograph (weight column).
The processing technique applies to sets whose chronological order is not known; the lines of the graph are to be reorganised so as to obtain boat-shaped silhouettes following the hypothesis of a chronological evolution corresponding to the seriation model.
Positive deviation from the average percentage (EPPM in French)
The average percentage is calculated for each ceramic category (columns) on the total number of accounts (all classes combined). From the average percentage we recover for each category and for each rows the difference between the percentage of the category in the class with the average percentage. The EPPM corresponds to the notion of independence deviation (between rows and columns, between categories and time classes) in a chi-square test approach. Although this approach is fundamental in statistical analysis, independence deviations are here purely indicative and are not associated with a p_value that could determine the significance of deviations.
Weight
Weight is the number of observations divided by the total number of observations. It indicates for each row the percentage of the data set used to calculate the frequencies of the elements (row).
Permutation
order argument:
The rows of the contingency table are initially in the order of appearance (from top to bottom). It must be possible to re-order the classes in a temporal way (You can also order as you want your contingency table).
permute argument:
In addition, it is possible to swap ceramic categories (contingency table columns) in order to highlight a serialization phenomenon. Matrix permutation uses an algorithm called "reciprocal averages". Each line is assigned a rank ranging from 1 to n the number of lines. A barycentre is calculated for each column by weighting according to the row rank. Finally, the columns are reorganized by sorting them by their barycentre.
Insert
It's possible to insert a row in the seriograph in order to represent a archeological hiatus or other temporal discontinuities.
The function returns a list (class: seriograph).
seriograph |
The seriograph plot |
dendrogram |
If cont is a hclustcompro object return the dendrogram with the period order as label |
contingency |
Data frame of the contingencies data group by cluster |
frequency |
Data frame of the frequencies data group by cluster |
ecart |
Data frame of the gap data group by cluster |
B. DESACHY
A. COULON
L. BELLANGER
P. HUSI
Desachy B. (2004). Le sériographe EPPM : un outil informatisé de sériation graphique pour tableaux de comptages. In: Revue archéologique de Picardie, n°3-4, 2004. Céramiques domestiques et terres cuites architecturales. Actes des journées d'étude d'Amiens (2001-2002-2003) pp. 39-56 doi:10.3406/pica.2004.2396
##---- Should be DIRECTLY executable !! ---- ##-- ==> Define data, use random, ##-- or do help(data=index) for the standard data sets. library(SPARTAAS) data(datangkor) ## network stratigraphic data (Network) network <- datangkor$stratigraphy ## contingency table cont <- datangkor$contingency ## default seriograph(cont) seriograph(cont,show = "EPPM") seriograph(cont,show = "frequency") ## Don't allow permutation of columns seriograph(cont, permute = FALSE) ## insert Hiatus (position, 1 -> after first row from bottom: oldest) seriograph(cont,insert = 2) seriograph(cont,insert = c(2,3)) ## insert costum label element insert <- list( position = c(2,3), label = c("Hiatus.100years","Missing data") ) seriograph(cont,insert = insert) ## change order with cluster name (letters on dendrogram) to sort them in a chronological order seriograph(cont,order=c("AI03","AI09","AI01","AI02","AI04","APQR01","AO05", "AO03","AI05","AO01","APQR02","AI07","AI08","AO02","AI06","AO04","APQR03")) ## by omitting the row names, you delete the corresponding rows seriograph(cont,order=c("AI02","AI08","APQR03","AI09"))
##---- Should be DIRECTLY executable !! ---- ##-- ==> Define data, use random, ##-- or do help(data=index) for the standard data sets. library(SPARTAAS) data(datangkor) ## network stratigraphic data (Network) network <- datangkor$stratigraphy ## contingency table cont <- datangkor$contingency ## default seriograph(cont) seriograph(cont,show = "EPPM") seriograph(cont,show = "frequency") ## Don't allow permutation of columns seriograph(cont, permute = FALSE) ## insert Hiatus (position, 1 -> after first row from bottom: oldest) seriograph(cont,insert = 2) seriograph(cont,insert = c(2,3)) ## insert costum label element insert <- list( position = c(2,3), label = c("Hiatus.100years","Missing data") ) seriograph(cont,insert = insert) ## change order with cluster name (letters on dendrogram) to sort them in a chronological order seriograph(cont,order=c("AI03","AI09","AI01","AI02","AI04","APQR01","AO05", "AO03","AI05","AO01","APQR02","AI07","AI08","AO02","AI06","AO04","APQR03")) ## by omitting the row names, you delete the corresponding rows seriograph(cont,order=c("AI02","AI08","APQR03","AI09"))
Cluster time range visualisation.
timerange( data, cluster = NULL, add = NULL, density = NULL, color = NULL, reorder = FALSE )
timerange( data, cluster = NULL, add = NULL, density = NULL, color = NULL, reorder = FALSE )
data |
data.frame containing the identifier, lower and upper bound (id, inf, sup) for each observation |
cluster |
vector containing the cluster number of each observation. |
add |
data.frame containing the information to be displayed on hover. |
density |
vector of the density for each observation. |
color |
vector of the colors for each observation (if you want the colors to correspond to something else than clusters). Character vector of the same length as the number of observations. |
reorder |
Logical to rearrange the colors. If TRUE, the first color corresponds to the leftmost cluster on the plot. If FALSE, the first color is that of cluster number 1, wherever it is. |
The function returns a list.
plot |
The timerange plot. |
A. COULON
B. MARTINEAU
L. BELLANGER
P. HUSI
##---- Should be DIRECTLY executable !! ---- ##-- ==> Define data, use random, ##-- or do help(data=index) for the standard data sets. library(SPARTAAS) data <- data.frame( id = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20), Lower_bound = c(400,401,401,350,500,460,482,432,399, 489,750,740,704,700,758,789,802,755,750,820), Upper_bound = c(550,689,755,700,780,700,700,699,650, 850,1100,1100,1010,889,999,999,1050,1002,1000,1100) ) cluster = c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2) add <- data.frame( Site = c("Angers","Angers","Angers","Angers","Angers", "Angers","Angers","Angers","Angers","Angers", "Blois","Blois","Blois","Blois","Blois", "Blois","Blois","Blois","Blois","Blois") ) timerange(data, cluster, add) ## with sub group (cluster 1 is sub divided in 2: 1.1 and 1.2) cluster_with_sub = c(1.1,1.1,1.1,1.1,1.1,1.2,1.2,1.2,1.2,1.2,2,2,2,2,2,2,2,2,2,2) timerange(data, cluster_with_sub, add) ## with density density <- c(32,34,35,19,9,25,19,29,28,18,10,13,9,10,9,6,3,7,7,1) timerange(data=data, cluster=cluster, density=density)
##---- Should be DIRECTLY executable !! ---- ##-- ==> Define data, use random, ##-- or do help(data=index) for the standard data sets. library(SPARTAAS) data <- data.frame( id = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20), Lower_bound = c(400,401,401,350,500,460,482,432,399, 489,750,740,704,700,758,789,802,755,750,820), Upper_bound = c(550,689,755,700,780,700,700,699,650, 850,1100,1100,1010,889,999,999,1050,1002,1000,1100) ) cluster = c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2) add <- data.frame( Site = c("Angers","Angers","Angers","Angers","Angers", "Angers","Angers","Angers","Angers","Angers", "Blois","Blois","Blois","Blois","Blois", "Blois","Blois","Blois","Blois","Blois") ) timerange(data, cluster, add) ## with sub group (cluster 1 is sub divided in 2: 1.1 and 1.2) cluster_with_sub = c(1.1,1.1,1.1,1.1,1.1,1.2,1.2,1.2,1.2,1.2,2,2,2,2,2,2,2,2,2,2) timerange(data, cluster_with_sub, add) ## with density density <- c(32,34,35,19,9,25,19,29,28,18,10,13,9,10,9,6,3,7,7,1) timerange(data=data, cluster=cluster, density=density)