{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"name":"hierarchique.ipynb","version":"0.3.2","provenance":[],"collapsed_sections":[],"toc_visible":true},"kernelspec":{"name":"python3","display_name":"Python 3"}},"cells":[{"metadata":{"id":"_OsqsO0NPdAv","colab_type":"text"},"cell_type":"markdown","source":["#Clustering - Regroupement hiérarchique\n","\n","##Introduction\n","\n","L'algorithme de *regroupement hiérarchique* est un algorithme de clustering **supervisé**.
\n","On initialise l'algorithme en créant un cluster pour chaque point.
\n","Puis on regroupe les deux clusters les plus proches jusqu'a atteindre le nombre de clusters souhaités. On utilise pour cela la distance minimale, moyenne ou maximale entre les clusters.
\n","\n","##Création des données\n","Pour commencer, nous allons créer les données que nous utiliserons pour la suite dans l'algorithme. L'idée est de créer des données d'initialisation et des données de test (pour évaluer l'efficacité de l'algorithme).
\n","L'explication de la création de ces données est réalisée dans le jupyter de l'algorithme *k-means*, ainsi nous ne nous étendrons pas sur la création de ces dernières."]},{"metadata":{"id":"01BdJscsNuCe","colab_type":"code","colab":{}},"cell_type":"code","source":["# Importation des librairies\n","import numpy as np\n","import matplotlib.pyplot as plt\n","\n","# Les paramètres importants\n","nbr_clusters=3\n","nbr_initialisation=30\n","nbr_evalutation=1000\n","\n","# Les listes\n","init_points=[]\n","init_labels=[]\n","eval_points=[]\n","eval_labels=[]\n","\n","donnees_initialisation=[[] for i in range(nbr_clusters)]\n","\n","barycentres=[]\n","for i in range(nbr_clusters):\n"," barycentres.append(2*np.random.random_sample((2,))-1)\n","\n","#barycentres=[np.array([0.8,0.8]),np.array([0.8,-0.8]),np.array([-0.8,-0.8]),np.array([-0.8,0.8])] \n","\n","# Création des données d'initialisation\n","for i in range(nbr_initialisation):\n"," indice=np.random.randint(0,nbr_clusters)\n"," point=barycentres[indice]+np.random.normal(scale=0.2,size=2) # Répartition normale autour d'un barycentre\n"," init_points.append(point)\n"," init_labels.append(indice)\n"," donnees_initialisation[indice].append(point)\n"," \n","# Création des données de test\n","for i in range(nbr_evalutation):\n"," indice=np.random.randint(0,nbr_clusters)\n"," point=barycentres[indice]+np.random.normal(scale=0.2,size=2) # Répartition normale autour d'un barycentre\n"," eval_points.append(point)\n"," eval_labels.append(indice)"],"execution_count":0,"outputs":[]},{"metadata":{"id":"3LU7Q6UsSvNl","colab_type":"text"},"cell_type":"markdown","source":["Maintenant que nous avons créé les données, je vous propose de les visualiser"]},{"metadata":{"id":"2a6zbFTDSvm3","colab_type":"code","outputId":"ed5bc395-0c46-43d0-fb49-54b1391d097a","executionInfo":{"status":"ok","timestamp":1557386457055,"user_tz":-120,"elapsed":1934,"user":{"displayName":"Lucas Millot","photoUrl":"","userId":"16896163850751379404"}},"colab":{"base_uri":"https://localhost:8080/","height":269}},"cell_type":"code","source":["lim=1.5\n","plt.xlim([-lim,lim])\n","plt.ylim([-lim,lim])\n","\n","# Définition des couleurs des clusters et des barycentres\n","couleurs=['r','g','b','y']\n","couleurs_barycentres=['#990000','#009900','#000099','#009999']\n","\n","for i in range(nbr_initialisation):\n"," point=init_points[i]\n"," label=init_labels[i]\n"," plt.plot(point[0],point[1],color=couleurs[label],marker=\"+\") # On affiche chaque point avec la couleur de son cluster\n"," \n","plt.show()"],"execution_count":17,"outputs":[{"output_type":"display_data","data":{"image/png":"iVBORw0KGgoAAAANSUhEUgAAAYQAAAD8CAYAAAB3u9PLAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAAEYZJREFUeJzt3W+MZXV9x/H3p6xgQhpFdgvIH4W6\nFWna+Gey9U9TabEtbhpWKzbwoEIC2VJLfOSDTUzcWRPT2Cc1VlvZIhH7ALAk6lrXUpBubdNiGQwr\nrBRdNxJ2RRnBYkgsFPn2wZzV22X+3Jl77j3nzrxfyc2ce++Z8/v99u6czz2/3zm/k6pCkqRf6LoC\nkqR+MBAkSYCBIElqGAiSJMBAkCQ1DARJEtBSICS5KcnjSR5c4v2LkzyV5P7m8cE2ypUktWdTS9v5\nNPBx4DPLrPOvVfUHLZUnSWpZK0cIVfVV4Mk2tiVJ6kZbRwjDeFOSg8D3gPdX1aHFVkqyE9gJcOqp\np77hwgsvnGAVJWm63XfffT+sqi1r+d1JBcLXgVdU1dNJtgOfB7YutmJV7QX2AszMzNTc3NyEqihJ\n0y/JI2v93YmcZVRVP66qp5vl/cCLkmyeRNmSpOFMJBCSnJkkzfK2ptwnJlG2JGk4rXQZJbkFuBjY\nnOQosBt4EUBVfRK4HPjTJM8BPwGuKKdZlaReaSUQqurKFd7/OAunpUqSesorlSVJgIEgSWoYCJIk\nwECQJDUMBEkSYCBIkhoGgiQJMBAkSQ0DQZIEGAiSpIaBIEkCDARJUsNAkCQBBoIkqWEgSJIAA0GS\n1DAQJEmAgSBJahgIkiTAQJAkNQwESRJgIEiSGgaCJAkwECRJDQNBkgQYCJKkhoEgSQJaCoQkNyV5\nPMmDS7yfJB9LcjjJN5K8vo1yJUntaesI4dPApcu8/3Zga/PYCfxNS+VKklrSSiBU1VeBJ5dZZQfw\nmVpwD/DSJGe1UbYkqR2TGkM4G3h04PnR5jVJUk/0blA5yc4kc0nm5ufnu66OJG0YkwqEY8C5A8/P\naV57garaW1UzVTWzZcuWiVROkjS5QNgHvKc52+iNwFNV9diEypY0ZWZnu67BxtTWaae3AP8BvDrJ\n0STXJLkuyXXNKvuBI8Bh4G+B97ZRrqT1ac+ermuwMW1qYyNVdeUK7xfwZ22UJUkaj94NKkvamGZn\nIVl4wM+X7T6anCx8ee+nmZmZmpub67oakiYsgR7vmnotyX1VNbOW3/UIQZIEGAiSemj37q5rsDEZ\nCJJ6x3GDbhgIkiTAQJAkNQwESRJgIEiSGgaCJAkwECRJDQNBkgQYCJKkhoEgaWheMLa+GQiShjbM\nfQoMjellIEhqlTe3mV4GgqRleZ+CjcNAkLSs2dmFexMcvz/B8eXBQDA01gdvkCNpaMPcuMab23TL\nG+RImog+3qfAo5D2GAiShjbMznfSoeEgdnsMBEmrtlww+I19ehkIklat62/lDmKPh4PKklatTwPH\nfapLHzioLGnspvVbed/r1ycGgqShDHM9wqTrAysPYnfdvTVNDARJq9KXb9zHd/R9qc96YCBIWpU9\ne/p5PcKgae3e6lorgZDk0iQPJzmcZNci71+dZD7J/c3j2jbKldSNLruJhtnR9617a1qMHAhJTgI+\nAbwduAi4MslFi6x6W1W9tnncOGq5kianL9+43dGPVxtHCNuAw1V1pKqeBW4FdrSwXUk9Mc074r53\nb/VJG4FwNvDowPOjzWsneleSbyS5Pcm5S20syc4kc0nm5ufnW6iepC6MOyyG3dFPQ2j1xaQGlb8I\nvLKqfh24E7h5qRWram9VzVTVzJYtWyZUPWljWsvOctgd8bhP93RH3742AuEYMPiN/5zmtZ+pqieq\n6pnm6Y3AG1ooV9KI1rLTdke8frURCPcCW5Ocn+Rk4Apg3+AKSc4aeHoZ8FAL5UqakGFDoC+Dz1qb\nkQOhqp4DrgfuYGFH/9mqOpTkQ0kua1Z7X5JDSQ4C7wOuHrVcSWuzlp32sEcS0zz4LCe3kza0YSeG\nW8sEck461w0nt5PUulG7fzzdc/p4hCBtYLOzw+3g/bY/PTxCkLQm9u1rkIEgaUV2/2wMBoKkFXkk\nsTEYCJIkwECQJDUMBEkSYCBIkhoGgiQJMBAkSQ0DQZIEGAiSpIaBIEkCDARJLfFq5ulnIEhqxbjv\noazxMxAkSYCBIGkE3kN5ffEGOZJa4U10+sEb5EiSRmYgSGqFN9GZfgaCpFY4bjD9DARJEmAgSJIa\nBoIkCTAQJEkNA0GSBLQUCEkuTfJwksNJdi3y/ilJbmve/1qSV7ZRriSpPSMHQpKTgE8AbwcuAq5M\nctEJq10D/KiqXgX8JfCRUcuVJLWrjSOEbcDhqjpSVc8CtwI7TlhnB3Bzs3w7cElyfPYTSVIftBEI\nZwOPDjw/2ry26DpV9RzwFHD6YhtLsjPJXJK5+fn5FqonSRpG7waVq2pvVc1U1cyWLVu6ro4kbRht\nBMIx4NyB5+c0ry26TpJNwEuAJ1ooe7p5rb+kHmkjEO4FtiY5P8nJwBXAvhPW2Qdc1SxfDtxdfZ53\ne1K8xZSkHtk06gaq6rkk1wN3ACcBN1XVoSQfAuaqah/wKeDvkhwGnmQhNCRJPdLKGEJV7a+qX6mq\nX66qDzevfbAJA6rqf6rq3VX1qqraVlVH2ih3KnmLKUk95R3TuuQtpiS1zDumSZJGZiB0yVtMSeoR\nA6FL4xw3cExC0ioZCOuVp7RKWiUDQZIEGAjri6e0ShqBp52uV57SKm1InnYqSRqZgbBeeUqrpFUy\nENYrxw0krZKBIEkCDARJUsNAkCQBBoIkqWEgyAFoSYCBIHDeI0mAgSBJahgIG5XzHkk6gXMZyXmP\npHXEuYzUPx5pSFPHQNB45j1yoFqaOgaC/DYvCTAQ1CYHqqWp5qCyxsOBaqkTDipLkkZmIGg8vEGP\nNHVGCoQkL0tyZ5JvNz9PW2K9nya5v3nsG6VMTQnHDaSpM+oRwi7gK1W1FfhK83wxP6mq1zaPy0Ys\nU5I0BqMGwg7g5mb5ZuAdI25PktSRUQPhjKp6rFn+PnDGEuu9OMlcknuSLBsaSXY2687Nz8+PWD31\nnl1LUm+seNppkruAMxd56wPAzVX10oF1f1RVLxhHSHJ2VR1LcgFwN3BJVX1npcp52ukG4OmpUqtG\nOe1000orVNXblin4B0nOqqrHkpwFPL7ENo41P48kOQC8DlgxECRJkzNql9E+4Kpm+SrgCyeukOS0\nJKc0y5uBtwDfHLFcTTOvaJZ6aaQrlZOcDnwWOA94BPijqnoyyQxwXVVdm+TNwA3A8ywE0Eer6lPD\nbN8uow3ALiOpVWPtMlpOVT0BXLLI63PAtc3yvwO/Nko52kBmZz1SkDrilcrq1olXNDttttQZA0Hd\n8mhA6g0DYRzcya1OzweZZw/Mdl0FaSKc/nocHChdux7+22VPqN39qpO0FKe/1uJ68g1b0nQwENrS\nx26PaRyg7cm02bMHZsmekD0Ln+fxZbuPtJ7ZZTQOfen26Es9ppxdRpomdhnp5/p4pCJpKox0YZqW\n0GW3x+CFXR4htGL3W/vRjSWNm11G65mBIG04dhlpcT0ZoJU0HQyE9cxxA0mrYCBIkgADQR3ynH6p\nXwwEdWbPv0zhhXPSOmYgqBc8WpC6ZyBoKG3tsJeaEsKjBal7BoKG0tYOe/biWWp3/WwqiMFlSd0y\nENQpJ5CT+sOpK7Sk2QOz/+/I4PiOe/dbdzN78ezI2x/cjhPISd1z6goNZdw7bANBaodTV2jqOYGc\n1D0DQUMZ9w67jS4oSaMxEDQUd9jS+mcgqDWeHSRNNwNBrfHiMmm6GQhakt/4pY1lpEBI8u4kh5I8\nn2TJ05ySXJrk4SSHk+wapUxNzjDf+JeaimIwTPoYLH2sk9S1ka5DSPIa4HngBuD9VfWCiwaSnAR8\nC/hd4ChwL3BlVX1zpe17HUK3VnttwFLr9/Eagz7WSWpDZ9chVNVDVfXwCqttAw5X1ZGqeha4Fdgx\nSrkan2G+8UtanyYxhnA28OjA86PNa+qhpSafG+a008FrFfoYLH2sk9QnK3YZJbkLOHORtz5QVV9o\n1jnA0l1GlwOXVtW1zfM/Bn6jqq5forydwE6A88477w2PPPLI8K1Rq9rqVulj90wf6yS1YZQuoxUn\nt6uqt61lwwOOAecOPD+neW2p8vYCe2FhDGHEsjUCp5OQNpZJdBndC2xNcn6Sk4ErgH0TKFcjauvq\n5D4GSx/rJHVt1LOM3gn8FbAF+G/g/qr6/SQvB26squ3NetuBjwInATdV1YeH2b5nGUnS6oy1y2g5\nVfU54HOLvP49YPvA8/3A/lHKkiSNl1cqS5IAA0GS1DAQJEmAgSBJahgImgivBpb6z0DQRHivBKn/\nDARJEmAgaIycTE6aLiNdqTxuXqm8frQxmdzsgdnWptOQ1qvO7ocgTZLjENJ4GQiaCCeTk/rPQNBE\nrLWrx3EIaXIcQ9DU8KY20socQ5AkjcxA0NRwHEIaLwNBU8NTTqXxMhAkSYCBIElqGAiSJMBAkCQ1\nDARJEmAgSJIaBoIkCTAQJEkNA0GSBBgIkqSGgSBJAgwESVJjpEBI8u4kh5I8n2TJ+beTfDfJA0nu\nT+INDiSphzaN+PsPAn8I3DDEur9dVT8csTxJ0piMFAhV9RBAknZqI0nqzKhHCMMq4J+SFHBDVe1d\nasUkO4GdzdNnkjw4iQp2YDOwno+YbN90s33T69Vr/cUVAyHJXcCZi7z1gar6wpDl/GZVHUvyS8Cd\nSf6rqr662IpNWOxtyp5b671B+249tw1s37SzfdNrlHHaFQOhqt621o0PbONY8/PxJJ8DtgGLBoIk\nqRtjP+00yalJfvH4MvB7LAxGS5J6ZNTTTt+Z5CjwJuBLSe5oXn95kv3NamcA/5bkIPCfwJeq6h+H\nLGLJsYZ1YD23DWzftLN902vNbUtVtVkRSdKU8kplSRJgIEiSGr0JhPU+DcYq2ndpkoeTHE6ya5J1\nHEWSlyW5M8m3m5+nLbHeT5vP7v4k+yZdz9Va6fNIckqS25r3v5bklZOv5doM0bark8wPfF7XdlHP\ntUpyU5LHl7qWKQs+1rT/G0leP+k6rtUQbbs4yVMDn90Hh9pwVfXiAbyGhQsqDgAzy6z3XWBz1/Ud\nR/uAk4DvABcAJwMHgYu6rvuQ7fsLYFezvAv4yBLrPd11XVfRphU/D+C9wCeb5SuA27qud4ttuxr4\neNd1HaGNvwW8Hnhwife3A18GArwR+FrXdW6xbRcD/7Da7fbmCKGqHqqqh7uux7gM2b5twOGqOlJV\nzwK3AjvGX7tW7ABubpZvBt7RYV3aMsznMdju24FLMh1zuUzz/7Wh1MLFr08us8oO4DO14B7gpUnO\nmkztRjNE29akN4GwCsenwbivmeZiPTkbeHTg+dHmtWlwRlU91ix/n4XTjRfz4iRzSe5J0vfQGObz\n+Nk6VfUc8BRw+kRqN5ph/6+9q+lOuT3JuZOp2sRM89/bMN6U5GCSLyf51WF+YVJzGQGTnwZj0lpq\nX28t177BJ1VVzbxVi3lF8/ldANyd5IGq+k7bdVUrvgjcUlXPJPkTFo6EfqfjOmk4X2fhb+3pJNuB\nzwNbV/qliQZCrfNpMFpo3zFg8FvYOc1rvbBc+5L8IMlZVfVYc9j9+BLbOP75HUlyAHgdC33ZfTTM\n53F8naNJNgEvAZ6YTPVGsmLbqmqwHTeyME60nvT6720UVfXjgeX9Sf46yeZa4RYEU9VltAGmwbgX\n2Jrk/CQnszBI2fszcRr7gKua5auAFxwRJTktySnN8mbgLcA3J1bD1Rvm8xhs9+XA3dWM6vXcim07\noT/9MuChCdZvEvYB72nONnoj8NRAt+dUS3Lm8bGsJNtY2Nev/EWl69HygVHxd7LQh/cM8APgjub1\nlwP7m+ULWDgb4iBwiIWumM7r3lb7mufbgW+x8K15mtp3OvAV4NvAXcDLmtdngBub5TcDDzSf3wPA\nNV3Xe4h2veDzAD4EXNYsvxj4e+AwC1OzXNB1nVts2583f2cHgX8GLuy6zqts3y3AY8D/Nn971wDX\nAdc17wf4RNP+B1jm7Ma+PYZo2/UDn909wJuH2a5TV0iSgCnrMpIkjY+BIEkCDARJUsNAkCQBBoIk\nqWEgSJIAA0GS1Pg/YIPOp0vtr9oAAAAASUVORK5CYII=\n","text/plain":["
"]},"metadata":{"tags":[]}}]},{"metadata":{"id":"Je2eBgjXTQ8P","colab_type":"text"},"cell_type":"markdown","source":["## Création de l'algorithme\n","\n","### La distance entre clusters\n","Laissons cette base de donnée de coté et repartons avec quelque chose de plus simple. prenons dix points en une dimension, tirés aléatoirement entre 0 et 1000. \n","\n","[653, 801, 887, 347, 638, 449, 999, 377, 703, 225]\n","\n","Nous attribuons un cluster a chaque point.
\n","Ici la distance entre deux éléments a et b c'est simplement |b-a|.
\n","La distance entre deux clusters est définie comme le *minimum* / la *moyenne* / le *maximum* entre les points des deux clusters pris deux a deux.
\n","On trouve les deux clusters les plus proches (distance la plus faible) et le tour est joué, il ne reste qu'a les fusionner."]},{"metadata":{"id":"zgtiHUAhWhci","colab_type":"code","outputId":"a1c81913-ab77-4640-ca74-e865d58c006a","executionInfo":{"status":"ok","timestamp":1557386457057,"user_tz":-120,"elapsed":1774,"user":{"displayName":"Lucas Millot","photoUrl":"","userId":"16896163850751379404"}},"colab":{"base_uri":"https://localhost:8080/","height":70}},"cell_type":"code","source":["def distance_element(e1, e2):\n"," \"\"\"\n"," Fonction distance entre les éléments, qui retourne |b-a| dans cet exemple\n"," \"\"\"\n"," if isinstance(e1, int):\n"," return abs(e1-e2)\n"," else:\n"," return np.linalg.norm(e1-e2)\n","\n","def distance_cluster_fig(c1, c2, parametre_distance_cluster):\n"," \"\"\"\n"," Fonction de distance entre des clusters\n"," \"\"\"\n"," #distance minimale\n"," if parametre_distance_cluster == 'min':\n"," min_distance_cluster = distance_element(c1[0], c2[0])\n"," for element_c1 in c1:\n"," for element_c2 in c2:\n"," if distance_element(element_c1, element_c2) < min_distance_cluster:\n"," min_distance_cluster = distance_element(element_c1, element_c2)\n"," return min_distance_cluster\n"," \n"," #distance maximale\n"," elif parametre_distance_cluster == 'max':\n"," max_distance_cluster = distance_element(c1[0], c2[0])\n"," for element_c1 in c1:\n"," for element_c2 in c2:\n"," if distance_element(element_c1, element_c2) > max_distance_cluster:\n"," max_distance_cluster = distance_element(element_c1, element_c2)\n"," return max_distance_cluster\n"," \n"," #distance moyenne\n"," elif parametre_distance_cluster== 'mean':\n"," dist=0\n"," for element_c1 in c1:\n"," for element_c2 in c2:\n"," dist += distance_element(element_c1, element_c2)\n"," return dist/(len(c1)*len(c2))\n","\n","# On crée les clusters de test\n","c1=[1,2,5]\n","c2=[4,6,9]\n","print(\"La distance minimale entre ces deux clusters est {} (distance entre 4 et 5)\".format(distance_cluster_fig(c1,c2,\"min\")))\n","print(\"La distance maximale entre ces deux clusters est {} (distance entre 1 et 9)\".format(distance_cluster_fig(c1,c2,\"max\")))\n","print(\"La distance moyenne entre ces deux clusters est {}\".format(distance_cluster_fig(c1,c2,\"mean\")))"],"execution_count":18,"outputs":[{"output_type":"stream","text":["La distance minimale entre ces deux clusters est 1 (distance entre 4 et 5)\n","La distance maximale entre ces deux clusters est 8 (distance entre 1 et 9)\n","La distance moyenne entre ces deux clusters est 3.888888888888889\n"],"name":"stdout"}]},{"metadata":{"id":"1XJcsZClX-vC","colab_type":"text"},"cell_type":"markdown","source":["### Fusion des clusters\n","Maintenant que l'on connait les distances inter-cluster il suffit de trouver le minimum de ces distances. Nous travaillons avec une liste où ligne[k] est le numéro du cluster du point k.
\n","La fonction **fusion** passe d'une ligne à la suivante en réduisant de 1 le nombre de clusters.
\n","Pour cela on crée un *dictionnaire de clusters*.
\n","En testant pour tous les clusters, on trouve ainsi les clusters à fusionner (la distance est la plus petite).
\n","On les *fusionne* puis on crée la liste suivant des clusters des points. Il y aura donc une valeur de moins dans cette liste, car il y aura un cluster de moins."]},{"metadata":{"id":"Pz-n-C6uX_X8","colab_type":"code","outputId":"623d1231-82e0-44a7-85c9-880c0d0a7ec5","executionInfo":{"status":"ok","timestamp":1557386457058,"user_tz":-120,"elapsed":1591,"user":{"displayName":"Lucas Millot","photoUrl":"","userId":"16896163850751379404"}},"colab":{"base_uri":"https://localhost:8080/","height":105}},"cell_type":"code","source":["import random as rd\n","\n","liste_aleat=[rd.randint(0,100) for i in range (6)]\n","def fusion(ligne, i, liste_alea):\n"," c = {}\n"," for element in range(len(ligne)):\n"," if ligne[element] in c.keys():\n"," c[int(ligne[element])] = c[int(ligne[element])] + [liste_alea[element]]\n"," else:\n"," c[int(ligne[element])] = [liste_alea[element]]\n","\n"," \n"," min = distance_cluster_fig(c[0], c[1], \"mean\")\n"," indices_min = (0, 1)\n"," for j in range(i-1):\n"," for l in range(j + 1, i):\n"," if distance_cluster_fig(c[j], c[l], \"mean\") <= min:\n"," min = distance_cluster_fig(c[j], c[l], \"mean\")\n"," indices_min = (j, l)\n"," j,l = indices_min\n"," print(\"\")\n"," print (\"il faut fusionner \"+str(c[j])+\" et \"+str(c[l]))\n"," print (\"cela correspond à \"+ str(j)+ \" et \" + str(l))\n"," #on a assuré j