class: inverse, center, middle background-image: url(./img/seurat.jpg) background-size: cover background-position: 50% 50% # Les essentiels de la data science ## R : le couteau suisse des données ### Jour 2 </BR> </BR> </BR> ### Joël Gombin, avec Timothée Gidoin <img src="./img/Logo_DATACTIVIST_TW.png" height="100px" /> .right[.footnote[<a href='https://commons.wikimedia.org/wiki/File%3ASeurat-Gravelines-Annonciade.jpg'>source</a>]] --- class: center, middle Retrouvez les matériaux sur https://github.com/datactivist/numa_R Ces slides en ligne : http://datactivi.st/numa_R/jour2.html Pad collaboratif : https://frama.link/numa_R --- class: center, middle  --- ## Objectifs de la journée - solidifer les acquis d'hier (tidy et transform) - acquérir, préparer et malaxer des données "en conditions réelles" - découvrir la visualisation de données --- ## Exercices - Importer des données en CSV - sélectionner certaines lignes en fonction d'une condition logique - ajouter une nouvelle variable - agréger/résumer les données par groupe - trier le résultat - pivoter des données --- ## Recoder des données - `recode` - `case_when` --- ## Recoder des données ```r suppressPackageStartupMessages(library(tidyverse)) elections <- read_csv("data/Presidentielle_2017_Resultats_Communes_T1_clean.csv") ``` ``` ## Parsed with column specification: ## cols( ## .default = col_double(), ## CodeInsee = col_character(), ## CodeDepartement = col_character(), ## Département = col_character(), ## Commune = col_character(), ## Inscrits = col_integer(), ## Abstentions = col_integer(), ## Votants = col_integer(), ## Blancs = col_integer(), ## Nuls = col_integer(), ## Exprimés = col_integer(), ## `LE PEN` = col_integer(), ## MÉLENCHON = col_integer(), ## MACRON = col_integer(), ## FILLON = col_integer(), ## LASSALLE = col_integer(), ## `DUPONT-AIGNAN` = col_integer(), ## HAMON = col_integer(), ## ASSELINEAU = col_integer(), ## POUTOU = col_integer(), ## ARTHAUD = col_integer() ## # ... with 1 more columns ## ) ``` ``` ## See spec(...) for full column specifications. ``` --- ## Recoder des données ```r elections %>% mutate(region = case_when(CodeDepartement %in% c("75", "77", "78", "91", "92", "93", "94", "95") ~ "Ile-de-France", TRUE ~ "Province")) %>% glimpse ``` ``` ## Observations: 35,496 ## Variables: 52 ## $ CodeInsee <chr> "01001", "01002", "01004", "01005", "01006",... ## $ CodeDepartement <chr> "01", "01", "01", "01", "01", "01", "01", "0... ## $ Département <chr> "Ain", "Ain", "Ain", "Ain", "Ain", "Ain", "A... ## $ Commune <chr> "L'Abergement-Clémenciat", "L'Abergement-de-... ## $ Inscrits <int> 598, 209, 8586, 1172, 99, 1880, 581, 254, 73... ## $ Abstentions <int> 92, 25, 1962, 215, 20, 268, 91, 41, 136, 43,... ## $ Abstentions_ins <dbl> 15.38462, 11.96172, 22.85115, 18.34471, 20.2... ## $ Votants <int> 506, 184, 6624, 957, 79, 1612, 490, 213, 602... ## $ Votants_ins <dbl> 84.61538, 88.03828, 77.14885, 81.65529, 79.7... ## $ Blancs <int> 2, 6, 114, 21, 2, 53, 12, 1, 2, 6, 0, 2, 33,... ## $ Blancs_ins <dbl> 0.3344482, 2.8708134, 1.3277428, 1.7918089, ... ## $ Blancs_vot <dbl> 0.3952569, 3.2608696, 1.7210145, 2.1943574, ... ## $ Nuls <int> 9, 2, 58, 3, 0, 11, 5, 3, 11, 2, 7, 0, 12, 8... ## $ Nuls_ins <dbl> 1.5050167, 0.9569378, 0.6755183, 0.2559727, ... ## $ Nuls_vot <dbl> 1.7786561, 1.0869565, 0.8756039, 0.3134796, ... ## $ Exprimés <int> 495, 176, 6452, 933, 77, 1548, 473, 209, 589... ## $ Exprimés_ins <dbl> 82.77592, 84.21053, 75.14559, 79.60751, 77.7... ## $ Exprimés_vot <dbl> 97.82609, 95.65217, 97.40338, 97.49216, 97.4... ## $ LE PEN <int> 126, 48, 1667, 306, 18, 458, 135, 40, 207, 6... ## $ MÉLENCHON <int> 59, 33, 1412, 126, 19, 296, 89, 39, 103, 30,... ## $ MACRON <int> 119, 37, 1332, 191, 15, 348, 95, 55, 110, 49... ## $ FILLON <int> 110, 34, 1084, 197, 14, 233, 84, 44, 83, 46,... ## $ LASSALLE <int> 2, 0, 60, 6, 1, 13, 3, 8, 14, 4, 2, 0, 21, 7... ## $ DUPONT-AIGNAN <int> 34, 6, 346, 45, 4, 80, 28, 6, 33, 9, 10, 3, ... ## $ HAMON <int> 29, 13, 344, 37, 3, 82, 23, 8, 20, 13, 10, 1... ## $ ASSELINEAU <int> 6, 1, 71, 10, 0, 11, 2, 6, 10, 1, 0, 2, 23, ... ## $ POUTOU <int> 4, 2, 91, 10, 2, 17, 8, 3, 5, 4, 2, 2, 15, 4... ## $ ARTHAUD <int> 4, 2, 40, 5, 1, 9, 3, 0, 3, 4, 2, 2, 9, 3, 5... ## $ CHEMINADE <int> 2, 0, 5, 0, 0, 1, 3, 0, 1, 0, 1, 0, 2, 2, 1,... ## $ LE PEN_ins <dbl> 21.07023, 22.96651, 19.41533, 26.10922, 18.1... ## $ MÉLENCHON_ins <dbl> 9.866221, 15.789474, 16.445376, 10.750853, 1... ## $ MACRON_ins <dbl> 19.899666, 17.703349, 15.513627, 16.296928, ... ## $ FILLON_ins <dbl> 18.394649, 16.267943, 12.625204, 16.808874, ... ## $ LASSALLE_ins <dbl> 0.3344482, 0.0000000, 0.6988120, 0.5119454, ... ## $ DUPONT-AIGNAN_ins <dbl> 5.685619, 2.870813, 4.029816, 3.839590, 4.04... ## $ HAMON_ins <dbl> 4.8494983, 6.2200957, 4.0065222, 3.1569966, ... ## $ ASSELINEAU_ins <dbl> 1.0033445, 0.4784689, 0.8269276, 0.8532423, ... ## $ POUTOU_ins <dbl> 0.6688963, 0.9569378, 1.0598649, 0.8532423, ... ## $ ARTHAUD_ins <dbl> 0.6688963, 0.9569378, 0.4658747, 0.4266212, ... ## $ CHEMINADE_ins <dbl> 0.33444816, 0.00000000, 0.05823433, 0.000000... ## $ LE PEN_exp <dbl> 25.45455, 27.27273, 25.83695, 32.79743, 23.3... ## $ MÉLENCHON_exp <dbl> 11.91919, 18.75000, 21.88469, 13.50482, 24.6... ## $ MACRON_exp <dbl> 24.040404, 21.022727, 20.644761, 20.471597, ... ## $ FILLON_exp <dbl> 22.22222, 19.31818, 16.80099, 21.11468, 18.1... ## $ LASSALLE_exp <dbl> 0.4040404, 0.0000000, 0.9299442, 0.6430868, ... ## $ DUPONT-AIGNAN_exp <dbl> 6.868687, 3.409091, 5.362678, 4.823151, 5.19... ## $ HAMON_exp <dbl> 5.8585859, 7.3863636, 5.3316801, 3.9657020, ... ## $ ASSELINEAU_exp <dbl> 1.2121212, 0.5681818, 1.1004340, 1.0718114, ... ## $ POUTOU_exp <dbl> 0.8080808, 1.1363636, 1.4104154, 1.0718114, ... ## $ ARTHAUD_exp <dbl> 0.8080808, 1.1363636, 0.6199628, 0.5359057, ... ## $ CHEMINADE_exp <dbl> 0.40404040, 0.00000000, 0.07749535, 0.000000... ## $ region <chr> "Province", "Province", "Province", "Provinc... ``` --- ## Fusionner des jeux de données - primary keys et foreign keys - `left_join` - `right_join` - `full_join` - `inner_join` - `semi_join` - `anti_join` --- ## Left join  --- ## Left join  --- ## Inner join  --- ## Outer join .reduite[.center[]] --- ## Semi join  --- ## Anti join  --- ## Exemple (source : https://www.insee.fr/fr/information/2114596) ```r library(readxl) ZE2010 <- read_xls("./data/ZE2010 au 01-01-2017.xls", sheet = "Composition_communale", skip = 5) elections <- elections %>% left_join(ZE2010, by = c("CodeInsee" = "CODGEO")) %>% glimpse() ``` ``` ## Observations: 35,496 ## Variables: 56 ## $ CodeInsee <chr> "01001", "01002", "01004", "01005", "01006",... ## $ CodeDepartement <chr> "01", "01", "01", "01", "01", "01", "01", "0... ## $ Département <chr> "Ain", "Ain", "Ain", "Ain", "Ain", "Ain", "A... ## $ Commune <chr> "L'Abergement-Clémenciat", "L'Abergement-de-... ## $ Inscrits <int> 598, 209, 8586, 1172, 99, 1880, 581, 254, 73... ## $ Abstentions <int> 92, 25, 1962, 215, 20, 268, 91, 41, 136, 43,... ## $ Abstentions_ins <dbl> 15.38462, 11.96172, 22.85115, 18.34471, 20.2... ## $ Votants <int> 506, 184, 6624, 957, 79, 1612, 490, 213, 602... ## $ Votants_ins <dbl> 84.61538, 88.03828, 77.14885, 81.65529, 79.7... ## $ Blancs <int> 2, 6, 114, 21, 2, 53, 12, 1, 2, 6, 0, 2, 33,... ## $ Blancs_ins <dbl> 0.3344482, 2.8708134, 1.3277428, 1.7918089, ... ## $ Blancs_vot <dbl> 0.3952569, 3.2608696, 1.7210145, 2.1943574, ... ## $ Nuls <int> 9, 2, 58, 3, 0, 11, 5, 3, 11, 2, 7, 0, 12, 8... ## $ Nuls_ins <dbl> 1.5050167, 0.9569378, 0.6755183, 0.2559727, ... ## $ Nuls_vot <dbl> 1.7786561, 1.0869565, 0.8756039, 0.3134796, ... ## $ Exprimés <int> 495, 176, 6452, 933, 77, 1548, 473, 209, 589... ## $ Exprimés_ins <dbl> 82.77592, 84.21053, 75.14559, 79.60751, 77.7... ## $ Exprimés_vot <dbl> 97.82609, 95.65217, 97.40338, 97.49216, 97.4... ## $ LE PEN <int> 126, 48, 1667, 306, 18, 458, 135, 40, 207, 6... ## $ MÉLENCHON <int> 59, 33, 1412, 126, 19, 296, 89, 39, 103, 30,... ## $ MACRON <int> 119, 37, 1332, 191, 15, 348, 95, 55, 110, 49... ## $ FILLON <int> 110, 34, 1084, 197, 14, 233, 84, 44, 83, 46,... ## $ LASSALLE <int> 2, 0, 60, 6, 1, 13, 3, 8, 14, 4, 2, 0, 21, 7... ## $ DUPONT-AIGNAN <int> 34, 6, 346, 45, 4, 80, 28, 6, 33, 9, 10, 3, ... ## $ HAMON <int> 29, 13, 344, 37, 3, 82, 23, 8, 20, 13, 10, 1... ## $ ASSELINEAU <int> 6, 1, 71, 10, 0, 11, 2, 6, 10, 1, 0, 2, 23, ... ## $ POUTOU <int> 4, 2, 91, 10, 2, 17, 8, 3, 5, 4, 2, 2, 15, 4... ## $ ARTHAUD <int> 4, 2, 40, 5, 1, 9, 3, 0, 3, 4, 2, 2, 9, 3, 5... ## $ CHEMINADE <int> 2, 0, 5, 0, 0, 1, 3, 0, 1, 0, 1, 0, 2, 2, 1,... ## $ LE PEN_ins <dbl> 21.07023, 22.96651, 19.41533, 26.10922, 18.1... ## $ MÉLENCHON_ins <dbl> 9.866221, 15.789474, 16.445376, 10.750853, 1... ## $ MACRON_ins <dbl> 19.899666, 17.703349, 15.513627, 16.296928, ... ## $ FILLON_ins <dbl> 18.394649, 16.267943, 12.625204, 16.808874, ... ## $ LASSALLE_ins <dbl> 0.3344482, 0.0000000, 0.6988120, 0.5119454, ... ## $ DUPONT-AIGNAN_ins <dbl> 5.685619, 2.870813, 4.029816, 3.839590, 4.04... ## $ HAMON_ins <dbl> 4.8494983, 6.2200957, 4.0065222, 3.1569966, ... ## $ ASSELINEAU_ins <dbl> 1.0033445, 0.4784689, 0.8269276, 0.8532423, ... ## $ POUTOU_ins <dbl> 0.6688963, 0.9569378, 1.0598649, 0.8532423, ... ## $ ARTHAUD_ins <dbl> 0.6688963, 0.9569378, 0.4658747, 0.4266212, ... ## $ CHEMINADE_ins <dbl> 0.33444816, 0.00000000, 0.05823433, 0.000000... ## $ LE PEN_exp <dbl> 25.45455, 27.27273, 25.83695, 32.79743, 23.3... ## $ MÉLENCHON_exp <dbl> 11.91919, 18.75000, 21.88469, 13.50482, 24.6... ## $ MACRON_exp <dbl> 24.040404, 21.022727, 20.644761, 20.471597, ... ## $ FILLON_exp <dbl> 22.22222, 19.31818, 16.80099, 21.11468, 18.1... ## $ LASSALLE_exp <dbl> 0.4040404, 0.0000000, 0.9299442, 0.6430868, ... ## $ DUPONT-AIGNAN_exp <dbl> 6.868687, 3.409091, 5.362678, 4.823151, 5.19... ## $ HAMON_exp <dbl> 5.8585859, 7.3863636, 5.3316801, 3.9657020, ... ## $ ASSELINEAU_exp <dbl> 1.2121212, 0.5681818, 1.1004340, 1.0718114, ... ## $ POUTOU_exp <dbl> 0.8080808, 1.1363636, 1.4104154, 1.0718114, ... ## $ ARTHAUD_exp <dbl> 0.8080808, 1.1363636, 0.6199628, 0.5359057, ... ## $ CHEMINADE_exp <dbl> 0.40404040, 0.00000000, 0.07749535, 0.000000... ## $ LIBGEO <chr> "L'Abergement-Clémenciat", "L'Abergement-de-... ## $ ZE2010 <chr> "8213", "8201", "8201", "8213", "8216", "820... ## $ LIBZE2010 <chr> "Villefranche-sur-Saône", "Ambérieu-en-Bugey... ## $ DEP <chr> "01", "01", "01", "01", "01", "01", "01", "0... ## $ REG <chr> "84", "84", "84", "84", "84", "84", "84", "8... ``` --- ## Exemple ```r elections %>% group_by(ZE2010) %>% summarise_at(vars(Inscrits, Abstentions, Votants, Blancs, Nuls, Exprimés, `LE PEN`:CHEMINADE), funs(sum(.))) ``` ``` ## # A tibble: 305 x 18 ## ZE2010 Inscrits Abstentions Votants Blancs Nuls Exprimés `LE PEN` ## <chr> <int> <int> <int> <int> <int> <int> <int> ## 1 0050 121799 21231 100568 2051 918 97599 18901 ## 2 0051 86349 15302 71047 1314 438 69295 15926 ## 3 0052 57124 12084 45040 786 436 43818 11943 ## 4 0053 104365 20689 83676 1719 690 81267 18395 ## 5 0054 35635 6336 29299 506 186 28607 7590 ## 6 0055 76458 14717 61741 1313 550 59878 19641 ## 7 0056 1000098 238264 761834 13718 5552 742564 155891 ## 8 0057 146393 26366 120027 2584 1442 116001 20371 ## 9 0059 367063 72659 294404 4947 1857 287600 87968 ## 10 0060 434913 93470 341443 6785 2343 332315 80238 ## # ... with 295 more rows, and 10 more variables: MÉLENCHON <int>, ## # MACRON <int>, FILLON <int>, LASSALLE <int>, `DUPONT-AIGNAN` <int>, ## # HAMON <int>, ASSELINEAU <int>, POUTOU <int>, ARTHAUD <int>, ## # CHEMINADE <int> ``` --- ## Exemple ```r elections %>% group_by(ZE2010) %>% summarise_at(vars(Inscrits, Abstentions, Votants, Blancs, Nuls, Exprimés, `LE PEN`:CHEMINADE), funs(sum(.))) %>% mutate_at(vars(`LE PEN`:CHEMINADE), funs(. / Inscrits * 100)) ``` ``` ## # A tibble: 305 x 18 ## ZE2010 Inscrits Abstentions Votants Blancs Nuls Exprimés `LE PEN` ## <chr> <int> <int> <int> <int> <int> <int> <dbl> ## 1 0050 121799 21231 100568 2051 918 97599 15.51819 ## 2 0051 86349 15302 71047 1314 438 69295 18.44376 ## 3 0052 57124 12084 45040 786 436 43818 20.90715 ## 4 0053 104365 20689 83676 1719 690 81267 17.62564 ## 5 0054 35635 6336 29299 506 186 28607 21.29928 ## 6 0055 76458 14717 61741 1313 550 59878 25.68861 ## 7 0056 1000098 238264 761834 13718 5552 742564 15.58757 ## 8 0057 146393 26366 120027 2584 1442 116001 13.91528 ## 9 0059 367063 72659 294404 4947 1857 287600 23.96537 ## 10 0060 434913 93470 341443 6785 2343 332315 18.44921 ## # ... with 295 more rows, and 10 more variables: MÉLENCHON <dbl>, ## # MACRON <dbl>, FILLON <dbl>, LASSALLE <dbl>, `DUPONT-AIGNAN` <dbl>, ## # HAMON <dbl>, ASSELINEAU <dbl>, POUTOU <dbl>, ARTHAUD <dbl>, ## # CHEMINADE <dbl> ``` --- ## À ne pas confondre : binding Ici il s'agit de juxtaposer des jeux de données - `bind_rows` - `bind_cols` --- ## Tips & tricks dplyr ```r ZE2010 %>% group_by(LIBZE2010) %>% summarise(n = n()) %>% arrange(desc(n)) ``` ``` ## # A tibble: 321 x 2 ## LIBZE2010 n ## <chr> <int> ## 1 Toulouse 716 ## 2 Amiens 478 ## 3 Rouen 473 ## 4 Tarbes - Lourdes 451 ## 5 Troyes 449 ## 6 Dijon 447 ## 7 Besançon 420 ## 8 Bordeaux 408 ## 9 Nancy 395 ## 10 Roissy - Sud Picardie 393 ## # ... with 311 more rows ``` --- ## Tips & tricks dplyr ```r ZE2010 %>% count(LIBZE2010, sort = TRUE) ``` ``` ## # A tibble: 321 x 2 ## LIBZE2010 n ## <chr> <int> ## 1 Toulouse 716 ## 2 Amiens 478 ## 3 Rouen 473 ## 4 Tarbes - Lourdes 451 ## 5 Troyes 449 ## 6 Dijon 447 ## 7 Besançon 420 ## 8 Bordeaux 408 ## 9 Nancy 395 ## 10 Roissy - Sud Picardie 393 ## # ... with 311 more rows ``` --- ## Tips & tricks dplyr Numéroter les communes par ordre décroissant de vote Le Pen dans le département ```r elections %>% group_by(CodeDepartement) %>% arrange(desc(`LE PEN_exp`)) %>% mutate(numero = 1:n()) %>% ungroup %>% arrange(desc(`LE PEN_exp`)) %>% select(CodeInsee, Commune, Département, numero, `LE PEN_exp`) ``` ``` ## # A tibble: 35,496 x 5 ## CodeInsee Commune Département numero `LE PEN_exp` ## <chr> <chr> <chr> <int> <dbl> ## 1 52066 Brachay Haute-Marne 1 83.72093 ## 2 ZP051 Tatakoto Polynésie française 1 74.78261 ## 3 10044 Bétignicourt Aube 1 74.07407 ## 4 02043 Bagneux Aisne 1 72.22222 ## 5 ZP041 Rapa Polynésie française 2 70.29703 ## 6 52201 Flammerécourt Haute-Marne 2 69.04762 ## 7 88298 Ménarmont Vosges 1 66.66667 ## 8 ZP030 Napuka Polynésie française 3 66.66667 ## 9 08184 Fromy Ardennes 1 66.07143 ## 10 25552 Sourans Doubs 1 65.59140 ## # ... with 35,486 more rows ``` --- class: inverse, center, middle # Visualisation de données avec ggplot2 --- ## Le panorama des systèmes graphiques de R - base graphics : mélange bas niveau/haut niveau, complexe, pas très cohérent, mais rapide et puissant - lattice : haut niveau, a une grammaire, plus très utilisé - ggplot2 : haut niveau, grammaire cohérente, très puissant, très largement utilisé et [imité](http://ggplot.yhathq.com/) --- ## La notion de grammaire de visualisation - données - correspondances esthétiques (`aes`thetic mapping) - couches (`layer`) : `geom` et `stat` - échelles (`scale`) - un système de coordonnées (`coord`) - petits multiples (`facet`) - un thème (`theme`) --- ## Un premier exemple simple - choisir un jeu de données - faire des graphiques descriptifs (univariés) - faire des graphiques bivariés - changer les `aes` - changer les `geom` - changer les `theme` --- ## Les principaux `geom` - `geom_point` - `geom_line` - `geom_text` - `geom_quantile` - `geom_histogram` - `geom_density` - `geom_smooth` - `geom_bar` - et même `geom_map` (et `geom_sf` en conjonction avec le package `sf`) --- ## Les principaux `aes`thetics - `x` - `y` - `colour` - `fill` - `size` - line width `lwd` - line type `lty` --- class: inverse, center, middle # Point sur les acquis, les projets, les besoins --- class: inverse, center, middle # Évaluation https://frama.link/eval_numa_R_2 --- class: inverse, center, middle # À votre disposition ! joel@datactivi.st