-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdatatable-intro-fr.pot
1190 lines (991 loc) · 37.9 KB
/
datatable-intro-fr.pot
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
#
msgid ""
msgstr ""
#: d.Rmd:block 1 (header)
msgid ""
"title: \"Introduction to data.table\"\n"
"date: \"`r Sys.Date()`\"\n"
"output:\n"
"markdown::html_format\n"
"vignette: >\n"
"%\\VignetteIndexEntry{Introduction to data.table}\n"
"%\\VignetteEngine{knitr::knitr}\n"
"\\usepackage[utf8]{inputenc}"
msgstr ""
#: d.Rmd:block 3 (paragraph)
msgid ""
"This vignette introduces the `data.table` syntax, its general form, how to "
"*subset* rows, *select and compute* on columns, and perform aggregations *by"
" group*. Familiarity with the `data.frame` data structure from base R is "
"useful, but not essential to follow this vignette."
msgstr ""
#: d.Rmd:block 4 (header)
msgid "Data analysis using `data.table`"
msgstr ""
#: d.Rmd:block 5 (paragraph)
msgid ""
"Data manipulation operations such as *subset*, *group*, *update*, *join*, "
"etc. are all inherently related. Keeping these *related operations together*"
" allows for:"
msgstr ""
#: d.Rmd:block 6 (unordered list)
msgid ""
"*concise* and *consistent* syntax irrespective of the set of operations you "
"would like to perform to achieve your end goal."
msgstr ""
#: d.Rmd:block 6 (unordered list)
msgid ""
"performing analysis *fluidly* without the cognitive burden of having to map "
"each operation to a particular function from a potentially huge set of "
"functions available before performing the analysis."
msgstr ""
#: d.Rmd:block 6 (unordered list)
msgid ""
"*automatically* optimising operations internally and very effectively by "
"knowing precisely the data required for each operation, leading to very fast"
" and memory-efficient code."
msgstr ""
#: d.Rmd:block 7 (paragraph)
msgid ""
"Briefly, if you are interested in reducing *programming* and *compute* time "
"tremendously, then this package is for you. The philosophy that `data.table`"
" adheres to makes this possible. Our goal is to illustrate it through this "
"series of vignettes."
msgstr ""
#: d.Rmd:block 8 (header)
msgid "Data {#data}"
msgstr ""
#: d.Rmd:block 9 (paragraph)
msgid ""
"In this vignette, we will use [NYC-"
"flights14](https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv)"
" data obtained from the [flights](https://github.com/arunsrinivasan/flights)"
" package (available on GitHub only). It contains On-Time flights data from "
"the Bureau of Transportation Statistics for all the flights that departed "
"from New York City airports in 2014 (inspired by "
"[nycflights13](https://github.com/tidyverse/nycflights13)). The data is "
"available only for Jan-Oct'14."
msgstr ""
#: d.Rmd:block 10 (paragraph)
msgid ""
"We can use `data.table`'s fast-and-friendly file reader `fread` to load "
"`flights` directly as follows:"
msgstr ""
#: d.Rmd:block 13 (paragraph)
msgid ""
"Aside: `fread` accepts `http` and `https` URLs directly, as well as "
"operating system commands such as `sed` and `awk` output. See `?fread` for "
"examples."
msgstr ""
#: d.Rmd:block 14 (header)
msgid "Introduction"
msgstr ""
#: d.Rmd:block 15 (paragraph)
msgid "In this vignette, we will"
msgstr ""
#: d.Rmd:block 16 (ordered list)
msgid ""
"Start with the basics - what is a `data.table`, its general form, how to "
"*subset* rows, how to *select and compute* on columns;"
msgstr ""
#: d.Rmd:block 16 (ordered list)
msgid "Then we will look at performing data aggregations by group"
msgstr ""
#: d.Rmd:block 17 (header)
msgid "1. Basics {#basics-1}"
msgstr ""
#: d.Rmd:block 18 (header)
msgid "a) What is `data.table`? {#what-is-datatable-1a}"
msgstr ""
#: d.Rmd:block 19 (paragraph)
msgid ""
"`data.table` is an R package that provides **an enhanced version** of a "
"`data.frame`, the standard data structure for storing data in `base` R. In "
"the [Data](#data) section above, we saw how to create a `data.table` using "
"`fread()`, but alternatively we can also create one using the `data.table()`"
" function. Here is an example:"
msgstr ""
#: d.Rmd:block 21 (paragraph)
msgid ""
"You can also convert existing objects to a `data.table` using `setDT()` (for"
" `data.frame` and `list` structures) or `as.data.table()` (for other "
"structures). For more details pertaining to the difference (goes beyond the "
"scope of this vignette), please see `?setDT` and `?as.data.table`."
msgstr ""
#: d.Rmd:block 22 (header)
msgid "Note that:"
msgstr ""
#: d.Rmd:block 23 (unordered list)
msgid ""
"Row numbers are printed with a `:` in order to visually separate the row "
"number from the first column."
msgstr ""
#: d.Rmd:block 23 (unordered list)
msgid ""
"When the number of rows to print exceeds the global option "
"`datatable.print.nrows` (default = `r "
"getOption(\"datatable.print.nrows\")`), it automatically prints only the top"
" 5 and bottom 5 rows (as can be seen in the [Data](#data) section). For a "
"large `data.frame`, you may have found yourself waiting around while larger "
"tables print-and-page, sometimes seemingly endlessly. This restriction helps"
" with that, and you can query the default number like so:"
msgstr ""
#: d.Rmd:block 23 (unordered list)
msgid ""
"`data.table` doesn't set or use *row names*, ever. We will see why in the "
"*\"Keys and fast binary search based subset\"* vignette."
msgstr ""
#: d.Rmd:block 24 (header)
msgid ""
"b) General form - in what way is a `data.table` *enhanced*? {#enhanced-1b}"
msgstr ""
#: d.Rmd:block 25 (paragraph)
msgid ""
"In contrast to a `data.frame`, you can do *a lot more* than just subsetting "
"rows and selecting columns within the frame of a `data.table`, i.e., within "
"`[ ... ]` (NB: we might also refer to writing things inside `DT[...]` as "
"\"querying `DT`\", as an analogy or in relevance to SQL). To understand it "
"we will have to first look at the *general form* of the `data.table` syntax,"
" as shown below:"
msgstr ""
#: d.Rmd:block 27 (paragraph)
msgid ""
"Users with an SQL background might perhaps immediately relate to this "
"syntax."
msgstr ""
#: d.Rmd:block 28 (header)
msgid "The way to read it (out loud) is:"
msgstr ""
#: d.Rmd:block 29 (paragraph)
msgid ""
"Take `DT`, subset/reorder rows using `i`, then calculate `j`, grouped by "
"`by`."
msgstr ""
#: d.Rmd:block 30 (paragraph)
msgid ""
"Let's begin by looking at `i` and `j` first - subsetting rows and operating "
"on columns."
msgstr ""
#: d.Rmd:block 31 (header)
msgid "c) Subset rows in `i` {#subset-i-1c}"
msgstr ""
#: d.Rmd:block 32 (header)
msgid ""
"-- Get all the flights with \"JFK\" as the origin airport in the month of "
"June."
msgstr ""
#: d.Rmd:block 34 (unordered list)
msgid ""
"Within the frame of a `data.table`, columns can be referred to *as if they "
"are variables*, much like in SQL or Stata. Therefore, we simply refer to "
"`origin` and `month` as if they are variables. We do not need to add the "
"prefix `flights$` each time. Nevertheless, using `flights$origin` and "
"`flights$month` would work just fine."
msgstr ""
#: d.Rmd:block 34 (unordered list)
msgid ""
"The *row indices* that satisfy the condition `origin == \"JFK\" & month == "
"6L` are computed, and since there is nothing else left to do, all columns "
"from `flights` at rows corresponding to those *row indices* are simply "
"returned as a `data.table`."
msgstr ""
#: d.Rmd:block 34 (unordered list)
msgid ""
"A comma after the condition in `i` is not required. But `flights[origin == "
"\"JFK\" & month == 6L, ]` would work just fine. In a `data.frame`, however, "
"the comma is necessary."
msgstr ""
#: d.Rmd:block 35 (header)
msgid "-- Get the first two rows from `flights`. {#subset-rows-integer}"
msgstr ""
#: d.Rmd:block 37 (unordered list)
msgid ""
"In this case, there is no condition. The row indices are already provided in"
" `i`. We therefore return a `data.table` with all columns from `flights` at "
"rows for those *row indices*."
msgstr ""
#: d.Rmd:block 38 (header)
msgid ""
"-- Sort `flights` first by column `origin` in *ascending* order, and then by"
" `dest` in *descending* order:"
msgstr ""
#: d.Rmd:block 39 (paragraph)
msgid "We can use the R function `order()` to accomplish this."
msgstr ""
#: d.Rmd:block 41 (header)
msgid "`order()` is internally optimised"
msgstr ""
#: d.Rmd:block 42 (unordered list)
msgid ""
"We can use \"-\" on `character` columns within the frame of a `data.table` "
"to sort in decreasing order."
msgstr ""
#: d.Rmd:block 42 (unordered list)
msgid ""
"In addition, `order(...)` within the frame of a `data.table` uses "
"`data.table`'s internal fast radix order `forder()`. This sort provided such"
" a compelling improvement over R's `base::order` that the R project adopted "
"the `data.table` algorithm as its default sort in 2016 for R 3.3.0 (for "
"reference, check `?sort` and the [R Release "
"NEWS](https://cran.r-project.org/doc/manuals/r-release/NEWS.pdf))."
msgstr ""
#: d.Rmd:block 43 (paragraph)
msgid ""
"We will discuss `data.table`'s fast order in more detail in the "
"*`data.table` internals* vignette."
msgstr ""
#: d.Rmd:block 44 (header)
msgid "d) Select column(s) in `j` {#select-j-1d}"
msgstr ""
#: d.Rmd:block 45 (header)
msgid "-- Select `arr_delay` column, but return it as a *vector*."
msgstr ""
#: d.Rmd:block 47 (unordered list)
msgid ""
"Since columns can be referred to as if they are variables within the frame "
"of a `data.table`, we directly refer to the *variable* we want to subset. "
"Since we want *all the rows*, we simply skip `i`."
msgstr ""
#: d.Rmd:block 47 (unordered list)
msgid "It returns *all* the rows for the column `arr_delay`."
msgstr ""
#: d.Rmd:block 48 (header)
msgid "-- Select `arr_delay` column, but return as a `data.table` instead."
msgstr ""
#: d.Rmd:block 50 (unordered list)
msgid ""
"We wrap the *variables* (column names) within `list()`, which ensures that a"
" `data.table` is returned. In the case of a single column name, not wrapping"
" with `list()` returns a vector instead, as seen in the [previous "
"example](#select-j-1d)."
msgstr ""
#: d.Rmd:block 50 (unordered list)
msgid ""
"`data.table` also allows wrapping columns with `.()` instead of `list()`. It"
" is an *alias* to `list()`; they both mean the same. Feel free to use "
"whichever you prefer; we have noticed most users seem to prefer `.()` for "
"conciseness, so we will continue to use `.()` hereafter."
msgstr ""
#: d.Rmd:block 51 (paragraph)
msgid ""
"A `data.table` (and a `data.frame` too) is internally a `list` as well, with"
" the stipulation that each element has the same length and the `list` has a "
"`class` attribute. Allowing `j` to return a `list` enables converting and "
"returning `data.table` very efficiently."
msgstr ""
#: d.Rmd:block 52 (header)
msgid "Tip: {#tip-1}"
msgstr ""
#: d.Rmd:block 53 (paragraph)
msgid ""
"As long as `j-expression` returns a `list`, each element of the list will be"
" converted to a column in the resulting `data.table`. This makes `j` quite "
"powerful, as we will see shortly. It is also very important to understand "
"this for when you'd like to make more complicated queries!!"
msgstr ""
#: d.Rmd:block 54 (header)
msgid "-- Select both `arr_delay` and `dep_delay` columns."
msgstr ""
#: d.Rmd:block 56 (unordered list)
msgid "Wrap both columns within `.()`, or `list()`. That's it."
msgstr ""
#: d.Rmd:block 57 (header)
msgid ""
"-- Select both `arr_delay` and `dep_delay` columns *and* rename them to "
"`delay_arr` and `delay_dep`."
msgstr ""
#: d.Rmd:block 58 (paragraph)
msgid ""
"Since `.()` is just an alias for `list()`, we can name columns as we would "
"while creating a `list`."
msgstr ""
#: d.Rmd:block 60 (header)
msgid "e) Compute or *do* in `j`"
msgstr ""
#: d.Rmd:block 61 (header)
msgid "-- How many trips have had total delay < 0?"
msgstr ""
#: d.Rmd:block 63 (header)
msgid "What's happening here?"
msgstr ""
#: d.Rmd:block 64 (unordered list)
msgid ""
"`data.table`'s `j` can handle more than just *selecting columns* - it can "
"handle *expressions*, i.e., *computing on columns*. This shouldn't be "
"surprising, as columns can be referred to as if they are variables. Then we "
"should be able to *compute* by calling functions on those variables. And "
"that's what precisely happens here."
msgstr ""
#: d.Rmd:block 65 (header)
msgid "f) Subset in `i` *and* do in `j`"
msgstr ""
#: d.Rmd:block 66 (header)
msgid ""
"-- Calculate the average arrival and departure delay for all flights with "
"\"JFK\" as the origin airport in the month of June."
msgstr ""
#: d.Rmd:block 68 (unordered list)
msgid ""
"We first subset in `i` to find matching *row indices* where `origin` airport"
" equals `\"JFK\"`, and `month` equals `6L`. We *do not* subset the *entire* "
"`data.table` corresponding to those rows *yet*."
msgstr ""
#: d.Rmd:block 68 (unordered list)
msgid ""
"Now, we look at `j` and find that it uses only *two columns*. And what we "
"have to do is to compute their `mean()`. Therefore, we subset just those "
"columns corresponding to the matching rows, and compute their `mean()`."
msgstr ""
#: d.Rmd:block 69 (paragraph)
msgid ""
"Because the three main components of the query (`i`, `j` and `by`) are "
"*together* inside `[...]`, `data.table` can see all three and optimise the "
"query altogether *before evaluation*, rather than optimizing each "
"separately. We are able to therefore avoid the entire subset (i.e., "
"subsetting the columns *besides* `arr_delay` and `dep_delay`), for both "
"speed and memory efficiency."
msgstr ""
#: d.Rmd:block 70 (header)
msgid ""
"-- How many trips have been made in 2014 from \"JFK\" airport in the month "
"of June?"
msgstr ""
#: d.Rmd:block 72 (paragraph)
msgid ""
"The function `length()` requires an input argument. We just need to compute "
"the number of rows in the subset. We could have used any other column as the"
" input argument to `length()`. This approach is reminiscent of `SELECT "
"COUNT(dest) FROM flights WHERE origin = 'JFK' AND month = 6` in SQL."
msgstr ""
#: d.Rmd:block 73 (paragraph)
msgid ""
"This type of operation occurs quite frequently, especially while grouping "
"(as we will see in the next section), to the point where `data.table` "
"provides a *special symbol* `.N` for it."
msgstr ""
#: d.Rmd:block 74 (header)
msgid "g) Handle non-existing elements in `i`"
msgstr ""
#: d.Rmd:block 75 (header)
msgid "-- What happens when querying for non-existing elements?"
msgstr ""
#: d.Rmd:block 76 (paragraph)
msgid ""
"When querying a `data.table` for elements that do not exist, the behavior "
"differs based on the method used."
msgstr ""
#: d.Rmd:block 78 (unordered list)
msgid "**Key-based subsetting: `dt[\"d\"]`**"
msgstr ""
#: d.Rmd:block 78 (unordered list)
msgid ""
"This performs a right join on the key column `x`, resulting in a row with "
"`d` and `NA` for columns not found. When using `setkeyv`, the table is "
"sorted by the specified keys and an internal index is created, enabling "
"binary search for efficient subsetting."
msgstr ""
#: d.Rmd:block 78 (unordered list)
msgid "**Logical subsetting: `dt[x == \"d\"]`**"
msgstr ""
#: d.Rmd:block 78 (unordered list)
msgid ""
"This performs a standard subset operation that does not find any matching "
"rows and thus returns an empty `data.table`."
msgstr ""
#: d.Rmd:block 78 (unordered list)
msgid "**Exact match using `nomatch=NULL`**"
msgstr ""
#: d.Rmd:block 78 (unordered list)
msgid ""
"For exact matches without `NA` for non-existing elements, use "
"`nomatch=NULL`:"
msgstr ""
#: d.Rmd:block 79 (paragraph)
msgid ""
"Understanding these behaviors can help prevent confusion when dealing with "
"non-existing elements in your data."
msgstr ""
#: d.Rmd:block 80 (header)
msgid "Special symbol `.N`: {#special-N}"
msgstr ""
#: d.Rmd:block 81 (paragraph)
msgid ""
"`.N` is a special built-in variable that holds the number of observations "
"*in the current group*. It is particularly useful when combined with `by` as"
" we'll see in the next section. In the absence of group by operations, it "
"simply returns the number of rows in the subset."
msgstr ""
#: d.Rmd:block 82 (paragraph)
msgid ""
"Now that we now, we can now accomplish the same task by using `.N` as "
"follows:"
msgstr ""
#: d.Rmd:block 84 (unordered list)
msgid ""
"Once again, we subset in `i` to get the *row indices* where `origin` airport"
" equals *\"JFK\"*, and `month` equals *6*."
msgstr ""
#: d.Rmd:block 84 (unordered list)
msgid ""
"We see that `j` uses only `.N` and no other columns. Therefore, the entire "
"subset is not materialised. We simply return the number of rows in the "
"subset (which is just the length of row indices)."
msgstr ""
#: d.Rmd:block 84 (unordered list)
msgid ""
"Note that we did not wrap `.N` with `list()` or `.()`. Therefore, a vector "
"is returned."
msgstr ""
#: d.Rmd:block 85 (paragraph)
msgid ""
"We could have accomplished the same operation by doing `nrow(flights[origin "
"== \"JFK\" & month == 6L])`. However, it would have to subset the entire "
"`data.table` first corresponding to the *row indices* in `i` *and then* "
"return the rows using `nrow()`, which is unnecessary and inefficient. We "
"will cover this and other optimisation aspects in detail under the "
"*`data.table` design* vignette."
msgstr ""
#: d.Rmd:block 86 (header)
msgid ""
"h) Great! But how can I refer to columns by names in `j` (like in a "
"`data.frame`)? {#refer_j}"
msgstr ""
#: d.Rmd:block 87 (paragraph)
msgid ""
"If you're writing out the column names explicitly, there's no difference "
"compared to a `data.frame` (since v1.9.8)."
msgstr ""
#: d.Rmd:block 88 (header)
msgid ""
"-- Select both `arr_delay` and `dep_delay` columns the `data.frame` way."
msgstr ""
#: d.Rmd:block 90 (paragraph)
msgid ""
"If you've stored the desired columns in a character vector, there are two "
"options: Using the `..` prefix, or using the `with` argument."
msgstr ""
#: d.Rmd:block 91 (header)
msgid "-- Select columns named in a variable using the `..` prefix"
msgstr ""
#: d.Rmd:block 93 (paragraph)
msgid ""
"For those familiar with the Unix terminal, the `..` prefix should be "
"reminiscent of the \"up-one-level\" command, which is analogous to what's "
"happening here -- the `..` signals to `data.table` to look for the "
"`select_cols` variable \"up-one-level\", i.e., within the global environment"
" in this case."
msgstr ""
#: d.Rmd:block 94 (header)
msgid "-- Select columns named in a variable using `with = FALSE`"
msgstr ""
#: d.Rmd:block 96 (paragraph)
msgid ""
"The argument is named `with` after the R function `with()` because of "
"similar functionality. Suppose you have a `data.frame` `DF` and you'd like "
"to subset all rows where `x > 1`. In `base` R you can do the following:"
msgstr ""
#: d.Rmd:block 98 (unordered list)
msgid ""
"Using `with()` in (2) allows using `DF`'s column `x` as if it were a "
"variable."
msgstr ""
#: d.Rmd:block 98 (unordered list)
msgid ""
"Hence, the argument name `with` in `data.table`. Setting `with = FALSE` "
"disables the ability to refer to columns as if they are variables, thereby "
"restoring the \"`data.frame` mode\"."
msgstr ""
#: d.Rmd:block 98 (unordered list)
msgid "We can also *deselect* columns using `-` or `!`. For example:"
msgstr ""
#: d.Rmd:block 98 (unordered list)
msgid ""
"From `v1.9.5+`, we can also select by specifying start and end column names,"
" e.g., `year:day` to select the first three columns."
msgstr ""
#: d.Rmd:block 98 (unordered list)
msgid "This is particularly handy while working interactively."
msgstr ""
#: d.Rmd:block 99 (paragraph)
msgid ""
"`with = TRUE` is the default in `data.table` because we can do much more by "
"allowing `j` to handle expressions - especially when combined with `by`, as "
"we'll see in a moment."
msgstr ""
#: d.Rmd:block 100 (header)
msgid "2. Aggregations"
msgstr ""
#: d.Rmd:block 101 (paragraph)
msgid ""
"We've already seen `i` and `j` from `data.table`'s general form in the "
"previous section. In this section, we'll see how they can be combined "
"together with `by` to perform operations *by group*. Let's look at some "
"examples."
msgstr ""
#: d.Rmd:block 102 (header)
msgid "a) Grouping using `by`"
msgstr ""
#: d.Rmd:block 103 (header)
msgid ""
"-- How can we get the number of trips corresponding to each origin airport?"
msgstr ""
#: d.Rmd:block 105 (unordered list)
msgid ""
"We know `.N` [is a special variable](#special-N) that holds the number of "
"rows in the current group. Grouping by `origin` obtains the number of rows, "
"`.N`, for each group."
msgstr ""
#: d.Rmd:block 105 (unordered list)
msgid ""
"By doing `head(flights)` you can see that the origin airports occur in the "
"order *\"JFK\"*, *\"LGA\"*, and *\"EWR\"*. The original order of grouping "
"variables is preserved in the result. *This is important to keep in mind!*"
msgstr ""
#: d.Rmd:block 105 (unordered list)
msgid ""
"Since we did not provide a name for the column returned in `j`, it was named"
" `N` automatically by recognising the special symbol `.N`."
msgstr ""
#: d.Rmd:block 105 (unordered list)
msgid ""
"`by` also accepts a character vector of column names. This is particularly "
"useful for coding programmatically, e.g., designing a function with the "
"grouping columns (in the form of a `character` vector) as a function "
"argument."
msgstr ""
#: d.Rmd:block 105 (unordered list)
msgid ""
"When there's only one column or expression to refer to in `j` and `by`, we "
"can drop the `.()` notation. This is purely for convenience. We could "
"instead do:"
msgstr ""
#: d.Rmd:block 105 (unordered list)
msgid "We'll use this convenient form wherever applicable hereafter."
msgstr ""
#: d.Rmd:block 106 (header)
msgid ""
"-- How can we calculate the number of trips for each origin airport for "
"carrier code `\"AA\"`? {#origin-.N}"
msgstr ""
#: d.Rmd:block 107 (paragraph)
msgid "The unique carrier code `\"AA\"` corresponds to *American Airlines Inc.*"
msgstr ""
#: d.Rmd:block 109 (unordered list)
msgid ""
"We first obtain the row indices for the expression `carrier == \"AA\"` from "
"`i`."
msgstr ""
#: d.Rmd:block 109 (unordered list)
msgid ""
"Using those *row indices*, we obtain the number of rows while grouped by "
"`origin`. Once again no columns are actually materialised here, because the "
"`j-expression` does not require any columns to be actually subsetted and is "
"therefore fast and memory efficient."
msgstr ""
#: d.Rmd:block 110 (header)
msgid ""
"-- How can we get the total number of trips for each `origin, dest` pair for"
" carrier code `\"AA\"`? {#origin-dest-.N}"
msgstr ""
#: d.Rmd:block 112 (unordered list)
msgid ""
"`by` accepts multiple columns. We just provide all the columns by which to "
"group by. Note the use of `.()` again in `by` -- again, this is just "
"shorthand for `list()`, and `list()` can be used here as well. Again, we'll "
"stick with `.()` in this vignette."
msgstr ""
#: d.Rmd:block 113 (header)
msgid ""
"-- How can we get the average arrival and departure delay for each "
"`orig,dest` pair for each month for carrier code `\"AA\"`? {#origin-dest-"
"month}"
msgstr ""
#: d.Rmd:block 115 (unordered list)
msgid ""
"Since we did not provide column names for the expressions in `j`, they were "
"automatically generated as `V1` and `V2`."
msgstr ""
#: d.Rmd:block 115 (unordered list)
msgid ""
"Once again, note that the input order of grouping columns is preserved in "
"the result."
msgstr ""
#: d.Rmd:block 116 (paragraph)
msgid ""
"Now what if we would like to order the result by those grouping columns "
"`origin`, `dest` and `month`?"
msgstr ""
#: d.Rmd:block 117 (header)
msgid "b) Sorted `by`: `keyby`"
msgstr ""
#: d.Rmd:block 118 (paragraph)
msgid ""
"`data.table` retaining the original order of groups is intentional and by "
"design. There are cases when preserving the original order is essential. But"
" at times we would like to automatically sort by the variables in our "
"grouping."
msgstr ""
#: d.Rmd:block 119 (header)
msgid "-- So how can we directly order by all the grouping variables?"
msgstr ""
#: d.Rmd:block 121 (unordered list)
msgid ""
"All we did was change `by` to `keyby`. This automatically orders the result "
"by the grouping variables in increasing order. In fact, due to the internal "
"implementation of `by` first requiring a sort before recovering the original"
" table's order, `keyby` is typically faster than `by` because it doesn't "
"require this second step."
msgstr ""
#: d.Rmd:block 122 (paragraph)
msgid ""
"**Keys:** Actually `keyby` does a little more than *just ordering*. It also "
"*sets a key* after ordering by setting an `attribute` called `sorted`."
msgstr ""
#: d.Rmd:block 123 (paragraph)
msgid ""
"We'll learn more about `keys` in the *Keys and fast binary search based "
"subset* vignette; for now, all you have to know is that you can use `keyby` "
"to automatically order the result by the columns specified in `by`."
msgstr ""
#: d.Rmd:block 124 (header)
msgid "c) Chaining"
msgstr ""
#: d.Rmd:block 125 (paragraph)
msgid ""
"Let's reconsider the task of [getting the total number of trips for each "
"`origin, dest` pair for carrier *\"AA\"*](#origin-dest-.N)."
msgstr ""
#: d.Rmd:block 127 (header)
msgid ""
"-- How can we order `ans` using the columns `origin` in ascending order, and"
" `dest` in descending order?"
msgstr ""
#: d.Rmd:block 128 (paragraph)
msgid ""
"We can store the intermediate result in a variable, and then use "
"`order(origin, -dest)` on that variable. It seems fairly straightforward."
msgstr ""
#: d.Rmd:block 130 (unordered list)
msgid ""
"Recall that we can use `-` on a `character` column in `order()` within the "
"frame of a `data.table`. This is possible due to `data.table`'s internal "
"query optimisation."
msgstr ""
#: d.Rmd:block 130 (unordered list)
msgid ""
"Also recall that `order(...)` with the frame of a `data.table` is "
"*automatically optimised* to use `data.table`'s internal fast radix order "
"`forder()` for speed."
msgstr ""
#: d.Rmd:block 131 (paragraph)
msgid ""
"But this requires having to assign the intermediate result and then "
"overwriting that result. We can do one better and avoid this intermediate "
"assignment to a temporary variable altogether by *chaining* expressions."
msgstr ""
#: d.Rmd:block 133 (unordered list)
msgid ""
"We can tack expressions one after another, *forming a chain* of operations, "
"i.e., `DT[ ... ][ ... ][ ... ]`."
msgstr ""
#: d.Rmd:block 133 (unordered list)
msgid "Or you can also chain them vertically:"
msgstr ""
#: d.Rmd:block 134 (header)
msgid "d) Expressions in `by`"
msgstr ""
#: d.Rmd:block 135 (header)
msgid "-- Can `by` accept *expressions* as well or does it just take columns?"
msgstr ""
#: d.Rmd:block 136 (paragraph)
msgid ""
"Yes it does. As an example, if we would like to find out how many flights "
"started late but arrived early (or on time), started and arrived late etc..."
msgstr ""
#: d.Rmd:block 138 (unordered list)
msgid ""
"The last row corresponds to `dep_delay > 0 = TRUE` and `arr_delay > 0 = "
"FALSE`. We can see that `r flights[!is.na(arr_delay) & !is.na(dep_delay), "
".N, .(dep_delay>0, arr_delay>0)][, N[4L]]` flights started late but arrived "
"early (or on time)."
msgstr ""
#: d.Rmd:block 138 (unordered list)
msgid ""
"Note that we did not provide any names to `by-expression`. Therefore, names "
"have been automatically assigned in the result. As with `j`, you can name "
"these expressions as you would for elements of any `list`, like for e.g. "
"`DT[, .N, .(dep_delayed = dep_delay>0, arr_delayed = arr_delay>0)]`."
msgstr ""
#: d.Rmd:block 138 (unordered list)
msgid ""
"You can provide other columns along with expressions, for example: `DT[, .N,"
" by = .(a, b>0)]`."
msgstr ""
#: d.Rmd:block 139 (header)
msgid "e) Multiple columns in `j` - `.SD`"
msgstr ""
#: d.Rmd:block 140 (header)
msgid "-- Do we have to compute `mean()` for each column individually?"
msgstr ""
#: d.Rmd:block 141 (paragraph)
msgid ""
"It is of course not practical to have to type `mean(myCol)` for every column"
" one by one. What if you had 100 columns to average `mean()`?"
msgstr ""
#: d.Rmd:block 142 (paragraph)
msgid ""
"How can we do this efficiently and concisely? To get there, refresh on [this"
" tip](#tip-1) - *\"As long as the `j`-expression returns a `list`, each "
"element of the `list` will be converted to a column in the resulting "
"`data.table`\"*. If we can refer to the *data subset for each group* as a "
"variable *while grouping*, we can then loop through all the columns of that "
"variable using the already- or soon-to-be-familiar base function `lapply()`."
" No new names to learn specific to `data.table`."
msgstr ""
#: d.Rmd:block 143 (header)
msgid "Special symbol `.SD`: {#special-SD}"
msgstr ""
#: d.Rmd:block 144 (paragraph)
msgid ""
"`data.table` provides a *special* symbol called `.SD`. It stands for "
"**S**ubset of **D**ata. It by itself is a `data.table` that holds the data "
"for *the current group* defined using `by`."
msgstr ""
#: d.Rmd:block 145 (paragraph)
msgid ""
"Recall that a `data.table` is internally a `list` as well with all its "
"columns of equal length."
msgstr ""
#: d.Rmd:block 146 (paragraph)
msgid ""
"Let's use the [`data.table` `DT` from before](#what-is-datatable-1a) to get "
"a glimpse of what `.SD` looks like."
msgstr ""
#: d.Rmd:block 148 (unordered list)
msgid ""
"`.SD` contains all the columns *except the grouping columns* by default."
msgstr ""
#: d.Rmd:block 148 (unordered list)
msgid ""
"It is also generated by preserving the original order - data corresponding "
"to `ID = \"b\"`, then `ID = \"a\"`, and then `ID = \"c\"`."
msgstr ""
#: d.Rmd:block 149 (paragraph)
msgid ""
"To compute on (multiple) columns, we can then simply use the base R function"
" `lapply()`."
msgstr ""
#: d.Rmd:block 151 (unordered list)
msgid ""
"`.SD` holds the rows corresponding to columns `a`, `b` and `c` for that "
"group. We compute the `mean()` on each of these columns using the already-"
"familiar base function `lapply()`."
msgstr ""
#: d.Rmd:block 151 (unordered list)
msgid ""
"Each group returns a list of three elements containing the mean value which "
"will become the columns of the resulting `data.table`."
msgstr ""
#: d.Rmd:block 151 (unordered list)
msgid ""
"Since `lapply()` returns a `list`, so there is no need to wrap it with an "
"additional `.()` (if necessary, refer to [this tip](#tip-1))."
msgstr ""
#: d.Rmd:block 152 (paragraph)
msgid ""
"We are almost there. There is one little thing left to address. In our "
"`flights` `data.table`, we only wanted to calculate the `mean()` of the two "
"columns `arr_delay` and `dep_delay`. But `.SD` would contain all the columns"
" other than the grouping variables by default."
msgstr ""
#: d.Rmd:block 153 (header)
msgid ""
"-- How can we specify just the columns we would like to compute the `mean()`"
" on?"
msgstr ""
#: d.Rmd:block 154 (header)
msgid ".SDcols"
msgstr ""
#: d.Rmd:block 155 (paragraph)
msgid ""
"Using the argument `.SDcols`. It accepts either column names or column "
"indices. For example, `.SDcols = c(\"arr_delay\", \"dep_delay\")` ensures "
"that `.SD` contains only these two columns for each group."
msgstr ""
#: d.Rmd:block 156 (paragraph)
msgid ""
"Similar to [part g)](#refer_j), you can also specify the columns to remove "
"instead of columns to keep using `-` or `!`. Additionally, you can select "
"consecutive columns as `colA:colB` and deselect them as `!(colA:colB)` or "
"`-(colA:colB)`."
msgstr ""
#: d.Rmd:block 157 (paragraph)
msgid ""
"Now let us try to use `.SD` along with `.SDcols` to get the `mean()` of "
"`arr_delay` and `dep_delay` columns grouped by `origin`, `dest` and `month`."
msgstr ""
#: d.Rmd:block 159 (header)
msgid "f) Subset `.SD` for each group:"
msgstr ""
#: d.Rmd:block 160 (header)
msgid "-- How can we return the first two rows for each `month`?"
msgstr ""
#: d.Rmd:block 162 (unordered list)
msgid ""
"`.SD` is a `data.table` that holds all the rows for *that group*. We simply "
"subset the first two rows as we have seen [here](#subset-rows-integer) "