-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy pathdesign.txt
1157 lines (905 loc) · 43.8 KB
/
design.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
assuming the following arrays: a(sex, age) and age_limit(sex)
step 1:
a[age > age_limit]
a[age + clength < age_limit]
b = a * (age > age_limit)
step 2:
a[X.age > age_limit]
# this is also possible ("X.age > age_limit" return an Expr, expr is evaluated
# during the binop (axes ref replace by real axe)
b = a * (X.age > age_limit)
==============
in general:
1) match axes by Axis object => No axis.id because we need to be able to share
the same axis in several collections/arrays.
=> this is slightly annoying for Group.__repr__ which uses axis.id
=> we cannot have twice the same axis object in a collection
(we can have the same name twice though)
2) match axes by name if any, by position otherwise
3) match axes by position
"""
# TODO
# * axes with no name should display as (or even have their name assigned to?)
# their position. Assigning does not work because after aggregating axis 0,
# we get the first axis named "1" which is a no-go.
# it would be much easier to have a .id attribute/property on axis with
# either the name or position in it, but this requires that axes know about
# their AxisCollection. id might not be defined when axis is not attached
# to a Collection
# * add check there is no duplicate label in axes!
# * for NDGroups, we have two options: cross product or intersection.
# Technically, this is easy, we just need to store a boolean and in getitem
# act accordingly (use ix_ or not), but what is the best API for users?
# a different class or a flag? In fact, the same question applies to
# positional vs label (in total, we got 4 different possibilities).
# ? how do you combine a cross-product group with an intersection group?
# a[cpgroup, igroup]
# -> no problem if they are on different dimensions: the igroup
# dimension(s) are collapsed into one, pgroup dimension(s) stay.
# The index need to be constructed carefully, but it can be done. See
# np_indexing.py
# -> if they are on the same dimensions, we have two options:
# * apply one then the other (left to right)
# * fail <-- I think this is safer at least to implement. One after the
# other can still be achieved by a[pgroup][igroup]
# API for ND groups (my example is mixing label with positional):
# union (bands): X.axis1[5:10] | X.axis2.i[3:4]
# intersection/cross/default: X.axis1[5:10] & X.axis2.i[3:4]
# points:
# * X.axis1[5:10] ^ X.axis2.i[1:6] --> this prevents symetric difference. this is little used but...
# * Points(X.axis[5:10], X.axis2.i[1:6])
# * X.axis[5:10].combine(X.axis2.i[1:6])
# this is very nice and would have orderedset-like semantics
# it does not seem to conflict with the axis methods (even though that might be
# confusing):
# X.axis1 | X.axis2 would have a very different meaning than
# X.axis1[:] | X.axis2[:]
# Note that cross sections is the default and it is useless to introduce
# another API **except to give a name**, so the & syntax is useless unless
# we allow naming groups after the fact
# => NDGroup((X.axis1[5:10], X.axis2.i[2.5]), 'exports')
# => Group((X.axis1[5:10], X.axis2.i[2.5]), 'exports')
# => (X.axis1[5:10] & X.axis2.i[2.5]).named('exports')
# http://xarray.pydata.org/en/stable/indexing.html#pointwise-indexing
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.lookup.html#pandas.DataFrame.lookup
# I think I would go for
# ARRAY.points[dim0_labels, ..., dimX_labels]
# and
# ARRAY.ipoints[dim0_indices, ..., dimX_indices]
# so that we can have a symmetrical API for get and set.
# I wonder if, for axes subscripting, I could not allow tuples as sequences,
# which would make it a bit nicer:
# X.axis1[5, 7, 10].named('brussels')
# instead of
# X.axis1[[5, 7, 10]].named('brussels')
# since axes are always 1D, this is not a direct problem. However the
# question is whether this would lead to an inconsistent API/confuse users
# because they would still have to write the brackets when no axis is present
# a[[5, 7, 9]]
# a[X.axis1[5, 7, 9]]
# in practice, this syntax is little used anyway
# options
# 1) multiple range in same [] create multiple groups
# =====================================================
# G.age[5, 7, 9] == G.age[5], G.age[7], G.age[9]
# G.age[:9, 10:14, 15:25] == G.age[:9], G.age[10:14], G.age[15:25]
# G[2:7, 'M', ['P01', 'P05']] == G[2:7], G['M'], G['P01', 'PO5']
# a[G[2:7, 'M', ['P01', 'P05']]] == a[2:7, 'M', ['P01', 'P05']]
# a[G[2:7, 'M', ['P01', 'P05']]] == a[2:7, 'M', ['P01', 'P05']]
# a[G[5, 7, 9]] == a[5, 7, 9] => key has several values for axis: age
# a[G[[5, 7, 9]]] == a[[5, 7, 9]]
# a[G.union[5, 7, 9]] == a[[5, 7, 9]]
# a[G.union[5, 7:10, 12]] == a[[5, 7, 8, 9, 10, 12]]
# a[G.age[2:7, 5:10], 'M', ['P01', 'P05']]]
# == a[2:7, 5:10, 'M', ['P01', 'P05']]
# OR rather
# == a[(G[2:7], G[5:10]), 'M', ['P01', 'P05']]
# == a[2:7, 'M', ['P01', 'P05']], a[5:10, 'M', ['P01', 'P05']]
# OR?
# == larray
# age 2:7 5:10
# a[2:7, 'M', ['P01', 'P05']] a[5:10, 'M', ['P01', 'P05']]
# OR?
# == larray with 4 dim (age_group, age, sex, lipro)
# this would currently only work if the slices have the same size,
# but with pandas/MI, this could work even when the slices are different
# a[G[2:7, 5:10, 'M', 'F']] <---- should this be supported? (I think so
# for first step (split in individual
# groups) but fail in a[]
# == a[G[2:7], G[5:10], G['M'], G['F']]
# == a[2:7, 5:10, 'M', 'F']
# == a[2:7, 'M'], a[2:7, 'F'], a[5:10, 'M'], a[5:10, 'F']
#
# or return an larray?
# 2:7 5:10
# M a[2:7, 'M'] a[5:10, 'M']
# F a[2:7, 'F'] a[5:10, 'F']
# a[G[2:7, 5:10], G['M', 'F']]]
# == a[(G[2:7], G[5:10]), (G['M'], G['F'])]
# == [(a[2:7], a[5:10]), (a['M'], a['F'])]
# a[G[2:7, 5:10] & G['M', 'F']]]
# == a[G[2:7] & G['M'], G[5:10] & G['M'], G[2:7] & G['F'], G[5:10] & G['F']]
# OR
# == a[G[2:7] & G['M'], G[2:7] & G['F'], G[5:10] & G['M'], G[5:10] & G['F']]
# a[(G[2:7], 'M'), (G[2:7], 'F'), (G[5:10], 'M'), (G[5:10], 'F')] = \
# [ 1, 2, 3, 4]
# a[G[2:7, 'M'], G[2:7, 'F'], G[5:10, 'M'], G[5:10, 'F']] = \
# [ 1, 2, 3, 4]
# ==
# a[(G[2: 7], G['M']), (G[2: 7], G['F']),
(G[5:10], G['M']), (G[5:10], G['F'])] = \
# [ 1, 2,
# 3, 4]
# ==
# a[[(G[2: 7], G['M']), (G[2: 7], G['F']),
(G[5:10], G['M']), (G[5:10], G['F'])]] = \
# [ 1, 2,
# 3, 4]
# ==
# a[[(G[2: 7], G['M']), (G[2: 7], G['F']),
(G[5:10], G['M']), (G[5:10], G['F'])]] = \
# [ 1, 2,
# 3, 4]
indexing: le dernier niveau (inner-most) => &,
les autres niveaux créent un larray de larrays
# a[[(G[2: 7], G['M']), (G[2: 7], G['F']),
(G[5:10], G['M']), (G[5:10], G['F'])]]
== LArray([a[2:7 & 'M'], a[2:7 & 'F'], a[5:10 & 'M'], a[5:10 & 'F']])
# mais si scalaires au lieu de slices, on pourrait s'attendre à un larray au
lieu de larray de larrays
a[(G[2], G['M']), (G[2], G['F']), (G[5], G['M']), (G[5], G['F'])]
== LArray([a[2, 'M'], a[2, 'F'], a[5, 'M'], a[5, 'F']])
mais si on veut ça, il suffit de faire:
a[G[[2, 5]], G[['M', 'F']]]
reste à savoir si on veut que G splitte les tuples de scalaires:
a[G[2, 5:10], G['M', 'F']]
== a[G[(2, 5:10)], G[('M', 'F'])]
== a[G[(2, 5:10)], G[('M', 'F'])]
== a[G[[2, 5]], G[['M', 'F']]] (== a[[2, 5], ['M', 'F']])
OU
== a[(G[2], G[5]), (G['M'], G['F'])] (== [[a[2], a[5]], [a['M'], a['F']])
ce qui revient à dire que tuple split et pas list
ou alors on utilise une méthode spécifique pour split (split ou groups ou
multi):
G.split[2, 5] == G[2], G[5]
G.clength.split[2, 5:10, 20] == G.clength[2], G.clength[5:10], G.clength[20]
G.clength.split[2, 5] == G.clength[2], G.clength[5]
# a[[(G[2: 7], G['M']), (G[2: 7], G['F']),
(G[5:10], G['M']), (G[5:10], G['F'])]] =
# a[2:7, 'M'] = 1
# a[2:7, 'F'] = 2
# a[5:10, 'M'] = 3
# a[5:10, 'F'] = 4
# a[2:7, 5:10, 'M', 'F'] = [1, 2, 3, 4]
# multi assignment use case
# a[:] = {(G[2:7], 'M'): 1,
# (G[2:7], 'F'): 2,
# (G[5:10], 'M'): 3,
# (G[5:10], 'F'): 4}
# a.update({(G[2:7], 'M'): 1,
# (G[2:7], 'F'): 2,
# (G[5:10], 'M'): 3,
# (G[5:10], 'F'): 4})
# a.update({(G.age[2:7], 'M'): 1,
# (G.age[2:7], 'F'): 2,
# (G.age[5:10], 'M'): 3,
# (G.age[5:10], 'F'): 4})
# a[:] = [(G[2:7], 'M'), 1,
# (G[2:7], 'F'), 2,
# (G[5:10], 'M'), 3,
# (G[5:10], 'F'), 4]
# a[:] = {G[2:7, 'M']: 1,
# G[2:7, 'F']: 2,
# G[5:10, 'M']: 3,
# G[5:10, 'F']: 4}
# a.update({G[2:7, 'M']: 1,
# G[2:7, 'F']: 2,
# G[5:10, 'M']: 3,
# G[5:10, 'F']: 4})
# >>>> the goals are to avoid repeating the axes names if ambiguous and the
# array name but have the values as close to the labels as possible (assign by
# position does not scale)
# same problem for LArray contructor BTW
# m = LArray([[1, 2],
# [3, 4]], axes=[Axis('age', R[2:7]), Axis('sex', 'M,F')])
#
# minr_replica = LArray([[0.20, 0.57], [0.46, 0.65]],
# [Axis('sex', ['men', 'women']),
# Axis('stat', ['benef_tot', 'prop_carr'])])
#
# minr_replica = [('men', 'benef_tot'), 0.20,
# ('women', 'benef_tot'), 0.57,
# ('men', 'prop_carr'), 0.46,
# ('women', 'prop_carr'), 0.65],
# ('sex', 'stat')
#
# benef_tot = [('men', 0.20), ('women', 0.57)]
# prop_carr = [('men', 0.46), ('women', 0.65)]
# minr_replica = ('benef_tot', benef_tot), ('prop_carr', prop_carr)
# hierarchical ordered "dict" (compact & readable but hard to write because of
# punctuation overload)
# minr_replica = [('benef_tot', [('men', 0.20), ('women', 0.57)])
# ('prop_carr', [('men', 0.46), ('women', 0.65)])]
# nice but only string labels
# minr_replica = od(benef_tot=od(men=0.20, women=0.57),
# prop_carr=od(men=0.46, women=0.65))
# benef_tot = [('men', 0.20), ('women', 0.57)]
# prop_carr = [('men', 0.46), ('women', 0.65)]
# minr_replica = ('benef_tot', benef_tot), ('prop_carr', prop_carr)
#
# minr_replica = ('benef_tot', benef_tot), ('prop_carr', prop_carr)
# men women
# minr_replica= la([[0.20, 0.57], # benef_tot
# [0.46, 0.65]], # prop_carr
# [Axis('sex', 'men,women'),
# Axis('stat', 'benef_tot,prop_carr')])
# minr_replica= la(['stat', 'sex'], ['men', 'women'],
# 'benef_tot', [ 0.20, 0.57],
# 'prop_carr', [ 0.46, 0.65])
# minr_replica = fromlists(['stat \ sex', 'men', 'women'],
# ['benef_tot', 0.20, 0.57],
# ['prop_carr', 0.46, 0.65])
# minr_replica = fromlists([['stat', 'sex'], 'men', 'women'],
# ['benef_tot', 0.20, 0.57],
# ['prop_carr', 0.46, 0.65])
# minr_replica = fromtuples([( 'stat', 'sex', 'value')
# ('benef_tot', 'men', 0.20),
# ('benef_tot', 'women', 0.57),
# ('prop_carr', 'men', 0.46),
# ('prop_carr', 'women', 0.65)])
# benef_tot = fromlists(['sex', 'men', 'women'],
# ['', 0.20, 0.57])
# benef_tot = fromtuples([( 'sex', 'value')
# ( 'men', 0.20),
# ('women', 0.57)])
# benef_tot = fromlists(['sex', 'men', 'women'],
# [None, 0.20, 0.57])
# benef_tot = fromlists(['sex', 'men', 'women'],
# [ 0.20, 0.57])
# benef_tot = fromtuples([('men', 0.20), ('women', 0.57)], header=False)
# discourage this because we do not know order of ticks of sex
# for zeros, full, ones, etc. it's fine (and encouraged) to reuse axes !
# benef_tot = LArray([0.20, 0.57], [sex])
# in xarray uses:
# foo = xr.DataArray(data, coords=[times, locs], dims=['time', 'space'])
# foo = xr.DataArray(data, [('x', ['a', 'b']),
# ('y', [-2, 0, 2])])
# m = {G[2:7, 'M']: 1, G[2:7, 'F']: 2, G[5:10, 'M']: 3, G[5:10, 'F']: 4}
# breaks if combination of axes
# a.set(X.age[m])
2) multiple range in same [] means "and"
=========================================
=> and set op if same axis, ND group otherwise
G.age[5, 7, 9] == G.age[5] & G.age[7] & G.age[9] => EMPTY group !
=> MUST use double brackets: G.age[[5, 7, 9]]
G.age[:20, 10:30] == G.age[:20] & G.age[10:30] == G.age[20:30]
G[2:7, 'M', ['P01', 'P05']] == G[2:7] & G['M'] & G['P01', 'PO5']
3) multiple range in same [] means "or"
========================================
=> set op if same axis, ND group otherwise
G.age[5, 7, 9] == G.age[5] | G.age[7] | G.age[9]
G.age[:20, 10:30] == G.age[:20] | G.age[10:30] == G.age[10:30]
G[2:7, 'M', ['P01', 'P05']] == G[2:7] | G['M'] | G['P01', 'PO5']
the ND variant is a bit silly (it is so rarely needed)
4) multiple range in same [] creates multiple groups except when there are
only scalars
===========================================================================
=> set op if same axis, ND group otherwise
G.age[5, 7, 11] == G.age[[5, 7, 9]]
G.age[5, 7:9, 11] == G.age[5], G.age[7:9], G.age[11]
G.age[:20, 10:30] == G.age[:20], G.age[10:30]
G.age[5, 10, 15:20, 25:30] == G.age[5], G.age[10], G.age[15:20], G.age[25:30]
G.age[5, 10, :20, 10:30] == G.age[:20], G.age[10:30]
G[2:7, 'M', ['P01', 'P05']] == G[2:7], G['M'], G['P01', 'PO5']
5) multiple range in same [] means "or" for same axis / "and" for other axis
=============================================================================
=> set op if same axis, ND group otherwise
G.age[5, 7, 11] == G.age[5] | G.age[7] | G.age[9]
G.age[5, 7:9, 11] == G.age[5] | G.age[7:9] | G.age[11]
G.age[:20, 10:30] == G.age[:20] | G.age[10:30] == G.age[:30]
G[5, 7:9, 'M', ['P01', 'P05']]
== (G.age[5] | G.age[7:9]) & G.sex['M'] & G.lipro[['P01', 'PO5']]
6) some other combination: doing different stuff whether same axis or not,
or whether slice or scalar
===========================================================================
7) multiple range in same [] are only allowed for same axis (and means "or")
============================================================================
=> set op if same axis, different axis not allowed
=> the definition of a Group is: a list of labels of one axis
implies more or less that we must have a different object for ND Groups
=> implies more or less that we will not support
array['5, 7, 11, P01,P05, M']
array[5, 7, 11, 'P01, P05, M']
this is fine though:
array['5, 7, 11; P01, P05; M']
array[[5, 7, 11], ['P01', 'P05'], 'M']
array[[5, 7, 11], 'P01, P05', 'M']
and maybe this too:
array[[5, 7, 11], 'P01, P05; M']
G.age[5, 7, 11] == G.age[[5, 7, 11]] == G.age[5] | G.age[7] | G.age[9]
G.age[5, 7:9, 11] == G.age[5, 7, 8, 9, 11]
G.age[:20, 10:30] == G.age[:20] | G.age[10:30] == G.age[:30]
G[5, 7:9, 'M', ['P01', 'P05']] --> fails because it tries to find a single axis containing all of those
G[5, 7:9] & G['M'] & G['P01', 'P05'] --> works (returns NDGroup)
G['5, 7:9; P01,P05; M'] --> returns NDGroup (same as above)
G[[5, 7:9], ['P01', 'P05'], 'M'] --> fails 7:9 is sadly an invalid syntax
G[[5 7,8,9], 'P01, P05', 'M'] --> works too
== age[5, 7, 8, 9] & sex['M'] & lipro['P01', 'PO5']
== NDGroup([[5, 7, 8, 9], 'M', ['P01', 'P05']], axes=['age', 'sex', 'lipro'])
OR
== NDGroup({'age': [5, 7, 8, 9],
'sex': 'M'
'lipro': ['P01', 'P05']})
'5,7:9; M; P01,P05'
'age[5,7:9]; sex[M]; lipro[P01,P05]'
# use cases
# 1) simple get/set
a['2:7; M; P01,P02']
a[2:7, 'M', ['P01', 'P02']]
# 2) boolean selection
a[(X.age < 10) | (X.clength > 5)]
# 3) simple with ambiguous values
a[G.age[2:7], G.clength[5, 7, 9], 'M', ['P01', 'P02']]
# 4) point-selection
a[G.age[2:4] ^ G.clength[5, 7, 9], 'M', ['P01', 'P02']]
a[G[2, 9, 3] ^ G['M', 'F', 'M'], ['P01', 'P02']]
# set "diagonal" to 0
countries = ...
use[src[countries] ^ dst[countries]] = 0
# 4b) lookup (this is a form of point-selection), wh potentially repeated values
person_age = [5, 1, 2, 1, 6, 5]
person_gender = ['M', 'F', 'F', 'M', 'M', 'F']
person_workstate = [1, 3, 1, 2, 1, 3]
income = mean_income[person_age, person_gender] # <-- FAILS ! (it does a cross product)
income = mean_income[G[person_age] ^ G[person_gender]]
income = mean_income[G[person_age].combine(G[person_gender])]
income = mean_income[G.points[person_age, person_gender]] # <-- disallow having an axis named "points"
income = mean_income.points[person_age, person_gender]
# if ambiguous
income = mean_income.points[G.age[person_age], person_gender]
# 4c) lookup with larger than axis keys
# 4d) lookup with boolean keys
TODO, especially versus set ops on group
income = mean_income[G[person_age] ^ ~G[person_gender]]
# this would fail because & does set-op, ie will most likely return "False, True"
income = extra_income[G[person_gender] & (G[workstate] == 1)] # <-- fails
# one potential solution might be to introduce yet another concept: G/LG
# could be renamed to (or presented explicitly as) LabelSet. In that case, it
# would make it obvious (for me ;-)) that using that for the "lookup" usecase
# does not fly.
# now, how would we name the "other concept" in here? LabelGroup? LabelKey?
income = extra_income[LK[person_gender] & (LK[workstate] == 1)]
# Q: maybe I don't need an extra concept anyway, just use 1D larray instead?
# A: We do need it, because the values, even if given as larray can be
# ambiguous (valid on several axes)
# Q: maybe we could introduce a different array class "LookupArray" whicb does
# .points by default.
# A: yes, that's an option but would not solve the "set" problem.
# => I NEED a way to set the axis on an LKey. maybe X.abc[LK] should not
# return an LSet? but an LKey with an axis.
# Q: maybe I could solve the boolop/set problem depending on what is inside the
# LGroup (v1):
# 1) scalars, slices, list: set op
# 2) larray: defer op to it (=> element-wise bool op)
# A: set op on list of bools or scalar bools does not make much sense
# Q: maybe I could solve the boolop/set problem depending on what is inside the
# LGroup (v2):
# 1) non bool stuff (scalars, slices, list, larray, ...): set op
# 2) bool stuff (scalar, list of bools, larray of bools: bool op
# A:
# 5) import-export, a single group on several dimensions
dom = (ARRV[nutsbe] & ARRA[nutsbe]).named('dom')
berow = (ARRV[nutsbe] & ARRA[nutsrow]).named('berow')
rowbe = (ARRV[nutsrow] & ARRA[nutsbe]).named('rowbe')
transit = (ARRV[nutsrow] & ARRA[nutsrow]).named('transit')
# we could define act, like this (but that should not be required, as the
# coupling is looser if we can do it manually):
act = Axis('act', (dom, berow, rowbe, transit))
act2 = (dom, berow, rowbe, transit)
AGGTON = stack([TON2.sum(nutsbe, nutsbe), TON2.sum(nutsbe, nutsrow),
TON2.sum(nutsrow, nutsbe), TON2.sum(nutsrow, nutsrow)],
["dom", "berow", "rowbe", "transit"], "act")
AGGTON = TON2.sum((dom, berow, rowbe, transit)).rename('ARRV,ARRA', 'act')
AGGTON = TON2.sum(act=(dom, berow, rowbe, transit))
AGGTON = TON2.sum(act=act)
AGGTON = TON2.sum(act) # <--- this is obviously the shortest but I am unsure
# its a good idea. On one hand, I dislike the fact that in act2 above, we
# cannot name the resulting axis, but on the other hand, summing on an axis
# not present in the original array, and which is created rather than
# aggregated seems weird at best. It's only after inspecting said axis that we
# can tell that it contains groups on axes which
# are present.
# in the 1D case this is usually not a problem since it is usually fine to have
# maybe doing:
act3 = stack([('dom', dom), ('berow', berow), ('rowbe', rowbe),
('transit', transit)], 'act')
AGGTON = TON2.sum(act3) # <--- I think this makes a lot of sense, and could
# potentially be useful with arrays > 1D: you could e.g. create 2 dimensions
# out of 1: assuming an age axis
act3 = fromlist(['sub \ 40+', '-39', '40+'],
[ 0, G[ : 9], G[40:49]],
[ 1, G[10:19], G[50:59]],
[ 2, G[20:29], G[60:69]],
[ 3, G[30:39], G[70:79]])
act3 = table(['sub', '40+'], [ '-39', '40+'],
[ 0, G[ : 9], G[40:49]],
[ 1, G[10:19], G[50:59]],
[ 2, G[20:29], G[60:69]],
[ 3, G[30:39], G[70:79]])
act3 = table(['sub', '40+', '-39', '40+'],
[ 0, G[ : 9], G[40:49]],
[ 1, G[10:19], G[50:59]],
[ 2, G[20:29], G[60:69]],
[ 3, G[30:39], G[70:79]])
.sum(act3)
# 6) multi slices, aggregate (one group per slice)
# groups = (X.clength[1:15], X.clength[16:25], X.clength[26:30],
# X.clength[31:35], X.clength[36:40], X.clength[41:50])
# agg = arr.sum(groups)
# groups = G.clength[1:15, 16:25, 26:30, 31:35, 36:40, 41:50]
# agg = arr.sum(groups)
# agg = arr.sum(G[1:15, 16:25, 26:30, 31:35, 36:40, 41:50])
# 7) multi slices, assign one value per slice
# multip_mat_min = zeros([clength, year])
# multip_mat_min[X.clength[1:15], X.year[first_year_p:2024]] = 7 / 7
# multip_mat_min[X.clength[16:25], X.year[first_year_p:2024]] = 20 / 20
# multip_mat_min[X.clength[26:30], X.year[first_year_p:2024]] = 27 / 27
# multip_mat_min[X.clength[31:35], X.year[first_year_p:2024]] = 32 / 32
# multip_mat_min[X.clength[36:40], X.year[first_year_p:2024]] = 37 / 37
# multip_mat_min[X.clength[41:50], X.year[first_year_p:2024]] = 42 / 42
# multip_mat_min[X.clength[1:15], X.year[2025:2029]] = 8 / 7
# multip_mat_min[X.clength[16:25], X.year[2025:2029]] = 21 / 20
# multip_mat_min[X.clength[26:30], X.year[2025:2029]] = 28 / 27
# multip_mat_min[X.clength[31:35], X.year[2025:2029]] = 33 / 32
# multip_mat_min[X.clength[36:40], X.year[2025:2029]] = 38 / 37
# multip_mat_min[X.clength[41:50], X.year[2025:2029]] = 43 / 42
# multip_mat_min[X.clength[1:15], X.year[2030:]] = 9 / 7
# multip_mat_min[X.clength[16:25], X.year[2030:]] = 22 / 20
# multip_mat_min[X.clength[26:30], X.year[2030:]] = 29 / 27
# multip_mat_min[X.clength[31:35], X.year[2030:]] = 34 / 32
# multip_mat_min[X.clength[36:40], X.year[2030:]] = 39 / 37
# multip_mat_min[X.clength[41:50], X.year[2030:]] = 44 / 42
#
# # already possible
# m = zeros(clength)
# m[X.clength[1:15]] = 7
# m[X.clength[16:25]] = 20
# m[X.clength[26:30]] = 27
# m[X.clength[31:35]] = 32
# m[X.clength[36:40]] = 37
# m[X.clength[41:50]] = 42
# multip_mat_min = zeros([clength, year])
# multip_mat_min[X.year[:2024]] = m / m
# multip_mat_min[X.year[2025:2029]] = (m + 1) / m
# multip_mat_min[X.year[2030:]] = (m + 2) / m
# >>> very nice for this case but it does not scale very well with number of
# values to set. On the other hand, splitting it in case it does not fit
# on a line is not TOO horrible (just a bit horrible ;-))
# m = zeros(clength)
# m[X.clength[1:15, 16:25, 26:30, 31:35, 36:40, 41:50]] = \
# [ 7, 20, 27, 32, 37, 42]
# multip_mat_min = zeros([clength, year])
# multip_mat_min[X.year[:2024, 2025:2029, 2030:]] = \
# [m / m, (m + 1) / m, (m + 2) / m]
# m = zeros(clength)
# m[:] = {
# G[1:15]: 7, G[16:25]: 20, G[26:30]: 27, G[31:35]: 32, G[36:40]:37,
# G[41:50]: 42
# }
# VERDICT: FAIL ! (see below)
# but m.set(map) could work
# this seems powerful but there might be ambiguities? (do we want to expand
# the groups or treat them as normal "scalars")
# m = zeros(clength)
# m[:] = fromlists(['clength', G[1:15, 16:25, 26:30, 31:35, 36:40, 41:50]],
# ['', 7, 20, 27, 32, 37, 42])
# VERDICT: FAIL ! the problem is that it breaks two basic assumptions
# * that m[:] will be entirely set
# * that value will be broadcast to the result of array[key]
# also think about G (for group) or L (for label):
# a[G[5:10]]
# a[G[5, 7, 9]]
# a[G[5:10].named('brussels')]
# a[G[5, 7, 9].named('brussels')]
# this is ugly
# a[G[5, 7, 9].axed('age')]
# a[G[5, 7, 9].x('age')]
# a[G[5, 7, 9].on('age')]
# a[G[5, 7, 9].with(axis='age')]
# nice for the simple cases, but cannot target anonymous axes G[0] or axes with
# strange names G['strange axis']. would both try to find labels on axes, not
# the axes themselves.
# a[G.age[5, 7, 9]]
# a[G.geo[5, 7, 9].named('brussels')]
# a[X.age[G[5, 7, 9]]]
# a[X.age[G[5, 7, 9].named('brussels')]]
# a[G.get('strange axis')[5, 7, 9].named('Brussels')]
# a[X.age[5, 7, 9]]
# positional groups *without axis* (G.i, P[], or I[]) does not make much sense,
# because it will matches all axes, but might be useful as an intermediate
# construct:
# g = G.i[2, 5, 7]
# g2 = X.age[g]
# positional groups *with axis* can be useful as a shorter alternative (but
# not worth it IMO, unless the whole API is more consistent for users):
# g = P.age[2, 5, 7]
# instead of
# g = X.age.i[2, 5, 7]
# we might want to also consider multi-dimensional groups:
# using K (for key) or I (for indexer) or G (without axis obviously):
# g = G[2:7, 'M', ['P01', 'P05']]
# it seems nifty, but that changes the semantic of a group (we would need
# multiple names??? OR )
# but it might be better to allow multiple slices from the same dimension in
# the same group:
# g = G.age[:9, 10:14, 15:25]
# in that case, the above group would be written like:
# g = G[2:7] & G['M'] & G['P01', 'P05']
# ND group with single name vs ND group with name per dim vs multi-group.
# g = G[2:7, 'M', ['P01', 'P05']].named('abc') would be a single ND group
# vs
# g = G[2:7].named('child'), G['M'], G['P01', 'P05'].named('plop')
# g = G[2:7].named('child') & G['M'] & G['P01', 'P05'].named('plop')
# behavior is the same when filtering but when aggregating the multi-name
# version would produce "child & M & plop" while
# vs
# g = G[2:7, 9:13].named('abc') would be a single ND group
# in that cases, having NDG to create N dimensional groups might be a good idea
# g = NDG[2:7, 'M', ['P01', 'P05']]
# we also need the best possible syntax to handle, "arbitrary" resampling
# pure_min_w1_comp_agg = zeros(result_axes)
# pure_min_w1_comp_agg[X.LBMosesXLS[1]] = pure_min_w1_comp.sum(X.clength[1:15])
# pure_min_w1_comp_agg[X.LBMosesXLS[2]] = pure_min_w1_comp.sum(X.clength[16:25])
# pure_min_w1_comp_agg[X.LBMosesXLS[3]] = pure_min_w1_comp.sum(X.clength[26:30])
# pure_min_w1_comp_agg[X.LBMosesXLS[4]] = pure_min_w1_comp.sum(X.clength[31:35])
# pure_min_w1_comp_agg[X.LBMosesXLS[5]] = pure_min_w1_comp.sum(X.clength[36:40])
# pure_min_w1_comp_agg[X.LBMosesXLS[6]] = pure_min_w1_comp.sum(X.clength[41:50])
#
# clength_groups = (X.clength[1:15], X.clength[16:25], X.clength[26:30],
# X.clength[31:35], X.clength[36:40], X.clength[41:50])
# pure_min_w1_comp_agg2 = pure_min_w1_comp.sum(clength_groups).rename(
# X.clength, X.LBMosesXLS)
# clength_groups = (L[1:15], L[16:25], L[26:30],
# L[31:35], L[36:40], L[41:50])
# pure_min_w1_comp_agg2 = pure_min_w1_comp.sum(clength_groups).rename(
# X.clength, X.LBMosesXLS)
#
# clength_groups = X.clength[1:15, 16:25, 26:30, 31:35, 36:40, 41:50]
# pure_min_w1_comp_agg2 = pure_min_w1_comp.sum(clength_groups)
#
# clength_groups = G[1:15, 16:25, 26:30, 31:35, 36:40, 41:50]
# pure_min_w1_comp_agg2 = pure_min_w1_comp.sum(clength_groups) \
# .replace(X.clength, LBMosesXLS)
# XXX: what if I want to sum on all the slices (as if it was a single slice)
# clength_groups = G[1:15] | G[16:25] | G[26:30] | G[31:35] | G[36:40] | G[41:50]
# OR
# G.clength.union[1:15, 16:25, 26:30, 31:35, 36:40, 41:50]
#
# pure_min_w1_comp_agg2 = pure_min_w1_comp.sum(clength_groups) \
# I would also like to have a nice syntax for assigning (multiple) values to
# multiple slices (example courtesy of MOSES)
# clength = Axis('clength', range(1, 51))
# year = Axis('year', range(2010, 2050))
# result_axes = AxisCollection([
# clength,
# year
# ])
#
# multip_mat_min = zeros([clength, year])
# multip_mat_min[X.clength[1:15], X.year[first_year_p:2024]] = 7 / 7
# multip_mat_min[X.clength[16:25], X.year[first_year_p:2024]] = 20 / 20
# multip_mat_min[X.clength[26:30], X.year[first_year_p:2024]] = 27 / 27
# multip_mat_min[X.clength[31:35], X.year[first_year_p:2024]] = 32 / 32
# multip_mat_min[X.clength[36:40], X.year[first_year_p:2024]] = 37 / 37
# multip_mat_min[X.clength[41:50], X.year[first_year_p:2024]] = 42 / 42
# multip_mat_min[X.clength[1:15], X.year[2025:2029]] = 8 / 7
# multip_mat_min[X.clength[16:25], X.year[2025:2029]] = 21 / 20
# multip_mat_min[X.clength[26:30], X.year[2025:2029]] = 28 / 27
# multip_mat_min[X.clength[31:35], X.year[2025:2029]] = 33 / 32
# multip_mat_min[X.clength[36:40], X.year[2025:2029]] = 38 / 37
# multip_mat_min[X.clength[41:50], X.year[2025:2029]] = 43 / 42
# multip_mat_min[X.clength[1:15], X.year[2030:]] = 9 / 7
# multip_mat_min[X.clength[16:25], X.year[2030:]] = 22 / 20
# multip_mat_min[X.clength[26:30], X.year[2030:]] = 29 / 27
# multip_mat_min[X.clength[31:35], X.year[2030:]] = 34 / 32
# multip_mat_min[X.clength[36:40], X.year[2030:]] = 39 / 37
# multip_mat_min[X.clength[41:50], X.year[2030:]] = 44 / 42
#
# # already possible
# m = zeros(clength)
# m[X.clength[1:15]] = 7
# m[X.clength[16:25]] = 20
# m[X.clength[26:30]] = 27
# m[X.clength[31:35]] = 32
# m[X.clength[36:40]] = 37
# m[X.clength[41:50]] = 42
# multip_mat_min = zeros([clength, year])
# multip_mat_min[X.year[:2024]] = m / m
# multip_mat_min[X.year[2025:2029]] = (m + 1) / m
# multip_mat_min[X.year[2030:]] = (m + 2) / m
# TODO: it would be nice to be able to say:
# m[X.clength[1:15, 16:25, 26:30, 31:35, 36:40, 41:50]] = [7, 20, 27, 32, 37, 42]
# but I am unsure it is possible/unambiguous
# this kind of pattern is not supported by numpy
# in numpy, you can assign multiple slices to the SAME value (not one value per
# slice, using m[np._r[1:15, 16:25, 26:30, 31:35, 36:40, 41:50]] = 7, but
# this actually construct a single array of indices, and ultimately,
# it is much slower than repeatedly doing m[slice()] = value
# %timeit j = np.r_[tuple(slice(i, i+5) for i in range(0, 1000, 10))]; a[j] = 9
# 1000 loops, best of 3: 307 µs per loop
# %timeit for i in range(0, 1000, 10): a[i:i+5] = i
# 10000 loops, best of 3: 45.9 µs per loop
# it is technically possibly to achieve in numpy (both in a vectorized manner
# and using explicit loop as above). The trick to vectorize this is to create
# an array of indices (with the slices expanded) and an array of values with
# the same size than the indices array (using np.repeat)
# http://stackoverflow.com/questions/38923763/assign-multiple-values-to-
# multiple-slices-of-a-numpy-array-at-once
# but the question is whether that syntax is unambiguous or not
# (and if not, whether we can come up with a syntax that is both nice
# and not ambiguous)
# m[X.clength[1:15, 16:25, 26:30, 31:35, 36:40, 41:50]] = [7, 20, 27, 32, 37, 42]
# multip_mat_min[X.year[:2024, 2025:2029, 2030:]] = [m / m, (m + 1) / m,
# (m + 2) / m]
# for the multi-value case to work I would probably have to make
# m[X.clength[1:15, 16:25, 26:30, 31:35, 36:40, 41:50]]
# return multiple arrays (as a tuple of arrays or an array of arrays)
# with pandas/MI support, we could just return an array with
# a (second) clength axis
# ========================================
# ================ SESSION ===============
# ========================================
# with my idea to dispatch aggregate operation to each element + make
# __eq__ use _binop
# (s1 == s2).all() would return a session with a list of bool scalars
# which is probably not what people expect (+ don't know how to further
# aggregate)
# ideally s.sum() would first sum each array then sum those sums
# and s.sum(X.age) would sum each array along age
# and s.sum(X.arrays) would try to add arrays together (and fail in
# some/most cases)
# the problem is that one important use case is not covered:
# aggregating along all dimensions of the arrays but NOT on X.arrays
# but see below for solutions
# Q: s.elements.sum() (or s.arrays.sum()) vs s.sum() solve this?
# A1: s.arrays.sum() would dispatch to each array and return a new Session
# s.sum() would try to do s.arrays.sum().sum()
# seems doable...
# A2: s.sum_by(X.arrays) (like Pandas default aggregate) would solve
# the issue even more nicely, but this is a bit more work (is it?) and
# can be safely added later.
# Q: does s1.arrays (op) s2.arrays works too?
# A: dispatch op to each array present in either s1 or s2
# so yes, but as we see below, no point
# Q: what happens when you do s1.arrays + s2 ?
# A:
# Q: what happens when you do s1 + s2 ?
# A: same than [a1 + a2 for a1, a2 in zip(s1, s2)]
# if we view s1 as a big array with an extra dimension, it would give
# that result (modulo union of names until we are Pandas based)
# Q: does that solve the == use case acceptably?
# A: to test that two session are equal, you'd have to write:
# (s1.arrays == s2.arrays).arrays.all().all()
# which is way too convoluted for users.
# (s1.arrays == s2.arrays).all() would work too though
# and even
# (s1 == s2).all()
# Q: what if I want to know which arrays are equal and which are not?
# A: (s1 == s2).all_by(X.arrays)
# boolean ops
# ===========
# Q: we could implement two(?) very different behavior:
# set-like behavior: combination of two Session, if same name,
# check that it is the same. a Session is more ordered-dict-like than
# set-like, so this might not make a lot of sense. an "update" method
# might make more sense. However most set-like operations do make sense.
# intersection: check common are the same or take left?
# union: check common are the same or take left?
# difference
# A: I think it's best that __bool_ops__ are element-wise, like other __ops__
# but we can/should define methods for the set-like operations
# .isdisjoint(other)
# .issubset(other)
# .issuperset(other)
# .union(*others)
# .intersection(*others)
# .difference(*others)
# .symmetric_difference(other)
# references
# https://docs.scipy.org/doc/numpy/reference/routines.set.html
# https://docs.scipy.org/doc/numpy/reference/routines.logic.html#logical-operations
# https://docs.python.org/3.7/library/stdtypes.html#set
>>> to_key('axis=a,b:d,e ! groupname')
>>> to_key('axis=a,b:d,e # groupname')
>>> to_key('axis=a,b:d,e & groupname')
>>> to_key('axis=a,b:d,e ~ groupname')
>>> to_key('axis=a,b:d,e @ groupname')
>>> to_key('groupname=a,b:d,e @ axis')
>>> to_key('groupname=#1:7a,b:d,e @ axis')
>>> to_key('teens=#1:7,b:d,e @ age')
>>> ext = la({'M': 55, 'F': 56})
# cannot use f-string (or string formatting) because that would be too
# limiting, ie. ext above wouldn't work
>>> a.sum('10_19=10:19 ; 20_29=20:29; #-1@year ; services=a,c:e,o@c_from ; {ext}@c_to')
>>> a.sum('age=10:19 > 10_19 ; 20:29 > 20_29 ; year=#-1 ; services=a,c:e ; c_from=o ; c_to={ext}')
# names only make sense when doing group aggregates, ie within parentheses
>>> a.sum('age=(10:19 > 10_19 ; 20:29 > 20_29); year=#-1 ; services=a, c:e ; c_from=o ; c_to={ext}')
# the parentheses are potentially superfluous since we separated groups
# using ; => means different LGroup
# several groups on same axis => multi group aggregate
>>> a['age=20:29|5 & year=#-1 & c_from=o & c_to={ext}']
>>> a['age=(10:19 >> 10_19 ; 20:29 >> 20_29) & c_from=o & c_to={ext}']
>>> a['age=10:19->10_19 ; age=20:29->20_29 & c_from=o & c_length={ext}']
>>> a['age=(10:19->10_19 ; #10:20) & c_from=o & c_length={ext}']
>>> a['age[20:29].by(5) & year.i[-1] & c_from[o] & c_to[{ext}]']
>>> a['age=(10:19 >> 10_19 ; 20:29 >> 20_29) & c_from=o & c_to={ext}']
>>> a['(age[10:19] >> 10_19, age[20:29] >> 20_29) & c_from[o] & c_to[{ext}]']
>>> a['age=(10:19 >> 10_19 ; 20:29 >> 20_29) & c_from=o & c_to={ext}']
>>> a['age=10:19->10_19 ; age=20:29->20_29 & c_from=o & c_length={ext}']
>>> a['age=(10:19->10_19 ; #10:20) & c_from=o & c_length={ext}']
is it:
* filter = expr | expr
= expr & expr
= expr ^ expr
= expr
* expr =
>>> a['10:60 by 5 & c_from=o & c_to={ext}']
>>> a.sum('10:19 > 10_19 ; 20:29 > 20_29 ; year=#-1')
>>> a.sum('(10:19 > 10_19 ; 20:29 > 20_29) & year=#-1')
>>> teens = G['age=10:19 >> teens']
>>> teens = X.age[10:19].named('teens')
>>> twenties = G['age=20:29']
>>> a.sum('({teens}, {twenties})')
>>> a.sum((teens, twenties))
# will we ever want to support this?
>>> a.sum('age > clength')
>>> a.sum('age > {ext}')
>>> a.sum(X.age > ext)
>>> a.sum('age > 10')
LGroup(['a', 'b', 'c'], name='abc')
expend_flow[X.cat_from['married_women'], X.cat_to['retirement_survival_women'], y] = \
flow[X.cat_from['married_women'], X.cat_to['retirement_survival_women'], y] * \
pension_age_diff_lag['married_men', y] * 1.1 * (45 / average_clength_survival['married_men', y])
expend_flow['cat_from[married_women], cat_to[retirement_survival_women]', y] = \
flow['cat_from[married_women], cat_to[retirement_survival_women]', y] * \
pension_age_diff_lag['married_men', y] * 1.1 * (45 / average_clength_survival['married_men', y])
# ===================================================
# ================ Multi Index/Sparse ===============
# ===================================================
* for each label in an axis part of a CombinedAxis I need a list/array of
positions => the total size of those position arrays will be equal to N * L
where L is the length of the combined axis
N is the number of axes in the combined axis
* the questions is whether:
arr['M', 10]
lines1 = sex.lines('M')
lines2 = age.lines(10)
total_lines = intersect(lines1, lines2)
is faster than pandas:
arr['M', 10]
filter1 = sex.filter('M')
filter2 = age.filter(10)
total_lines = (filter1 & filter2).nonzero()
* does something like categorical (transforms label -> index) help?
# ========================================================
# ================ set operation on groups ===============
# ========================================================
PROBLEM: we want __sub__ op on groups to be both a set operation or arithmetic operation depending on the case.
for y in time[start_year + 1:]:
res = a[y + 1]
for c in sutcode.matching('^...$') + sutcode.matching('^..$') - 'ND':
g = sutcode.startingwith(c) - c