๐Ÿ”๏ธ Spike Skipper

AuroraGPT
Author
Affiliation
Published

September 17, 2024

Modified

November 17, 2024

Details

We describe below our implementation for skipping individual steps during training.

๐Ÿ“ Example

Suppose we observe a large spike in our loss curve, as shown below:

spike-skipper

spike-skipper

Seemingly, this spike is being caused by a batch of โ€œbad dataโ€. In order to prevent this โ€œbad dataโ€ sample from corrupting our training, we would like to โ€œskipโ€ that particular training step.

This can be accomplished by passing the keyword argument --train-range-to-skip and specifying the endpoints of the ranges to be skipped.

e.g., if you would like to skip all steps from [10, 20] and from [25, 30], we would specify:

PBS_O_WORKDIR=$(pwd) bash train_aGPT_7B.sh \
    --train-range-to-skip 10 20 25 30

๐Ÿงช Implementation

We discuss below the details of the implementation, and provide some simple results to confirm things are behaving how we expect.

  1. Check if args.train_range_to_skip is not None [here]

    • Assert len(args.train_range_to_skip) % 2 == 0 [here]

    • Zip these up into pairs [here]:

      ranges_to_skip = list(
          zip(
              args.train_range_to_skip[::2],
              args.train_range_to_skip[1::2]
          )
      )
  2. If current iteration is in any of these pairs [here]

Must be even since weโ€™re specifying the endpoints of intervals to skip

โœ… Sanity Check

In order to confirm things are behaving as expected, we can explicitly look at the tokens drawn for each step, and ensure that they are the same regardless of whether or not that iteration was skipped.

  • In particular, we see that:

    • test 1:

      # [2024-09-16 23:09:09.059118][INFO][training:1083] - iteration=2 [0/8]: (torch.Size([4, 4097]))
      _tokens[:10]=tensor(
          [[ 1858,  3851, 29889,  ...,   500,    13,    13],
           [  349,  6156,  1650,  ...,  5806, 28557,  3519],
           [16554,   304,  1653,  ...,   322,  6934, 14722],
           [ 4955,   310, 10465,  ...,  1438,  3841, 29892]]
      )
      # [2024-09-16 23:09:09.061999][INFO][training:1083] - iteration=2 [1/8]: (torch.Size([4, 4097]))
      _tokens[:10]=tensor(
          [[  363,  1302, 16453,  ...,  7967, 29891,   484],
           [  367,   766,  4752,  ...,     1, 29871, 30143],
           [29899,   855,  1503,  ...,  3786, 29892,  5100],
           [  465,  1974,   289,  ..., 21588,   533,   304]]
      )
    • test 2:

      # [2024-09-16 22:59:27.752277][INFO][pretrain_gpt_alcf:198] - args.iteration=2:
      data['text'][:10]=tensor(
          [[ 1858,  3851, 29889,  ...,   500,    13,    13],
           [  349,  6156,  1650,  ...,  5806, 28557,  3519],
           [16554,   304,  1653,  ...,   322,  6934, 14722],
           [ 4955,   310, 10465,  ...,  1438,  3841, 29892]]
      )
      # [2024-09-16 22:59:27,755] [INFO] [profiler.py:81:start_profile] Flops profiler started
      # [2024-09-16 22:59:28.568805][INFO][pretrain_gpt_alcf:198] - args.iteration=2:
      data['text'][:10]=tensor(
          [[363,  1302, 16453,  ...,  7967, 29891,   484],
           [  367,   766,  4752,  ...,     1, 29871, 30143],
           [29899,   855,  1503,  ...,  3786, 29892,  5100],
           [  465,  1974,   289,  ..., 21588,   533,   304]]
      )

    as expected.

๐Ÿ” Details

  • First 4 steps:

    tokens:

    • Iteration 0:

      [2024-09-16 22:58:50.168667][INFO][pretrain_gpt_alcf:198] - args.iteration=0: data['text'][:10]=tensor([[  304,  7344,  5146,  ...,  9776, 29914, 26419],
              [29889,    13,  4706,  ...,  9280, 30004,    13],
              [29943, 20774, 29908,  ...,   304, 27391,   322],
              [ 2645,   445, 29871,  ..., 16888,  4656, 10070]])
      [2024-09-16 22:58:58.866409][INFO][pretrain_gpt_alcf:198] - args.iteration=0: data['text'][:10]=tensor([[ 2768,   596,  1788,  ..., 27274,   393, 30010],
              [  278,  5613,  4192,  ...,   362,   310,  1950],
              [28038, 29892,  2022,  ...,  3160,   278,  2087],
              [ 4149,   907, 29888,  ..., 29896, 29892, 29896]])
      [2024-09-16 22:59:02.043059][INFO][pretrain_gpt_alcf:198] - args.iteration=0: data['text'][:10]=tensor([[  424,   322, 16232,  ...,   366,   748,   467],
              [   13,   462,  1678,  ...,  2084, 29892,  3497],
              [ 7562,   310, 19320,  ...,  8973, 22684,   358],
              [ 2089,  3633,   292,  ..., 13774,   269,  2375]])
      [2024-09-16 22:59:03.456919][INFO][pretrain_gpt_alcf:198] - args.iteration=0: data['text'][:10]=tensor([[21411,   322,  3896,  ...,  2610, 29889,   319],
              [ 8003, 29898, 29900,  ...,    12,  6658,   529],
              [  278,  4148,   310,  ...,   263, 12212,   282],
              [ 5977, 29871, 29906,  ..., 15332,   310,  1749]])
      [2024-09-16 22:59:04.596630][INFO][pretrain_gpt_alcf:198] - args.iteration=0: data['text'][:10]=tensor([[  278,  1473, 24987,  ...,   263,  2217,  3804],
              [ 2973,   263, 18778,  ...,   263,  4642,  6673],
              [  309,   323,   804,  ...,  1063, 15296,   327],
              [  278,  5864,   322,  ...,  9409, 29889,  2178]])
      [2024-09-16 22:59:05.486913][INFO][pretrain_gpt_alcf:198] - args.iteration=0: data['text'][:10]=tensor([[29892, 13731,  6617,  ..., 29871, 29896, 29946],
              [ 2892,  1012,  1266,  ...,  4036,  7512,  2068],
              [ 1473,  1556,  3619,  ...,  3762,   338,   263],
              [23353, 29918,  2177,  ...,   501,   567,   814]])
      [2024-09-16 22:59:06.361333][INFO][pretrain_gpt_alcf:198] - args.iteration=0: data['text'][:10]=tensor([[ 5400, 14378,  4768,  ...,  2107, 18677, 29889],
              [ 9200, 29887, 29914,  ...,   293, 24235,   322],
              [30143,  4746,  2184,  ..., 11891, 29974, 25760],
              [19263, 29914,   303,  ...,   358, 29889,    13]])
      [2024-09-16 22:59:07.230671][INFO][pretrain_gpt_alcf:198] - args.iteration=0: data['text'][:10]=tensor([[  309,  1306,   681,  ...,   310, 23186, 21809],
              [29896, 29929,    13,  ..., 29871, 29900,    13],
              [ 9558,   964,   263,  ...,   322,   282,   682],
              [  278, 23904, 21767,  ...,   313, 29929, 29889]])
    • Iteration 1:

      [2024-09-16 22:59:19.287338][INFO][training_log:661] -  iteration=       1/  635782 | consumed_samples=         768 | consumed_tokens=     3145728 | elapsed_time_per_iteration_ms=29570.0 | learning_rate=9.4372e-09 | global_batch_size=768 | lm loss=11.167250 | loss_scale=1.0 | grad_norm=6.363 | actual_seqlen= 4096 | number_of_skipped_iterations=  0 | number_of_nan_iterations=  0 | samples_per_second=25.972 | tokens_per_gpu_per_second_tgs=4432.597 | [LM]TFLOPs=20.30 | [DS]TFLOPs=26.18 |
      [2024-09-16 22:59:19.289582][INFO][utils:207] - [Rank 0] (after 1 iterations) memory (MB) | allocated: 1894.57666015625 | max allocated: 9752.35498046875 | reserved: 11342.0 | max reserved: 11342.0
      (min, max) time across ranks (ms):
        forward-backward ...............................: (26094.39, 26095.09)
        optimizer ......................................: (3407.56, 3409.92)
      [2024-09-16 22:59:19.297183][INFO][pretrain_gpt_alcf:198] - args.iteration=1: data['text'][:10]=tensor([[ 1472, 29892,   408,  ..., 29892,  1584,   363],
            [  967, 19475,  6593,  ...,  8093, 29899, 11249],
            [ 1006,  2218, 13326,  ...,  2355,  1304,   304],
            [29900, 29916, 29947,  ...,   353,  1870, 29936]])
      [2024-09-16 22:59:20.104352][INFO][pretrain_gpt_alcf:198] - args.iteration=1: data['text'][:10]=tensor([[ 2354,   274,  1041,  ..., 29892, 13049,  9098],
            [ 8798,  9547, 10353,  ...,   303,  3143, 29889],
            [ 1373,  4056,  7236,  ...,  3186,   297,  5837],
            [ 1738, 29920,  7355,  ...,    13, 29871,  3776]])
      [2024-09-16 22:59:20.977036][INFO][utils:326] -  >> building dataset for /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/c4-0000_text_document
      [2024-09-16 22:59:20.977877][INFO][utils:326] -  > building dataset index ...
      [2024-09-16 22:59:20.977147][INFO][pretrain_gpt_alcf:198] - args.iteration=1: data['text'][:10]=tensor([[ 2020,   306,  1016,  ...,   322,   920,   372],
            [ 5921,  1749,  7306,  ..., 19252,   297,  5664],
            [  970,   770, 28547,  ...,   970,   894,  2577],
            [ 1907,   363, 14188,  ...,   756,  3646,   287]])
      [2024-09-16 22:59:21.851620][INFO][pretrain_gpt_alcf:198] - args.iteration=1: data['text'][:10]=tensor([[  715, 25392,  3104,  ...,   289,  5761,   616],
            [  426,    13,  9651,  ...,  9651,  1815, 22603],
            [ 7714,  1213,    13,  ...,    13, 29876,   457],
            [29889, 28663,  1230,  ...,  1546,   278,  6586]])
      [2024-09-16 22:59:22.720945][INFO][pretrain_gpt_alcf:198] - args.iteration=1: data['text'][:10]=tensor([[29929,    13,    13,  ..., 10739,  4770, 11277],
            [ 4528,   304,  2367,  ...,  2501,   385,  4203],
            [  869,   319,   794,  ...,  3158, 29889,  3115],
            [  592,   260,  4125,  ...,   284,  1135, 18655]])
      [2024-09-16 22:59:23.590149][INFO][pretrain_gpt_alcf:198] - args.iteration=1: data['text'][:10]=tensor([[14338, 25323,  3321,  ...,  5607,  1806,  1164],
            [  322,   278, 15352,  ...,  6462,   313,  1552],
            [25738,   714, 29889,  ..., 29915, 29879, 24842],
            [ 5122,   399, 29889,  ..., 29947,  7284,  2305]])
      [2024-09-16 22:59:24.457646][INFO][pretrain_gpt_alcf:198] - args.iteration=1: data['text'][:10]=tensor([[  367, 19310,  1891,  ...,  2408,   292,   263],
            [  470,  3307,  5713,  ...,   568,  2594, 19385],
            [29953, 29905,  1631,  ...,  1118,   343, 29897],
            [10261,   373,  5490,  ...,   511,   297,  1760]])
      [2024-09-16 22:59:25.326699][INFO][pretrain_gpt_alcf:198] - args.iteration=1: data['text'][:10]=tensor([[ 1006,   326, 29901,  ..., 14834,  6694,  9595],
            [12058,  5446, 29892,  ..., 29889,  8246,  3310],
            [ 7483,   310,   278,  ...,   402,  9851,  4423],
            [ 8041,   813,   322,  ...,  3303,  3900,   393]])
    • Iteration 2:

      [2024-09-16 22:59:27.744603][INFO][training_log:661] -  iteration=       2/  635782 | consumed_samples=        1536 | consumed_tokens=     6291456 | elapsed_time_per_iteration_ms=8457.2 | learning_rate=1.88744e-08 | global_batch_size=768 | lm loss=11.164009 | loss_scale=1.0 | grad_norm=6.271 | actual_seqlen= 4096 | number_of_skipped_iterations=  0 | number_of_nan_iterations=  0 | samples_per_second=90.810 | tokens_per_gpu_per_second_tgs=15498.234 | [LM]TFLOPs=70.98| [DS]TFLOPs=91.53 |
      (min, max) time across ranks (ms):
        forward-backward ...............................: (8384.83, 8385.57)
        optimizer ......................................: (55.03, 55.61)
      [2024-09-16 22:59:27.752277][INFO][pretrain_gpt_alcf:198] - args.iteration=2: data['text'][:10]=tensor([[ 1858,  3851, 29889,  ...,   500,    13,    13],
            [  349,  6156,  1650,  ...,  5806, 28557,  3519],
            [16554,   304,  1653,  ...,   322,  6934, 14722],
            [ 4955,   310, 10465,  ...,  1438,  3841, 29892]])
      [2024-09-16 22:59:27,755] [INFO] [profiler.py:81:start_profile] Flops profiler started
      [2024-09-16 22:59:28.568805][INFO][pretrain_gpt_alcf:198] - args.iteration=2: data['text'][:10]=tensor([[  363,  1302, 16453,  ...,  7967, 29891,   484],
            [  367,   766,  4752,  ...,     1, 29871, 30143],
            [29899,   855,  1503,  ...,  3786, 29892,  5100],
            [  465,  1974,   289,  ..., 21588,   533,   304]])
      [2024-09-16 22:59:28,571] [INFO] [profiler.py:81:start_profile] Flops profiler started
      [2024-09-16 22:59:29.440843][INFO][pretrain_gpt_alcf:198] - args.iteration=2: data['text'][:10]=tensor([[29889,    13,  4806,  ...,  3086, 26040,  9220],
            [  293,  7207,   355,  ..., 18131,   520,  1247],
            [ 8619, 29889, 29871,  ...,   304, 10029,   266],
            [  363, 15202, 29892,  ...,   482, 17162, 19104]])
      [2024-09-16 22:59:29,443] [INFO] [profiler.py:81:start_profile] Flops profiler started
      [2024-09-16 22:59:30.313403][INFO][pretrain_gpt_alcf:198] - args.iteration=2: data['text'][:10]=tensor([[25561,   411,   278,  ...,   297,  2898, 26163],
            [22574,  2607, 18134,  ...,    13,  4706,   500],
            [20190, 24820,  1623,  ...,   310,   901, 29892],
            [29892,  1951,  4486,  ...,   869,   887, 30010]])
      [2024-09-16 22:59:30,316] [INFO] [profiler.py:81:start_profile] Flops profiler started
      [2024-09-16 22:59:31.185339][INFO][pretrain_gpt_alcf:198] - args.iteration=2: data['text'][:10]=tensor([[ 5371, 22417, 29892,  ...,    13,  6716,   901],
            [  353,  1565, 29936,  ..., 29878,  3567,  7196],
            [17296,   338,  1985,  ...,  3741,  9089,   422],
            [  694, 13331,   310,  ..., 21180, 29892,   607]])
      [2024-09-16 22:59:31,188] [INFO] [profiler.py:81:start_profile] Flops profiler started
      [2024-09-16 22:59:32.057207][INFO][pretrain_gpt_alcf:198] - args.iteration=2: data['text'][:10]=tensor([[  292,  8818,   267,  ..., 29892, 11275,  7407],
            [ 1870, 29897,    13,  ...,  2697, 29901,    13],
            [29913,   338,   263,  ..., 29892,   591,  3394],
            [ 2253,   472,  1554,  ...,   982,   304,   376]])
      [2024-09-16 22:59:32,060] [INFO] [profiler.py:81:start_profile] Flops profiler started
      [2024-09-16 22:59:32.930293][INFO][pretrain_gpt_alcf:198] - args.iteration=2: data['text'][:10]=tensor([[  391,  2598, 29883,  ..., 22629,   346,   440],
            [29871, 29896, 29906,  ...,   407,   583,  2833],
            [ 4262,  1836,    13,  ...,   310,   263, 10608],
            [ 1199,   411, 24770,  ...,   272,  2153, 29889]])
      [2024-09-16 22:59:32,932] [INFO] [profiler.py:81:start_profile] Flops profiler started
      [2024-09-16 22:59:33.803567][INFO][pretrain_gpt_alcf:198] - args.iteration=2: data['text'][:10]=tensor([[  620, 20503,   428,  ...,   297,  1009,  9443],
            [  950, 25078,   892,  ...,   408, 10636,   284],
            [ 1012,  2003,   364,  ...,  7313, 29912, 19303],
            [29906, 29892, 29945,  ...,   967, 26414,   472]])
    • Iteration 3:

      [2024-09-16 22:59:34.881265][INFO][training_log:661] -  iteration=       3/  635782 | consumed_samples=        2304 | consumed_tokens=     9437184 | elapsed_time_per_iteration_ms=7136.5 | learning_rate=2.83116e-08 | global_batch_size=768 | lm loss=11.164038 | loss_scale=1.0 | grad_norm=6.279 | actual_seqlen= 4096 | number_of_skipped_iterations=  0 | number_of_nan_iterations=  0 | samples_per_second=107.615 | tokens_per_gpu_per_second_tgs=18366.372 | [LM]TFLOPs=84.12 | [DS]TFLOPs=108.46 |
      (min, max) time across ranks (ms):
        forward-backward ...............................: (7078.48, 7079.28)
        optimizer ......................................: (38.62, 43.08)
      [2024-09-16 22:59:34.888870][INFO][pretrain_gpt_alcf:198] - args.iteration=3: data['text'][:10]=tensor([[  496,   313, 29941,  ...,  1316,   408,  4857],
            [29899,  3204, 29889,  ...,  1074,   330,  2547],
            [29916, 29900, 29946,  ..., 18455, 29889,  4002],
            [26406,   338,  1641,  ...,   670,  1914,  6900]])
      [2024-09-16 22:59:35.719630][INFO][pretrain_gpt_alcf:198] - args.iteration=3: data['text'][:10]=tensor([[29945, 29900,   867,  ...,  7601, 12091,   310],
            [  975, 29871, 29896,  ...,  3573,   825,   306],
            [29906, 29900,  4638,  ..., 29227, 23145, 29892],
            [  278, 14368,   322,  ..., 14909, 29936, 25913]])
      [2024-09-16 22:59:36.591343][INFO][pretrain_gpt_alcf:198] - args.iteration=3: data['text'][:10]=tensor([[  988,   306,  1033,  ...,   437,   408,  1532],
            [  450, 10317,   310,  ...,   322,   752, 13036],
            [11405,  8020, 29889,  ...,   471, 18096,   287],
            [  288,  3594, 19284,  ...,   910,   338,   385]])
      [2024-09-16 22:59:37.463941][INFO][pretrain_gpt_alcf:198] - args.iteration=3: data['text'][:10]=tensor([[  322, 15151, 29946,  ..., 11648,  1497, 29889],
            [24233,   362,   467,  ...,  4513,  1353,   322],
            [ 3311, 13605, 29912,  ...,   945, 29899,  4181],
            [ 1951,   366,   508,  ...,  6589,   491,   777]])
      [2024-09-16 22:59:38.343307][INFO][pretrain_gpt_alcf:198] - args.iteration=3: data['text'][:10]=tensor([[29889, 29900,    13,  ...,  6017,   424,  1711],
            [  297,  5500,  1489,  ...,   310,  3802,  7875],
            [ 8078,  5314,   515,  ...,   373,   278,  6991],
            [13763,  6204,  6359,  ...,  4706,  2024,  1347]])
      [2024-09-16 22:59:39.214871][INFO][pretrain_gpt_alcf:198] - args.iteration=3: data['text'][:10]=tensor([[   13,  4806,  3512,  ...,   278,  7824,  6438],
            [ 2294,   938,   903,  ...,  4537,  3047,   449],
            [ 1230,  4123,   767,  ...,   310,   963, 21003],
            [ 1152,  2319, 10365,  ...,   367, 14040,   363]])
      [2024-09-16 22:59:40.085368][INFO][utils:326] -  >> building dataset for /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/tulu_flan-0000_text_document
      [2024-09-16 22:59:40.086224][INFO][utils:326] -  > building dataset index ...
      [2024-09-16 22:59:40.085475][INFO][pretrain_gpt_alcf:198] - args.iteration=3: data['text'][:10]=tensor([[27297, 29924,   801,  ..., 28947, 29892,   470],
            [12542,  5568,   703,  ...,   426,    13,  4706],
            [ 6907,   800,   322,  ..., 29892,  1661, 30304],
            [29900, 13630,   293,  ..., 26552,   363,   975]])
      [2024-09-16 22:59:40.958071][INFO][pretrain_gpt_alcf:198] - args.iteration=3: data['text'][:10]=tensor([[   13, 29946, 29953,  ..., 29953, 29945, 29871],
            [ 1283, 16578,  1156,  ...,   408,  2215,   408],
            [29906,  4229,  7671,  ...,    13,  1576,  1014],
            [  526,  2898,   304,  ...,   471,  4802, 29991]])
    • Iteration 4:

      [2024-09-16 22:59:42.028460][INFO][training_log:661] -  iteration=       4/  635782 | consumed_samples=        3072 | consumed_tokens=    12582912 | elapsed_time_per_iteration_ms=7147.0 | learning_rate=3.77488e-08 | global_batch_size=768 | lm loss=11.171233 | loss_scale=1.0 | grad_norm=6.272 | actual_seqlen= 4096 | number_of_skipped_iterations=  0 | number_of_nan_iterations=  0 | samples_per_second=107.458 | tokens_per_gpu_per_second_tgs=18339.524 | [LM]TFLOPs=84.00 | [DS]TFLOPs=108.31 |
      (min, max) time across ranks (ms):
        forward-backward ...............................: (7091.77, 7092.56)
        optimizer ......................................: (39.21, 40.11)
      [2024-09-16 22:59:42.035716][INFO][pretrain_gpt_alcf:198] - args.iteration=4: data['text'][:10]=tensor([[  443,   666, 10170,  ...,   278,   619,  7323],
            [   13, 11008,   338,  ...,  2472,   363, 22049],
            [29871,    13, 29938,  ..., 29962,  8521, 29896],
            [ 1165,  2280,   304,  ...,   306,   471,  2086]])
      [2024-09-16 22:59:42.860756][INFO][pretrain_gpt_alcf:198] - args.iteration=4: data['text'][:10]=tensor([[  304,   679,   304,  ...,  1475, 29889,  1570],
            [ 2184, 29936,    13,  ...,  4706,   970,  1780],
            [29872,   352, 29901,  ..., 29905,  4915, 29912],
            [16809,   304,  1438,  ..., 13457, 29889,    13]])
      [2024-09-16 22:59:43.731208][INFO][pretrain_gpt_alcf:198] - args.iteration=4: data['text'][:10]=tensor([[12015, 29901, 20549,  ...,   322, 10752, 17906],
            [  372, 30010, 29879,  ..., 29892, 14595,   653],
            [18280, 29958,    13,  ..., 18884,   736,  6251],
            [29889,    13,    13,  ...,   599,   373, 17097]])
      [2024-09-16 22:59:44.605047][INFO][pretrain_gpt_alcf:198] - args.iteration=4: data['text'][:10]=tensor([[ 3367,   567,   964,  ...,  3353, 24870,  2181],
            [ 1262,  2609,   367,  ..., 29974, 29896,  7570],
            [29871, 29941, 29900,  ...,   341,   555,   265],
            [ 4225,   526,  6041,  ...,  1925,  1623,  2748]])
      [2024-09-16 22:59:45.479433][INFO][pretrain_gpt_alcf:198] - args.iteration=4: data['text'][:10]=tensor([[ 2283, 10162,  1496,  ..., 30656, 30317, 30605],
            [29879,  9228,   292,  ...,  7968, 29899,  7052],
            [  884,   599,   367,  ..., 29892,   278,  6054],
            [29879,   411,   278,  ...,   367,  5019,  1183]])
      [2024-09-16 22:59:46.351707][INFO][pretrain_gpt_alcf:198] - args.iteration=4: data['text'][:10]=tensor([[ 3867,   281,   761,  ..., 11949,   338,  4922],
            [  297,  1432,  2586,  ...,  5414,   278, 29811],
            [29892,   278, 15562,  ..., 10296,   310,   394],
            [ 1451,  2960,  3505,  ...,   657, 14346,  8003]])
      [2024-09-16 22:59:47.222016][INFO][pretrain_gpt_alcf:198] - args.iteration=4: data['text'][:10]=tensor([[ 6601,  2874,   414,  ...,   302,   317,  7390],
            [16415,   297,  5146,  ...,   763,   372,   471],
            [29941, 29906,  1118,  ..., 29900, 29889, 29953],
            [ 4893,   304,  4808,  ...,  2284,  2164, 18690]])
      [2024-09-16 22:59:48.094752][INFO][pretrain_gpt_alcf:198] - args.iteration=4: data['text'][:10]=tensor([[  901,   310,  2994,  ..., 29873,  1641,   766],
            [  304,  1716,  2562,  ...,  3489,   304,   367],
            [ 1949,  6736, 29871,  ..., 29965, 29909,   353],
            [   13,    13, 29930,  ..., 16497,   316,   474]])
  • Skipping steps [2, 3]:

    tokens:

    • Iteration 0:

      [2024-09-16 23:08:47.749839][INFO][pretrain_gpt_alcf:198] - args.iteration=0: data['text'][:10]=tensor([[  304,  7344,  5146,  ...,  9776, 29914, 26419],
        [29889,    13,  4706,  ...,  9280, 30004,    13],
        [29943, 20774, 29908,  ...,   304, 27391,   322],
        [ 2645,   445, 29871,  ..., 16888,  4656, 10070]])
      [2024-09-16 23:08:51.451183][INFO][pretrain_gpt_alcf:198] - args.iteration=0: data['text'][:10]=tensor([[ 2768,   596,  1788,  ..., 27274,   393, 30010],
        [  278,  5613,  4192,  ...,   362,   310,  1950],
        [28038, 29892,  2022,  ...,  3160,   278,  2087],
        [ 4149,   907, 29888,  ..., 29896, 29892, 29896]])
      [2024-09-16 23:08:54.073597][INFO][pretrain_gpt_alcf:198] - args.iteration=0: data['text'][:10]=tensor([[  424,   322, 16232,  ...,   366,   748,   467],
        [   13,   462,  1678,  ...,  2084, 29892,  3497],
        [ 7562,   310, 19320,  ...,  8973, 22684,   358],
        [ 2089,  3633,   292,  ..., 13774,   269,  2375]])
      [2024-09-16 23:08:56.212476][INFO][pretrain_gpt_alcf:198] - args.iteration=0: data['text'][:10]=tensor([[21411,   322,  3896,  ...,  2610, 29889,   319],
        [ 8003, 29898, 29900,  ...,    12,  6658,   529],
        [  278,  4148,   310,  ...,   263, 12212,   282],
        [ 5977, 29871, 29906,  ..., 15332,   310,  1749]])
      [2024-09-16 23:08:57.207940][INFO][pretrain_gpt_alcf:198] - args.iteration=0: data['text'][:10]=tensor([[  278,  1473, 24987,  ...,   263,  2217,  3804],
        [ 2973,   263, 18778,  ...,   263,  4642,  6673],
        [  309,   323,   804,  ...,  1063, 15296,   327],
        [  278,  5864,   322,  ...,  9409, 29889,  2178]])
      [2024-09-16 23:08:58.083935][INFO][pretrain_gpt_alcf:198] - args.iteration=0: data['text'][:10]=tensor([[29892, 13731,  6617,  ..., 29871, 29896, 29946],
        [ 2892,  1012,  1266,  ...,  4036,  7512,  2068],
        [ 1473,  1556,  3619,  ...,  3762,   338,   263],
        [23353, 29918,  2177,  ...,   501,   567,   814]])
      [2024-09-16 23:08:58.951793][INFO][pretrain_gpt_alcf:198] - args.iteration=0: data['text'][:10]=tensor([[ 5400, 14378,  4768,  ...,  2107, 18677, 29889],
        [ 9200, 29887, 29914,  ...,   293, 24235,   322],
        [30143,  4746,  2184,  ..., 11891, 29974, 25760],
        [19263, 29914,   303,  ...,   358, 29889,    13]])
      [2024-09-16 23:08:59.820234][INFO][pretrain_gpt_alcf:198] - args.iteration=0: data['text'][:10]=tensor([[  309,  1306,   681,  ...,   310, 23186, 21809],
        [29896, 29929,    13,  ..., 29871, 29900,    13],
        [ 9558,   964,   263,  ...,   322,   282,   682],
        [  278, 23904, 21767,  ...,   313, 29929, 29889]])
    • Iteration 1:

      [2024-09-16 23:09:08.943867][INFO][training_log:661] -  iteration=       1/  635782 | consumed_samples=         768 | consumed_tokens=     3145728 | elapsed_time_per_iteration_ms=21224.4 | learning_rate=9.4372e-09 | global_batch_size=768 | lm loss=11.167250 | loss_scale=1.0 | grad_norm=6.363 | actual_seqlen= 4096 | number_of_skipped_iterations=  0 | number_of_nan_iterations=  0 | samples_per_second=36.185 | tokens_per_gpu_per_second_tgs=6175.523 | [LM]TFLOPs=28.29 | [DS]TFLOPs=36.47 |
      [2024-09-16 23:09:08.953432][INFO][training:1083] - iteration=1 [0/8]: (torch.Size([4, 4097]))
      _tokens[:10]=tensor([[ 1472, 29892,   408,  ..., 29892,  1584,   363],
        [  967, 19475,  6593,  ...,  8093, 29899, 11249],
        [ 1006,  2218, 13326,  ...,  2355,  1304,   304],
        [29900, 29916, 29947,  ...,   353,  1870, 29936]])
      [2024-09-16 23:09:08.957524][INFO][training:1083] - iteration=1 [1/8]: (torch.Size([4, 4097]))
      _tokens[:10]=tensor([[ 2354,   274,  1041,  ..., 29892, 13049,  9098],
        [ 8798,  9547, 10353,  ...,   303,  3143, 29889],
        [ 1373,  4056,  7236,  ...,  3186,   297,  5837],
        [ 1738, 29920,  7355,  ...,    13, 29871,  3776]])
      [2024-09-16 23:09:08.966648][INFO][training:1083] - iteration=1 [2/8]: (torch.Size([4, 4097]))
      _tokens[:10]=tensor([[ 2020,   306,  1016,  ...,   322,   920,   372],
        [ 5921,  1749,  7306,  ..., 19252,   297,  5664],
        [  970,   770, 28547,  ...,   970,   894,  2577],
        [ 1907,   363, 14188,  ...,   756,  3646,   287]])
      [2024-09-16 23:09:08.969989][INFO][training:1083] - iteration=1 [3/8]: (torch.Size([4, 4097]))
      _tokens[:10]=tensor([[  715, 25392,  3104,  ...,   289,  5761,   616],
        [  426,    13,  9651,  ...,  9651,  1815, 22603],
        [ 7714,  1213,    13,  ...,    13, 29876,   457],
        [29889, 28663,  1230,  ...,  1546,   278,  6586]])
      [2024-09-16 23:09:08.990736][INFO][training:1083] - iteration=1 [4/8]: (torch.Size([4, 4097]))
      _tokens[:10]=tensor([[29929,    13,    13,  ..., 10739,  4770, 11277],
        [ 4528,   304,  2367,  ...,  2501,   385,  4203],
        [  869,   319,   794,  ...,  3158, 29889,  3115],
        [  592,   260,  4125,  ...,   284,  1135, 18655]])
      [2024-09-16 23:09:08.993101][INFO][training:1083] - iteration=1 [5/8]: (torch.Size([4, 4097]))
      _tokens[:10]=tensor([[14338, 25323,  3321,  ...,  5607,  1806,  1164],
        [  322,   278, 15352,  ...,  6462,   313,  1552],
        [25738,   714, 29889,  ..., 29915, 29879, 24842],
        [ 5122,   399, 29889,  ..., 29947,  7284,  2305]])
      [2024-09-16 23:09:09.036896][INFO][training:1083] - iteration=1 [6/8]: (torch.Size([4, 4097]))
      _tokens[:10]=tensor([[  367, 19310,  1891,  ...,  2408,   292,   263],
        [  470,  3307,  5713,  ...,   568,  2594, 19385],
        [29953, 29905,  1631,  ...,  1118,   343, 29897],
        [10261,   373,  5490,  ...,   511,   297,  1760]])
      [2024-09-16 23:09:09.039401][INFO][training:1083] - iteration=1 [7/8]: (torch.Size([4, 4097]))
      _tokens[:10]=tensor([[ 1006,   326, 29901,  ..., 14834,  6694,  9595],
        [12058,  5446, 29892,  ..., 29889,  8246,  3310],
        [ 7483,   310,   278,  ...,   402,  9851,  4423],
        [ 8041,   813,   322,  ...,  3303,  3900,   393]])
    • Iteration 2:

      [2024-09-16 23:09:09.050766][INFO][training_log:661] -  iteration=       2/  635782 | consumed_samples=        1536 | consumed_tokens=     6291456 | elapsed_time_per_iteration_ms=106.8 | learning_rate=1.88744e-08 | global_batch_size=  768 | loss_scale=1.0 | grad_norm=6.363 | actual_seqlen= 4096 | number_of_skipped_iterations=  1 | number_of_nan_iterations=  0 | samples_per_second=7190.781 | tokens_per_gpu_per_second_tgs=1227226.651 | [LM]TFLOPs=5620.92 | [DS]TFLOPs=7247.49 |
      [2024-09-16 23:09:09.055864][INFO][training:1069] - Caught 3 in 'ranges_to_skip', skipping!
      [2024-09-16 23:09:09.057929][INFO][training:1082] - torch.Size([4, 4097]), len(train_data_iterator)=490723200
      [2024-09-16 23:09:09.059118][INFO][training:1083] - iteration=2 [0/8]: (torch.Size([4, 4097]))
      _tokens[:10]=tensor([[ 1858,  3851, 29889,  ...,   500,    13,    13],
      [  349,  6156,  1650,  ...,  5806, 28557,  3519],
      [16554,   304,  1653,  ...,   322,  6934, 14722],
      [ 4955,   310, 10465,  ...,  1438,  3841, 29892]])
      [2024-09-16 23:09:09.061999][INFO][training:1083] - iteration=2 [1/8]: (torch.Size([4, 4097]))
      _tokens[:10]=tensor([[  363,  1302, 16453,  ...,  7967, 29891,   484],
      [  367,   766,  4752,  ...,     1, 29871, 30143],
      [29899,   855,  1503,  ...,  3786, 29892,  5100],
      [  465,  1974,   289,  ..., 21588,   533,   304]])
      [2024-09-16 23:09:09.065494][INFO][training:1083] - iteration=2 [2/8]: (torch.Size([4, 4097]))
      _tokens[:10]=tensor([[29889,    13,  4806,  ...,  3086, 26040,  9220],
      [  293,  7207,   355,  ..., 18131,   520,  1247],
      [ 8619, 29889, 29871,  ...,   304, 10029,   266],
      [  363, 15202, 29892,  ...,   482, 17162, 19104]])
      [2024-09-16 23:09:09.069035][INFO][training:1083] - iteration=2 [3/8]: (torch.Size([4, 4097]))
      _tokens[:10]=tensor([[25561,   411,   278,  ...,   297,  2898, 26163],
      [22574,  2607, 18134,  ...,    13,  4706,   500],
      [20190, 24820,  1623,  ...,   310,   901, 29892],
      [29892,  1951,  4486,  ...,   869,   887, 30010]])
      [2024-09-16 23:09:09.072577][INFO][training:1083] - iteration=2 [4/8]: (torch.Size([4, 4097]))
      _tokens[:10]=tensor([[ 5371, 22417, 29892,  ...,    13,  6716,   901],
      [  353,  1565, 29936,  ..., 29878,  3567,  7196],
      [17296,   338,  1985,  ...,  3741,  9089,   422],
      [  694, 13331,   310,  ..., 21180, 29892,   607]])
      [2024-09-16 23:09:09.075789][INFO][training:1083] - iteration=2 [5/8]: (torch.Size([4, 4097]))
      _tokens[:10]=tensor([[  292,  8818,   267,  ..., 29892, 11275,  7407],
      [ 1870, 29897,    13,  ...,  2697, 29901,    13],
      [29913,   338,   263,  ..., 29892,   591,  3394],
      [ 2253,   472,  1554,  ...,   982,   304,   376]])
      [2024-09-16 23:09:09.079052][INFO][training:1083] - iteration=2 [6/8]: (torch.Size([4, 4097]))
      _tokens[:10]=tensor([[  391,  2598, 29883,  ..., 22629,   346,   440],
      [29871, 29896, 29906,  ...,   407,   583,  2833],
      [ 4262,  1836,    13,  ...,   310,   263, 10608],
      [ 1199,   411, 24770,  ...,   272,  2153, 29889]])
      [2024-09-16 23:09:09.082739][INFO][training:1083] - iteration=2 [7/8]: (torch.Size([4, 4097]))
      _tokens[:10]=tensor([[  620, 20503,   428,  ...,   297,  1009,  9443],
      [  950, 25078,   892,  ...,   408, 10636,   284],
      [ 1012,  2003,   364,  ...,  7313, 29912, 19303],
      [29906, 29892, 29945,  ...,   967, 26414,   472]])
    • Iteration 3:

      [2024-09-16 23:09:09.135651][INFO][training_log:661] - iteration= 3/ 635782 | consumed_samples= 2304 | consumed_tokens= 9437184 | elapsed_time_per_iteration_ms=84.7 | learning_rate=2.83116e-08 | global_batch_size= 768 | loss_scale=1.0 | grad_norm=6.363 | actual_seqlen= 4096 | number_of_skipped_iterations= 1 | number_of_nan_iterations= 0 | samples_per_second=9070.783 | tokens_per_gpu_per_second_tgs=1548080.271 | [LM]TFLOPs=7090.49 | [DS]TFLOPs=9142.31 |
      [2024-09-16 23:09:09.143511][INFO][pretrain_gpt_alcf:198] - args.iteration=3: data['text'][:10]=tensor([[496,   313, 29941,  ...,  1316,   408,  4857],
        [29899,  3204, 29889,  ...,  1074,   330,  2547],
        [29916, 29900, 29946,  ..., 18455, 29889,  4002],
        [26406,   338,  1641,  ...,   670,  1914,  6900]])
      [2024-09-16 23:09:09.971988][INFO][pretrain_gpt_alcf:198] - args.iteration=3: data['text'][:10]=tensor([[29945, 29900,   867,  ...,  7601, 12091,   310],
        [  975, 29871, 29896,  ...,  3573,   825,   306],
        [29906, 29900,  4638,  ..., 29227, 23145, 29892],
        [  278, 14368,   322,  ..., 14909, 29936, 25913]])
      [2024-09-16 23:09:10.843966][INFO][pretrain_gpt_alcf:198] - args.iteration=3: data['text'][:10]=tensor([[988,   306,  1033,  ...,   437,   408,  1532],
        [  450, 10317,   310,  ...,   322,   752, 13036],
        [11405,  8020, 29889,  ...,   471, 18096,   287],
        [  288,  3594, 19284,  ...,   910,   338,   385]])
      [2024-09-16 23:09:11.715513][INFO][pretrain_gpt_alcf:198] - args.iteration=3: data['text'][:10]=tensor([[322, 15151, 29946,  ..., 11648,  1497, 29889],
        [24233,   362,   467,  ...,  4513,  1353,   322],
        [ 3311, 13605, 29912,  ...,   945, 29899,  4181],
        [ 1951,   366,   508,  ...,  6589,   491,   777]])
      [2024-09-16 23:09:12.584136][INFO][pretrain_gpt_alcf:198] - args.iteration=3: data['text'][:10]=tensor([[29889, 29900,    13,  ...,  6017,   424,  1711],
        [  297,  5500,  1489,  ...,   310,  3802,  7875],
        [ 8078,  5314,   515,  ...,   373,   278,  6991],
        [13763,  6204,  6359,  ...,  4706,  2024,  1347]])
      [2024-09-16 23:09:13.450767][INFO][pretrain_gpt_alcf:198] - args.iteration=3: data['text'][:10]=tensor([[13,  4806,  3512,  ...,   278,  7824,  6438],
        [ 2294,   938,   903,  ...,  4537,  3047,   449],
        [ 1230,  4123,   767,  ...,   310,   963, 21003],
        [ 1152,  2319, 10365,  ...,   367, 14040,   363]])
      [2024-09-16 23:09:14.317517][INFO][pretrain_gpt_alcf:198] - args.iteration=3: data['text'][:10]=tensor([[27297, 29924,   801,  ..., 28947, 29892,   470],
        [12542,  5568,   703,  ...,   426,    13,  4706],
        [ 6907,   800,   322,  ..., 29892,  1661, 30304],
        [29900, 13630,   293,  ..., 26552,   363,   975]])
      [2024-09-16 23:09:15.187191][INFO][pretrain_gpt_alcf:198] - args.iteration=3: data['text'][:10]=tensor([[13, 29946, 29953,  ..., 29953, 29945, 29871],
        [ 1283, 16578,  1156,  ...,   408,  2215,   408],
        [29906,  4229,  7671,  ...,    13,  1576,  1014],
        [  526,  2898,   304,  ...,   471,  4802, 29991]])

Citation

BibTeX citation:
@online{foreman2024,
  author = {Foreman, Sam},
  title = {Spike {Skipper}},
  date = {2024-09-17},
  url = {https://samforeman.me/posts/AuroraGPT/spike-skipper},
  langid = {en}
}
For attribution, please cite this work as:
Foreman, Sam. 2024. โ€œSpike Skipper.โ€ September 17, 2024. https://samforeman.me/posts/AuroraGPT/spike-skipper.