flowchart LR
subgraph D["`Data`"]
direction TB
x("`xโ`")
x1("`xโ`")
x2("`xโ`")
end
direction LR
subgraph G0["`GPU0`"]
direction LR
subgraph N0["`NN`"]
end
%%y0("`yโ`")
L0["`Loss`"]
end
subgraph G1["`GPU1`"]
direction LR
subgraph N1["`NN`"]
end
L1["`Loss`"]
end
subgraph G2["`GPU2`"]
direction LR
subgraph N2["`NN`"]
end
L2["`Loss`"]
end
x --> G0
x1 --> G1
x2 --> G2
N0 --> L0
N1 --> L1
N2 --> L2
classDef block fill:#CCCCCC02,stroke:#838383,stroke-width:1px,color:#838383
classDef text fill:#CCCCCC02,stroke:#838383,stroke-width:0px,color:#838383
classDef grey fill:#cccccc,stroke:#333,stroke-width:1px,color:#000
classDef red fill:#ff8181,stroke:#333,stroke-width:1px,color:#000
classDef orange fill:#FFC47F,stroke:#333,stroke-width:1px,color:#000
classDef yellow fill:#FFFF7F,stroke:#333,stroke-width:1px,color:#000
classDef green fill:#98E6A5,stroke:#333,stroke-width:1px,color:#000
classDef blue fill:#7DCAFF,stroke:#333,stroke-width:1px,color:#000
classDef purple fill:#FFCBE6,stroke:#333,stroke-width:1px,color:#000
class x,y0,L0 red
class x1,L1 green
class x2,L2 blue
class x3,ar grey
class D,N0,N1,N2,G0,G1,G2,GU block
class AR block
class bc text
Figure 2: Each GPU receives unique data at each step
Data Parallel Training: Forward Pass
flowchart LR
subgraph D["`Data`"]
direction TB
%%xp("`xโโโ`")
x("`xโ`")
x1("`xโ`")
x2("`xโ`")
end
direction LR
subgraph G0["`GPU0`"]
direction LR
subgraph N0["`NN`"]
end
%%y0("`yโ`")
L0["`Loss`"]
end
subgraph G1["`GPU1`"]
direction LR
subgraph N1["`NN`"]
end
L1["`Loss`"]
end
subgraph G2["`GPU2`"]
direction LR
subgraph N2["`NN`"]
end
L2["`Loss`"]
end
subgraph AR["`Average Grads`"]
direction TB
ar("`(1/n) โ gโ`")
end
x --> G0
x1 --> G1
x2 --> G2
N0 --> L0
N1 --> L1
N2 --> L2
G0 -.-> AR
G1 -.-> AR
G2 -.-> AR
classDef block fill:#CCCCCC02,stroke:#838383,stroke-width:1px,color:#838383
classDef grey fill:#cccccc,stroke:#333,stroke-width:1px,color:#000
classDef red fill:#ff8181,stroke:#333,stroke-width:1px,color:#000
classDef orange fill:#FFC47F,stroke:#333,stroke-width:1px,color:#000
classDef yellow fill:#FFFF7F,stroke:#333,stroke-width:1px,color:#000
classDef green fill:#98E6A5,stroke:#333,stroke-width:1px,color:#000
classDef blue fill:#7DCAFF,stroke:#333,stroke-width:1px,color:#000
classDef purple fill:#FFCBE6,stroke:#333,stroke-width:1px,color:#000
classDef text fill:#CCCCCC02,stroke:#838383,stroke-width:0px,color:#838383
class x,y0,L0 red
class x1,L1 green
class x2,L2 blue
class x3,ar grey
class D,N0,N1,N2,G0,G1,G2,GU block
class AR block
class bc text
Figure 3: Average gradients across all GPUs
Data Parallel Training: Backward Pass
flowchart RL
subgraph D["`Data`"]
direction TB
x("`xโ`")
x1("`xโ`")
x2("`xโ`")
end
subgraph G0["`GPU0`"]
direction RL
subgraph N0["`NN`"]
end
L0["`Loss`"]
end
subgraph G1["`GPU1`"]
direction RL
subgraph N1["`NN`"]
end
L1["`Loss`"]
end
subgraph G2["`GPU2`"]
direction RL
subgraph N2["`NN`"]
end
L2["`Loss`"]
end
subgraph BC["`Send Updates`"]
direction TB
end
BC -.-> G0
BC -.-> G1
BC -.-> G2
L0 ~~~ N0
L1 ~~~ N1
L2 ~~~ N2
G0 ~~~ x
G1 ~~~ x1
G2 ~~~ x2
classDef grey fill:#cccccc,stroke:#333,stroke-width:1px,color:#000
classDef block fill:#CCCCCC02,stroke:#838383,stroke-width:1px,color:#838383
classDef red fill:#ff8181,stroke:#333,stroke-width:1px,color:#000
classDef orange fill:#FFC47F,stroke:#333,stroke-width:1px,color:#000
classDef yellow fill:#FFFF7F,stroke:#333,stroke-width:1px,color:#000
classDef green fill:#98E6A5,stroke:#333,stroke-width:1px,color:#000
classDef blue fill:#7DCAFF,stroke:#333,stroke-width:1px,color:#000
classDef purple fill:#FFCBE6,stroke:#333,stroke-width:1px,color:#000
classDef text fill:#CCCCCC02,stroke:#838383,stroke-width:0px,color:#838383
class x,y0,L0 red
class x1,L1 green
class x2,L2 blue
class x3,ar grey
class D,N0,N1,N2,G0,G1,G2,GU block
class BC block
class bc text
Figure 4: Send global updates back to each GPU
Data Parallel Training
flowchart LR
subgraph D["`Data`"]
direction TB
x("`xโ`")
x1("`xโ`")
x2("`xโ`")
end
direction LR
subgraph G0["`GPU0`"]
direction LR
subgraph N0["`NN`"]
end
L0["`L0`"]
end
subgraph G1["`GPU1`"]
direction LR
subgraph N1["`NN`"]
end
L1["`L1`"]
end
subgraph G2["`GPU2`"]
direction LR
subgraph N2["`NN`"]
end
L2["`L2`"]
end
subgraph AR["`Average Grads`"]
direction TB
ar("`(1/n) โ gโ`")
bc("`Update Weights`")
ar --> bc
end
x --> G0
x1 --> G1
x2 --> G2
N0 --> L0
N1 --> L1
N2 --> L2
G0 <-.-> AR
G1 <-.-> AR
G2 <-.-> AR
classDef block fill:#CCCCCC02,stroke:#838383,stroke-width:1px,color:#838383
classDef grey fill:#cccccc,stroke:#333,stroke-width:1px,color:#000
classDef red fill:#ff8181,stroke:#333,stroke-width:1px,color:#000
classDef orange fill:#FFC47F,stroke:#333,stroke-width:1px,color:#000
classDef yellow fill:#FFFF7F,stroke:#333,stroke-width:1px,color:#000
classDef green fill:#98E6A5,stroke:#333,stroke-width:1px,color:#000
classDef blue fill:#7DCAFF,stroke:#333,stroke-width:1px,color:#000
classDef purple fill:#FFCBE6,stroke:#333,stroke-width:1px,color:#000
classDef text fill:#CCCCCC02,stroke:#838383,stroke-width:0px,color:#838383
class x,y0,L0 red
class x1,L1 green
class x2,L2 blue
class x3,ar grey
class D,N0,N1,N2,G0,G1,G2,GU block
class AR block
class bc text
every rank must participate (collective communication) !!
flowchart TD
subgraph D["`Data`"]
direction LR
x("`xโ`")
x1("`xโ`")
x2("`xโ`")
end
subgraph G0["`GPU0`"]
direction TB
subgraph N0["`NN`"]
end
L0["`Lโ`"]
end
subgraph G1["`GPU1`"]
direction TB
subgraph N1["`NN`"]
end
L1["`Lโ`"]
end
subgraph G2["`GPU2`"]
direction TB
subgraph N2["`NN`"]
end
L2["`Lโ`"]
end
subgraph AR["`Average Grads`"]
direction TB
ar("`(1/n) โ gโ`")
bc("`Update Weights`")
ar --> bc
end
x --> G0
x1 --> G1
x2 --> G2
N0 --> L0
N1 --> L1
N2 --> L2
G0 <-.-> AR
G1 <-.-> AR
G2 <-.-> AR
classDef block fill:#CCCCCC02,stroke:#838383,stroke-width:1px,color:#838383
classDef grey fill:#cccccc,stroke:#333,stroke-width:1px,color:#000
classDef red fill:#ff8181,stroke:#333,stroke-width:1px,color:#000
classDef orange fill:#FFC47F,stroke:#333,stroke-width:1px,color:#000
classDef yellow fill:#FFFF7F,stroke:#333,stroke-width:1px,color:#000
classDef green fill:#98E6A5,stroke:#333,stroke-width:1px,color:#000
classDef blue fill:#7DCAFF,stroke:#333,stroke-width:1px,color:#000
classDef purple fill:#FFCBE6,stroke:#333,stroke-width:1px,color:#000
classDef text fill:#CCCCCC02,stroke:#838383,stroke-width:0px,color:#838383
class x,y0,L0 red
class x1,L1 green
class x2,L2 blue
class x3,ar grey
class D,N0,N1,N2,G0,G1,G2,GU block
class AR block
class bc text
Collective operations have to be called for each rank to form a complete collective operation.
Failure to do so will result in other ranks waiting indefinitely
AllReduce
Perform reductions on data (e.g. sum, min, max) across ranks, send result back to everyone.
flowchart TD
subgraph R0["`Rank 0`"]
x0("`x0`")
end
subgraph R1["`Rank 1`"]
x1("`x1`")
end
subgraph R2["`Rank 2`"]
x2("`x2`")
end
subgraph R3["`Rank 3`"]
x3("`x3`")
end
subgraph AR["`Allreduce`"]
xp["`x' = โ xโ `"]
end
subgraph AR3["`Rank 3`"]
xp3("`x'`")
end
subgraph AR2["`Rank 2`"]
xp2("`x'`")
end
subgraph AR1["`Rank 1`"]
xp1("`x'`")
end
subgraph AR0["`Rank 0`"]
xp0("`x'`")
end
x0 --> AR
x1 --> AR
x2 --> AR
x3 --> AR
AR --> xp0
AR --> xp1
AR --> xp2
AR --> xp3
classDef block fill:#CCCCCC02,stroke:#838383,stroke-width:1px,color:#838383
classDef red fill:#ff8181,stroke:#333,stroke-width:1px,color:#000
classDef orange fill:#FFC47F,stroke:#333,stroke-width:1px,color:#000
classDef yellow fill:#FFFF7F,stroke:#333,stroke-width:1px,color:#000
classDef green fill:#98E6A5,stroke:#333,stroke-width:1px,color:#000
classDef pink fill:#E599F7,stroke:#333,stroke-width:1px,color:#000
classDef blue fill:#7DCAFF,stroke:#333,stroke-width:1px,color:#000
classDef purple fill:#FFCBE6,stroke:#333,stroke-width:1px,color:#000
class R0,R1,R2,R3,AR,AR0,AR1,AR2,AR3 block
class xp,xp0,xp1,xp2,xp3, purple
class x0, red
class x1, green
class x2, blue
class x3, yellow
Figure 7: All-Reduce operation: each rank receives the reduction of input values across ranks.
Reduce
Perform a reduction on data across ranks, send to individual
flowchart TD
subgraph R0["`Rank 0`"]
x0("`x0`")
end
subgraph R1["`Rank 1`"]
x1("`x1`")
end
subgraph R2["`Rank 2`"]
x2("`x2`")
end
subgraph R3["`Rank 3`"]
x3("`x3`")
end
subgraph AR["`Reduce`"]
xp["`x'=reduce(x, 2, SUM)`"]
end
subgraph AR3["`Rank 3`"]
end
subgraph AR2["`Rank 2`"]
xp2("`x'`")
end
subgraph AR1["`Rank 1`"]
end
subgraph AR0["`Rank 0`"]
end
x0 --> AR
x1 --> AR
x2 --> AR
x3 --> AR
AR --> AR3
AR --> xp2
AR --> AR1
AR --> AR0
classDef block fill:#CCCCCC02,stroke:#838383,stroke-width:1px,color:#838383
classDef red fill:#ff8181,stroke:#333,stroke-width:1px,color:#000
classDef orange fill:#FFC47F,stroke:#333,stroke-width:1px,color:#000
classDef green fill:#98E6A5,stroke:#333,stroke-width:1px,color:#000
classDef yellow fill:#FFFF7F,stroke:#333,stroke-width:1px,color:#000
classDef blue fill:#7DCAFF,stroke:#333,stroke-width:1px,color:#000
classDef purple fill:#FFCBE6,stroke:#333,stroke-width:1px,color:#000
classDef pink fill:#E599F7,stroke:#333,stroke-width:1px,color:#000
class R0,R1,R2,R3,AR,AR0,AR1,AR2,AR3, block
class xp,xp2 purple
class x0, red
class x1, green
class x2, blue
class x3, yellow
Figure 8: Reduce operation: one rank receives the reduction of input values across ranks
Broadcast
flowchart TD
subgraph R3["`Rank 3`"]
end
subgraph R2["`Rank 2`"]
x2("`x2`")
end
subgraph R1["`Rank 1`"]
end
subgraph R0["`Rank 0`"]
end
subgraph AR["` `"]
xp["`broadcast(x2, 2)`"]
end
subgraph AR3["`Rank 3`"]
xp3("`x2`")
end
subgraph AR2["`Rank 2`"]
xp2("`x2`")
end
subgraph AR1["`Rank 1`"]
xp1("`x2`")
end
subgraph AR0["`Rank 0`"]
xp0("`x2`")
end
x2 --> AR
AR --> AR3
AR --> AR2
AR --> AR1
AR --> AR0
classDef text fill:#CCCCCC02,stroke:#838383,stroke-width:0px,color:#838383,font-weight:500
classDef block fill:#CCCCCC02,stroke:#838383,stroke-width:1px,font-weight:500,color:#838383
classDef blue fill:#7DCAFF,stroke:#333,stroke-width:1px,color:#000
classDef yellow fill:#FFFF7F,stroke:#333,stroke-width:1px,color:#000
class R0,R1,R2,R3,AR0,AR1,AR2,AR3,AR, block
class x2,xp0,xp1,xp2,xp3 blue
class xp, text
Figure 9: broadcast (send) a tensor x from one rank to all ranks
AllGather
flowchart LR
subgraph R0["`Rank 0`"]
x0("`x0`")
end
subgraph R1["`Rank 1`"]
x1("`x1`")
end
subgraph R2["`Rank 2`"]
x2("`x2`")
end
subgraph R3["`Rank 3`"]
x3("`x3`")
end
subgraph AG["`Allgather`"]
%%xp0["`z=[empty_like(x) for _ in range(4)]`"]
%%xp1["`dist.all_gather(z, x)`"]
end
subgraph AG3["`Rank 3`"]
direction TB
xp03("`x0`")
xp13("`x1`")
xp23("`x2`")
xp33("`x3`")
end
subgraph AG2["`Rank 2`"]
direction TB
xp02("`x0`")
xp12("`x1`")
xp22("`x2`")
xp32("`x3`")
end
subgraph AG1["`Rank 1`"]
direction TB
xp01("`x0`")
xp11("`x1`")
xp21("`x2`")
xp31("`x3`")
end
subgraph AG0["`Rank 0`"]
direction TB
xp00("`x0`")
xp10("`x1`")
xp20("`x2`")
xp30("`x3`")
end
x0 --> AG
x1 --> AG
x2 --> AG
x3 --> AG
AG --> AG0
AG --> AG1
AG --> AG2
AG --> AG3
classDef red fill:#ff8181,stroke:#333,stroke-width:1px,color:#000
classDef orange fill:#FFC47F,stroke:#333,stroke-width:1px,color:#000
classDef yellow fill:#FFFF7F,stroke:#333,stroke-width:1px,color:#000
classDef green fill:#98E6A5,stroke:#333,stroke-width:1px,color:#000
classDef blue fill:#7DCAFF,stroke:#333,stroke-width:1px,color:#000
classDef purple fill:#FFCBE6,stroke:#333,stroke-width:1px,color:#000
classDef block fill:#CCCCCC02,stroke:#838383,stroke-width:1px,font-weight:500,color:#838383
classDef text fill:#CCCCCC02,stroke:#838383,stroke-width:0px,color:#838383
class xp0,xp1, text
class AG0,AG1,AG2,AG3,AG,R0,R1,R2,R3, block
class xp00,xp01,xp02,xp03, red
class xp10,xp11,xp12,xp13, green
class xp20,xp21,xp22,xp23, blue
class xp30,xp31,xp32,xp33, yellow
class x0, red
class x1, green
class x2, blue
class x3, yellow
Figure 10: Gathers tensors from the whole group in a list.
Scatter
flowchart TD
subgraph R3["`Rank 3`"]
end
subgraph R2["`Rank 2`"]
end
subgraph R1["`Rank 1`"]
direction TB
xp03("`x0`")
xp13("`x1`")
xp23("`x2`")
xp33("`x3`")
end
subgraph R0["`Rank 0`"]
end
subgraph S["`Scatter`"]
end
subgraph AG3["`Rank 3`"]
x3("`x3`")
end
subgraph AG2["`Rank 2`"]
x2("`x2`")
end
subgraph AG1["`Rank 1`"]
x1("`x1`")
end
subgraph AG0["`Rank 0`"]
x0("`x0`")
end
%%R0 --> S
R1 --> S
%%R2 --> S
%%R3 --> S
S --> AG0
S --> AG1
S --> AG2
S --> AG3
classDef red fill:#ff8181,stroke:#333,stroke-width:1px,color:#000
classDef orange fill:#FFC47F,stroke:#333,stroke-width:1px,color:#000
classDef yellow fill:#FFFF7F,stroke:#333,stroke-width:1px,color:#000
classDef green fill:#98E6A5,stroke:#333,stroke-width:1px,color:#000
classDef blue fill:#7DCAFF,stroke:#333,stroke-width:1px,color:#000
classDef purple fill:#FFCBE6,stroke:#333,stroke-width:1px,color:#000
classDef block fill:#CCCCCC02,stroke:#838383,stroke-width:1px,font-weight:500,color:#838383
class AG0,AG1,AG2,AG3,S,R0,R1,R2,R3, block
class xp00,xp01,xp02,xp03, red
class xp10,xp11,xp12,xp13, orange
class xp20,xp21,xp22,xp23, yellow
class xp30,xp31,xp32,xp33, blue
class x0, red
class x1, green
class x2, blue
class x3, yellow
Figure 11: Scatters a list of tensors to the whole group
flowchart TB
subgraph G0["`GPU 0`"]
direction LR
a0("`Layer 0`")
b0("`Layer 1`")
end
subgraph G1["`GPU 1`"]
direction LR
a1("`Layer 2`")
b1("`Layer 3`")
end
a0 -.-> b0
b0 --> a1
a1 -.-> b1
classDef block fill:#CCCCCC02,stroke:#838383,stroke-width:1px,color:#838383
classDef red fill:#ff8181,stroke:#333,stroke-width:1px,color:#000
classDef orange fill:#FFC47F,stroke:#333,stroke-width:1px,color:#000
classDef yellow fill:#FFFF7F,stroke:#333,stroke-width:1px,color:#000
classDef green fill:#98E6A5,stroke:#333,stroke-width:1px,color:#000
classDef blue fill:#7DCAFF,stroke:#333,stroke-width:1px,color:#000
classDef purple fill:#FFCBE6,stroke:#333,stroke-width:1px,color:#000
class G0,G1, block
class a0, red
class b0, green
class a1, blue
class b1, yellow
Figure 15: Pipeline Parallelism
Tensor Parallel (TP)
Each tensor is split up into multiple chunks
Each shard of the tensor resides on its designated GPU
During processing each shard gets processed separately (and in parallel) on different GPUs
flowchart LR
subgraph G0["`GPU0`"]
direction TB
a0("`Layer 0`")
b0("`Layer 1`")
c0("`Layer 2`")
d0("`Layer 3`")
end
subgraph G1["`GPU1`"]
direction TB
a1("`Layer 0`")
b1("`Layer 1`")
c1("`Layer 2`")
d1("`Layer 3`")
end
a0 <-.-> a1
b0 <-.-> b1
c0 <-.-> c1
d0 <-.-> d1
classDef red fill:#ff8181,stroke:#333,stroke-width:1px,color:#000
classDef orange fill:#FFC47F,stroke:#333,stroke-width:1px,color:#000
classDef yellow fill:#FFFF7F,stroke:#333,stroke-width:1px,color:#000
classDef green fill:#98E6A5,stroke:#333,stroke-width:1px,color:#000
classDef blue fill:#7DCAFF,stroke:#333,stroke-width:1px,color:#000
classDef purple fill:#FFCBE6,stroke:#333,stroke-width:1px,color:#000
classDef block fill:#CCCCCC02,stroke:#838383,stroke-width:1px,color:#838383
class G0,G1, block
class a0,a1 red
class b0,b1 green
class c0,c1 blue
class d0,d1 yellow
Figure 16: Tensor Parallel Training
Tensor Parallel (TP)
Suitable when the model is too large to fit onto a single device (CPU / GPU)
Typically more complicated to implement than data parallel training
This is what one may call horizontal parallelism
Communication whenever dataflow between two subsets
flowchart LR
subgraph G0["`GPU0`"]
direction TB
a0("`Layer 0`")
b0("`Layer 1`")
c0("`Layer 2`")
d0("`Layer 3`")
end
subgraph G1["`GPU1`"]
direction TB
a1("`Layer 0`")
b1("`Layer 1`")
c1("`Layer 2`")
d1("`Layer 3`")
end
a0 <-.-> a1
b0 <-.-> b1
c0 <-.-> c1
d0 <-.-> d1
classDef red fill:#ff8181,stroke:#333,stroke-width:1px,color:#000
classDef orange fill:#FFC47F,stroke:#333,stroke-width:1px,color:#000
classDef yellow fill:#FFFF7F,stroke:#333,stroke-width:1px,color:#000
classDef green fill:#98E6A5,stroke:#333,stroke-width:1px,color:#000
classDef blue fill:#7DCAFF,stroke:#333,stroke-width:1px,color:#000
classDef purple fill:#FFCBE6,stroke:#333,stroke-width:1px,color:#000
classDef block fill:#CCCCCC02,stroke:#838383,stroke-width:1px,color:#838383
class G0,G1, block
class a0,a1 red
class b0,b1 green
class c0,c1 blue
class d0,d1 yellow
Figure 17: Tensor Parallel Training
Tensor (/ Model) Parallel Training: Example
Want to compute: y = \sum_{i} x_{i} W_{i} = x_0 W_0 + x_1 W_1 + x_2 W_2 where each GPU only has only its portion of the full weights as shown below
Compute: x_{0} W_{0}\rightarrowGPU1
Compute: x_{0} W_{0} + x_{1} W_{1}\rightarrowGPU2
Compute: y = \sum_{i} x_{i} W_{i} โ
flowchart TD
subgraph X2["`GPU2`"]
direction LR
c("`Wโ`")
end
subgraph X1["`GPU1`"]
direction TB
b("`Wโ`")
end
subgraph X0["`GPU0`"]
a("`Wโ`")
end
X0 <-.-> X1
X1 <-.-> X2
t0("`xโ`") --> X0
t1("`xโ`") --> X1
t2("`xโ`") --> X2
classDef redText fill:#CCCCCC02,stroke:#FF8181,stroke-width:2px,color:#838383,font-weight:500
classDef orangeText fill:#CCCCCC02,stroke:#FFC47F,stroke-width:2px,color:#838383
classDef yellowText fill:#CCCCCC02,stroke:#FFFF7F,stroke-width:2px,color:#838383
classDef blueText fill:#CCCCCC02,stroke:#7DCAff,stroke-width:2px,color:#838383
classDef greenText fill:#CCCCCC02,stroke:#98E6A5,stroke-width:2px,color:#838383
classDef red fill:#ff8181,stroke:#333,stroke-width:1px,color:#000
classDef orange fill:#FFC47F,stroke:#333,stroke-width:1px,color:#000
classDef yellow fill:#FFFF7F,stroke:#333,stroke-width:1px,color:#000
classDef green fill:#98E6A5,stroke:#333,stroke-width:1px,color:#000
classDef blue fill:#7DCAFF,stroke:#333,stroke-width:1px,color:#000
classDef purple fill:#FFCBE6,stroke:#333,stroke-width:1px,color:#000
classDef block fill:#CCCCCC02,stroke:#838383,stroke-width:1px,color:#838383
classDef text fill:#CCCCCC02,stroke:#838383,stroke-width:0px,color:#838383
class a, red
class b, green
class c, blue
class X0,X1,X2, block
%%class t0, redText
%%class t1, greenText
%%class t2, blueText
class a0,b0,c0, text
Tensor (Model) Parallelism1
In Tensor Paralleism each GPU processes only a slice of a tensor and only aggregates the full tensor for operations that require the whole thing.
The main building block of any transformer is a fully connected nn.Linear followed by a nonlinear activation GeLU.
Y = GeLU(XA), where X and Y are the input and output vectors, and A is the weight matrix.
If we look at the computation in matrix form, itโs easy to see how the matrix multiplication can be split between multiple GPUs:
Tensor Parallelism
Figure 18: Tensor Parallel GEMM. This information is based on (the much more in-depth) TP Overview by @anton-l
Figure 29: The simplest, fastest repository for training / finetuning GPT based models. Figure from karpathy/nanoGPT
Prepare Data
python3 wordplay/data/shakespeare_char/prepare.py# Using HF_DATASETS_CACHE=/home/foremans/tmp/polaris-talk/2024-07-17-073327/wordplay/data/shakespeare_char/.cache/huggingface# length of dataset in characters: 1,115,394# all the unique characters:# !$&\',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz# vocab size: 65# train has 1,003,854 tokens# val has 111,540 tokens
This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357
Wei, Jason, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, et al. 2022. โEmergent Abilities of Large Language Models.โhttps://arxiv.org/abs/2206.07682.
Yang, Jingfeng, Hongye Jin, Ruixiang Tang, Xiaotian Han, Qizhang Feng, Haoming Jiang, Bing Yin, and Xia Hu. 2023. โHarnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond.โhttps://arxiv.org/abs/2304.13712.
Yao, Shunyu, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. โTree of Thoughts: Deliberate Problem Solving with Large Language Models.โhttps://arxiv.org/abs/2305.10601.