how to proper log in training_step or on_training_batch_end in DDP #20098

huangfu170 · 2024-07-17T06:22:07Z

huangfu170
Jul 17, 2024

I use 4 gpus in 1 nodes with a NLP task, I wonder to know, how to log the training loss at step level to the Tensorboard, I use the following code, but it didn't work and only output the loss at the end of epoch:` def training_step(self,batch,batch_idx):
input_ids, attention_mask, label, label_input_ids, label_attention_mask, edge_index, cp_input_ids, cp_attention_mask = self.unzip_batch(batch)
sim, outputs,,,, = self(input_ids, attention_mask, label_input_ids, label_attention_mask, edge_index, cp_input_ids, cp_attention_mask)
loss_sim = loss_function(sim, label)
loss_output = loss_function(outputs, label)
loss = loss_sim + loss_output
sim= (torch.sigmoid(sim[:,1:])>=0.8)
outputs=(torch.sigmoid(outputs[:,1:])>=0.8)

    # print(loss)
    return {"loss":loss,
            "sim":sim.cpu().detach().numpy(),
            "outputs":outputs.cpu().detach().numpy(),
            "label":label[:,1:].cpu().detach().numpy()}
def on_train_batch_end(self, outputs, batch, batch_idx) -> None:
    loss=outputs["loss"].cpu().detach().numpy()
    sim=outputs["sim"]
    outputs=outputs["outputs"]
    label=batch["label"][:,1:]
    f1=f1_score(label,sim|outputs,average='samples')
    self.log('train_f1',f1,on_step=True)
    self.log('train_loss',loss,sync_dist=True,on_step=True)`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to proper log in training_step or on_training_batch_end in DDP #20098

{{title}}

Replies: 0 comments

Select a reply

how to proper log in training_step or on_training_batch_end in DDP #20098

huangfu170 Jul 17, 2024

Replies: 0 comments

huangfu170
Jul 17, 2024