failed reproduce llama3-8b result #14

chunniunai220ml · 2024-06-17T01:01:35Z

i can not reproduce llama3-8b result according ur advice, just got
{'exact_match': 53.9604, 'num_predicted': 202, 'mean_prediction_length_characters': 1.0, 'LEval_score': 53.9604, 'display_keys': ['exact_match'], 'display': [53.9604]}

here is my codes:
python Baselines/llama2-chat-test.py
--metric exam_eval
--task_name quality
--max_length 4k

and change llama2-chat-test.py
elif args.metric == "exam_eval":
context = "Document is as follows. {document} \nQuestion: {inst}. Please directly give the answer without any additional output or explanation "

message="<|begin_of_text|>"+sys_prompt # B_INST + B_SYS + sys_prompt + E_SYS + context + E_INST
message += "\nAnswer:"

ChenxinAn-fdu · 2024-06-17T01:51:51Z

Hi! You should use the Instruct version of Llama3 8B and set max_length to 8k.
Please use the chat template of Llama3.

ylsung · 2024-07-18T06:12:06Z

Hi,

Thank you for providing the codes and tips for reproducing LLaMA 3 results!

I modified the LLaMA 2 codes based on your suggestions:

Use the LLaMA3-Instruct model
Set the max_length to 8k
Use the llama3 template (as shown below)

message = ""
message += "<|begin_of_text|><|start_header_id|>system<|end_header_id|>"
message += "\n" + sys_prompt
message += "<|eot_id|><|start_header_id|>user<|end_header_id|>"
message += "\n" + context
message += "<|eot_id|><|start_header_id|>assistant<|end_header_id|>"

The results I got for the six tasks are

Llama3-8b	TOEFL	QuALITY	Coursera	SFiction	GSM	CodeU
Your Results	82.89	64.85	53.77	69.53	79.00	2.22
My Reproduction	81.04	61.88	52.62	71.09	29.00	4.44

Results on most datasets are within an acceptable gap to your results while the GSM100k result I got is somehow very bad.
Could you please help me check if my prompt is the same as yours? Or do you have any other suggestions for reproducing the results (such as tuning the decoding hyperparameters)? Thank you very much.

ChenxinAn-fdu · 2024-07-18T06:27:42Z

Hi! I suggest using:

message = ""
message += "<|begin_of_text|><|start_header_id|>system<|end_header_id|>"
message += "\n" + sys_prompt
message += "<|eot_id|><|start_header_id|>user<|end_header_id|>"
message += "\n" + context + "\nAnswer:"

The Question and Answer pair is needed to achieve high performance on math tasks.

ylsung · 2024-07-19T19:33:18Z

Thanks for your reply.

I found the role special tokens have to be added to all the examples in GSM100k, such as

context = document + "\n\n" + inst
                    
context = context.replace(
   "Question:", 
   "<|eot_id|><|start_header_id|>user<|end_header_id|>\nQuestion:"
)
                    
context = context.replace(
    "Let's think step by step", 
    "Let's think step by step\n<|eot_id|><|start_header_id|>assistant<|end_header_id|>"
)
                    
message = ""
message += "<|begin_of_text|><|start_header_id|>system<|end_header_id|>"
message += "\n" + sys_prompt
message += context

Then the accuracy will be 78!

There is also the other option not to use any chat format

message = sys_prompt + "\n" + context

the model will act like a pre-trained language model and keep outputting self-curated questions and answers after the CoT and answer for the original question. If we parse the first answer that the model generates (which has been done in the current code), the accuracy is 80.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

failed reproduce llama3-8b result #14

failed reproduce llama3-8b result #14

chunniunai220ml commented Jun 17, 2024 •

edited

Loading

ChenxinAn-fdu commented Jun 17, 2024 •

edited

Loading

ylsung commented Jul 18, 2024

ChenxinAn-fdu commented Jul 18, 2024

ylsung commented Jul 19, 2024 •

edited

Loading

failed reproduce llama3-8b result #14

failed reproduce llama3-8b result #14

Comments

chunniunai220ml commented Jun 17, 2024 • edited Loading

ChenxinAn-fdu commented Jun 17, 2024 • edited Loading

ylsung commented Jul 18, 2024

ChenxinAn-fdu commented Jul 18, 2024

ylsung commented Jul 19, 2024 • edited Loading

chunniunai220ml commented Jun 17, 2024 •

edited

Loading

ChenxinAn-fdu commented Jun 17, 2024 •

edited

Loading

ylsung commented Jul 19, 2024 •

edited

Loading