Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed reproduce llama3-8b result #14

Open
chunniunai220ml opened this issue Jun 17, 2024 · 4 comments
Open

failed reproduce llama3-8b result #14

chunniunai220ml opened this issue Jun 17, 2024 · 4 comments

Comments

@chunniunai220ml
Copy link

chunniunai220ml commented Jun 17, 2024

i can not reproduce llama3-8b result according ur advice, just got
{'exact_match': 53.9604, 'num_predicted': 202, 'mean_prediction_length_characters': 1.0, 'LEval_score': 53.9604, 'display_keys': ['exact_match'], 'display': [53.9604]}

here is my codes:
python Baselines/llama2-chat-test.py
--metric exam_eval
--task_name quality
--max_length 4k

and change llama2-chat-test.py
elif args.metric == "exam_eval":
context = "Document is as follows. {document} \nQuestion: {inst}. Please directly give the answer without any additional output or explanation "

message="<|begin_of_text|>"+sys_prompt # B_INST + B_SYS + sys_prompt + E_SYS + context + E_INST
message += "\nAnswer:"

@ChenxinAn-fdu
Copy link
Collaborator

ChenxinAn-fdu commented Jun 17, 2024

Hi! You should use the Instruct version of Llama3 8B and set max_length to 8k.
Please use the chat template of Llama3.

@ylsung
Copy link

ylsung commented Jul 18, 2024

Hi,

Thank you for providing the codes and tips for reproducing LLaMA 3 results!

I modified the LLaMA 2 codes based on your suggestions:

  1. Use the LLaMA3-Instruct model
  2. Set the max_length to 8k
  3. Use the llama3 template (as shown below)
message = ""
message += "<|begin_of_text|><|start_header_id|>system<|end_header_id|>"
message += "\n" + sys_prompt
message += "<|eot_id|><|start_header_id|>user<|end_header_id|>"
message += "\n" + context
message += "<|eot_id|><|start_header_id|>assistant<|end_header_id|>"

The results I got for the six tasks are

Llama3-8b TOEFL QuALITY Coursera SFiction GSM CodeU
Your Results 82.89 64.85 53.77 69.53 79.00 2.22
My Reproduction 81.04 61.88 52.62 71.09 29.00 4.44

Results on most datasets are within an acceptable gap to your results while the GSM100k result I got is somehow very bad.
Could you please help me check if my prompt is the same as yours? Or do you have any other suggestions for reproducing the results (such as tuning the decoding hyperparameters)? Thank you very much.

@ChenxinAn-fdu
Copy link
Collaborator

Hi! I suggest using:

message = ""
message += "<|begin_of_text|><|start_header_id|>system<|end_header_id|>"
message += "\n" + sys_prompt
message += "<|eot_id|><|start_header_id|>user<|end_header_id|>"
message += "\n" + context + "\nAnswer:"

The Question and Answer pair is needed to achieve high performance on math tasks.

@ylsung
Copy link

ylsung commented Jul 19, 2024

Thanks for your reply.

I found the role special tokens have to be added to all the examples in GSM100k, such as

context = document + "\n\n" + inst
                    
context = context.replace(
   "Question:", 
   "<|eot_id|><|start_header_id|>user<|end_header_id|>\nQuestion:"
)
                    
context = context.replace(
    "Let's think step by step", 
    "Let's think step by step\n<|eot_id|><|start_header_id|>assistant<|end_header_id|>"
)
                    
message = ""
message += "<|begin_of_text|><|start_header_id|>system<|end_header_id|>"
message += "\n" + sys_prompt
message += context

Then the accuracy will be 78!

There is also the other option not to use any chat format

message = sys_prompt + "\n" + context

the model will act like a pre-trained language model and keep outputting self-curated questions and answers after the CoT and answer for the original question. If we parse the first answer that the model generates (which has been done in the current code), the accuracy is 80.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants