Comparative Study: Training OPT-350M and GPT-2 on Anthropic’s HH-RLHF Dataset Using Reward-Based Training