This is an MOE of Llama-3-8b with 4 experts. This does not use semantic routing, as this utilizes the deepseek-moe architecture. There is no routing, and there is no gate - all experts are active on every token. ```python import torch from transformers import AutoTokenizer, TextStreamer, AutoModelForCausalLM model_path = "Crystalcareai/llama-3-4x8b" model = AutoModelForCausalLM.from_pretrained( model_path, device_map="auto", low_cpu_mem_usage=True, torch_dtype=torch.bfloat16, trust_remote_code=True, attn_implementation="flash_attention_2", ) tokenizer = AutoTokenizer.from_pretrained(model_path) streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True) # Modify the prompt to match the Alpaca instruction template prompt = """ Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: Sam is faster than Joe. Joe is faster than Jane. Is Sam faster than Jane? Explain your reasoning step by step. ### Input: ### Response: """ tokens = tokenizer( prompt, return_tensors='pt' ).input_ids.cuda() generation_output = model.generate( tokens, streamer=streamer, max_new_tokens=512, ) ```