Edit model card

Tanvir1337/BanglaLLama-3-8b-BnWiki-Instruct-GGUF

This model has been quantized using llama.cpp, a high-performance inference engine for large language models.

System Prompt Format

To interact with the model, use the following prompt format:

{System}
### Prompt:
{User}
### Response:

Usage Instructions

If you're new to using GGUF files, refer to TheBloke's README for detailed instructions.

Quantization Options

The following graph compares various quantization types (lower is better):

image.png

For more information on quantization, see Artefact2's notes.

Choosing the Right Model File

To select the optimal model file, consider the following factors:

  1. Memory constraints: Determine how much RAM and/or VRAM you have available.
  2. Speed vs. quality: If you prioritize speed, choose a model that fits within your GPU's VRAM. For maximum quality, consider a model that fits within the combined RAM and VRAM of your system.

Quantization formats:

  • K-quants (e.g., Q5_K_M): A good starting point, offering a balance between speed and quality.
  • I-quants (e.g., IQ3_M): Newer and more efficient, but may require specific hardware configurations (e.g., cuBLAS or rocBLAS).

Hardware compatibility:

  • I-quants: Not compatible with Vulcan (AMD). If you have an AMD card, ensure you're using the rocBLAS build or a compatible inference engine.

For more information on the features and trade-offs of each quantization format, refer to the llama.cpp feature matrix.

Downloads last month
69
GGUF
Model size
8.03B params
Architecture
llama

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Inference API
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for Tanvir1337/BanglaLLama-3-8b-BnWiki-Instruct-GGUF

Quantized
this model

Dataset used to train Tanvir1337/BanglaLLama-3-8b-BnWiki-Instruct-GGUF