--- license: gemma library_name: transformers pipeline_tag: text-generation extra_gated_heading: Access Gemma on Hugging Face extra_gated_prompt: >- To access Gemma on Hugging Face, you’re required to review and agree to Google’s usage license. To do this, please ensure you’re logged in to Hugging Face and click below. Requests are processed immediately. extra_gated_button_content: Acknowledge license tags: - conversational base_model: google/gemma-2-27b-it --- # DataGemma model card **Resources and Technical Documentation**: - [DataGemma RAG on Kaggle](https://www.kaggle.com/models/google/datagemma-rag) - [DataGemma RIG on Kaggle](https://www.kaggle.com/models/google/datagemma-rig) **Terms of Use**: [Terms](https://ai.google.dev/gemma/terms) **Authors**: Google # Model Information ### Description DataGemma is a series of fine-tuned Gemma 2 models used to help LLMs access and incorporate reliable public statistical data from Data Commons into their responses. DataGemma RAG is used with Retrieval Augmented Generation, where it is trained to take a user query and generate natural language queries that can be understood by Data Commons' existing natural language interface. More information can be found with this academic paper (TODO: insert link). ### Inputs and outputs - **Input**: Text string containing a user query with a prompt to ask for statistical questions. - **Output**: Generated English-language text in response to the input, such as an answer to a question, or a summary of a document. Here is an example of a prompt used to get statistical questions for the user query `[User Query]`: ``` Your role is that of a Question Generator. Given Query below, come up with a maximum of 25 Statistical Questions that help in answering Query. These are the only forms of Statistical Questions you can generate: 1. What is $METRIC in $PLACE? 2. What is $METRIC in $PLACE $PLACE_TYPE? 3. How has $METRIC changed over time in $PLACE $PLACE_TYPE? where, - $METRIC should a metric on societal topics like demographics, economy, health, education, environment, etc. Examples are unemployment rate and life expectancy. - $PLACE is the name of a place like California, World, Chennai, etc. - $PLACE_TYPE is an immediate child type within $PLACE, like counties, states, districts, etc. Your response should only have questions, one per line, without any numbering or bullet. If you cannot come up with Statistical Questions to ask for a Query, return an empty response. Query: [User Query] Statistical Questions: ``` ### Citation TODO # Model Data The base model was trained on a dataset of text data that includes a wide variety of sources, see the [Gemma 2 documentation](https://ai.google.dev/gemma#gemma-2) for more details. The DataGemma RAG model is fine-tuned on synthetically generated data. More details can be found in the DataGemma technical paper(TODO:url). # Implementation Information Like [Gemma](https://ai.google.dev/gemma/docs/model_card#implementation_information), DataGemma RAG was trained on [TPUv5e](https://cloud.google.com/tpu/docs/intro-to-tpu?_gl=1*18wi411*_ga*MzE3NDU5OTY1LjE2MzQwNDA4NDY.*_ga_WH2QY8WWF5*MTcxMTA0MjUxMy4xNy4wLjE3MTEwNDI1MTkuMC4wLjA.&_ga=2.239449409.-317459965.1634040846), using [JAX](https://github.com/google/jax). # Evaluation Evaluation on the model was done as part of evaluation on the full RAG workflow and documented in the DataGemma technical paper(TODO:url). # Ethics and Safety We are releasing an early version of the models. They are meant for trusted tester use (primarily for academic and research purposes) and are not yet ready for commercial or general public use. This version was trained on a very small corpus of examples and may exhibit unintended, and at times controversial or inflammatory, behavior. Please anticipate errors and limitations as we actively develop this LLM interface. - We red teamed and checked the Data Commons Natural Language interface pre-launch against a set of potentially dangerous queries that could result in misleading, controversial, or inflammatory results. - We ran these same queries against the outputs of the RIG and RAG models, finding a few examples where query responses were controversial, but not dangerous. - As this model is meant purely for academic and research purposes, it has not been subjected to our usual safety evaluations. # Usage and Limitations These models have certain limitations that users should be aware of. This is a very early version of DataGemma RAG. It is meant for trusted tester use (primarily for academic and research use) and not yet ready for commercial or general public use. This version was trained on a very small corpus of examples and may exhibit unintended, and at times controversial or inflammatory behavior. Please anticipate errors and limitations as we actively develop this large language model interface. Your feedback and evaluations are critical to refining DataGemma's performance and will directly contribute to its training process. Known limitations are detailed in the DataGemma technical paper(TODO:url), and we encourage you to consult it for a comprehensive understanding of DataGemma's current capabilities.