Fundamentals of Data Strategy to get more efficient and predictive AI models
- Pravin Shinde
- Mar 1
- 4 min read
The Data relevancy, consistency and cleanses will be fundamental to building the scalable and competitive indigenous large language model, today we are thinking we are with the world but yes we are 3/4 years behind and unless we have very structured data sets our dream to achieve the AI feasibility for ordinary citizen will remain as it is.
To Run a Large Language Model (LLM), we need a highly scalable and huge dataset of text data, typically consisting of vast amounts of text scraped from the internet, books, articles, code repositories, and other sources, which is used to train the model to understand and generate human-like text by identifying patterns, context, grammar, and meaning within the data.This training of the data is crucial for the LLM to function effectively. Now if we look at the data that India has may be in public space is vastly unstructured and in many different distributed technologies and it is not cleaned and mostly non relevant- for example the healthcare data is not good sample to predict accurate forecast, the reason being the agencies & healthcare system collecting the data have many loopholes, and some of the data is just copy past by agencies to get payment from states. But finally we need to get our LLM running and race is very fast, hyper competitive and we may be the last one to start running, as deepseek is already ahead of us. We have used LLM for many of our private clients and build use cases and what we observed is if we run the same use case on different LLM the outcome is bit different so what it means for country like us is we need to first decide on what will be the fundamental data set that we will create to run our LLM and this is not only to train the LLM but need to make sure that the its based on our strategic initiative may be facing the public health care, education or weather predictions for agricultural output efficiencies or may be anything that will touch the life of ordinary citizen and not the copy paste that deepseek and other LLM developers are doing.
And in order to get best outcomes of the LLM we must focus on-
Large Volume of divers, structured and clean data:
LLMs require massive amounts of text data, often in the terabytes range, to learn complex relationships between words and concepts. So success of our indigenous LLM depends on how much data we get from real sources, we should not train our LLM on data that belongs from outside of India at least the government as in future there will be many more LLMs and many of the Indian IT companies will develope own LLM and definitely use the outside of india data that was either used by previous LLMs or may be private, government must use the in-house data else it will take lot of additional time to train the LLM on our data and it will again not fulfil the purpose of making AI for ordinary citizens of India.
Diversity of the sources:
The existing LLMs have collected the data from a variety of sources like websites, online books, research papers, and code repositories to ensure a broad range of language styles and topics.in our case internet speed and access to the smart phone have quadrupled the data storages but the quality and diversity of data is worst our most of the public sector data is either in videos or scanned files, now tools related to optical character recognition will help us to some extent but here many of the scanned files are in local languages and so this will definitely limit and distort the quality of diversity of the data. We need to prepare very clear and concise strategy for data diversity and align our data set as per need of the AI Module.
LLM needs data in Text Format:
Primarily, the traditional LLMs use raw text as the input data, which is then cleaned and processed into tokens for training. But this is very lengthy and costly process, we can also
use General Text Data this includes books, articles, and web pages. These sources provide a broad range of vocabulary and sentence structures. Our education system have done immense work to build their literature to date and this data will be enough at the stage to run any LLM and get deep insight on why public education system is not performing at the pace of education systems of other countries. We can also start collecting some Domain-Specific Data. For specialized applications, data from specific domains like medical journals, legal documents, or technical manuals can be very useful. Beyond this the Conversational Data such as Chat logs, transcripts of conversations, and dialogue datasets help in training models for natural language understanding and generation but we need to be more careful and start liberalising our population on why they need to share this type of data despite of its personal nature.
Quality matters:
While a large amount of data is necessary, the quality of the data significantly impacts the LLM's performance, so careful curation may be required to remove irrelevant or noisy information. Ensuring high-quality of data for training a Large Language Model (LLM) is crucial for its performance and reliability. So here are some key quality parameters we can consider- The Accuracy of the data Ensure the data is factually correct and reliable. Inaccurate data can lead to misinformation and errors in the model's output. The Data Relevancy, data should be pertinent to the intended application of the LLM. For example, if the LLM is for medical applications, the data should include accurate medical information. Cleanliness of the data, the data should be free from noise, such as grammatical errors, duplicate entries, and irrelevant content. Pre-processing steps like tokenization and normalization are important. One of the fundamental issue with the data is Ethical Considerations, The dataset should be free from biased, harmful, or offensive content. Ethical guidelines should be followed to avoid perpetuating stereotypes or discrimination. Also the Consistency is important; the data should be consistent in terms of formatting, style, and terminology. This helps the model learn patterns more effectively. The finale but not short is the size of the data while the size of the data is not strictly a quality parameter, the quantity of data is also important. A larger dataset can help the model learn better, but quality should not be compromised for quantity.

Comments