Data Juice Story #3: National Library’s Search Index Goes on a Data Diet

The following is an excerpt from the book, "Data Juice: 101 Real-World Stories of How Organizations Are Squeezing Value From Available Data Assets."

The National Diet Library (NDL) in Tokyo serves as a repository of knowledge, language, and culture, collecting and preserving both traditional books and paper materials, as well as digital information from Japan and other countries worldwide. To achieve its objective of making this vast amount of information accessible to everyone, anywhere and anytime, the NDL embarked on the development of a digital archive. The development of the NDL Search began with the use of open-source software called Apache Hadoop, which was used to speed up full-text search indexing and automatic bibliographic grouping. The system used more than 30 Hadoop nodes and processed data volume at around 5TB, equivalent to tens of millions of items.

By leveraging online tools and technology, the NDL was able to capture and manage a huge volume of data, create a search index from all its documents, and reduce the time for manual searching of materials and information. The benefits of this digital archive are numerous, not only enabling library users and readers to access information more efficiently but also creating new knowledge by reusing existing information and building a usable knowledge infrastructure.

Creating a massive online volume of data is laudable, but the greater value lies in the ability to distill large sets of data down to meaningful essence. Any organization with large sets of data should expand access using alternative data structures such as graph to make the exploration of data more fluid. Large sets of data also lend themselves to the use of machine learning (e.g., clustering, classification) to guide users to the most significant bits of information more readily. Furthermore, organizations should capture usage metadata so that over time, they can highlight the most popular paths through the data by other users. Finally, governance processes and standards will be important to maintaining and increasing the value of the data and insights over time.

The NDL's digital archive serves as an excellent example of the power of data and analytics in managing large volumes of information. By leveraging open-source software and online tools, the NDL was able to create a search index that reduced manual searching and increased the efficiency of access to information. The benefits of this digital archive go beyond efficiency, as it enables the creation of new knowledge by reusing existing information and building a usable knowledge infrastructure.

For organizations with large sets of data, creating a massive online volume of data is just the first step. They must also leverage alternative data structures, machine learning, and capture usage metadata to extract meaningful insights from the data. Finally, governance processes and standards are crucial to maintaining and increasing the value of the data and insights over time. By following these steps, organizations can create a robust digital archive that not only streamlines access to information but also creates new knowledge and value.

What are Small Language Models (SLMs)? When to Use Them Over LLMs?

AI Agents Will Know Your “Intentions” Before You Do. It’s Called “Intention Economy”

Code Less, Build More: The AI Future of Low-Code Development

Ukraine To Use Advanced AI Drone Tech For Battlefield Surveillance

Episode 7 - Balancing the Scales: Citizen Developers and Centralised Orchestration

What Is Data Augmentation? Techniques and Benefits Explored

What Is Demand Forecasting? Smarter Business Decisions Aligned to Customer Needs

What Is a Colocation Data Centre? Why Is It Important for Enterprise Tech?

How to Strengthen Cybersecurity with Cloud Innovation

What is the Willow Quantum Computing Chip by Google?

PETs and Privacy: Walking the Fine Line of AI Ethics

What Is Serverless Architecture? A Comprehensive Guide for SaaS Platforms

Top 10 Best Colocation Data Centres

What Is Inventory Optimisation? Definition and Strategies for Success

Fiplana from insightsoftware: Optimize Enterprise Planning

Power ON from insightsoftware: Supercharge Power BI With Planning and Write-Back

Brain Cipher Ransomware Gang responsible for Rhode Island RIBridges data breach

Supply Chain Survival: How to Dodge Disasters and Stay Ahead of the Curve

What is Keystroke Logging? The Invisible Threat Watching You Type

Omada: Navigating the Evolving Compliance Landscape

Episode 6 - Automation Excellence in 2025: What Should Be On Your Radar?

The Human Touch: Technological Innovation in Modern Business

Elon Musk's ‘X’ Rival, Bluesky is Taking Off, What is it?

Takeaways From Fall Conferences - AI Evolution for EX and CX, Getting Workers Back to the Office, and Future of Work Expo Updates

Brain Cipher Ransomware Gang responsible for Rhode Island RIBridges data breach

What are Small Language Models (SLMs)? When to Use Them Over LLMs?

What is Keystroke Logging? The Invisible Threat Watching You Type

AI Agents Will Know Your “Intentions” Before You Do. It’s Called “Intention Economy”

Supply Chain Survival: How to Dodge Disasters and Stay Ahead of the Curve

Code Less, Build More: The AI Future of Low-Code Development

Episode 7 - Balancing the Scales: Citizen Developers and Centralised Orchestration

Code, Chaos and Clever Machines: Solving Enterprise IT Challenges with Practical AI

Top 10 Best Colocation Data Centres

Top 10 AI Audio Generators

Top 10 Supply Chain Attacks: What You Need to Know

Top 10 Cloud Security Posture Management (CSPM) Tools

Omada: Navigating the Evolving Compliance Landscape

Omada: NIS2 Directive Explained - Your Guide to Compliance and Security

Fiplana from insightsoftware: Optimize Enterprise Planning

Power ON from insightsoftware: Supercharge Power BI With Planning and Write-Back

Shaping the Future: The AI Summit New York 2024

AI and Big Data Expo Global adds a host of leading industry experts to the agenda

AI and Big Data Expo Europe key agenda sessions

Cybersecurity Luminary Stephen Khan to Receive Prestigious Hall of Fame Award at Infosecurity Europe

Meet The Analyst - Ann Emery

Meet the Analyst: Debbie Reynolds

Exploring AI Integration in Contact Centers: Insights from DTXUCX 2024

"AI is Less of a Focus this Year" | Mike Plested @ DTXUCX 2024

More from Douglas Laney

Douglas Laney

Recommended for you

What Happened to the Metaverse? How Zuck's VR Dream Died