New AI Tool Searches Millions of Historical Newspaper Pages

30 September 2020

A new search tool uses machine learning to search millions of U.S. newspaper pages for historical pictures.

The U.S. Library of Congress recently launched the tool, called Newspaper Navigator. The online search system is available for free to the public.

The Library of Congress is the world's largest library. It offers materials from the creative record of the United States. The library serves as the main research service for the U.S. Congress.

Newspaper Navigator currently permits users to search more than 16 million pages from newspapers across the country, from 1900 to 1963.

The newspaper pages were digitized for another Library of Congress project, called Chronicling America. This tool also permits searches across the library's 16 million newspaper pages. The pages contain more than 1.5 million images.

A screenshot of the new Newspaper Navigator tool shows an image search for
A screenshot of the new Newspaper Navigator tool shows an image search for "baseball players."

The Chronicling America system permits users to find and look at full newspaper pages as digitized images. Users can also search the collection by keyword, using optical character recognition -- OCR. OCR is a tool that uses digital cameras to identify printed characters on a page for searches or to produce text.

This meant that people using the Chronicling America site had to search through newspaper pages themselves when trying to find specific images. The new Newspaper Navigator tool offers the ability to carry out searches based on image-only content in the collection.

This is where the machine-learning methods come in. The search system was trained to recognize different kinds of images. For example, it was designed to tell the difference between photos, maps, comics, advertisements, etc. It can also identify similar images and return these in search results.

Benjamin Lee created the system. He is a member of the Library of Congress' Innovator in Residence Program. The program was established to sponsor people from different fields to create new ways to present the library's huge historical collections to the public.

Lee trained a machine-learning model to identify the visual content and then ran the model over all 16 million pages in Chronicling America.

His training model was based on another Library of Congress experiment called Beyond Words. That project invited members of the public to help identify cartoons, drawings, pictures and advertisements in newspapers during World War I.

Lee said that after he learned of the Beyond Words experiment, he saw a great possibility to use that information to power his machine-learning tool. "I began to wonder whether this identified visual content was the key to throwing open the treasure chest of visual content, throughout all 16 million pages in Chronicling America."

Newspaper Navigator works like other search engines. Users enter a search term in the "keyword" box. They can also choose to limit search results by location, as well as by date.

But one of the most powerful tools in the system is the ability to search images by visual similarity. Users of the tool can save images to a personal "collection." They can then use those images as a basis for finding other visually similar images across the library's full collection.

The system even permits users to "retrain" the machine learning tool for individual searches. This is done by examining the images that the search returns. By selecting whether images found were similar or not similar to the desired result, the user is "retraining" the system to improve its search performance.

A demonstration of the Newspaper Navigator is available to help users learn more about the tool and how to carry out different searches. The creators hope the tool can be useful for historians, reporters, educators, professional researchers or anyone interested in learning about U.S. history through newspapers.

The Library of Congress notes that all images included in Newspaper Navigator and Chronicling America are in the public domain, meaning people are free to use them as they wish.

I'm Bryan Lynn.

Bryan Lynn wrote this story for VOA Learning English, based on reports from the Library of Congress. Ashley Thompson was the editor.

We want to hear from you. Write to us in the Comments section, and visit 51VOA.COM.


Words in This Story

pagen. one part of a website

digitizev. to put information into the form or a series of numbers, usually so that it can be understood by a computer

character – n. a letter, number or other mark or sign used in writing or printing

comics – n. a series of pictures that tell a story

content n. information contained in a piece of writing, a speech, a movie or on the internet

visual – adj. related to seeing

sponsor v. to pay for someone to do something or for something to happen

location n. place where something takes place