AI and image analysis at the BnF: experience feedback

2012-2022: From AlexNet to the BnF Datalab 

Jean-Philippe Moreux

The creation of the Gallica digital library in 1997, the mass digitization begun in 2005, and its status as a research library all led the Bibliothèque nationale de France to participate in French and European research projects in the digital field and to establish a network of scientific collaboration and cooperation. Initially focused on print collections (document analysis, OCR) in the 2010s, the library's concerns then diversified, in particular towards the valorization of its iconographic resources. The deep learning turning point in 2012 led to a proliferation of projects starting in 2015, with the first concrete impacts on uses and services observed in 2016-2017.

This presentation reviews this period, between continuity and rupture, through the example of several emblematic projects dedicated to image analysis and CBIR (content-based image retrieval), notably GallicaPix and GallicaSnoop, driven by the needs of the Gallica digital library or by the uses of the so-called collection as data dynamic. In addition to technical or scientific issues, particular emphasis is placed on both operational and strategic issues that are at the heart of any research and development activity: choice and prioritization, financing, support for change, and industrialization of research results. Taking them into consideration has led BnF to carry out a double cross-cutting reflection that has resulted in a new service, the BnF Datalab (2021), already invested by visual studies teams, and an Artificial Intelligence roadmap (2021).

AI4LAM Panel: Building a community of practice, with and beyond libraries

Emmanuelle Bermès, Neil Fitzgerald

View Slides: AI4LAM Panel

The AI4LAM community (Artificial Intelligence for Libraries, Archives and Museums [1]) emerged from the signing of an MOU in 2018 between two libraries, the Stanford University Library and the National Library of Norway. In 2019, 3 additional founding institutions joined to form the AI4LAM Secretariat, two of which are also libraries who currently co-chair the group: the BnF (national library of France) and the British Library. This highlights the maturity of AI awareness in the library community, and advocates for a strong relationship between the future IFLA SIG and AI4LAM.

AI4LAM benefits from 3 years of work, including three international conferences, two online training seasons and the work of several working groups and chapters. We have started building a registry [2] of projects, activities, datasets and models, plus we organise monthly community calls with presentations from the community on a variety of AI related topics. As the IFLA SIG on AI is starting its activity, we would be thrilled to have the opportunity to share our feedback on our work to date.
Many library use cases with AI are shared with other institutions: for instance computer vision, handwritten text recognition (HTR)... AI4LAM can bring insights on these shared use cases to the IFLA audience and report on developments that we are aware of. On the other hand, thanks to its broad scope, IFLA is a wonderful place to address issues that cross other sections’ or unit’s interests, like ethics (data privacy, diversity, biases, etc.) or use cases that relate to specific types of collections or services, and to reach professionals
beyond Europe and North America, who don’t always have the opportunity to attend more technical events. Hence we are convinced that there would be a strong mutual benefit to liaise between our two communities.

Artificial intelligence is already in libraries, let's master it

Dr. Mojca Rupar Korošec

View Slides: Artificial intelligence is already in libraries, let's master it

Artificial intelligence is already in libraries and its proper evaluation is particularly important today. We present the links between the use of AI in an ethical way and libraries. We highlight the legal framework of the trends dictated by the European Union for the handling of data in libraries We also highlight the strategies followed by the institutions dealing with this issue.

Artificial intelligence is a priority for the European Union, as it is predicted to play a key role in the digital transformation of the economy and society.

On 19 May 2021, the European Parliament adopted a report on the use of artificial intelligence in the fields of education, culture, and audiovisual, calling for AI technologies to be designed in a way that avoids gender, social or cultural bias and protects diversity. (https://www.europarl.europa.eu/news/en/headlines/society/20201015STO89417/ai-rules-what-the-european-parliament-wants ; https://www.europarl.europa.eu/news/en/press-room/20210517IPR04135/meps-call-for-an-ethical-framework-to-ensure-ai-respects-eu-values )

We are mindful of intellectual freedom and the related right to information. The European Union Intellectual Property Office (EUIPO) is working on the importance of the intellectual property. We need more independent non-profit organizations such as 'DataEthics.EU' whose purpose is to ensure 'individual control over data' based on a European legal and value framework.

As decisions of AI systems only come with a certain, measurable accuracy, the accuracy of human performance should be used as a benchmark to assess the quality of an AI system.

We present some worthwhile ethical frameworks designed for AI algorithms in a way that will prevent bias based on gender, social position, or culture and protect diversity. Specific indicators for measuring diversity need to be developed, and inclusive ethical datasets in which humans must always take responsibility. We are following a very important document published by IFLA and get inspired by the UNESCO COMEST document.

Artificial Intelligence: what's our story?

Dr Andrew Cox

View Slides: Artificial Intelligence: what's our story?

Automatic indexing using AI methods – a project insight at the German National Library

Florian Engel

View Slides: Automatic indexing using AI methods

Due to the legal collection mandate of the German National Library, an average of almost 4,000 online publications enter the DNB's collection every day. In order to make these publications retrievable for users and to offer topical access points, the publications must be assigned subject headings. This cataloguing process is automated, due on the one hand to the large number of texts, but also in view of limited resources.

This automation results in a problem from the Extreme Multilabel Text Classification (XMTC) problem class. More precisely, subject headings from the vocabulary of the Gemeinsame Normdatei (GND) are to be assigned to each publication, resulting in a classification problem with up to 1.3 million possible target labels. This task, combined with a highly uneven distribution of the subject headings, results in a complex problem that is being addressed in a three-year project at DNB.

In this project, algorithms and methods from the field of artificial intelligence will be investigated, selected, combined and adapted in order to measurably improve the quality of machine-based subject cataloguing. In addition, the GND vocabulary is to be processed and suitably represented so that the potential of this knowledge graph can be better exploited. The resulting algorithms are planned to be made available in a flexible and reusable way as open source tools for libraries and institutions with similar tasks, in order to build up or expand AI competencies in cultural institutions.

In the following presentation, the planned procedure for achieving these goals will be shown in detail. For this purpose, the initial situation was first analysed, whereby parameters could be identified as well as initial work steps derived. This results in a framework of pipelines consisting of preprocessing, evaluation, and corpus management. These pipelines serve as a basic prerequisite for the implementation and testing of various approaches from the field of artificial intelligence and represent the status quo of the project at the time of the conference. Moreover, a short outlook on the further course of the project will be given, accompanied by an (incomplete) list of methods that require further investigation.

Building Chatbot for libraries

Iman Khamis

First, we will create numeric representation of our text, by using method to vectorize our documents. To build the chatbot we will create intent JASON file.

This intent JASON file will help us answering the questions customers has in their mind. Intent is very important for chatbot to work properly, as it will give our chatbot the ability to analyze customer's intents and give a successful interaction.  

The JSON file contains variety of messages that our users might ask in a typical customer services situation. Then chatbot will map these questions to a group of appropriate responses provided in the JSON file. The tag on each dictionary in the file indicates which group that customer's message would belong to.

We will use this to train neural network to take a sentence of the words and classify it as one of the tags in the JSON file. This is how our chatbot will be able to take a response from these groups and display the right answer to the customer. The more tags, responses, and patterns we can provide to the neural network the better our chatbot answers would be.

Our neural network must be built to deal with text, that's we will use process in Pytorch called LINEAR. This linear process will apply a linear transformation to the incoming data.

For an ethics of personalised recommendation at the French National Library

Lucie Termignon, Céline Leclaire

View Slides: For an ethics of personalised recommendation at the French National Library

What questions and ethical principles guide the French National Library as it embarks on the development of a personalised content recommendation system powered by artificial intelligence (AI)? How does this approach reflect the institution's overall AI policy?

In order to address the issue of AI ethics, which is subject of numerous reflections and contributions, we propose to take the example of a service under development at the BnF.

The personalised content recommendation project aims to be implemented in Gallica, the digital library of the BnF and its partners. It is based on an active benchmark and on experiments that began in 2017 and focused on the analysis of users' logs (cf Beaudouin et al.). While the project has just been launched, these preliminary studies as well as the ambitious overall context of which it is part (in particular the publication of the BnF's roadmap on AI in 2022) lead us to take into account ethical issues right from the design phase of the project by involving legal experts, sociologists, data specialists, and collection specialists.

The integration of recommendations into Gallica appears to be an increasingly necessary tool for enhancing the value of its rich and diverse collections while responding to the users’ needs (particularly researchers). Beyond the technical challenge, its implementation raises many ethical concerns trust, transparency, and avoidance of filter bubblesand forces the Library to clarify the position it intends to occupy in the documentary landscape. The meaning of recommendation and the definition of the librarian and its role are at stake.

The presentation will illustrate these reflections with inspiring examples and will describe the different ways of recommendation, particularly those that focus on transparency and on the possibility for the user to navigate consciously through the collections.

From text to data inside bibliographic records. Entity recognition and entity linking of contributors and their roles from statements of responsibility

Thomas Zaragoza,  Aline Le Provost, Yann Nicolas

Sudoc is the french higher education union catalogue. It is run by Abes. As any large database (15 million records), Sudoc has some quality issues that can negatively impact the user experience or the database maintenance efforts, e.g. the process towards a LRM compliant catalogue.

Quality issues are diverse : data can be inaccurate, ambiguous, miscategorized, redundant, inconsistent or missing. Sometimes, they are not really missing, they are hidden, lost in some text inside the bibliographic record itself. For instance, contributor names and roles are transcribed from the document to MARC descriptive fields (statement of responsibility). Most of them have a corresponding access point that contains the normalized name and a relator code (to express the role) - optionally the identifier of an authority record. But in Sudoc, many records have contributor mentions in descriptive fields that are not identified in access points. Moreover, many access points lack a relator code.

This paper will describe our efforts to extract structured information about contributors and their role from the statements of responsibility to automatically generate the following data in access points : last name, first name, relator code and optionally identifier to link to www.idref.fr, the french higher education authority file. The first step is a named entity recognition task implemented through a machine learning (ML) approach. For the recognition of names, a pre-existing generic model (from Spacy library) is employed and retrained with ad hoc data, annotated by librarians through a dedicated annotation tool (Prodigy). For roles, a model is generated from scratch. The second step is an entity linking task. The linking of contributor names is achieved with Qualinka, a logical rule based artificial intelligence framework (LE PROVOST, 2017 IFLA conference).The linking of roles is currently still being debated with a  preference for either an entity linking model or a classification model over a rule based approach.

This pipeline is for Abes a first experience in adopting machine learning and building a generic approach with the librarian in the loop.

Get everybody on board and get going – the automatization of subject indexing at ZBW

Dr. Anna Kasprzik

View Slides: Get everybody on board and get going 

Subject indexing, i.e., the enrichment of metadata records with descriptors, is one of the core activities of libraries. Due to the proliferation of digital documents it is no longer possible to annotate every single document intellectually, which is why we need to explore the potentials of automation on every level.

ZBW hosts and develops their own thesaurus for subject indexing in the domain of economics (Standard-Thesaurus Wirtschaft, STW). At ZBW the efforts to automate the subject indexing process have started as early as 2000, and since 2014 the necessary applied research is done in-house. However, the prototypical machine learning solutions that the researchers develop were yet to be integrated into productive operations at the library. Therefore in 2020 a pilot phase was initiated (planned to last until 2024) with the task to transfer our solutions into practice by building a suitable software architecture that allows for real-time subject indexing with our trained models and the integration thereof into the other metadata workflows at ZBW.

The output of those models has two purposes: the descriptors are directly fed into the database underlying the ZBW discovery portal EconBiz and they are also displayed as suggestions within the platform used for intellectual subject indexing at ZBW. Recently, with the provider of that platform we have also developed a solution for subject librarians to give a graded feedback with respect to the suggestions, for individual descriptors and for the sum of descriptors pertaining to one document.

In this presentation, in addition to the milestones we have reached and the challenges we faced (both on the operative and on the strategic level), we describe what the communication and cooperation with our subject librarians looked like while building this software architecture, and how we expect this interaction to evolve in the future.

Libraries and AI - practical examples of AI in use and design considerations

Dr Edmund Balnaves

View Slides: Libraries and AI

Artificial Intelligence is the use of computer systems to achieve tasks that would normally have required human interpretive intervention.  This includes:

  • Interpreting images for place, facial or object recognition
  • Interpreting audio for language recognition and translation
  • Analysing data in depth and breadth to discern patterns
  • Controlling and managing movemen10t of robotics in a real (physical) environment
  • Conversational dialogue management that recognises and interprets and response appropriately

This presentation covers areas of focus for AI in libraries and ways to for libraries to prepare for this technology, practical examples of use, design considerations.  The presentation also explores AI systems at Prosentient Systems for libraries services, system delivery and web defence in depth.

Steps Towards Building Library AI Infrastructures: Research Data Repositories, Scholarly Research Ecosystems and AI Scaffolding

Dr Ray Uzwyshyn

Artificial Intelligence possibilities for Deep Learning, machine learning, neural nets and natural language processing present fascinating new AI library service areas. In the future, most of these areas will be integrated into traditional academic library ‘information’ and ‘digital’ literacy programs and university research environments to enable research faculty, students and library staff.  Most university faculty, graduate students and library staff working outside of Computer Science disciplines will also require help to enable their data and research towards new AI possibilities.   This research overviews methodologies and infrastructures for building new AI services within the ‘third interdisciplinary space’ of the academic library

A library is a very suitable space to enable these new ‘algorithmic literacy’ services.  This work utilizes the pragmatic steps taken by Texas State University Libraries to set up good foundations.  Data-centered steps for setting up digital scholarly research ecosystems are reviewed. Needed groundwork for library AI services are forwarded to enable research, data and media towards wider global online AI possibilities. Library AI external scholarly communications services are discussed as well as educational methodologies involving incremental steps for foundational AI scaffolding.

Bootstrapping tools build on present systems and allow for the later enablement of future AI insights.  This work clarifies pathways from data collection to data cleaning, analytics and data visualization to AI applications. Focused steps are forwarded to move library staff, research faculty and graduate students towards these new AI possibilities.  Data-centred ecosystems, retooling and building on present library staff expertise are reviewed.

Data research repositories, algorithmic and programmatic literacy set good foundations for later AI possibilities.  Preliminary AI library working groups and R&D prototype methodologies for scaling up future library services and human resource infrastructures are considered. Recommended emergent pathways are prescribed to create library AI infrastructures to better prepare for a currently occurring global AI paradigm shift.  

The AI and Libraries Study Circle: how 100 library professionals increased their AI literacy

Karolina Andersdotter

View Slides: The AI and Libraries Study Circle

This paper presents the results of a seven month long digital study circle about artificial intelligence (AI) and libraries, in which approximately 100 Swedish librarians and library professionals participated. The study circle was centered around the freely available online course Elements of AI and each meeting had additional readings on topics relating to AI, libraries, and the information society.

The impact of the study circle on the participants’ AI literacy level is measured through self-efficacy questionnaires which were distributed at the beginning, the middle, and the end of the study circle. The self-efficacy questionnaire results shows that participants gained a better understanding of what AI is and how it works, how it can be applied to various practices within the library (e.g. cataloguing, user service, collection management), and which ethical and political issues arise in relation to AI and libraries. They also gained more confidence in leading library projects with AI elements and in explaining AI related matters (both in relation to library services and as citizens in an information society) to their colleagues and their users. 

A conclusion from the project is that the informal learning environment of a study circle provides sufficient support for librarians who wish to learn more about AI, which can be a complex topic to grasp without support or discussion regarding technological details or ethical aspects.

An important general outcome presented in the paper is how the digital study circle worked as a non-traditional pedagogical format for skill building among librarians, since the format potentially could open up to new forms of building and sharing knowledge in a library community unrestricted by geography, library type, or an individual’s role within their library organisation.

The Ex Libris journey for Artificial Intelligence in Libraries

Itai Veltzman

View Slides: The Ex Libris journey for Artificial Intelligence in Libraries

The impact of Artificial Intelligence in Smart Libraries: An Overview of Global Trends

Sanghamitra Dalbehera

The impact of Artificial Intelligence in libraries can be seen as a collection of technologies enabling machines to sense, comprehend, act and learn and can perform administrative functions and have provided cutting edge technologies for libraries. Big data analysis and intelligent machine learning are reshaping the way libraries gather, access, and distribute the information. From the digitization of information to the Internet of Things (IoT), modern intelligent technologies are changing the library professional’s ability to process the data, digging the meaning out of it, and make decisions based on the drawn definition.   Application of artificial intelligence in library system encompasses descriptive cataloguing, subject indexing, reference services, technical services, shelf reading, collection development, information retrieval system. discovery search, chatbots, text and data miningetc.  Artificial Intelligence has brought changes in Smart library services in India subcontinent such as Expert system in library services, Natural Language Processing in library services, Robotics in library services, Machine Learning in library services, IntelligentInterfaces to Online Databases etc.

This paper attempts to find the potential impact of AI in Intelligent libraries of India. It summarizes the application status of artificial intelligence in libraries, in three areas. These three application areas are: Intelligent resource system, intelligent services and intelligent knowledge services. This study put forward the existing problems, and look forward to the application of artificial intelligence in the smart libraries of India. It gives some insights on competencies and skills required for librarians in the AI era and the role of librarians in its implications for library works. Finally, design thinking is presented as an approach to solving emerging issues with AI and opening up opportunities for this technology at a more strategic level.

Toward Bias Conscious Artificial Intelligence for Student Success in Higher Education

Josette Riep, Dr. Annu Prabhakar

View Slides: Toward Bias Conscious Artificial Intelligence

Artificial Intelligence (AI) has continued to increase its footprint in the area of Human-Computer Interactions (HCI). Systems that span every aspect of daily life have become increasingly reliant on algorithms to identify products, promote opportunities, and guide strategy and operations. The field of Higher Education has seen a dramatic increase in the use of AI to drive recruitment and retention decisions. Persistence predictors and risk factors for example have garnered broad use across institutions in some cases without a thorough assessment of the impact on underrepresented groups in areas such as STEM.

STEM remains one of the fastest-growing and most segregated professions in the United States. As many STEM fields in the US remain predominately white and male and companies continue to struggle to find enough skilled candidates, we face the reality that underrepresented groups are too often left behind. If we, for example, examine technology as a subset of STEM, we see that African Americans make up less than 5% of the IT workforce and a small percentage of IT graduates. Although there is a general acknowledgment and some investment by both industry and educational institutions, there has been minimal success in changing the demographic landscape.

Systemic challenges span a multitude of areas including the increased use of AI. Tangible examples of bias that exist within AI algorithms are too often developed by teams that lack inclusive representation or an inclusive approach through training and design. This presentation analyzes US data and focuses on opportunities to leverage AI to remove bias and inform the design of meaningful solutions that can facilitate innovative pathways towards STEM graduation attainment for an increasingly diverse student body.

Without heading? - fully automated linked subject systems creation

Martin Malmsten

Recent developments in machine learning such as transformer based topic modelling give rise to new tools to understand and explore digital collections. This is of interest specifically to libraries that already have a long history of knowledge organisation through, for example, subject heading and classification systems. These systems, while undoubtedly useful, take a large effort to create and maintain and are prone to bias due to an ever changing context. A system created from content on the other hand can be recreated at a moments notice. Can a fully automated approach complement, enhance or even replace existing systems?

In this paper we explore the possibility to fully automate the creation of a subject heading system, albeit without an actual singular heading. To achieve this we use recent machine learning methods based on languange understanding in conjunction with topic modelling techniques to create vector-based clusters derived from actual content. This gives us the opportunity to define subjects without the constraint of having to use a single word or phrase to denote some pre-existing, and inherently biased, concept. We do this while retaining the functional requirements of a traditional subject headings system e.g the ability to expose the system as linked data and as a tool for end-users.