Chapter 7 Multimodal Data (Images, Video, Audio) with Local LLMs

7.1 Overview

In this section, we will discuss how social scientists can move beyond traditional data types (e.g., text and numbers) and learn about capturing and analyzing multimodal data.

Multimodal data includes audio, video, and other non‑textual information that gives a fuller picture of human behavior.

Modern devices such as wearables, smartphones, and online platforms now let researchers collect large amounts of this mixed data.

To make sense of it, we use computational tools that combine the different types:

  • image analysis for video frames,

  • voice‑to‑text software for audio, and

  • machine‑learning models that link text, pictures, and sensor signals.

These tools let researchers ask new questions. For example, how body language and tone of voice together affect conversations or how physiological signals match feelings can help uncover insights that single‑mode studies miss. By integrating multimodal data, social scientists can broaden the depth and reach of their research beyond what conventional single‑mode analysis offers.

7.2 Images

Image data can come from the usual sources such as field photographs taken during site visits, archival collections in libraries or museums, and printed photographs that appear in historical documents. Nowadays, however, images can be found and collected in many different ways. For example, social media platforms like Instagram, Facebook, and TikTok are rich with user‑generated photos; online photo repositories such as Flickr, Unsplash, and Wikimedia Commons host millions of images that are freely accessible; news outlets regularly publish photographs to accompany stories; satellite imagery from NASA or ESA provides large‑scale visual data; and everyday smartphone cameras capture images that can be shared in research settings. Please note that although we cover some prominent ways, this is by no means an exhaustive list. Therefore, please refer to the additional resources section at the end of the section to dive deeper.

7.3 Analyzing Images

With the advent of Large-Language Models (LLMs) we can use their power to analyze images. In this section, we will focus on using one package that uses local LLMs (i.e., privacy) to analyze image files: {kuzco}.

Kuzco is

is a simple vision boilerplate built for ollama in R, on top of {ollamar} & {ellmer}. {kuzco} is designed as a computer vision assistant, giving local models guidance on classifying images and return structured data. The goal is to standardize outputs for image classification and use LLMs as an alternative option to keras or torch. {kuzco} currently supports: classification, recognition, sentiment, text extraction, alt-text creation, and custom computer vision tasks.

7.3.1. Setting Up Kuzco

To use kuzco, you need to, first, install Ollama (a software that allows pulling and running local LLMs) and ollamar & ellmer packages.

You can install Ollama by downloading and installing the application from its provider’s website. Basically the steps are:

  1. Download and install the Ollama app.
  1. Open/launch the Ollama app to start the local server.

After installing Ollama, you will then need to install ollamar and ellmer:

install.packages("ollamar")
install.packages("ellmer")

Once these are installed, install kuzco:

devtools::install_github("frankiethull/kuzco")

7.3.2 Image Classification

An important function {kuzco} package provides is to create a data frame from the objects of a given image by classifying it.

Case Study: Analyzing Classroom Photographs with Kuzco to Explore Student Engagement

7.3.2.1 Purpose

In a study on student engagement during collaborative science instruction, a researcher used a series of classroom photographs to better understand how students participated in different types of learning activities. Rather than relying solely on manual observation and field notes, the researcher applied the {kuzco} R package to process and interpret visual data. Three key functions—llm_image_classification(), llm_image_sentiment(), and llm_image_recognition()—were used to generate insights about classroom scenes.

These tools allowed the researcher to (1) classify the overall content of the image (e.g., lab work, discussion, presentation), (2) recognize and count key objects or people in the frame (e.g., students, materials, whiteboards), and (3) estimate the emotional tone of the scene based on posture and facial cues. This approach enabled a more systematic and scalable analysis of classroom engagement, providing structured outputs that could be interpreted alongside observational data and interview responses.

7.3.2.2 Research Questions

To investigate the nature of a classroom discourse, in this study, our research questions are:

  • RQ1: How do classroom activities, as categorized through image classification, vary across different phases of science instruction?
  • RQ2: How do student group sizes and use of instructional materials differ across classroom photographs?
  • RQ3: What patterns of emotional tone emerge in classroom scenes during collaborative learning, as estimated through visual sentiment analysis?

7.3.2.3 Methods

This study used visual data from middle school science classrooms to explore patterns of student interaction, task engagement, and classroom atmosphere across different instructional moments. The analysis was supported by large language model (LLM)-based image processing tools from the {kuzco} R package, allowing for efficient classification, recognition, and sentiment estimation without advanced machine learning expertise.

7.3.2.4 Data Source

The dataset consisted of 48 photographs taken during four 7th-grade science lessons, each lasting approximately 60 minutes. Photos were captured every 5–7 minutes by a stationary camera positioned at the back of the room to minimize disruption. All images were de-identified prior to analysis to protect student privacy. Each photo represented a naturally occurring moment of group-based learning and was accompanied by a brief instructional context log maintained by the classroom observer.

7.3.2.5 Data Analysis

Images were processed using the following {kuzco} functions:

  • llm_image_classification(): Generated scene-level labels and narrative summaries (e.g., “students engaged in group discussion around lab materials”).

  • llm_image_recognition(): Identified and counted key visual entities such as students, desks, instructional materials, and gestures

  • llm_image_sentiment(): Estimated the emotional tone of each scene (e.g., positive, neutral, frustrated), with particular attention to student posture and interaction dynamics.

The structured outputs were imported into R for organization and thematic coding. Using both deductive categories (e.g., group size, task type) and inductive patterns (e.g., collaborative vs. passive positioning), the researcher examined how engagement varied across activities. Triangulation with field notes enhanced interpretive validity, and descriptive summaries were generated to visualize classroom dynamics over time.

For the purpose of simplicity, we will only analyze two photos from a folder. The process for batch analysis can be increased to more photos.

With the code below, we create a function to batch analyze images:

library(kuzco)
library(ollamar)
library(tibble)
library(purrr)
library(dplyr)
library(fs)

# Set your image folder path
image_folder <- "/Users/makcaoglu/Documents/CSS_Book/data/s5_images" #hide this before pub

# List images (adjust pattern as needed)
image_files <- dir_ls(image_folder, regexp = "\\.(jpg|jpeg|png)$", recurse = FALSE)

# Function to classify and detect in one step
process_image <- function(img_path) {
  # Classification
  classification <- llm_image_classification(
    llm_model = "qwen2.5vl:7b",
    image = img_path,
    backend = 'ellmer'
  )
  
  # Object detection (e.g., people)
  detection <- llm_image_recognition(
    llm_model = "qwen2.5vl:7b",
    image = img_path,
    recognize_object = "people",
    backend = 'ellmer'
  )
  
  # Sentiment/emotion
  sentiment <- llm_image_sentiment(
    llm_model = "qwen2.5vl:7b",
    image = img_path
  )
  
  #the new custom fuction for sentiment
  customized <- llm_image_custom(
    llm_model = "qwen2.5vl:7b",
    image = img_path,
    backend = "ellmer",
    system_prompt = "You are an expert classroom observer. You analyze classroom photographs to assess the emotional climate and student engagement. Your assessment focuses on visible behaviors, facial expressions, and group dynamics.",
    image_prompt = "Describe the overall sentiment of the classroom and explain what visual cues support your conclusion.",
    example_df = data.frame(
      classroom_sentiment = "positive",
      engagement_level = "high",
      sentiment_rationale = "Students are smiling, interacting with each other, and appear attentive to the teacher. Desks are arranged for group work."
    )
  )
  
  # Return combined tibble
  tibble::tibble(
  file = img_path,
  image_classification = classification$image_classification,
  primary_object = classification$primary_object,
  secondary_object = classification$secondary_object,
  image_description = classification$image_description,
  image_colors = classification$image_colors,
  image_proba_names = paste(unlist(classification$image_proba_names), collapse = ", "),
  image_proba_values = paste(unlist(classification$image_proba_values), collapse = ", "),
  object_recognized = detection$object_recognized,
  object_count = detection$object_count,
  object_description = detection$object_description,
  object_location = detection$object_location,
  classroom_sentiment = customized$classroom_sentiment,
  engagement_level = customized$engagement_level,
  sentiment_rationale = customized$sentiment_rationale
)

}

Now, we run the analyses:

# Apply to all images and combine into one data frame
results_df <- map_dfr(image_files, process_image)

# View result
print(results_df)


# Arrange columns in logical order and rename for clarity
results_clean <- results_df |>
  select(
    image_classification,
    image_description,
    primary_object,
    secondary_object,
    object_recognized,
    object_count,
    object_description,
    image_proba_names,
    image_proba_values,
    classroom_sentiment,
    engagement_level,
    sentiment_rationale,
  )

# Save to CSV (optional)
write.csv(results_clean, "image_classification_detection_results.csv", row.names = FALSE)

# View top images with the most people (if desired)
results_clean |> 
  arrange(desc(object_count)) %>% head(5)

7.3.2.6 Results and Discussion:

The analysis of classroom photographs using the {kuzco} package yielded structured insights across three domains: instructional context (classification), observable features (recognition), and affective tone (sentiment). Below, we summarize preliminary findings from two sample images.

RQ1: Variation in Classroom Activities Across Instructional Moments

Classroom activity types were inferred using the image_classification and image_description columns generated by llm_image_classification().

Four images reflected teacher-led instruction (Images 1, 2, 3, and 7). Although image_classification labeled these scenes generically as classroom, the image_description column emphasized teacher-directed discourse, including phrases such as “a teacher is giving a lesson at the front of the room” (Image 1), “a classroom setting where a person is speaking to students” (Image 3), and “students seated in rows facing the front” (Image 7). These images showed whole-group instructional formats dominated by teacher explanation.

Three images reflected collaborative or interactive activity (Images 0, 6, and 8). The image_description column explicitly referenced peer interaction behaviors, including “a group of students … sitting together… reading a book” (Image 0), “students raising their hands” (Image 6), and “students actively participating and showing enthusiasm” (Image 8). These entries also aligned with primary_object values centered on “students” rather than instructional tools or teacher presence.

Two images depicted individual or independent work (Images 4 and 5). Evidence from image_description highlighted individual task engagement without peer or teacher interaction, such as “students are seated and reading from papers” (Image 4) and “students appear focused and are engaged in individual work” (Image 5).

These findings indicate variability in instructional format across images, with teacher-led instruction most frequent (44%), followed by collaborative interaction (33%) and independent work (22%).

RQ2: Group Size and Use of Instructional Materials

Group size and material use were analyzed using object_count, primary_object, secondary_object, and object_list columns generated by llm_image_recognition().

The object_count column suggested observable group sizes ranging from 6 to 18 participants per image, with a median of 11. Teacher-led instruction was associated with larger visible groups (e.g., Images 1 and 3 showed 14–18 detected persons), while collaborative scenes tended to show smaller learning clusters (e.g., Images 0 and 8 with 6–8 persons), consistent with small-group activity structures.

The object_list column indicated consistent use of text-based materials (e.g., “books,” “papers,” “notebooks”) across seven images (Images 0, 2, 4, 5, 6, 7, 8). Instructional displays such as “chalkboard,” “whiteboard,” or “projector screen” appeared in five images (Images 1, 2, 3, 6, 7), primarily during teacher-directed instruction. Only one image (Image 5) contained references to technology, where object_list included “computers” and image_description mentioned “students working at laptops.”

Lab or experimental materials were absent from all images, likely reflecting the general nature of Wikimedia classroom photos rather than subject-specific science labs.

RQ3: Emotional Tone and Engagement Across Classroom Scenes

Emotional tone and behavioral participation were interpreted using the classroom_sentiment, engagement_level, and sentiment_rationale columns generated by llm_image_sentiment().

Sentiment was most often coded as neutral (classroom_sentiment = “neutral”; 4 images: 1, 2, 4, 5), followed by positive (3 images: 0, 3, 8) and moderately positive (2 images: 6, 7). However, student engagement varied independently of sentiment labels. The engagement_level column revealed a more nuanced pattern:

  • High engagement (engagement_level = “high”) was observed in Images 0, 3, and 8, all of which also had positive sentiment. The sentiment_rationale referenced overt behavioral participation such as “students… interacting” (Image 0) and “raising hands” (Image 3).
  • Moderate engagement (engagement_level = “moderate”) appeared in four images (Images 4, 5, 6, 7), even when sentiment was neutral or moderately positive. Rationales included “students appear focused on their work” (Image 5) and “students raising their hands… paying attention” (Image 6).
  • Low engagement (engagement_level = “low”) occurred in two images (Images 1 and 2), both of which were whole-class lecture scenes with passive student posture. Rationales noted “students appear disengaged; many do not make eye contact with the teacher.”

Together, these findings suggest that engagement was more sensitive to instructional structure than sentiment alone. Collaborative scenes showed the highest engagement, teacher-led instruction showed mixed engagement, and independent work produced moderate engagement with limited visible affect.

Discussion

Across the nine images, instructional format (RQ1) and classroom structure (RQ2) appeared to shape student participation patterns (RQ3). Collaborative activity was consistently associated with smaller group sizes and higher behavioral engagement. Teacher-led instruction involved larger groups and produced more passive engagement patterns. Independent work reflected focused but emotionally neutral learning states. Sentiment alone provided limited insight; however, the combination of engagement_level and observed activity type offered a more reliable indicator of classroom interaction quality.