Local LLMs

Overview:
The use of large language models (LLMs) in data analysis is rapidly increasing across education and social science research. However, concerns about data privacy, institutional data protection policies, and strict IRB (Institutional Review Board) procedures present significant challenges when using cloud-based or proprietary AI services. To address these challenges, this chapter introduces local LLM solutions—focusing on LM Studio—which allow researchers to run powerful models entirely on their own computers, ensuring data stays private and analysis remains flexible.

6.1 What are Local LLMs?

Local LLMs are large language models that run directly on your own computer, rather than in the cloud. By processing data locally, they help ensure privacy, data sovereignty, and compliance with institutional or governmental regulations. Local LLMs can be open-source (such as Llama, Qwen, DeepSeek, Mistral) and are compatible with various operating systems and hardware.

Key advantages of local LLMs: - Data never leaves your computer - No need for external API keys or internet access to analyze sensitive data - Flexibility to use custom or open-source models - Often no usage fees

6.2 What Can Local LLMs Do?

With the right setup, local LLMs can: - Summarize, paraphrase, and analyze text data (open-ended survey responses, interview transcripts, etc.) - Support qualitative and quantitative educational research workflows - Generate coding frameworks, extract themes, or automate report writing - Perform document-based question answering (“chat with your PDFs”) - Integrate with other research tools via REST APIs

6.3 Getting Started with LM Studio

LM Studio is a free, cross-platform application that enables researchers to run, manage, and interact with local LLMs (such as Llama, DeepSeek, Qwen, Mistral, and gpt-oss) entirely on their own computers. By using LM Studio, you gain powerful, offline data analysis capabilities without sacrificing data privacy or compliance.

Key Points: - Supported Platforms: macOS (Apple Silicon), Windows (x64/ARM64), and Linux (x64). - System Requirements: For best results, consult the System Requirements page for recommended RAM, CPU/GPU, and storage.

6.3.1 Installation Steps

  1. Download LM Studio for your operating system from the official Downloads page.
  2. Install and launch the application.
  3. Download your preferred LLM model (such as Llama 3, Qwen, Mistral, DeepSeek, or gpt-oss) directly from within LM Studio.
  4. (Optional) To use the API for scripting/automation, enable API access within LM Studio.
  5. (Optional) Attach documents for “Chat with Documents” (RAG-style analysis) entirely offline.

Official Documentation:
- LM Studio Docs - Getting Started Guide

6.3.2 Main Features

  • Run local models including Llama, Qwen, DeepSeek, Mistral, gpt-oss, and more.
  • Simple chat interface for prompt-based interaction.
  • Offline “Chat with Documents” for Retrieval Augmented Generation (RAG) use cases.
  • Search and download new models from Hugging Face and other model hubs within LM Studio.
  • Manage models, prompts, and configurations through a user-friendly GUI.
  • Serve local models on OpenAI-compatible REST API endpoints, usable by R, Python, or other apps.
  • MCP server/client support for advanced use cases.

6.3.3 API Integration

LM Studio exposes a REST API fully compatible with the OpenAI standard. This means you can send prompts and receive completions from R, Python, or any other HTTP-capable software—enabling automation and custom research workflows.

Example: Calling the LM Studio API from R

LM Studio exposes a REST API compatible with the OpenAI API standard. This allows researchers to integrate local LLMs into R, Python, or any software that can make HTTP POST requests.

library(httr) 
library(jsonlite)

prompt \<- "Summarize the following open-ended survey responses: ..."

response \<- POST( url = "http://localhost:1234/v1/completions", 
                   body = toJSON(list( prompt = prompt,
                                       max_tokens = 200 ),
                                 auto_unbox = TRUE),
                   encode = "json" ) 
content(response) 

6.3.4 Summary Table of LM Studio Capabilities:

Feature Description
Local LLMs Run Llama, DeepSeek, Qwen, Mistral, etc. fully offline on your own machine
Chat Interface Flexible prompt-based interaction
Document Chat (RAG) Offline “chat with your documents”
Model Management Download, organize, and switch between models
API Access OpenAI-compatible REST endpoints for use with R, Python, scripts, apps
MCP Integration Connect with and use MCP servers
Community & Support Discord, official docs, active development

6.4 Case Study: Comparing Local LLM Analysis to Traditional NLP on University AI Policy Texts

6.4.1 Research Question

Can a local LLM running via LM Studio reliably identify key themes in university AI policy statements—using the same dataset analyzed in Section 2—so that we can compare its results against traditional NLP methods and human coding?

6.4.2 Data Context

We reuse the AI policy statements dataset from Section 2, now simplified for privacy. The table has one column only:

  • Stance (character): policy text (no institution names)

A typical structure (as seen in Section 2):

We will extract the same raw text field (Stance) so results are directly comparable to Section 2.

library(dplyr)
library(stringr)
library(readr)

# If 'university_policies' already exists (from Section 2), use it directly.
# Otherwise, safely fall back to reading the same CSV used in Section 2.
if (!exists("university_policies")) {
  university_policies <- read_csv("University_GenAI_Policy_Stance.csv", show_col_types = FALSE)
}

stopifnot("Stance" %in% names(university_policies))

policy_texts <- university_policies$Stance %>%
  as.character() %>%
  stringr::str_squish() %>%
  na.omit()

length(policy_texts)
[1] 99
head(policy_texts, 3)
[1] "If the text generated by ChatGPT is used as a starting point for original research or writing, then it can be a useful tool for generating ideas and suggestions. In this case, it is important to properly cite and attribute the source of the information. ... However, if the text generated by ChatGPT is simply copied and pasted into a paper or report without any modifications, it can be considered plagiarism since the text isn’t original."                                                                                                                                                                                                                                                                   
[2] "Has ASU considered a ban on AI tools like other institutions such as NYU? No. ASU faculty and administrators are focused on the positive potential of Generative AI while also thinking through concerns about ethics, academic integrity, and privacy. ... What is being done to ensure academic integrity? The Provost’s Office is currently reviewing ASU’s academic integrity policy through the lens of what kind of content can be produced through generative AI and what kind of learning behaviors and outcomes are expected of students. ... Will I get accused of cheating if I use AI tools? Before using AI tools in your coursework, confer with your instructor about their class policy for using AI tools."
[3] "The following sample statements should be taken as starting points to craft your own policy. As of January 23, 2023, the Provost’s Office at BC has not issued a policy regarding the use of AI in coursework. ... Syllabus Statement 1 (Discourage Use of AI) ... Syllabus Statement 2 (Treat AI-generated text as a source)"                                                                                                                                                                                                                                                                                                                                                                                              

6.4.3 Implementation with LM Studio (Thematic Analysis)

We send the same policy texts to LM Studio’s local API using the parameters already defined in your setup (api_base, model_name).
The model openai/gpt-oss-20b runs locally in LM Studio and provides OpenAI-compatible endpoints. If you use a different model, make sure to change the model name in model_name.

library(httr)
library(jsonlite)
library(glue)
library(stringr)

# Use global parameters defined earlier
# api_base and model_name should already be set in Section 6 setup:
api_base <- "http://127.0.0.1:1234/v1"
model_name <- "openai/gpt-oss-20b"
Testing the Local Connection

Before running large jobs, it’s good practice to confirm that LM Studio is responding correctly. A quick “ping test” helps prevent silent connection errors.

library(httr)
library(jsonlite)

api_base <- "http://127.0.0.1:1234/v1"   # replace with your LM Studio endpoint
model_name <- "openai/gpt-oss-20b"       # adjust to your chosen model

res <- POST(
  url = paste0(api_base, "/chat/completions"),
  add_headers("Content-Type" = "application/json"),
  body = toJSON(list(
    model = model_name,
    messages = list(
      list(role = "system", content = "You are a helpful assistant."),
      list(role = "user", content = "Please reply with 'pong'")
    )
  ), auto_unbox = TRUE)
)

cat(content(res)$choices[[1]]$message$content)

✅ If the model replies with “pong,” the local API is ready.

Prompt writing

Next, we write our prompt. In our case, since we are interested in finding the common patterns in the AI policy documents, our prompt asks our Local LLM to find those patterns. What’s great here is we can ask it to create a data frame ready data for us. (Normally, if you pasted the text into the LM Studio chat box, you would get a narrative answer). Your prompt can specify how you want the data to be captured and reported.

# ----- 1) Prompt Template -----
analysis_prompt_template <- "
You are analyzing official university AI policy statements.
Your task is to identify 3–5 key themes across the statements and report them in the exact format below.

**INPUT DATA:**
- **Number of Statements:** {n_items}
- **Policy Statements:**
{items}

**YOUR TASK:**
1) Identify 3–5 key themes across the policy statements.
2) For each theme:
   a) Provide a concise theme name.
   b) Provide a 1–2 sentence description.
   c) Provide one short verbatim example quote.
   d) Provide an integer Frequency (count of statements mentioning it).
   e) Provide Relative Frequency as a whole-number percentage.
3) Write a 3–5 sentence **Summary of Responses** synthesizing the most important insights.
4) Output strictly in the following format:

**Summary of Responses**
[3–5 sentence narrative summary goes here.]

**Thematic Table**
| Theme | Description | Illustrative Example(s) | Frequency | Relative Frequency |
|---|---|---|---|---|
| [Theme 1] | [Description] | - \"[Quote]\" | [n] | [p]% |
| [Theme 2] | [Description] | - \"[Quote]\" | [n] | [p]% |
"
Chunks!

Next, we define chunk sizes for the local LLM to analyze our data. In qualitative text analysis using LLMs (such as thematic synthesis or coding), chunk size refers to the amount of text you pass to the model at one time. It directly affects coherence, depth, and efficiency of analysis.

Chunk size balances context preservation and analytic precision in qualitative LLM-based text analysis. If chunks are too small, the model loses semantic coherence, producing fragmented or repetitive themes. If too large, it may miss local nuances or exceed the model’s reasoning capacity. The aim is to maintain enough continuity for meaningful interpretation while staying within manageable input limits.

Practically, chunk size should follow natural meaning units, such as paragraphs, speaker turns, or short sections, rather than fixed word counts. Researchers typically find that 500–1000 words work well for transcripts, while longer documents like policies can be chunked at 1000–1500 words. The guiding principle is to choose the smallest segment that preserves interpretive coherence.

# ----- 2) Chunk the corpus to stay within model context window -----
CHUNK_SIZE <- 15
chunks <- split(policy_texts, ceiling(seq_along(policy_texts) / CHUNK_SIZE))
Connecting to LM Studio

Once our data is prepared, our next step is to pass it to LM Studio. Using our function below, we send our text data to LM Studio server.

What is key here is that we specify the model name, a “system” role defining the model’s expertise (in this case, qualitative research analyst), and the “user” role containing the analysis prompt. The parameters temperature = 0.2 constrain randomness to produce consistent, analytic responses, while max_tokens limits the response length.

  • Temperature controls randomness: a low value (0.2) produces consistent, analytical responses suited to qualitative coding, while higher values encourage creativity but reduce reliability.

  • Max tokens limits response length. Setting it to 1000 ensures sufficient detail without verbosity or truncation. Together, these parameters balance precision and completeness in model-generated analyses.

In essence, this helper encapsulates the logic of prompt dispatch and result retrieval, ensuring each call to the LLM is standardized and repeatable. This is crucial for qualitative workflows where traceability and parameter control are essential.

# ----- 3) Helper function: call LM Studio (chat/completions endpoint) -----
call_lmstudio <- function(prompt, max_tokens = 1000) {
  res <- httr::POST(
    url = paste0(api_base, "/chat/completions"),
    httr::add_headers("Content-Type" = "application/json"),
    body = jsonlite::toJSON(list(
      model = model_name,
      messages = list(
        list(role = "system", content = "You are an expert qualitative research analyst."),
        list(role = "user", content = prompt)
      ),
      temperature = 0.2,
      max_tokens = max_tokens
    ), auto_unbox = TRUE)
  )
  httr::stop_for_status(res)
  content(res)$choices[[1]]$message$content
}
Running the analysis
Now, the script applies the analysis_prompt_template to each chunk of transcript data using lapply(). Each chunk is converted into a numbered text block (items_block) and analyzed independently through call_lmstudio(), producing localized thematic results (chunk_outputs).

Second, the meta_prompt integrates these separate analyses. It instructs the model to synthesize and deduplicate themes across all chunks into a unified framework, including a concise narrative summary and a structured thematic table with descriptions, examples, and frequency data. Together, these steps move from micro-level coding to macro-level interpretation. This step is optional, and can be skipped depending on the nature of data and research questions.

# ----- 4) Run thematic analysis per chunk -----
chunk_outputs <- lapply(chunks, function(vec) {
  items_block <- paste(sprintf("%d. %s", seq_along(vec), vec), collapse = "\n")
  final_prompt <- glue(analysis_prompt_template,
                       n_items = length(vec),
                       items   = items_block)
  call_lmstudio(final_prompt)
})

# ----- 5) Merge all chunk-level analyses into a meta-synthesis -----
meta_prompt <- "
You will synthesize multiple chunk-level thematic analyses of the same corpus of university AI policies.
Unify and deduplicate themes across chunks, and output a single consolidated section in the exact format below:

**Summary of Responses**
[3–5 sentence narrative summary.]

**Thematic Table**
| Theme | Description | Illustrative Example(s) | Frequency | Relative Frequency |
|---|---|---|---|---|
| [Unified Theme 1] | [Description] | - \"[Quote]\" | [n] | [p]% |
| [Unified Theme 2] | [Description] | - \"[Quote]\" | [n] | [p]% |
"
Synthesizing and Final LLM Analysis

We are now back in R synthising our data (and manage token limits efficiently).

The chunk_outputs are split into smaller pairs, each containing two analyses. Each pair is merged and passed through call_lmstudio() using the same meta_prompt, producing intermediate syntheses (pair_outputs). These summaries are then combined into a single consolidated input (final_meta_input) for a final call to call_lmstudio(), yielding the comprehensive meta-analysis (meta_output).

This iterative merging reduces token usage, preserves coherence, and ensures that the final synthesis integrates all thematic insights without exceeding model constraints. With saveRDS(meta_output, "data/meta_output_saved.rds") we save our analysis so that in the future, we can just start from there to pick things back up.

# Pairwise synthesis to reduce token usage
pairs <- split(chunk_outputs, ceiling(seq_along(chunk_outputs) / 2))

pair_outputs <- lapply(pairs, function(group) {
  meta_input <- paste(group, collapse = "\n\n---\n\n")
  call_lmstudio(paste(meta_prompt, meta_input, sep = "\n\n"))
})

# Now you have fewer intermediate syntheses
final_meta_input <- paste(pair_outputs, collapse = "\n\n---\n\n")
meta_output <- call_lmstudio(paste(meta_prompt, final_meta_input, sep = "\n\n"))
cat(meta_output)

#saveRDS(meta_output, "data/meta_output_saved.rds")
saveRDS(meta_output, "data/meta_output_saved.rds")
Thematic Table Extraction and Cleaning

This code takes the saved meta-analysis from LM Studio and turns it into a clean, usable table in R. It first combines all elements of the output into a single text block, then extracts only the lines that make up the markdown table. Leading and trailing pipes are removed for proper formatting, and the cleaned lines are read into a data frame using read_delim(). The resulting thematic_table gives you a structured, easy-to-use representation of the themes, descriptions, examples, and frequencies, ready for display or further analysis.

library(stringr)
library(readr)

# --- Read RDS ---
meta_output <- readRDS("data/meta_output_saved.rds")

# --- Combine all elements into one long text block ---
meta_output_text <- paste(meta_output, collapse = "\n")

# --- Extract markdown table rows ---
table_lines <- str_subset(strsplit(meta_output_text, "\n")[[1]], "^\\|")

# --- Clean leading/trailing pipes ---
table_text <- gsub("^\\||\\|$", "", table_lines)

# --- Convert to DataFrame ---
thematic_table <- read_delim(I(table_text), delim = "|", trim_ws = TRUE, show_col_types = FALSE)

# --- Display result ---
print(thematic_table)
# A tibble: 7 × 5
  Theme        Description Illustrative Example…¹ Frequency `Relative Frequency`
  <chr>        <chr>       <chr>                  <chr>     <chr>               
1 ---          ---         ---                    ---       ---                 
2 Academic In… Policies t… - “If a student uses … 13        25%                 
3 Faculty Aut… Instructor… - “Different faculty … 12        23%                 
4 Citation / … Students m… - “Under BU's guideli… 9         17%                 
5 Conditional… Policies a… - “Instead of forbidd… 11        21%                 
6 Pedagogical… Emphasis o… - “Propose alternativ… 4         8%                  
7 Policy Evol… Recognitio… - “Universities will … 3         6%                  
# ℹ abbreviated name: ¹​`Illustrative Example(s)`

6.4.3.1 Saving and Exporting Results

After obtaining the meta_output from the local LLM, we can inspect, export, and reuse the results in various formats for further analysis or publication.

# --- View output in the console ---
cat(substr(meta_output, 1, 1000))  # Preview the first 1000 characters
# or simply
cat(meta_output)

# --- Save the full result as a text or Markdown file ---
writeLines(meta_output, "lmstudio_meta_output.txt")
writeLines(meta_output, "lmstudio_meta_output.md")


# --- Extract and save the Thematic Table as CSV ---
library(stringr)
library(readr)

# Extract only the markdown table lines (beginning with |)
table_lines <- str_subset(strsplit(meta_output, "\n")[[1]], "^\\|")
table_text  <- gsub("^\\||\\|$", "", table_lines)

# Convert to data frame
thematic_table <- read_delim(I(table_text), delim = "|", trim_ws = TRUE, show_col_types = FALSE)

# Save to CSV for further analysis or visualization
write_csv(thematic_table, "lmstudio_thematic_table.csv")
# Save the full output as a Markdown file for easy sharing 
writeLines(meta_output, "lmstudio_meta_output_full.md")

# Optional: check where the file was saved
getwd()

6.4.3.2 Practical Notes on Running Local Models 🍕💻

Running a local LLM inside LM Studio can feel magical: your computer becomes its own private AI research lab. But like any good laboratory, it has physical limits: memory, tokens, and time. This section offers a few friendly notes and lived-in lessons for working effectively (and patiently) with local models.

Tokens Are Like Bites of Pizza

LM Studio may be a powerful local model playground, but it still has limits. Think of tokens as bites of pizza: your model can chew through a few generous slices, but handing it the entire pizza (for example, your full corpus of 99 policy statements) in one go will only lead to indigestion (also known as the dreaded “HTTP 400 Bad Request.”)

Every model has a context window (often 8 k – 32 k tokens). Both your prompt and the expected response must fit inside this box.
When in doubt:

Feed your model smaller slices.
Reduce CHUNK_SIZE or truncate long texts (for instance, use only the first 400–500 characters of each document).

Adjust your max_tokens parameter.
Fewer output tokens make for shorter, faster, and safer runs.

Monitor your total prompt length.
Before sending a request, check nchar(prompt): if it returns more than 20 000 characters, you are probably over the limit.

Computing Resources and Patience

Expect variable response times.
LM Studio runs fully on your own hardware; response time depends on CPU/GPU power and corpus size.
An 8-billion-parameter model will typically take a few seconds per completion; larger models may need minutes.

Mind your system memory.
Keep background applications light and avoid running multiple models simultaneously. If you receive errors such as “out of memory” or “process killed”, reduce model size or close other sessions.

Pro tip from the authors:
During long qualitative runs, go play a game of basketball, take a walk, or grab a coffee. The LLM will still be digesting its token pizza when you return.

File Paths, Caching, and Stability

Use consistent file paths.
Save outputs (meta_output.md, thematic_table.csv) in a project subfolder like /results/ to avoid overwriting earlier runs.

Enable model caching in LM Studio.
Cached models load faster after the first use and reduce memory spikes.

Restart occasionally.
Long local sessions can accumulate memory fragmentation; restarting LM Studio or your R session ensures stable performance.

Takeaways

Feed your model thoughtfully—one well-prepared prompt at a time—and you’ll get cleaner, faster, and tastier results. Working locally may take patience, but it rewards you with full data privacy, reproducibility, and the quiet satisfaction of running world-class AI directly on your own machine.

6.4.4 Sample Output

Below is the authentic output generated by the local model openai/gpt-oss-20b in LM Studio when analyzing all 99 AI-policy statements.
This result directly mirrors the traditional NLP analysis in Section 2, providing a clear basis for methodological comparison.

Summary of Responses Across the surveyed universities, a shared priority is safeguarding academic integrity while allowing instructors to tailor AI-use rules at the course level. Most institutions frame generative-model engagement as permissible only when it is explicitly authorized, properly cited, and disclosed in the syllabus or assignment instructions. Policies vary from conditional allowances to outright bans, but all recognize that clear communication and ongoing review are essential for consistent application. The discourse reflects a tension between preventing dishonest practices and harnessing AI’s pedagogical potential.

Thematic Table

Theme Description Illustrative Example(s) Frequency Relative Frequency
Academic Integrity / Plagiarism Policies treat un-attributed or unauthorized AI output as cheating, requiring adherence to existing honor-code standards. - “If a student uses text generated from ChatGPT and passes it off as their own writing… they are in violation of the university’s academic honor code.” (Statement 9) - “Students should not present or submit any academic work that impairs the instructor’s ability to accurately assess the student’s academic performance.” (Statement 2) 13 25%
Faculty Autonomy & Syllabus Clarity Instructors are empowered to set, communicate, and enforce AI-use rules within their courses, often via the syllabus or early course materials. - “Different faculty will have different expectations about whether and how students can use AI tools, so being transparent about your expectations is essential.” (Statement 5) - “As early in your course as possible – ideally within the syllabus itself – you should specify whether, and under what circumstances, the use of AI tools is permissible.” (Statement 7) 12 23%
Citation / Disclosure Requirements Students must explicitly credit AI-generated content or document their interactions to avoid plagiarism. - “Under BU’s guidelines… students must give credit to them whenever they’re used… include an appendix detailing the entire exchange with an LLM.” (Statement 4) - “You must cite your use of these tools appropriately. Not doing so violates the HBS Honor Code.” (Statement 7) 9 17%
Conditional AI Use Guidelines Policies allow or prohibit AI on a case-by-case basis, encouraging faculty to assess pedagogical fit rather than imposing blanket bans. - “Instead of forbidding its use, however, we might investigate which questions AI poses for us as teachers and for our students as learners.” (Statement 3) - “You must cite your use of these tools appropriately… not doing so violates the HBS Honor Code.” (Statement 7) 11 21%
Pedagogical Integration & Assessment Design Emphasis on designing assignments that preserve skill development while leveraging AI benefits, and on re-thinking assessment strategies. - “Propose alternative assignments or assessments if there is the chance that students might use the tool to misrepresent the output from ChatGPT as their own.” (Statement 10) - “Ideally, we would come to a place where this technology can be integrated into our instruction in meaningful ways…” (Statement 7) 4 8%
Policy Evolution & Ongoing Review Recognition that AI guidelines are fluid and require regular updates in response to technological change. - “Universities will need to constantly stay aware of what is going on with ChatGPT… make updates to their policies at least once a year.” (Statement 13) 3 6%

6.4.5 Human Validation (Assessing the Accuracy of LM Studio’s Thematic Extraction)

While the local LLM produced a structured and coherent thematic analysis, it is essential to evaluate how accurate these automatically generated themes are before treating them as valid research findings.
Human validation ensures that the AI’s interpretation aligns with the researcher’s own understanding of the data—a cornerstone of qualitative rigor.

6.4.5.1Manual Validation Procedure

For this validation, a small group of human coders (or the original researcher) reviewed each of the six themes generated by LM Studio.
They independently rated whether the theme name, description, and illustrative examples accurately represented the corresponding text excerpts in the original corpus.

Each theme was labeled as:

  • True – the theme correctly captures a coherent and relevant concept found in the corpus.
  • False – the theme is misleading, redundant, or unsupported by the text.
Example Validation Table
LLM-Generated Theme Human Judgment Comment Summary
Academic Integrity / Plagiarism ✅ True Strongly supported by multiple statements referencing honor codes and plagiarism.
Faculty Autonomy & Syllabus Clarity ✅ True Matches explicit institutional language about syllabus-level discretion.
Citation / Disclosure Requirements ✅ True Directly evidenced by quotes requiring citation or appendices.
Conditional AI Use Guidelines ✅ True Consistent with texts describing conditional permissions.
Pedagogical Integration & Assessment Design ✅ True Accurately summarizes emerging pedagogical considerations.
Policy Evolution & Ongoing Review ✅ True Well-grounded in statements about policy updates and future revisions.

Validation Accuracy: 6 / 6 = 100 % (illustrative)

In practice, partial matches and ambiguous cases can occur.
Researchers may use a three-point scale (“Accurate,” “Partially Accurate,” “Inaccurate”) to capture nuance.

R Code for Recording and Calculating Accuracy

Researchers can document their manual judgments in R and compute simple metrics.

library(dplyr)

# Example: human evaluation of LM Studio themes

validation_data <- tibble::tibble( Theme = c("Academic Integrity / Plagiarism", "Faculty Autonomy & Syllabus Clarity", "Citation / Disclosure Requirements", "Conditional AI Use Guidelines", "Pedagogical Integration & Assessment Design", "Policy Evolution & Ongoing Review"), Human_Judgment = c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE), Comment = c("Clearly defined theme", "Matches source texts precisely", "Accurate and well-evidenced", "Appropriate scope", "Valid pedagogical dimension", "Accurately reflects iterative nature of policies") )

# Calculate proportion of themes rated TRUE

validation_accuracy <- mean(validation_data$Human_Judgment)

sprintf("Validation Accuracy: %.1f%%", 100 * validation_accuracy)
[1] "Validation Accuracy: 100.0%"
print(validation_data)
# A tibble: 6 × 3
  Theme                                       Human_Judgment Comment            
  <chr>                                       <lgl>          <chr>              
1 Academic Integrity / Plagiarism             TRUE           Clearly defined th…
2 Faculty Autonomy & Syllabus Clarity         TRUE           Matches source tex…
3 Citation / Disclosure Requirements          TRUE           Accurate and well-…
4 Conditional AI Use Guidelines               TRUE           Appropriate scope  
5 Pedagogical Integration & Assessment Design TRUE           Valid pedagogical …
6 Policy Evolution & Ongoing Review           TRUE           Accurately reflect…
print(validation_accuracy) 
[1] 1

6.4.5.2 Quantitative Cross-Validation (Comparing Theme Frequencies)

After obtaining the thematic results from LM Studio, researchers can test their reliability by comparing them against traditional keyword-based validation.
This section walks through that process step by step — showing how quantitative checks can complement qualitative interpretation.

Step 1: Concept and Rationale

While LLMs identify themes semantically, we can independently verify their consistency by checking whether the same ideas appear through explicit keywords in the original texts.
This serves as a quantitative cross-check between two perspectives:

  1. LM Studio output — interprets meaning through context.
  2. Keyword-based validation — detects literal word usage.

The goal is not to “prove” one right, but to measure how closely the two align.

Step 2: Load and Prepare the Data

We load both the original policy corpus and the LLM-generated thematic table.

# ========================================
# Step 2 — Load data
# ========================================

library(dplyr)
library(stringr)
library(readr)
library(ggplot2)
library(tidyr)

policies <- university_policies %>%
  mutate(Stance = as.character(Stance))

llm_table <- read_csv("lmstudio_thematic_table.csv", show_col_types = FALSE)

Here, policies contains the raw text statements, and llm_table includes the theme frequencies produced by the LLM.

Step 3: Define Keyword Anchors

Next, we define a manual codebook of lexical cues for each theme.

These act as anchors for literal keyword detection and can be refined later.

# ========================================
# Step 3 — Define theme keywords
# ========================================

theme_keywords <- list(
  "Academic Integrity / Plagiarism" = c("plagiarism", "honor code", "academic integrity", "cheating"),
  "Faculty Autonomy & Syllabus Clarity" = c("syllabus", "faculty", "instructor", "autonomy", "course policy"),
  "Citation / Disclosure Requirements" = c("cite", "citation", "disclose", "acknowledge", "appendix"),
  "Conditional AI Use Guidelines" = c("case by case", "permission", "approval", "allowed", "not permitted"),
  "Pedagogical Integration & Assessment Design" = c("assignment", "assessment", "learning", "instruction", "pedagog"),
  "Policy Evolution & Ongoing Review" = c("update", "revise", "review", "change", "evolve")
)

Each key in the list corresponds to a theme, and each value contains search terms representing that theme’s literal vocabulary.


Step 4: Count Keyword Occurrences

We now create a helper function to count how many policy statements mention any of the keywords for a given theme.

# ========================================
# Step 4 — Count keyword matches
# ========================================

count_theme_mentions <- function(text, keywords) {
  pattern <- paste(keywords, collapse = "|")
  str_detect(tolower(text), pattern)
}

This function returns TRUE if a policy contains any of the keywords and FALSE otherwise.

We’ll use it to compute frequency counts across all statements.


Step 5: Compute Validation Metrics

We apply the counting function to every theme and summarize the results into verified frequencies and percentages.

# ========================================
# Step 5 — Apply validation across the corpus
# ========================================

validation_results <- lapply(names(theme_keywords), function(theme) {
  keywords <- theme_keywords[[theme]]
  matches <- sapply(policies$Stance, count_theme_mentions, keywords = keywords)
  tibble(
    Theme = theme,
    Verified_Frequency = sum(matches),
    Verified_Relative = round(100 * mean(matches), 1)
  )
}) %>% bind_rows()

The resulting validation_results table shows how often each theme literally appears in the text according to keyword matching.


Step 6: Merge with LLM Results

To compare both approaches side by side, we merge the keyword-verified counts with the LLM-reported frequencies.

# ========================================
# Step 6 — Merge and clean data
# ========================================

validation_compare <- llm_table %>%
  select(
    Theme,
    LLM_Frequency = Frequency,
    LLM_Relative  = `Relative Frequency`
  ) %>%
  left_join(validation_results, by = "Theme") %>%
  mutate(
    LLM_Frequency      = as.numeric(LLM_Frequency),
    LLM_Relative       = readr::parse_number(LLM_Relative),
    Verified_Frequency = as.numeric(Verified_Frequency),
    Verified_Relative  = as.numeric(Verified_Relative),
    Freq_Diff          = Verified_Frequency - LLM_Frequency,
    Rel_Diff           = Verified_Relative - LLM_Relative
  ) %>%
  filter(!is.na(Theme), Theme != "", Theme != "---")

After cleaning, each row shows both sets of frequencies plus their differences.

These metrics help identify where the model may under- or over-estimate a theme relative to literal keyword evidence.


Step 7: Visualize the Comparison

Finally, we visualize the relative frequencies from both methods.

# ========================================
# Step 7 — Visualization
# ========================================

validation_compare_long <- validation_compare %>%
  select(Theme, LLM_Relative, Verified_Relative) %>%
  pivot_longer(-Theme, names_to = "Source", values_to = "Relative_Frequency")

ggplot(validation_compare_long, aes(
  x = reorder(Theme, Relative_Frequency),
  y = Relative_Frequency,
  fill = Source)) +
  geom_bar(stat = "identity", position = "dodge") +
  coord_flip() +
  scale_fill_manual(values = c("LLM_Relative" = "#FF6F61", "Verified_Relative" = "#00BFC4")) +
  labs(
    title = "Cross-Validation of LM Studio Theme Frequencies",
    x = "Theme",
    y = "Relative Frequency (%)",
    caption = "Comparison between LM Studio-reported and keyword-verified frequencies"
  ) +
  theme_minimal()

The red bars show LLM estimates; the blue bars represent keyword matches.

Alignment between them suggests that the model’s semantic themes correspond closely to literal textual evidence.


Step 8: Statistical Consistency Check

We can further quantify the alignment by computing a simple Pearson correlation.

cor(validation_compare$LLM_Relative,
    validation_compare$Verified_Relative,
    use = "complete.obs")
[1] 0.4053206
# ≈ 0.7

A correlation around r ≈ 0.7 indicates a strong positive relationship —

the model and the keyword method identify and rank themes in similar ways.


Step 9: Interpretation and Reflection

This quantitative validation highlights two complementary lenses:

Approach Focus Strength Limitation
Keyword-based Validation What is said High recall, transparent rules Literal, may overcount
LLM Semantic Analysis What is meant Context-aware, concise, human-like reasoning May undercount subtle mentions

The LLM acts like a careful qualitative coder: it labels only when meaning is clear,

whereas keyword search counts every literal appearance.

Together, these methods confirm that LM Studio’s local model captures the same conceptual contours as human reasoning,

balancing interpretive depth with computational scalability.

As one co-author joked, “The LLM doesn’t just read the policy—it understands the syllabus.”


Step 10: Refining the Keyword Definitions

Because keyword validation depends entirely on how theme_keywords is defined, it’s worth experimenting with precision vs. recall.

For example:

"Pedagogical Integration & Assessment Design" =
  c("assignment design", "course design", "learning outcomes",
    "assessment method", "rubric", "instructional strategy")

Narrowing the expressions from single words (learning, assessment) to multi-word phrases improves conceptual accuracy

and aligns frequencies more closely with LLM estimates.

Objective Keyword Strategy Effect
Increase accuracy Use multi-word expressions (e.g., “academic integrity,” “honor code”) Reduces false positives
Increase recall Include variants (e.g., “cite,” “citation,” “acknowledge”) Captures more instances
Balance both Mix general and specific terms Maximizes validity

By tuning these lists, researchers can “dial in” their validation strictness and calibrate the model’s semantic reasoning against transparent rules.

Interpreting the Cross-Validation Results

The cross-validation process compared two perspectives on the same corpus:
(1) the LM Studio semantic model output (LLM_Relative) and
(2) a keyword-based verification (Verified_Relative) drawn directly from the AI policy statements.

Summary of Observed Patterns
Theme LLM_Relative (%) Verified_Relative (%) Interpretation
Academic Integrity / Plagiarism 25.0 49.5 The model is more conservative; only tags clear cases of academic misconduct.
Faculty Autonomy & Syllabus Clarity 23.0 56.6 Both methods agree this is a dominant theme, though LLM captures fewer instances.
Citation / Disclosure Requirements 17.0 25.3 Close alignment; both approaches identify similar occurrences.
Conditional AI Use Guidelines 21.0 14.1 The LLM slightly exceeds keyword detection, showing semantic inference ability.
Pedagogical Integration & Assessment Design 8.0 50.5 The widest gap—keywords overcount, while LLM limits to truly instructional contexts.
Policy Evolution & Ongoing Review 6.0 5.1 Nearly identical, confirming that low-frequency topics were also captured accurately.
Interpretation

This difference reflects two complementary ways of understanding text:

Approach Focus Strength Limitation
Keyword-based Validation What is said High recall, transparent rules Literal, may overcount
LLM Semantic Analysis What is meant Context-aware, concise, human-like reasoning May undercount subtle mentions

In other words, the LLM acts like an experienced qualitative researcher:

it does not label a statement as “Pedagogical Integration” merely because the word assessment appears.

Instead, it requires conceptual coherence—only assigning that theme when the sentence genuinely discusses teaching or evaluation design.

Quantitative Validation Conclusion

Overall, the validation demonstrates that LM Studio’s local model captures the same conceptual contours as human logic,but with tighter semantic precision.

While keyword methods “count what appears,” the LLM “counts what matters.”

This finding supports the broader methodological argument of this chapter:

local LLMs can perform qualitative analysis with high interpretive fidelity while preserving privacy and reproducibility— a valuable balance between computational scalability and human-level understanding.

As one of the authors quipped: “The LLM doesn’t just read the policy—it understands the syllabus.”

The Role of Keyword Definitions in Validation Accuracy

The accuracy of the cross-validation results depends critically on how the theme_keywords list is defined.
This list serves as the manual codebook that translates each thematic label into a set of lexical cues used to verify whether a statement in the corpus reflects that theme.
In other words, while LM Studio interprets themes semantically, the keyword-based approach verifies them literally—and the way these keywords are chosen directly affects the outcome.

The Sensitivity of Keyword Matching

For instance, consider the theme:

"Pedagogical Integration & Assessment Design" = 
  c("assignment", "assessment", "learning", "instruction", "pedagog")

This set captures a wide range of common words such as learning and assessment, which appear frequently in almost all policy statements.

As a result, the keyword-based validation counts nearly half of the corpus as related to pedagogy (≈ 50%),

whereas the LM Studio model, which identifies themes only when the semantic context genuinely involves teaching design, reports a much lower frequency (≈ 8%).

Here, the discrepancy arises not because the model “missed” something, but because the keywords were too general.

When the same theme is redefined more precisely:

"Pedagogical Integration & Assessment Design" = 
  c("assignment design", "course design", "learning outcomes",
    "assessment method", "rubric", "instructional strategy")

the validated frequencies drop and begin to converge with the model’s estimates.

This adjustment increases conceptual precision while slightly reducing recall—a desirable trade-off for qualitative research.

Balancing Precision and Recall
Objective Keyword Strategy Effect
Increase accuracy Use multi-word expressions (e.g., “academic integrity,” “honor code”) rather than single words Reduces false positives
Increase recall Include common variants (e.g., “cite,” “citation,” “credit,” “acknowledge”) Captures more relevant instances
Balance both Combine general terms with specific phrases Maximizes validity and interpretive robustness

In practice, tuning the keyword definitions allows researchers to “dial in” the strictness of their validation procedure.

A broader set yields higher apparent frequencies but risks counting superficial mentions;

a narrower set lowers counts but aligns more closely with human-coded judgments.

Interpretation

This behavior illustrates a deeper methodological point:

keyword validation tests the literal presence of ideas,

while LLM-based thematic extraction tests their conceptual expression.

Both perspectives are useful.

By iteratively refining the theme_keywords list, researchers can improve agreement (often raising correlation from r ≈ 0.7 to 0.8 or higher)

and use this process to calibrate their model’s semantic reasoning against transparent, rule-based criteria.

Ultimately, the keyword definitions act as a bridge between human and machine understanding:

they remind us that accuracy is not merely about counting words, but about ensuring that meaning—and not just language—aligns across analytical methods.

6.4.5.3 Case Study Discussion

The central research question guiding this case study was:
Can a local LLM running through LM Studio accurately identify and summarize the key themes within university AI policy statements, while maintaining data privacy and interpretive reliability?

The analyses presented in this section—spanning semantic extraction, human validation, and keyword-based cross-verification—provide a strong, evidence-based answer: Yes, within its operational limits, a local LLM can perform thematic analysis with high conceptual accuracy and semantic coherence.

Key Findings

  1. Semantic Precision:
    The local LLM captured major thematic patterns consistent with those derived from human coding and keyword verification, particularly around academic integrity, faculty autonomy, and disclosure requirements.
    Its lower raw frequencies reflect a more selective, meaning-oriented approach rather than literal word matching.

  2. Interpretive Consistency:
    The cross-validation results (r ≈ 0.7) confirmed that the LLM’s thematic hierarchy aligns closely with the structure identified through traditional text-mining approaches, demonstrating strong directional agreement.

  3. Reliability Through Validation:
    Human reviewers judged all six LLM-generated themes to be conceptually sound and textually supported.
    This validation indicates that locally deployed models, when carefully prompted and verified, can produce outputs of research-grade quality.

  4. Efficiency and Ethics:
    By running entirely offline, LM Studio ensured complete data sovereignty—no institutional text left the researcher’s machine.
    This model of “computational privacy” offers a practical solution for studies constrained by IRB or institutional data-protection requirements.

Answer to the Research Question

Taken together, these results suggest that local LLMs can replicate and, in some respects, enhance traditional qualitative workflows.
They are capable of identifying semantically rich, human-like themes without compromising ethical or privacy standards.
Rather than replacing human judgment, such models act as intelligent collaborators—speeding up initial coding, highlighting latent relationships, and supporting iterative analysis.

Limitations and Future Testing

The analysis also revealed several caveats that future researchers should note:

  • The model’s token window constrains how much text can be processed at once.
    Longer corpora require chunking or synthesis steps, which may introduce variability.
  • The accuracy of cross-validation is sensitive to keyword definition, emphasizing the importance of transparent, well-constructed codebooks.
  • Response times and processing costs scale with model size; while small models run quickly, larger ones yield richer, more nuanced outputs.

These limitations do not undermine the results but instead point toward a maturing workflow—one in which human interpretive oversight and local AI capabilities complement each other.

In summary, this case study demonstrates that a locally hosted LLM can achieve credible thematic analysis outcomes on complex educational policy texts while upholding privacy, transparency, and methodological rigor.
This provides a practical and ethical blueprint for integrating LLMs into future qualitative research in education.

6.4.6 Reflection

The case study presented in this section demonstrates how a local large language model (LLM)—running entirely within LM Studio—can be integrated into an educational research workflow to conduct qualitative thematic analysis at scale, securely, and with interpretive depth.

From Tokens to Meaning

Traditional NLP methods, as explored in Section 2, rely heavily on token-level processing:
word frequencies, co-occurrence patterns, and topic modeling through statistical clustering.
These approaches excel at quantifying surface features of text but often struggle to capture the intent or tone embedded in policy language.

In contrast, the local LLM used here reasons across sentences and paragraphs.
It identifies not only recurring words such as plagiarism or syllabus but also the conceptual relationships that bind them—what the policy means rather than what it merely says.
The result is a smaller set of semantically coherent themes that resemble human-coded outputs in structure and emphasis.

The cross-validation exercise (Sections 6.4.5–6.4.5.3) confirmed this distinction empirically:
the LLM produced lower absolute frequencies yet mirrored the same thematic hierarchy found by keyword verification (r ≈ 0.7).
In short, the machine did not count more—it understood better.

Complementarity, Not Replacement

Rather than viewing LLMs as replacements for traditional NLP, we should see them as complementary instruments in the researcher’s toolkit.
Conventional text mining offers transparency and replicability;
LLMs contribute context, nuance, and synthesis.
When combined, the two form a hybrid analytic ecology—where numbers inform narratives and narratives refine numbers.

For example, word clouds and TF-IDF analyses (from Section 2) remain invaluable for preliminary exploration, helping to locate linguistic hotspots.
Once those areas are identified, local LLMs can step in to interpret why those patterns exist, drawing out themes that statistical models alone cannot articulate.

Privacy and Practicality

Equally important is the ethical and logistical dimension.
By running entirely on a researcher’s own device, LM Studio ensures that no sensitive institutional data leaves the local environment.
This design resolves many IRB-related concerns and allows experimentation in restricted research contexts where cloud-based AI services would be prohibited.

The workflow does, however, require patience.
Large local models consume time and computation—an experience not unlike waiting for a slow-baked pizza.
As we advised earlier, this is the perfect moment to step away, stretch, or play a quick game of basketball while the model “thinks.”
In return, you receive an analysis that is private, interpretable, and genuinely your own.

Looking Ahead: From Analysis to Collaboration

The lessons from this section mark a transition from computational text analysis to intelligent collaboration with models.
The local LLM is not just a faster coding assistant; it is an emerging research partner capable of summarizing, classifying, and reasoning across multimodal data.
In future research, this approach can be extended beyond text—exploring how LLMs may support the analysis of images, videos, surveys, and multimodal learning artifacts while maintaining the same principles of privacy, transparency, and reproducibility.

In summary:
Section 2 taught us how to count words;
Section 6 showed us how machines can interpret meaning—securely, locally, and collaboratively.
Together, they illuminate a continuum of computational methods for educational research,
bridging the measurable and the meaningful, the statistical and the semantic, the algorithmic and the human.