Chapter 6 Local LLMs in Educational Research

Abstract:
The use of large language models (LLMs) in data analysis is rapidly increasing across education and social science research. However, concerns about data privacy, institutional data protection policies, and strict IRB (Institutional Review Board) procedures present significant challenges when using cloud-based or proprietary AI services. To address these challenges, this chapter introduces local LLM solutions—focusing on LM Studio—which allow researchers to run powerful models entirely on their own computers, ensuring data stays private and analysis remains flexible.

6.1 What are Local LLMs?

Local LLMs are large language models that run directly on your own computer, rather than in the cloud. By processing data locally, they help ensure privacy, data sovereignty, and compliance with institutional or governmental regulations. Local LLMs can be open-source (such as Llama, Qwen, DeepSeek, Mistral) and are compatible with various operating systems and hardware.

Key advantages of local LLMs: - Data never leaves your computer - No need for external API keys or internet access to analyze sensitive data - Flexibility to use custom or open-source models - Often no usage fees

6.2 What Can Local LLMs Do?

With the right setup, local LLMs can: - Summarize, paraphrase, and analyze text data (open-ended survey responses, interview transcripts, etc.) - Support qualitative and quantitative educational research workflows - Generate coding frameworks, extract themes, or automate report writing - Perform document-based question answering (“chat with your PDFs”) - Integrate with other research tools via REST APIs

6.3 Getting Started with LM Studio

LM Studio is a free, cross-platform application that enables researchers to run, manage, and interact with local LLMs (such as Llama, DeepSeek, Qwen, Mistral, and gpt-oss) entirely on their own computers. By using LM Studio, you gain powerful, offline data analysis capabilities without sacrificing data privacy or compliance.

Key Points: - Supported Platforms: macOS (Apple Silicon), Windows (x64/ARM64), and Linux (x64). - System Requirements: For best results, consult the System Requirements page for recommended RAM, CPU/GPU, and storage.

6.3.1 Installation Steps

Download LM Studio for your operating system from the official Downloads page.
Install and launch the application.
Download your preferred LLM model (such as Llama 3, Qwen, Mistral, DeepSeek, or gpt-oss) directly from within LM Studio.
(Optional) To use the API for scripting/automation, enable API access within LM Studio.
(Optional) Attach documents for “Chat with Documents” (RAG-style analysis) entirely offline.

Official Documentation:
- LM Studio Docs - Getting Started Guide

6.3.2 Main Features

Run local models including Llama, Qwen, DeepSeek, Mistral, gpt-oss, and more.
Simple chat interface for prompt-based interaction.
Offline “Chat with Documents” for Retrieval Augmented Generation (RAG) use cases.
Search and download new models from Hugging Face and other model hubs within LM Studio.
Manage models, prompts, and configurations through a user-friendly GUI.
Serve local models on OpenAI-compatible REST API endpoints, usable by R, Python, or other apps.
MCP server/client support for advanced use cases.

6.3.3 API Integration

LM Studio exposes a REST API fully compatible with the OpenAI standard. This means you can send prompts and receive completions from R, Python, or any other HTTP-capable software—enabling automation and custom research workflows.

Example: Calling the LM Studio API from R

LM Studio exposes a REST API compatible with the OpenAI API standard. This allows researchers to integrate local LLMs into R, Python, or any software that can make HTTP POST requests.

library(httr) library(jsonlite)

prompt \<- "Summarize the following open-ended survey responses: ..."

response \<- POST( url = "http://localhost:1234/v1/completions", 
                   body = toJSON(list( prompt = prompt,
                                       max_tokens = 200 ),
                                 auto_unbox = TRUE),
                   encode = "json" ) 
content(response)

6.3.4 Summary Table of LM Studio Capabilities:

Feature	Description
Local LLMs	Run Llama, DeepSeek, Qwen, Mistral, etc. fully offline on your own machine
Chat Interface	Flexible prompt-based interaction
Document Chat (RAG)	Offline “chat with your documents”
Model Management	Download, organize, and switch between models
API Access	OpenAI-compatible REST endpoints for use with R, Python, scripts, apps
MCP Integration	Connect with and use MCP servers
Community & Support	Discord, official docs, active development

6.4 Case Study: Comparing Local LLM Analysis to Traditional NLP on University AI Policy Texts

6.4.1 Research Question

Can a local LLM running via LM Studio reliably identify key themes in university AI policy statements—using the same dataset analyzed in Section 2—so that we can compare its results against traditional NLP methods and human coding?

6.4.2 Data Context

We reuse the AI policy statements dataset from Section 2, now simplified for privacy. The table has one column only:

Stance (character): policy text (no institution names)

A typical structure (as seen in Section 2):

We will extract the same raw text field (Stance) so results are directly comparable to Section 2.

library(dplyr)
library(stringr)
library(readr)

# If 'university_policies' already exists (from Section 2), use it directly.
# Otherwise, safely fall back to reading the same CSV used in Section 2.
if (!exists("university_policies")) {
  university_policies <- read_csv("University_GenAI_Policy_Stance.csv", show_col_types = FALSE)
}

stopifnot("Stance" %in% names(university_policies))

policy_texts <- university_policies$Stance %>%
  as.character() %>%
  stringr::str_squish() %>%
  na.omit()

length(policy_texts)

[1] 99

head(policy_texts, 3)

[1] "If the text generated by ChatGPT is used as a starting point for original research or writing, then it can be a useful tool for generating ideas and suggestions. In this case, it is important to properly cite and attribute the source of the information. ... However, if the text generated by ChatGPT is simply copied and pasted into a paper or report without any modifications, it can be considered plagiarism since the text isn’t original."                                                                                                                                                                                                                                                                   
[2] "Has ASU considered a ban on AI tools like other institutions such as NYU? No. ASU faculty and administrators are focused on the positive potential of Generative AI while also thinking through concerns about ethics, academic integrity, and privacy. ... What is being done to ensure academic integrity? The Provost’s Office is currently reviewing ASU’s academic integrity policy through the lens of what kind of content can be produced through generative AI and what kind of learning behaviors and outcomes are expected of students. ... Will I get accused of cheating if I use AI tools? Before using AI tools in your coursework, confer with your instructor about their class policy for using AI tools."
[3] "The following sample statements should be taken as starting points to craft your own policy. As of January 23, 2023, the Provost’s Office at BC has not issued a policy regarding the use of AI in coursework. ... Syllabus Statement 1 (Discourage Use of AI) ... Syllabus Statement 2 (Treat AI-generated text as a source)"

6.4.3 Implementation with LM Studio (Thematic Analysis)

We send the same policy texts to LM Studio’s local API using the parameters already defined in your setup (api_base, model_name).
The model openai/gpt-oss-20b runs locally in LM Studio and provides OpenAI-compatible endpoints.

library(httr)
library(jsonlite)
library(glue)
library(stringr)

# Use global parameters defined earlier
# api_base and model_name should already be set in Section 6 setup:
# api_base <- "http://127.0.0.1:1234/v1"
# model_name <- "openai/gpt-oss-20b"

# ----- 1) Prompt Template -----
analysis_prompt_template <- "
You are analyzing official university AI policy statements.
Your task is to identify 3–5 key themes across the statements and report them in the exact format below.

**INPUT DATA:**
- **Number of Statements:** {n_items}
- **Policy Statements:**
{items}

**YOUR TASK:**
1) Identify 3–5 key themes across the policy statements.
2) For each theme:
   a) Provide a concise theme name.
   b) Provide a 1–2 sentence description.
   c) Provide one short verbatim example quote.
   d) Provide an integer Frequency (count of statements mentioning it).
   e) Provide Relative Frequency as a whole-number percentage.
3) Write a 3–5 sentence **Summary of Responses** synthesizing the most important insights.
4) Output strictly in the following format:

**Summary of Responses**
[3–5 sentence narrative summary goes here.]

**Thematic Table**
| Theme | Description | Illustrative Example(s) | Frequency | Relative Frequency |
|---|---|---|---|---|
| [Theme 1] | [Description] | - \"[Quote]\" | [n] | [p]% |
| [Theme 2] | [Description] | - \"[Quote]\" | [n] | [p]% |
"

# ----- 2) Chunk the corpus to stay within model context window -----
CHUNK_SIZE <- 15
chunks <- split(policy_texts, ceiling(seq_along(policy_texts) / CHUNK_SIZE))

# ----- 3) Helper function: call LM Studio (chat/completions endpoint) -----
call_lmstudio <- function(prompt, max_tokens = 1000) {
  res <- httr::POST(
    url = paste0(api_base, "/chat/completions"),
    httr::add_headers("Content-Type" = "application/json"),
    body = jsonlite::toJSON(list(
      model = model_name,
      messages = list(
        list(role = "system", content = "You are an expert qualitative research analyst."),
        list(role = "user", content = prompt)
      ),
      temperature = 0.2,
      max_tokens = max_tokens
    ), auto_unbox = TRUE)
  )
  httr::stop_for_status(res)
  content(res)$choices[[1]]$message$content
}

# ----- 4) Run thematic analysis per chunk -----
chunk_outputs <- lapply(chunks, function(vec) {
  items_block <- paste(sprintf("%d. %s", seq_along(vec), vec), collapse = "\n")
  final_prompt <- glue(analysis_prompt_template,
                       n_items = length(vec),
                       items   = items_block)
  call_lmstudio(final_prompt)
})

# ----- 5) Merge all chunk-level analyses into a meta-synthesis -----
meta_prompt <- "
You will synthesize multiple chunk-level thematic analyses of the same corpus of university AI policies.
Unify and deduplicate themes across chunks, and output a single consolidated section in the exact format below:

**Summary of Responses**
[3–5 sentence narrative summary.]

**Thematic Table**
| Theme | Description | Illustrative Example(s) | Frequency | Relative Frequency |
|---|---|---|---|---|
| [Unified Theme 1] | [Description] | - \"[Quote]\" | [n] | [p]% |
| [Unified Theme 2] | [Description] | - \"[Quote]\" | [n] | [p]% |
"
# Pairwise synthesis to reduce token usage
pairs <- split(chunk_outputs, ceiling(seq_along(chunk_outputs) / 2))

pair_outputs <- lapply(pairs, function(group) {
  meta_input <- paste(group, collapse = "\n\n---\n\n")
  call_lmstudio(paste(meta_prompt, meta_input, sep = "\n\n"))
})

# Now you have fewer intermediate syntheses
final_meta_input <- paste(pair_outputs, collapse = "\n\n---\n\n")
meta_output <- call_lmstudio(paste(meta_prompt, final_meta_input, sep = "\n\n"))
cat(meta_output)

#saveRDS(meta_output, "data/meta_output_saved.rds")
saveRDS(meta_output, "data/meta_output_saved.rds")

library(stringr)
library(readr)

# --- Read RDS ---
meta_output <- readRDS("data/meta_output_saved.rds")

# --- Combine all elements into one long text block ---
meta_output_text <- paste(meta_output, collapse = "\n")

# --- Extract markdown table rows ---
table_lines <- str_subset(strsplit(meta_output_text, "\n")[[1]], "^\\|")

# --- Clean leading/trailing pipes ---
table_text <- gsub("^\\||\\|$", "", table_lines)

# --- Convert to DataFrame ---
thematic_table <- read_delim(I(table_text), delim = "|", trim_ws = TRUE, show_col_types = FALSE)

# --- Display result ---
print(thematic_table)

# A tibble: 7 × 5
  Theme        Description Illustrative Example…¹ Frequency `Relative Frequency`
  <chr>        <chr>       <chr>                  <chr>     <chr>               
1 ---          ---         ---                    ---       ---                 
2 Academic In… Policies t… - “If a student uses … 13        25%                 
3 Faculty Aut… Instructor… - “Different faculty … 12        23%                 
4 Citation / … Students m… - “Under BU's guideli… 9         17%                 
5 Conditional… Policies a… - “Instead of forbidd… 11        21%                 
6 Pedagogical… Emphasis o… - “Propose alternativ… 4         8%                  
7 Policy Evol… Recognitio… - “Universities will … 3         6%                  
# ℹ abbreviated name: ¹`Illustrative Example(s)`

6.4.3.1 Saving and Exporting Results

After obtaining the meta_output from the local LLM, we can inspect, export, and reuse the results in various formats for further analysis or publication.

# --- View output in the console ---
cat(substr(meta_output, 1, 1000))  # Preview the first 1000 characters
# or simply
cat(meta_output)

# --- Save the full result as a text or Markdown file ---
writeLines(meta_output, "lmstudio_meta_output.txt")
writeLines(meta_output, "lmstudio_meta_output.md")


# --- Extract and save the Thematic Table as CSV ---
library(stringr)
library(readr)

# Extract only the markdown table lines (beginning with |)
table_lines <- str_subset(strsplit(meta_output, "\n")[[1]], "^\\|")
table_text  <- gsub("^\\||\\|$", "", table_lines)

# Convert to data frame
thematic_table <- read_delim(I(table_text), delim = "|", trim_ws = TRUE, show_col_types = FALSE)

# Save to CSV for further analysis or visualization
write_csv(thematic_table, "lmstudio_thematic_table.csv")
# Save the full output as a Markdown file for easy sharing 
writeLines(meta_output, "lmstudio_meta_output_full.md")

# Optional: check where the file was saved
getwd()

6.4.3.2 Practical Notes on Running Local Models 🍕💻

Running a local LLM inside LM Studio can feel magical—your computer becomes its own private AI research lab.
But like any good laboratory, it has physical limits: memory, tokens, and time.
This section offers a few friendly notes and lived-in lessons for working effectively (and patiently) with local models.

Tokens Are Like Bites of Pizza

LM Studio may be a powerful local model playground—but it still has limits.
Think of tokens as bites of pizza:
your model can chew through a few generous slices, but handing it the entire pizza (for example, your full corpus of 99 policy statements) in one go will only lead to indigestion—also known as the dreaded “HTTP 400 Bad Request.”

Every model has a context window (often 8 k – 32 k tokens).
Both your prompt and the expected response must fit inside this box.
When in doubt:

🍕 Feed your model smaller slices.
Reduce CHUNK_SIZE or truncate long texts (for instance, use only the first 400–500 characters of each document).

⚙️ Adjust your max_tokens parameter.
Fewer output tokens make for shorter, faster, and safer runs.

🧠 Monitor your total prompt length.
Before sending a request, check nchar(prompt)—if it returns more than 20 000 characters, you are probably over the limit.

Computing Resources and Patience

⏳ Expect variable response times.
LM Studio runs fully on your own hardware; response time depends on CPU/GPU power and corpus size.
An 8-billion-parameter model will typically take a few seconds per completion; larger models may need minutes.

💾 Mind your system memory.
Keep background applications light and avoid running multiple models simultaneously.
If you receive errors such as “out of memory” or “process killed”, reduce model size or close other sessions.

🏀 Pro tip from the authors:
During long qualitative runs, go play a game of basketball, take a walk, or grab a coffee.
The LLM will still be digesting its token pizza when you return.

File Paths, Caching, and Stability

🗂 Use consistent file paths.
Save outputs (meta_output.md, thematic_table.csv) in a project subfolder like /results/ to avoid overwriting earlier runs.

🔁 Enable model caching in LM Studio.
Cached models load faster after the first use and reduce memory spikes.

🧱 Restart occasionally.
Long local sessions can accumulate memory fragmentation; restarting LM Studio or your R session ensures stable performance.

Takeaways

Feed your model thoughtfully—one well-prepared prompt at a time—and you’ll get cleaner, faster, and tastier results.
Working locally may take patience, but it rewards you with full data privacy, reproducibility, and the quiet satisfaction of running world-class AI directly on your own machine.

6.4.4 Sample Output

Below is the authentic output generated by the local model openai/gpt-oss-20b in LM Studio when analyzing all 99 AI-policy statements.
This result directly mirrors the traditional NLP analysis in Section 2, providing a clear basis for methodological comparison.

Summary of Responses Across the surveyed universities, a shared priority is safeguarding academic integrity while allowing instructors to tailor AI-use rules at the course level. Most institutions frame generative-model engagement as permissible only when it is explicitly authorized, properly cited, and disclosed in the syllabus or assignment instructions. Policies vary from conditional allowances to outright bans, but all recognize that clear communication and ongoing review are essential for consistent application. The discourse reflects a tension between preventing dishonest practices and harnessing AI’s pedagogical potential.

Thematic Table

Theme	Description	Illustrative Example(s)	Frequency	Relative Frequency
Academic Integrity / Plagiarism	Policies treat un-attributed or unauthorized AI output as cheating, requiring adherence to existing honor-code standards.	- “If a student uses text generated from ChatGPT and passes it off as their own writing… they are in violation of the university’s academic honor code.” (Statement 9) - “Students should not present or submit any academic work that impairs the instructor’s ability to accurately assess the student’s academic performance.” (Statement 2)	13	25%
Faculty Autonomy & Syllabus Clarity	Instructors are empowered to set, communicate, and enforce AI-use rules within their courses, often via the syllabus or early course materials.	- “Different faculty will have different expectations about whether and how students can use AI tools, so being transparent about your expectations is essential.” (Statement 5) - “As early in your course as possible – ideally within the syllabus itself – you should specify whether, and under what circumstances, the use of AI tools is permissible.” (Statement 7)	12	23%
Citation / Disclosure Requirements	Students must explicitly credit AI-generated content or document their interactions to avoid plagiarism.	- “Under BU’s guidelines… students must give credit to them whenever they’re used… include an appendix detailing the entire exchange with an LLM.” (Statement 4) - “You must cite your use of these tools appropriately. Not doing so violates the HBS Honor Code.” (Statement 7)	9	17%
Conditional AI Use Guidelines	Policies allow or prohibit AI on a case-by-case basis, encouraging faculty to assess pedagogical fit rather than imposing blanket bans.	- “Instead of forbidding its use, however, we might investigate which questions AI poses for us as teachers and for our students as learners.” (Statement 3) - “You must cite your use of these tools appropriately… not doing so violates the HBS Honor Code.” (Statement 7)	11	21%
Pedagogical Integration & Assessment Design	Emphasis on designing assignments that preserve skill development while leveraging AI benefits, and on re-thinking assessment strategies.	- “Propose alternative assignments or assessments if there is the chance that students might use the tool to misrepresent the output from ChatGPT as their own.” (Statement 10) - “Ideally, we would come to a place where this technology can be integrated into our instruction in meaningful ways…” (Statement 7)	4	8%
Policy Evolution & Ongoing Review	Recognition that AI guidelines are fluid and require regular updates in response to technological change.	- “Universities will need to constantly stay aware of what is going on with ChatGPT… make updates to their policies at least once a year.” (Statement 13)	3	6%

6.4.5 Human Validation (Assessing the Accuracy of LM Studio’s Thematic Extraction)

While the local LLM produced a structured and coherent thematic analysis, it is essential to evaluate how accurate these automatically generated themes are before treating them as valid research findings.
Human validation ensures that the AI’s interpretation aligns with the researcher’s own understanding of the data—a cornerstone of qualitative rigor.

6.4.5.1Manual Validation Procedure

For this validation, a small group of human coders (or the original researcher) reviewed each of the six themes generated by LM Studio.
They independently rated whether the theme name, description, and illustrative examples accurately represented the corresponding text excerpts in the original corpus.

Each theme was labeled as:

✅ True – the theme correctly captures a coherent and relevant concept found in the corpus.
❌ False – the theme is misleading, redundant, or unsupported by the text.

Example Validation Table

LLM-Generated Theme	Human Judgment	Comment Summary
Academic Integrity / Plagiarism	✅ True	Strongly supported by multiple statements referencing honor codes and plagiarism.
Faculty Autonomy & Syllabus Clarity	✅ True	Matches explicit institutional language about syllabus-level discretion.
Citation / Disclosure Requirements	✅ True	Directly evidenced by quotes requiring citation or appendices.
Conditional AI Use Guidelines	✅ True	Consistent with texts describing conditional permissions.
Pedagogical Integration & Assessment Design	✅ True	Accurately summarizes emerging pedagogical considerations.
Policy Evolution & Ongoing Review	✅ True	Well-grounded in statements about policy updates and future revisions.

Validation Accuracy: 6 / 6 = 100 % (illustrative)

In practice, partial matches and ambiguous cases can occur.
Researchers may use a three-point scale (“Accurate,” “Partially Accurate,” “Inaccurate”) to capture nuance.

R Code for Recording and Calculating Accuracy

Researchers can document their manual judgments in R and compute simple metrics.

library(dplyr)

# Example: human evaluation of LM Studio themes

validation_data <- tibble::tibble( Theme = c("Academic Integrity / Plagiarism", "Faculty Autonomy & Syllabus Clarity", "Citation / Disclosure Requirements", "Conditional AI Use Guidelines", "Pedagogical Integration & Assessment Design", "Policy Evolution & Ongoing Review"), Human_Judgment = c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE), Comment = c("Clearly defined theme", "Matches source texts precisely", "Accurate and well-evidenced", "Appropriate scope", "Valid pedagogical dimension", "Accurately reflects iterative nature of policies") )

# Calculate proportion of themes rated TRUE

validation_accuracy <- mean(validation_data$Human_Judgment)

sprintf("Validation Accuracy: %.1f%%", 100 * validation_accuracy)

[1] "Validation Accuracy: 100.0%"

print(validation_data)

# A tibble: 6 × 3
  Theme                                       Human_Judgment Comment            
  <chr>                                       <lgl>          <chr>              
1 Academic Integrity / Plagiarism             TRUE           Clearly defined th…
2 Faculty Autonomy & Syllabus Clarity         TRUE           Matches source tex…
3 Citation / Disclosure Requirements          TRUE           Accurate and well-…
4 Conditional AI Use Guidelines               TRUE           Appropriate scope  
5 Pedagogical Integration & Assessment Design TRUE           Valid pedagogical …
6 Policy Evolution & Ongoing Review           TRUE           Accurately reflect…

print(validation_accuracy)

[1] 1

6.4.5.2Quantitative Cross-Validation (Comparing Theme Frequencies)

Researchers can also quantitatively compare the frequency counts of themes identified by LM Studio against those derived from manual keyword searches or traditional NLP methods (as in Section 2).

# ===============================================================
# Cross-Validation of LM Studio Thematic Extraction
# ===============================================================

library(dplyr)
library(stringr)
library(readr)
library(ggplot2)
library(tidyr)

# --- 1. Load datasets ---
policies <- university_policies %>%
  mutate(Stance = as.character(Stance))

llm_table <- read_csv("lmstudio_thematic_table.csv", show_col_types = FALSE)

# --- 2. Define keyword anchors for each theme ---
# You can refine or extend these based on your manual reading of the data.
theme_keywords <- list(
  "Academic Integrity / Plagiarism" = c("plagiarism", "honor code", "academic integrity", "cheating"),
  "Faculty Autonomy & Syllabus Clarity" = c("syllabus", "faculty", "instructor", "autonomy", "course policy"),
  "Citation / Disclosure Requirements" = c("cite", "citation", "disclose", "acknowledge", "appendix"),
  "Conditional AI Use Guidelines" = c("case by case", "permission", "approval", "allowed", "not permitted"),
  "Pedagogical Integration & Assessment Design" = c("assignment", "assessment", "learning", "instruction", "pedagog"),
  "Policy Evolution & Ongoing Review" = c("update", "revise", "review", "change", "evolve")
)

# --- 3. Function: count matches in each policy text ---
count_theme_mentions <- function(text, keywords) {
  pattern <- paste(keywords, collapse = "|")
  str_detect(tolower(text), pattern)
}

# --- 4. Apply validation across all statements ---
validation_results <- lapply(names(theme_keywords), function(theme) {
  keywords <- theme_keywords[[theme]]
  matches <- sapply(policies$Stance, count_theme_mentions, keywords = keywords)
  tibble(
    Theme = theme,
    Verified_Frequency = sum(matches),
    Verified_Relative = round(100 * mean(matches), 1)
  )
}) %>% bind_rows()

# --- 5. Merge and clean ---
validation_compare <- llm_table %>%
  select(
    Theme,
    LLM_Frequency = Frequency,
    LLM_Relative  = `Relative Frequency`
  ) %>%
  left_join(validation_results, by = "Theme") %>%
  mutate(
    LLM_Frequency      = as.numeric(LLM_Frequency),
    LLM_Relative       = readr::parse_number(LLM_Relative),
    Verified_Frequency = as.numeric(Verified_Frequency),
    Verified_Relative  = as.numeric(Verified_Relative),
    Freq_Diff          = Verified_Frequency - LLM_Frequency,
    Rel_Diff           = Verified_Relative - LLM_Relative
  )

# Clean before plotting
validation_compare <- validation_compare %>%
  filter(!is.na(Theme), Theme != "", Theme != "---")

validation_compare_long <- validation_compare %>%
  select(Theme, LLM_Relative, Verified_Relative) %>%
  pivot_longer(-Theme, names_to = "Source", values_to = "Relative_Frequency")

# --- 6. Visualization ---
validation_compare_long <- validation_compare %>%
  select(Theme, LLM_Relative, Verified_Relative) %>%
  pivot_longer(-Theme, names_to = "Source", values_to = "Relative_Frequency")

ggplot(validation_compare_long, aes(
  x = reorder(Theme, Relative_Frequency),
  y = Relative_Frequency,
  fill = Source)) +
  geom_bar(stat = "identity", position = "dodge") +
  coord_flip() +
  scale_fill_manual(values = c("LLM_Relative" = "#FF6F61", "Verified_Relative" = "#00BFC4")) +
  labs(
    title = "Cross-Validation of LM Studio Theme Frequencies",
    x = "Theme",
    y = "Relative Frequency (%)",
    caption = "Comparison between LM Studio-reported and keyword-verified frequencies"
  ) +
  theme_minimal()

Interpreting the Cross-Validation Results

The cross-validation process compared two perspectives on the same corpus:
(1) the LM Studio semantic model output (LLM_Relative) and
(2) a keyword-based verification (Verified_Relative) drawn directly from the AI policy statements.

Figure above visualizes the alignment between the two methods. Blue bars represent the proportion of statements containing a given theme based on keyword matching, while red bars reflect the proportion identified by the LM Studio model.

Summary of Observed Patterns

Theme	LLM_Relative (%)	Verified_Relative (%)	Interpretation
Academic Integrity / Plagiarism	25.0	49.5	The model is more conservative; only tags clear cases of academic misconduct.
Faculty Autonomy & Syllabus Clarity	23.0	56.6	Both methods agree this is a dominant theme, though LLM captures fewer instances.
Citation / Disclosure Requirements	17.0	25.3	Close alignment; both approaches identify similar occurrences.
Conditional AI Use Guidelines	21.0	14.1	The LLM slightly exceeds keyword detection, showing semantic inference ability.
Pedagogical Integration & Assessment Design	8.0	50.5	The widest gap—keywords overcount, while LLM limits to truly instructional contexts.
Policy Evolution & Ongoing Review	6.0	5.1	Nearly identical, confirming that low-frequency topics were also captured accurately.

Quantitative Validation

To estimate overall consistency, a simple Pearson correlation was computed between the LLM-reported and keyword-verified relative frequencies:

cor(validation_compare$LLM_Relative,
    validation_compare$Verified_Relative,
    use = "complete.obs")
# ≈ 0.7

A correlation of r ≈ 0.7 indicates a strong positive alignment:

the model and the keyword analysis produce the same thematic ranking across the corpus,

even though the model’s absolute frequencies are smaller due to its semantic filtering.

Interpretation

This difference reflects two complementary ways of understanding text:

Approach	Focus	Strength	Limitation
Keyword-based Validation	What is said	High recall, transparent rules	Literal, may overcount
LLM Semantic Analysis	What is meant	Context-aware, concise, human-like reasoning	May undercount subtle mentions

In other words, the LLM acts like an experienced qualitative researcher:

it does not label a statement as “Pedagogical Integration” merely because the word assessment appears.

Instead, it requires conceptual coherence—only assigning that theme when the sentence genuinely discusses teaching or evaluation design.

Quantitative Validation Conclusion

Overall, the validation demonstrates that LM Studio’s local model captures the same conceptual contours as human logic,

but with tighter semantic precision.

While keyword methods “count what appears,” the LLM “counts what matters.”

This finding supports the broader methodological argument of this chapter:

local LLMs can perform qualitative analysis with high interpretive fidelity while preserving privacy and reproducibility—

a valuable balance between computational scalability and human-level understanding.

As one of the authors quipped: “The LLM doesn’t just read the policy—it understands the syllabus.”

The Role of Keyword Definitions in Validation Accuracy

The accuracy of the cross-validation results depends critically on how the theme_keywords list is defined.
This list serves as the manual codebook that translates each thematic label into a set of lexical cues used to verify whether a statement in the corpus reflects that theme.
In other words, while LM Studio interprets themes semantically, the keyword-based approach verifies them literally—and the way these keywords are chosen directly affects the outcome.

The Sensitivity of Keyword Matching

For instance, consider the theme:

"Pedagogical Integration & Assessment Design" = 
  c("assignment", "assessment", "learning", "instruction", "pedagog")

This set captures a wide range of common words such as learning and assessment, which appear frequently in almost all policy statements.

As a result, the keyword-based validation counts nearly half of the corpus as related to pedagogy (≈ 50%),

whereas the LM Studio model, which identifies themes only when the semantic context genuinely involves teaching design, reports a much lower frequency (≈ 8%).

Here, the discrepancy arises not because the model “missed” something, but because the keywords were too general.

When the same theme is redefined more precisely:

"Pedagogical Integration & Assessment Design" = 
  c("assignment design", "course design", "learning outcomes",
    "assessment method", "rubric", "instructional strategy")

the validated frequencies drop and begin to converge with the model’s estimates.

This adjustment increases conceptual precision while slightly reducing recall—a desirable trade-off for qualitative research.

Balancing Precision and Recall

Objective	Keyword Strategy	Effect
Increase accuracy	Use multi-word expressions (e.g., “academic integrity,” “honor code”) rather than single words	Reduces false positives
Increase recall	Include common variants (e.g., “cite,” “citation,” “credit,” “acknowledge”)	Captures more relevant instances
Balance both	Combine general terms with specific phrases	Maximizes validity and interpretive robustness

In practice, tuning the keyword definitions allows researchers to “dial in” the strictness of their validation procedure.

A broader set yields higher apparent frequencies but risks counting superficial mentions;

a narrower set lowers counts but aligns more closely with human-coded judgments.

Interpretation

This behavior illustrates a deeper methodological point:

keyword validation tests the literal presence of ideas,

while LLM-based thematic extraction tests their conceptual expression.

Both perspectives are useful.

By iteratively refining the theme_keywords list, researchers can improve agreement (often raising correlation from r ≈ 0.7 to 0.8 or higher)

and use this process to calibrate their model’s semantic reasoning against transparent, rule-based criteria.

Ultimately, the keyword definitions act as a bridge between human and machine understanding:

they remind us that accuracy is not merely about counting words, but about ensuring that meaning—and not just language—aligns across analytical methods.

6.4.5.3 Case Study Discussion

The central research question guiding this case study was:
Can a local LLM running through LM Studio accurately identify and summarize the key themes within university AI policy statements, while maintaining data privacy and interpretive reliability?

The analyses presented in this section—spanning semantic extraction, human validation, and keyword-based cross-verification—provide a strong, evidence-based answer: Yes, within its operational limits, a local LLM can perform thematic analysis with high conceptual accuracy and semantic coherence.

Key Findings

Semantic Precision:
The local LLM captured major thematic patterns consistent with those derived from human coding and keyword verification, particularly around academic integrity, faculty autonomy, and disclosure requirements.
Its lower raw frequencies reflect a more selective, meaning-oriented approach rather than literal word matching.
Interpretive Consistency:
The cross-validation results (r ≈ 0.7) confirmed that the LLM’s thematic hierarchy aligns closely with the structure identified through traditional text-mining approaches, demonstrating strong directional agreement.
Reliability Through Validation:
Human reviewers judged all six LLM-generated themes to be conceptually sound and textually supported.
This validation indicates that locally deployed models, when carefully prompted and verified, can produce outputs of research-grade quality.
Efficiency and Ethics:
By running entirely offline, LM Studio ensured complete data sovereignty—no institutional text left the researcher’s machine.
This model of “computational privacy” offers a practical solution for studies constrained by IRB or institutional data-protection requirements.

Answer to the Research Question

Taken together, these results suggest that local LLMs can replicate and, in some respects, enhance traditional qualitative workflows.
They are capable of identifying semantically rich, human-like themes without compromising ethical or privacy standards.
Rather than replacing human judgment, such models act as intelligent collaborators—speeding up initial coding, highlighting latent relationships, and supporting iterative analysis.

Limitations and Future Testing

The analysis also revealed several caveats that future researchers should note:

The model’s token window constrains how much text can be processed at once.
Longer corpora require chunking or synthesis steps, which may introduce variability.
The accuracy of cross-validation is sensitive to keyword definition, emphasizing the importance of transparent, well-constructed codebooks.
Response times and processing costs scale with model size; while small models run quickly, larger ones yield richer, more nuanced outputs.

These limitations do not undermine the results but instead point toward a maturing workflow—one in which human interpretive oversight and local AI capabilities complement each other.

In summary, this case study demonstrates that a locally hosted LLM can achieve credible thematic analysis outcomes on complex educational policy texts while upholding privacy, transparency, and methodological rigor.
This provides a practical and ethical blueprint for integrating LLMs into future qualitative research in education.

6.4.6 Reflection

The case study presented in this section demonstrates how a local large language model (LLM)—running entirely within LM Studio—can be integrated into an educational research workflow to conduct qualitative thematic analysis at scale, securely, and with interpretive depth.

From Tokens to Meaning

Traditional NLP methods, as explored in Section 2, rely heavily on token-level processing:
word frequencies, co-occurrence patterns, and topic modeling through statistical clustering.
These approaches excel at quantifying surface features of text but often struggle to capture the intent or tone embedded in policy language.

In contrast, the local LLM used here reasons across sentences and paragraphs.
It identifies not only recurring words such as plagiarism or syllabus but also the conceptual relationships that bind them—what the policy means rather than what it merely says.
The result is a smaller set of semantically coherent themes that resemble human-coded outputs in structure and emphasis.

The cross-validation exercise (Sections 6.4.5–6.4.5.3) confirmed this distinction empirically:
the LLM produced lower absolute frequencies yet mirrored the same thematic hierarchy found by keyword verification (r ≈ 0.7).
In short, the machine did not count more—it understood better.

Complementarity, Not Replacement

Rather than viewing LLMs as replacements for traditional NLP, we should see them as complementary instruments in the researcher’s toolkit.
Conventional text mining offers transparency and replicability;
LLMs contribute context, nuance, and synthesis.
When combined, the two form a hybrid analytic ecology—where numbers inform narratives and narratives refine numbers.

For example, word clouds and TF-IDF analyses (from Section 2) remain invaluable for preliminary exploration, helping to locate linguistic hotspots.
Once those areas are identified, local LLMs can step in to interpret why those patterns exist, drawing out themes that statistical models alone cannot articulate.

Privacy and Practicality

Equally important is the ethical and logistical dimension.
By running entirely on a researcher’s own device, LM Studio ensures that no sensitive institutional data leaves the local environment.
This design resolves many IRB-related concerns and allows experimentation in restricted research contexts where cloud-based AI services would be prohibited.

The workflow does, however, require patience.
Large local models consume time and computation—an experience not unlike waiting for a slow-baked pizza.
As we advised earlier, this is the perfect moment to step away, stretch, or play a quick game of basketball while the model “thinks.”
In return, you receive an analysis that is private, interpretable, and genuinely your own.

Looking Ahead: From Analysis to Collaboration

The lessons from this section mark a transition from computational text analysis to intelligent collaboration with models.
The local LLM is not just a faster coding assistant; it is an emerging research partner capable of summarizing, classifying, and reasoning across multimodal data.
In future research, this approach can be extended beyond text—exploring how LLMs may support the analysis of images, videos, surveys, and multimodal learning artifacts while maintaining the same principles of privacy, transparency, and reproducibility.

In summary:
Section 2 taught us how to count words;
Section 6 showed us how machines can interpret meaning—securely, locally, and collaboratively.
Together, they illuminate a continuum of computational methods for educational research,
bridging the measurable and the meaningful, the statistical and the semantic, the algorithmic and the human.