LLM-Based Intelligent Public Opinion Analytics Assistant
Skill by ara.so — Data Skills collection.
Overview
This project is a comprehensive public opinion analytics platform that combines real-time data from 26 hot lists across 15 mainstream platforms (Weibo, Bilibili, Zhihu, Baidu, etc.) with large language model (LLM) analysis capabilities. It provides conversational query interfaces for hot searches, topic clustering, sentiment analysis, and multi-channel push notifications (WeChat, Email, Telegram).
Key Capabilities:
- Real-time crawler cluster for 15+ platforms
- LLM-powered content analysis (including video content extraction)
- Natural language query interface
- Topic clustering and sentiment analysis
- Multi-channel alert system (Email, WeChat Work, Telegram)
- Keyboard shortcuts for crawler control
Installation
Prerequisites
- Browser Driver Setup (Required for detail page scraping):
# Check your Chrome/Edge version first
# Chrome: chrome://settings/help
# Edge: edge://settings/help
# Download matching driver:
# ChromeDriver: https://chromedriver.chromium.org/
# EdgeDriver: https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/
# Linux/macOS - place driver in PATH:
sudo mv chromedriver /usr/local/bin/
sudo chmod +x /usr/local/bin/chromedriver
# Verify installation:
chromedriver --version
- MySQL Database:
# Install MySQL 8.0+
# Create database and user
mysql -u root -p
CREATE DATABASE hotsearch_db CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
CREATE USER 'hotsearch_user'@'localhost' IDENTIFIED BY 'your_password';
GRANT ALL PRIVILEGES ON hotsearch_db.* TO 'hotsearch_user'@'localhost';
FLUSH PRIVILEGES;
- Python Environment:
# Clone repository
git clone https://github.com/hmmnxkl/LLM-Based-Intelligent-Public-Opinion-Analytics-Assistant.git
cd LLM-Based-Intelligent-Public-Opinion-Analytics-Assistant
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
Database Initialization
Reference the init.py file to create necessary tables:
# Example table structure (adapt from init.py)
import pymysql
connection = pymysql.connect(
host='localhost',
user='hotsearch_user',
password='your_password',
database='hotsearch_db',
charset='utf8mb4'
)
cursor = connection.cursor()
# Hot search items table
cursor.execute("""
CREATE TABLE IF NOT EXISTS hot_search_items (
id INT AUTO_INCREMENT PRIMARY KEY,
platform VARCHAR(50) NOT NULL,
rank INT,
title VARCHAR(500) NOT NULL,
url VARCHAR(1000),
heat_value VARCHAR(100),
crawl_time DATETIME NOT NULL,
detail_content TEXT,
sentiment VARCHAR(20),
INDEX idx_platform (platform),
INDEX idx_crawl_time (crawl_time)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
""")
connection.commit()
connection.close()
Configuration
Environment Variables
Create .env file in the project root:
# Database Configuration
MYSQL_HOST=localhost
MYSQL_PORT=3306
MYSQL_USER=hotsearch_user
MYSQL_PASSWORD=your_password
MYSQL_DATABASE=hotsearch_db
# LLM API Configuration (OpenAI-compatible format)
OPENAI_API_KEY=your_api_key
OPENAI_API_BASE=https://your-llm-endpoint.com/v1
OPENAI_MODEL=gpt-4
# Huawei Pangu Model (recommended alternative)
PANGU_API_KEY=your_pangu_key
PANGU_API_BASE=https://pangu-api.huaweicloud.com
# Push Notification Channels
# Email (SMTP)
SMTP_HOST=smtp.gmail.com
SMTP_PORT=587
SMTP_USER=your_email@gmail.com
SMTP_PASSWORD=your_app_password
# WeChat Work Bot
WECHAT_WORK_WEBHOOK=https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=YOUR_KEY
# WeChat Work Application
WECHAT_WORK_CORP_ID=your_corp_id
WECHAT_WORK_APP_SECRET=your_app_secret
WECHAT_WORK_AGENT_ID=your_agent_id
# Telegram Bot
TELEGRAM_BOT_TOKEN=your_bot_token
TELEGRAM_CHAT_ID=your_chat_id
Crawler Settings
Edit hotsearchcrawler/settings.py:
# MySQL Connection Pool
MYSQL_CONFIG = {
'host': os.getenv('MYSQL_HOST', 'localhost'),
'port': int(os.getenv('MYSQL_PORT', 3306)),
'user': os.getenv('MYSQL_USER'),
'password': os.getenv('MYSQL_PASSWORD'),
'database': os.getenv('MYSQL_DATABASE'),
'charset': 'utf8mb4',
'autocommit': True
}
# Optional: Platform-specific cookies for authenticated access
PLATFORM_COOKIES = {
'weibo': 'your_weibo_cookies', # Optional, for better access
'bilibili': 'your_bilibili_cookies'
}
# Concurrent requests
CONCURRENT_REQUESTS = 16
DOWNLOAD_DELAY = 1
# User-Agent rotation
USER_AGENTS = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36'
]
Usage
Starting the System
# Activate virtual environment
source venv/bin/activate
# Start the main application (web interface + API)
python app.py
# Access web interface at http://localhost:5000
Crawler Management
# Manual crawler test (single platform)
cd hotsearchcrawler
python runspider-test.py
# Start all crawlers (typically triggered via web UI)
python run_spiders.py
Via Web Interface:
- Use keyboard shortcuts to start/stop crawlers
- View real-time crawling status
- Monitor data collection metrics
Natural Language Queries
# Examples of conversational queries via web interface:
# "Show me today's top 10 trending topics on Weibo"
# "What's trending about AI technology across all platforms?"
# "Analyze sentiment for news about electric vehicles"
# "Cluster topics related to economic policy"
# "Compare hot topics between Bilibili and Zhihu"
Programmatic API Usage
from hotsearch_analysis_agent.analyzer import OpinionAnalyzer
from datetime import datetime, timedelta
# Initialize analyzer
analyzer = OpinionAnalyzer()
# Query hot searches
results = analyzer.query_hot_searches(
platforms=['weibo', 'zhihu', 'bilibili'],
time_range=(datetime.now() - timedelta(hours=24), datetime.now()),
keyword='人工智能'
)
# Perform sentiment analysis
sentiment = analyzer.analyze_sentiment(results)
print(f"Overall sentiment: {sentiment['overall']}")
print(f"Positive: {sentiment['positive_ratio']}%")
# Topic clustering
clusters = analyzer.cluster_topics(results, num_clusters=5)
for i, cluster in enumerate(clusters):
print(f"Cluster {i+1}: {cluster['keywords']}")
print(f" Items: {len(cluster['items'])}")
Push Notification Setup
from hotsearch_analysis_agent.push_service import PushService
# Initialize push service
push_service = PushService()
# Create scheduled push task
task = push_service.create_task(
name="AI Technology Daily Report",
keywords=['人工智能', '大模型', '机器学习'],
platforms=['weibo', 'zhihu', 'bilibili'],
schedule='0 8,12,18 * * *', # Cron format: 8am, 12pm, 6pm daily
channels=['wechat_work', 'email'],
threshold={'heat_value': 100000, 'sentiment': 'positive'}
)
# Test push task
python test_push_task.py
Analysis Report Generation
from hotsearch_analysis_agent.report_generator import ReportGenerator
generator = ReportGenerator()
# Generate comprehensive report
report = generator.generate_report(
topic="人工智能与前沿科技",
time_range=(datetime.now() - timedelta(days=7), datetime.now()),
include_sentiment=True,
include_clustering=True,
include_trend_analysis=True
)
# Report includes:
# - Core findings with data highlights
# - Detailed news content with source URLs
# - Sentiment distribution
# - Topic clusters
# - Trend analysis
# - Information spread characteristics
# Save report
report.save_markdown('output/ai_tech_report.md')
report.save_pdf('output/ai_tech_report.pdf')
Common Patterns
Multi-Platform Data Aggregation
from hotsearch_analysis_agent.aggregator import DataAggregator
aggregator = DataAggregator()
# Fetch and merge data from multiple platforms
merged_data = aggregator.aggregate(
platforms=['weibo', 'douyin', 'zhihu', 'bilibili', 'baidu'],
dedup_threshold=0.8, # Similarity threshold for deduplication
sort_by='heat_value',
limit=50
)
# Cross-platform topic correlation
correlations = aggregator.find_correlations(merged_data)
print(f"Found {len(correlations)} cross-platform trending topics")
Video Content Analysis
# The system automatically extracts text from video news
# using browser automation and LLM analysis
from hotsearch_analysis_agent.video_analyzer import VideoAnalyzer
video_analyzer = VideoAnalyzer()
# Analyze video-based hot topics (e.g., from Bilibili, Douyin)
video_topics = video_analyzer.extract_content(
url='https://www.bilibili.com/video/BV13pSoBBEvX/',
extract_comments=True,
max_comments=100
)
print(f"Video title: {video_topics['title']}")
print(f"Description: {video_topics['description']}")
print(f"Top comments sentiment: {video_topics['comments_sentiment']}")
Custom LLM Integration
from hotsearch_analysis_agent.llm_client import LLMClient
# Use Huawei Pangu Model (recommended)
llm = LLMClient(
api_base=os.getenv('PANGU_API_BASE'),
api_key=os.getenv('PANGU_API_KEY'),
model='pangu-embedded-7b'
)
# Or use any OpenAI-compatible endpoint
llm = LLMClient(
api_base=os.getenv('OPENAI_API_BASE'),
api_key=os.getenv('OPENAI_API_KEY'),
model='gpt-4'
)
# Analyze custom content
analysis = llm.analyze(
content=news_content,
task='sentiment_and_summary',
language='zh'
)
Scheduled Monitoring
from hotsearch_analysis_agent.scheduler import MonitorScheduler
scheduler = MonitorScheduler()
# Add monitoring rule
scheduler.add_rule(
name="Tech Company Crisis Monitoring",
keywords=['某公司', '丑闻', '争议'],
alert_conditions={
'heat_spike': 2.0, # 2x normal heat
'sentiment_drop': -0.3, # 30% sentiment decrease
'platforms_count': 3 # Trending on 3+ platforms
},
notification_channels=['wechat_work', 'telegram', 'email'],
urgent=True
)
# Start scheduler
scheduler.start()
Troubleshooting
Browser Driver Issues
# Error: "Message: 'chromedriver' executable needs to be in PATH"
# Solution: Verify driver installation
which chromedriver # Should return path
# If not found, reinstall:
# 1. Check browser version
google-chrome --version # or microsoft-edge --version
# 2. Download exact matching driver version
# 3. Place in /usr/local/bin/ and chmod +x
# Alternative: Specify driver path in settings
CHROMEDRIVER_PATH=/path/to/chromedriver
Database Connection Errors
# Error: "Can't connect to MySQL server"
# Check MySQL service
sudo systemctl status mysql
# Verify credentials
mysql -u hotsearch_user -p -h localhost hotsearch_db
# Check .env file encoding (must be UTF-8 without BOM)
file -I .env # Should show charset=utf-8
# Test connection in Python
import pymysql
try:
conn = pymysql.connect(
host=os.getenv('MYSQL_HOST'),
user=os.getenv('MYSQL_USER'),
password=os.getenv('MYSQL_PASSWORD'),
database=os.getenv('MYSQL_DATABASE')
)
print("Connection successful")
except Exception as e:
print(f"Error: {e}")
Crawler Rate Limiting
# Error: HTTP 429 or blocked requests
# Solution: Adjust crawler settings
# In hotsearchcrawler/settings.py:
CONCURRENT_REQUESTS = 8 # Reduce from 16
DOWNLOAD_DELAY = 2 # Increase delay
# Enable AutoThrottle
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
# Rotate User-Agents and proxies
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}
LLM API Timeouts
# Error: Request timeout or rate limit
# Solution: Implement retry logic and fallback
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
def call_llm_with_retry(prompt):
return llm.analyze(prompt)
# Use batch processing for large datasets
from hotsearch_analysis_agent.batch_processor import BatchProcessor
processor = BatchProcessor(batch_size=10, delay=2)
results = processor.process_items(news_items, analyze_func)
Memory Issues with Large Datasets
# Error: MemoryError or slow processing
# Solution: Use pagination and streaming
from hotsearch_analysis_agent.db_client import DBClient
db = DBClient()
# Stream results instead of loading all at once
for batch in db.stream_hot_searches(batch_size=100):
process_batch(batch)
# Process and discard to free memory
# Use database aggregation instead of in-memory
aggregated = db.aggregate_by_platform(
start_date='2026-01-01',
end_date='2026-05-01'
)
Project Structure Reference
.
├── app.py # Main application entry
├── hotsearch_analysis_agent/ # Analysis system
│ ├── analyzer.py # Core analysis logic
│ ├── llm_client.py # LLM integration
│ ├── report_generator.py # Report generation
│ ├── push_service.py # Notification service
│ └── scheduler.py # Task scheduling
├── hotsearchcrawler/ # Crawler cluster
│ ├── spiders/ # Platform-specific spiders
│ ├── settings.py # Crawler settings
│ └── run_spiders.py # Crawler launcher
├── test_push_task.py # Push notification testing
├── runspider-test.py # Single crawler testing
├── init.py # Database initialization
├── requirements.txt # Python dependencies
└── .env # Environment configuration
Best Practices
- Database Indexing: Ensure indexes on
platform,crawl_time, andtitlecolumns for fast queries - LLM Cost Management: Cache analysis results to avoid redundant API calls
- Crawler Politeness: Respect platform rate limits and robots.txt
- Notification Throttling: Implement cooldown periods to avoid alert fatigue
- Data Retention: Set up automatic archival for data older than 90 days
- Model Choice: Consider Huawei Pangu for better Chinese language understanding and local deployment

