Big Data Recruitment Data Visualization System (Source Code + Paper)
0 Introduction
Over the past few years, the requirements and difficulty of graduation projects have been continuously increasing. Traditional graduation project topics lack innovation and highlights, often failing to meet the requirements of graduation defense. In recent years, many junior students have told senior students that their project systems do not meet the teacher's requirements.
To help everyone pass the graduation project with minimal effort, the senior shares high-quality graduation project topics. Today's topic is
🚩 Big Data Recruitment Data Visualization System (Source Code + Paper)
🥇 The senior gives a comprehensive score for this topic (each item is out of 5 points)
Difficulty: 3 points
Workload: 3 points
Innovation: 4 points
🧿 Project Sharing: See the end of the article!
1 Project Running Effect
Video Effect:
Big Data Recruitment Data Visualization System
2 Topic Background
With the rapid development of technology, data has seen explosive growth, and everyone cannot avoid dealing with data. The demand for "data" professionals in society is also constantly increasing. Therefore, understanding what kind of talent companies need today, and what skills are required, is very necessary for both students and job seekers.
This paper focuses on the Boss Zhipin website and uses the Scrapy framework to scrape job information for positions related to big data, data analysis, data mining, machine learning, and artificial intelligence in major cities across the country. It analyzes and compares the salary, educational requirements of different positions; analyzes and compares the demand for relevant talents in different regions and industries; and analyzes and compares the knowledge and skill requirements for different positions.
3 Project Implementation
3.1 Overview
This project is divided into three subtasks: data collection - data preprocessing - data analysis/visualization.
Project Flowchart
Project Architecture Diagram
3.2 Data Collection
Scrapy Crawler Introduction
Scrapy is a crawler framework based on Twisted, which can extract data from various data sources. Its architecture is clear, the coupling between modules is low, and it has strong scalability, high crawling efficiency, and can flexibly complete various needs. It can be conveniently used to handle most anti-crawling websites, and is currently the most widely used crawler framework in Python. Scrapy framework mainly consists of five components: Scheduler, Downloader, Spider, Item Pipeline, and Scrapy Engine. The functions of each component are as follows:
- Scheduler: Simply put, it can be assumed as a URL (the address of the web page to be crawled or the link) queue, which decides what the next URL to crawl is, while removing duplicate URLs (not doing useless work). Users can customize the scheduler according to their own needs.
- Downloader: It is the heaviest among all components, used to quickly download resources on the network. The code of Scrapy's downloader is not too complex, but it is efficient, mainly because Scrapy's downloader is built on the efficient asynchronous model of twisted (in fact, the entire framework is built on this model).
- Spider: This is the part that users care about the most. Users customize their own spider (by customizing regular expressions and other syntax), used to extract the information they need from specific web pages, i.e., the so-called items. Users can also extract links from them, allowing Scrapy to continue crawling the next page.
- Item Pipeline: Used to process the items extracted by the spider. The main functions include persisting items, validating the validity of items, and removing unnecessary information.
- Scrapy Engine: The Scrapy engine is the core of the entire framework. Its used to control the scheduler, downloader, and spider. In fact, the engine is equivalent to the CPU of a computer, controlling the entire process.
Scrapy Official Architecture Diagram
Crawl job data for popular cities on Boss Zhipin and save the data in CSV file format. As shown in the following figure:
Writing the crawler program
After creating and configuring the Scrapy project, we can write the Scrapy crawler program.
import scrapy
import json
import logging
import random
from example.items import ExampleItem
class ExampleSpider(scrapy.Spider):
name = 'example'
allowed_domains = ['example.com']
start_urls = ['https://www.example.com/api/cityGroup.json'] # URL for popular cities list
# Set multiple cookies, recommend quantity is number of pages / 2 + 1. At least set 4
# Just copy the __zp_stoken__ part
cookies = [ '__zp_stoken__=f330bOEgsRnsAIS5Bb2FXe250elQKNzAgMBcQZ1hvWyBjUFE1DCpKLWBtBn99Nwd%2BPHtlVRgdOi1vDEAkOz9sag50aRNRfhs6TQ9kWmNYc0cFI3kYKg5fAGVPPX0WO2JCOipvRlwbP1YFBQlHOQ%3D%3D', '__zp_stoken__=f330bOEgsRnsAIUsENEIbe250elRsb2U4Bg0QZ1hvW19mPEdeeSpKLWBtN3Y9QCN%2BPHtlVRgdOilvfTYkSTMiaFN0X3NRAGMjOgENX2krc0cFI3kYKiooQGx%2BPX0WO2I3OipvRlwbP1YFBQlHOQ%3D%3D', '__zp_stoken__=f330bOEgsRnsAITsLNnJIe250elRJMH95DBAQZ1hvW1J1ewdmDCpKLWBtBHZtagV%2BPHtlVRgdOil1LjkkR1MeRAgdY3tXbxVORWVuTxQlc0cFI3kYKgwCEGxNPX0WO2JCOipvRlwbP1YFBQlHOQ%3D%3D'
]
# Set multiple request headers
user_agents = [
'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_0_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.1 Safari/605.1.15',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0',
'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1',
'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)',
]
page_number = 1 # Initialize pagination
def random_header(self):
"""
Randomly generate request headers
:return: headers
"""
headers = {'Referer': 'https://www.example.com/c101020100/?ka=sel-city-101020100'}
headers['cookie'] = random.choice(self.cookies)
headers['user-agent'] = random.choice(self.user_agents)
return headers
def parse(self, response):
"""
Parse the list of popular cities on the homepage and select popular cities for crawling
:param response: Dictionary data of popular cities
:return:
"""
# Get server returned content
city_group = json.loads(response.body.decode())
# Get list of popular cities
hot_city_list = city_group['zpData']['hotCityList']
# Initialize an empty list to store printed information
# city_lst = []
# for index,item in enumerate(hot_city_list):
# city_lst.apend({index+1: item['name']})
# List comprehension:
hot_city_names = [{index + 1: item['name']} for index, item in enumerate(hot_city_list)]
print("--->", hot_city_names)
# Get city number from keyboard
city_no = int(input('Please select a number from the above city list to start crawling:'))
# Build query interface
# Get city code
city_code = hot_city_list[city_no - 1]['code']
# Build query URL
city_url = 'https://www.example.com/job_detail/?query=&city={}&industry=&position='.format(city_code)
logging.info("<<<<<<<<<<<<< Crawling job data for page _{}_ >>>>>>>>>>>>".format(self.page_number))
yield scrapy.Request(url=city_url, headers=self.random_header(), callback=self.parse_city)
def parse_city(self, response):
"""
Parse job page data
:param response: Job page response data
:return:
"""
if response.status != 200:
logging.warning("<<<<<<<<<<<<< Failed to get city job information, IP has been blocked. Please try again later >>>>>>>>>>>>")
return
li_elements = response.xpath('//div[@class="job-list"]/ul/li') # Locate all li tags
next_url = response.xpath('//div[@class="page"]/a[last()]/@href').get() # Get next page
for li in li_elements:
job_title = li.xpath('./div/div[1]//div[@class="job-title"]/span[1]/a/text()').get()
job_location = li.xpath('./div/div[1]//div[@class="job-title"]/span[2]/span[1]/text()').get()
job_salary = li.xpath('./div/div[1]//span[@class="red"]/text()').get()
company_name = li.xpath('./div/div[1]/div[2]//div[@class="company-text"]/h3/a/text()').get()
company_type = li.xpath('./div/div[1]/div[2]/div[1]/p/a/text()').get()
company_size = li.xpath('./div/div[1]/div[2]/div[1]/p/text()[2]').get()
financial_stage = li.xpath('./div/div[1]/div[2]/div[1]/p/text()[1]').get()
experience_required = li.xpath('./div/div[1]/div[1]/div[1]/div[2]/p/text()[1]').get()
education_level = li.xpath('./div/div[1]/div[1]/div[1]/div[2]/p/text()[2]').get()
job_benefits = li.xpath('./div/div[2]/div[2]/text()').get()
item = ExampleItem(job_title=job_title, job_location=job_location, job_salary=job_salary, company_name=company_name,
company_type=company_type, company_size=company_size,
financial_stage=financial_stage, experience_required=experience_required, education_level=education_level,
job_benefits=job_benefits)
yield item
if next_url == "javascript:;":
logging.info('<<<<<<<<<<<<< Job data for popular cities has been crawled >>>>>>>>>>>>')
logging.info("<<<<<<<<<<<<< A total of _{}_ pages of job data has been crawled >>>>>>>>>>>>".format(self.page_number))
return
next_url = response.urljoin(next_url) # URL concatenation
self.page_number += 1
logging.info("<<<<<<<<<<<<< Crawling job data for page _{}_ >>>>>>>>>>>>".format(self.page_number))
yield scrapy.Request(url=next_url, headers=self.random_header(), callback=self.parse_city)
Saving Data
from itemadapter import ItemAdapter
class ExamplePipeline:
def process_item(self, item, spider):
"""
Save data to local CSV file
:param item: Data item
:param spider:
:return:
"""
with open(file='Popular-City-Job-Data.csv', mode='a+', encoding='utf8') as f:
f.write(
'{job_title},{job_location},{job_salary},{company_name},{company_type},{company_size},{financial_stage},{experience_required},'
'{education_level},{job_benefits}'.format(
**item))
return item
Editing Local CSV File
job_title,job_location,job_salary,company_name,company_type,company_size,financial_stage,experience_required,education_level,job_benefits
3.3 Data Cleaning and Preprocessing
After writing and running the crawler program, we can crawl job data for popular cities on Boss Zhipin to the local system. Observing the data, we find that there are a lot of dirty data and highly coupled data. We need to clean and preprocess these dirty data before using them.
Requirements:
Read the `Popular-City-Job-Data.csv` file
Clean duplicate rows.
Preprocess the `job_location` field. Requirement: Beijing·Haidian·Xibeiwang --> Beijing, Haidian, Xibeiwang. Split into 3 fields
Preprocess the `job_salary` field. Requirement: 30-60K·15薪 --> Minimum: 30, Maximum: 60
Preprocess the `experience_required` field. Requirement: Experience not required/Internship/Graduate: 0, 1-3 years: 1, 3-5 years: 2, 5-10 years: 3, Over 10 years: 4
Preprocess the `company_size` field. Requirement: Less than 500 people: 0, 500-999: 1, 1000-9999: 2, More than 10000 people: 3
Preprocess the `job_benefits` field. Requirement: Replace Chinese commas ',' with English commas ',' in the description
Clean rows with missing values.
Save the processed data to MySQL database
Writing the code for cleaning and preprocessing
# -*- coding:utf-8 -*-
"""
Author: jhzhong
Function: Clean and preprocess job data
Requirements:
1. Read `Popular-City-Job-Data.csv` file
2. Clean duplicate rows.
3. Preprocess the `job_location` field. Requirement: Beijing·Haidian·Xibeiwang --> Beijing, Haidian, Xibeiwang. Split into 3 fields
4. Preprocess the `job_salary` field. Requirement: 30-60K·15薪 --> Minimum: 30, Maximum: 60
5. Preprocess the `experience_required` field. Requirement: Experience not required/Internship/Graduate: 0, 1-3 years: 1, 3-5 years: 2, 5-10 years: 3, Over 10 years: 4
6. Preprocess the `company_size` field. Requirement: Less than 500 people: 0, 500-999: 1, 1000-9999: 2, More than 10000 people: 3
7. Preprocess the `job_benefits` field. Requirement: Replace Chinese commas ',' with English commas ',' in the description
8. Clean rows with missing values.
9. Save the processed data to MySQL database
"""
# Import modules
import pandas as pd
from sqlalchemy import create_engine
import logging
# Read `Popular-City-Job-Data.csv` file
all_city_zp_df = pd.read_csv('../Popular-City-Job-Data.csv', encoding='utf8')
# Clean duplicate rows.
all_city_zp_df.drop_duplicates(inplace=True)
# Preprocess the `job_location` field. Requirement: Beijing·Haidian·Xibeiwang --> Beijing, Haidian, Xibeiwang. Split into 3 fields
all_city_zp_area_df = all_city_zp_df['job_location'].str.split('·', expand=True)
all_city_zp_area_df = all_city_zp_area_df.rename(columns={0: "city", 1: "district", 2: "street"})
# Preprocess the `job_salary` field. Requirement: 30-60K·15薪 --> Minimum: 30, Maximum: 60
all_city_zp_salary_df = all_city_zp_df['job_salary'].str.split('K', expand=True)[0].str.split('-', expand=True)
all_city_zp_salary_df = all_city_zp_salary_df.rename(columns={0: 'salary_min', 1: 'salary_max'})
# Preprocess the `experience_required` field. Requirement: Experience not required/Internship/Graduate: 0, 1-3 years: 1, 3-5 years: 2, 5-10 years: 3, Over 10 years: 4
def fun_experience(x):
if x in "1-3 years":
return 1
elif x in "3-5 years":
return 2
elif x in "5-10 years":
return 3
elif x in "Over 10 years":
return 4
else:
return 0
all_city_zp_df['experience_required'] = all_city_zp_df['experience_required'].apply(lambda x: fun_experience(x))
# Preprocess the `company_size` field. Requirement: Less than 500 people: 0, 500-999: 1, 1000-9999: 2, More than 10000 people: 3
def fun_company_size(x):
if x in "500-999 people":
return 1
elif x in "1000-9999 people":
return 2
elif x in "More than 10000 people":
return 3
else:
return 0
# Preprocess the `job_benefits` field. Requirement: Replace Chinese commas ',' with English commas ',' in the description
all_city_zp_df['job_benefits'] = all_city_zp_df['job_benefits'].str.replace(',', ',')
# Merge all datasets
clean_all_city_zp_df = pd.concat([all_city_zp_df, all_city_zp_salary_df, all_city_zp_area_df], axis=1)
# Delete redundant columns
clean_all_city_zp_df.drop('job_location', axis=1, inplace=True) # Delete original area
clean_all_city_zp_df.drop('job_salary', axis=1, inplace=True) # Delete original salary
# Clean rows with missing values.
clean_all_city_zp_df.dropna(axis=0, how='any', inplace=True)
clean_all_city_zp_df.drop(axis=0,
index=(clean_all_city_zp_df.loc[(clean_all_city_zp_df['job_benefits'] == 'None')].index),
inplace=True)
# Save the processed data to MySQL database
engine = create_engine('mysql+pymysql://root:123456@localhost:3306/example_db?charset=utf8')
clean_all_city_zp_df.to_sql('t_example_info', con=engine, if_exists='replace')
logging.info("Write to MySQL Successfully!")
Run the program and check whether the data has been cleaned successfully and inserted into the database.
4 Data Analysis and Visualization
After successfully running the above two processes, we have obtained high-quality data for data analysis. After obtaining this data, we use python + sql scripts to analyze the data from multiple dimensions and use highcharts tools for data visualization. The entire analysis and visualization are deployed using a lightweight WEB framework Flask.
Flask Framework Introduction
Flask is a lightweight web application framework based on Werkzeug and Jinja2. Compared with other similar frameworks, Flask has higher flexibility, lightweight, and security, and is easy to learn. It can be well combined with MVC mode for development. Flask also has powerful customization capabilities, developers can add corresponding functions according to actual needs, and maintain the simplicity of core functions while implementing rich functions and extensions. The rich plugin library of Flask allows users to achieve personalized website customization, thus developing powerful websites.
In this project, when developing the backend with Flask, the front-end requests will encounter cross-domain issues. To solve this problem, you can change the data type to jsonp, use the GET method, or add response headers on the Flask side. Here, we use the Flask-CORS library to solve the cross-domain issue. Additionally, the axios request library needs to be installed.
Flask Framework Diagram
Writing Visualization Code
# -*- coding:utf-8 -*-
"""
Author: jhzhong
Function: Data Analysis and Visualization
"""
from flask import Flask, render_template
from example.web.dbutils import DBUtils
import json
app = Flask(__name__)
def get_db_conn():
"""
Get database connection
:return: db_conn database connection object
"""
return DBUtils(host='localhost', user='root', passw='123456', db='example_db')
def msg(status, data='No data loaded'):
"""
:param status: Status code 200 success, 201 no data found
:param data: Response data
:return: Dictionary such as {'status': 201, 'data': 'No data loaded'}
"""
return json.dumps({'status': status, 'data': data})
@app.route('/')
def index():
"""
Homepage
:return: index.html jump to homepage
"""
return render_template('index.html')
@app.route('/getwordcloud')
def get_word_cloud():
"""
Get job benefits word cloud data
:return:
"""
db_conn = get_db_conn()
text = \
db_conn.get_one(sql_str="SELECT GROUP_CONCAT(job_benefits) FROM t_example_info")[0]
if text is None:
return msg(201)
return msg(200, text)
@app.route('/getjobinfo')
def get_job_info():
"""
Get distribution of job locations
:return:
"""
db_conn = get_db_conn()
results = db_conn.get_all(
sql_str="SELECT city,district,COUNT(1) as num FROM t_example_info GROUP BY city,district")
# {"city":"Beijing","info":[{"district":"Chaoyang District","num":27},{"Haidian District":43}]}
if results is None or len(results) == 0:
return msg(201)
data = []
city_detail = {}
for r in results:
info = {'name': r[1], 'value': r[2]}
if r[0] not in city_detail:
city_detail[r[0]] = [info]
else:
city_detail[r[0]].append(info)
for k, v in city_detail.items():
temp = {'name': k, 'data': v}
data.append(temp)
return msg(200, data)
@app.route('/getjobnum')
def get_job_num():
"""
Get number of jobs per city
:return:
"""
db_conn = get_db_conn()
results = db_conn.get_all(sql_str="SELECT city,COUNT(1) num FROM t_example_info GROUP BY city")
if results is None or len(results) == 0:
return msg(201)
if results is None or len(results) == 0:
return msg(201)
data = []
for r in results:
data.append(list(r))
return msg(200, data)
@app.route('/getcomtypenum')
def get_com_type_num():
"""
Get proportion of company types
:return:
"""
db_conn = get_db_conn()
results = db_conn.get_all(
sql_str="SELECT com_type, ROUND(COUNT(1)/(SELECT SUM(t1.num) FROM (SELECT COUNT(1) num FROM t_example_info GROUP BY com_type) t1)*100,2) percent FROM t_example_info GROUP BY com_type")
if results is None or len(results) == 0:
return msg(201)
data = []
for r in results:
data.append({'name': r[0], 'y': float(r[1])})
return msg(200, data)
# Pie chart
@app.route('/geteducationnum')
def geteducationnum():
"""
Get proportion of education levels
:return:
"""
db_conn = get_db_conn()
results = db_conn.get_all(
sql_str="SELECT t1.education,ROUND(t1.num/(SELECT SUM(t2.num) FROM(SELECT COUNT(1) num FROM t_example_info t GROUP BY t.education)t2)*100,2) FROM( SELECT t.education,COUNT(1) num FROM t_example_info t GROUP BY t.education) t1")
if results is None or len(results) == 0:
return msg(201)
data = []
for r in results:
data.append([r[0], float(r[1])])
return msg(200, data)
# Get ranking
@app.route('/getorder')
def getorder():
"""
Get ranking of company recruitment numbers
:return:
"""
db_conn = get_db_conn()
results = db_conn.get_all(
sql_str="SELECT t.com_name,COUNT(1) FROM t_example_info t GROUP BY t.com_name ORDER BY COUNT(1) DESC LIMIT 10")
if results is None or len(results) == 0:
return msg(201)
data = []
for i, r in enumerate(results):
data.append({'id': i + 1,
'name': r[0],
'num': r[1]})
return msg(200, data)
if __name__ == '__main__':
app.run(host='127.0.0.1', port=8080, debug=True)
Due to space limitations, more detailed design is available in the design paper
6 Conclusion
Project Contents
Complete Detailed Design Paper
🧿 Project Sharing: See the end of the article!