scrape Linkedin using python playwright

scrape Linkedin using python playwright
Photo by Greg Bulla / Unsplash

How to Scrape LinkedIn Using Python and Playwright

Scraping LinkedIn can be a challenging task due to its sophisticated anti-scraping mechanisms. However, with the right tools and techniques, it is possible to extract valuable data from LinkedIn profiles. In this blog, we will explore how to use Python and Playwright to scrape LinkedIn effectively.

Understanding the Tools

Before we dive into the scraping process, let's understand the tools we will be using:

  • Python: A versatile programming language that is widely used for web scraping due to its powerful libraries and ease of use.
  • Playwright: A browser automation library that allows us to control and automate web browsers. It is capable of bypassing core encryption by executing JavaScript, which simplifies the reverse-engineering process [1].

Setting Up the Environment

To scrape LinkedIn using Python and Playwright, you will need to set up your environment first. Ensure you have Python installed on your system. Then, install Playwright using the following command:

pip install playwright

After installing Playwright, you need to run the following command to download the necessary browser binaries:

playwright install

Writing the Scraper

Now that we have our environment set up, let's write the scraper. We will use Playwright to automate the browser and navigate through LinkedIn pages.

from playwright.sync_api import sync_playwright

def scrape_linkedin():
    with sync_playwright() as p:
        # Launch the browser
        browser = p.chromium.launch(headless=False)  # Set headless=False to see the browser in action
        page = browser.new_page()

        # Log in to LinkedIn
        page.goto('https://www.linkedin.com/login')
        page.fill('input[name="session_key"]', 'your_email@example.com')
        page.fill('input[name="session_password"]', 'your_password')
        page.click('button[type="submit"]')

        # Wait for the navigation to complete
        page.wait_for_load_state('networkidle')

        # Navigate to the profile you want to scrape
        page.goto('https://www.linkedin.com/in/target-profile/')

        # Extract the data you need
        # For example, to get the profile name:
        name = page.query_selector('div.ph5.pb5 > div.display-flex.mt2 ul li').inner_text()
        print(f'Name: {name}')

        # Add more extraction logic as needed

        # Close the browser
        browser.close()

if __name__ == '__main__':
    scrape_linkedin()

Handling Authentication

LinkedIn requires users to log in to view profiles. The above script includes a simple login process using your LinkedIn credentials. It is important to handle authentication carefully and respect LinkedIn's terms of service.

Data Extraction

Once logged in, you can navigate to the target profile and use Playwright's selectors to extract the data you need. The example provided extracts the profile name, but you can extend this to gather other information such as experience, education, and skills.

Ethical Considerations and LinkedIn's Policy

It's crucial to note that scraping LinkedIn or any other website should be done ethically and in compliance with the site's terms of service. LinkedIn has strict policies against scraping, and violating these can lead to legal consequences and account suspension. Always review LinkedIn's terms of service and use scraping tools responsibly.

Conclusion

Scraping LinkedIn with Python and Playwright can be a powerful way to gather data for various legitimate purposes. However, it requires careful handling of authentication and compliance with LinkedIn's policies. The example provided is a starting point, and you can build upon it to create more sophisticated scrapers as needed.

Remember, the key to successful web scraping is not just about the technical aspects but also about respecting the legal and ethical boundaries of the platforms you are scraping from.


πŸ“š
resources

[1] MediaCrawler

⚑A versatile crawler for social media platforms like Xiaohongshu, Douyin, Kuaishou, and Bilibili.
🎯 To scrape videos, images, comments, likes, and shares from various social media platforms.
πŸ’‘ MediaCrawler features include Cookie, QR code, and mobile login; keyword search; data scraping by post ID; session caching; data persistence; and IP proxy pool support. It bypasses core encryption by using Playwright to execute JavaScript, simplifying the reverse-engineering process.
πŸ€– Create a versatile crawler that can handle multiple authentication methods, interact with different social media platforms, and collect a wide range of data, while also managing session state and utilizing IP proxies to avoid blocking.
πŸ”‘ Python, Playwright, JavaScript, SQL
πŸ†

[2] CEH-Exam-Questions

⚑A repository with 125 Certified Ethical Hacker (CEH) exam preparation questions and answers.
🎯 The code provides a study guide for individuals preparing for the CEH certification exam by offering practice questions and answers.
πŸ’‘ The project features a collection of 125 multiple-choice questions and answers, covering various topics of the CEH exam syllabus such as cryptography, system hacking, malware, sniffing, social engineering, and more. It's a valuable resource for self-assessment and exam preparation for aspiring ethical hackers.
πŸ€– Generate a code base for a web application that presents users with practice questions and answers for the Certified Ethical Hacker (CEH) exam, including features for self-assessment, topic-wise categorization, and timed quizzes.
πŸ”‘ HTML, CSS, JavaScript, Web Scraping (for question collection)
πŸ†

[3] hrequests

⚑A Python library providing a feature-rich replacement for the requests library with additional browser automation capabilities.
🎯 To facilitate seamless integration of HTTP requests and headless browser automation with advanced features for network concurrency, realistic browser header generation, and fast HTML parsing.
πŸ’‘ HTTP and headless browser switching, fast HTML parsing, network concurrency with goroutines and gevent, TLS fingerprint replication, JavaScript rendering, HTTP/2 support, realistic browser header generation, fast JSON serialization, cursor movement and typing emulation, extension support, full page screenshots, CORS restriction avoidance, threadsafe operation, minimal standard library dependence, high performance.
πŸ€– Generate a Python library similar to hrequests with features like seamless HTTP and browser automation transition, advanced network concurrency, realistic browser header generation, and high-performance HTML parsing and JSON serialization.
πŸ”‘ Python, Playwright, goroutines, gevent, Go, selectolax, HTML parsing, JavaScript, TLS fingerprinting, HTTP/2
πŸ†

[4] threads-net

⚑An unofficial and reverse-engineered Python API wrapper for Threads.net.
🎯 To provide programmatic access to Threads.net functionality through reverse-engineered APIs.
πŸ’‘ The project offered features to interact with Threads.net, allowing educational and research purposes before being discontinued.
πŸ€– Generate an unofficial Python API wrapper for educational research on interacting with a social media platform like Threads.net, ensuring respect for legal and ethical standards.
πŸ”‘ Python, API, Reverse Engineering
πŸ†

[5] hrequests

⚑A feature-rich replacement for Python's requests library with headless browser support.
🎯 To provide a high-performance, concurrent web scraping and browser automation tool with realistic browser header and TLS fingerprint generation.
πŸ’‘ HTTP and headless browsing, fast HTML parser, network concurrency with goroutines and gevent, TLS fingerprint replication, JavaScript rendering, HTTP/2 support, browser header generation, faster JSON serialization, browser crawling with human-like automation, extension support, full page screenshots, CORS bypass, and thread safety.
πŸ€– Generate a Python package that offers high-performance, concurrent web scraping and browser automation with features like TLS fingerprinting, JSON serialization, and headless browsing.
πŸ”‘ Python, Playwright, Goroutines, gevent, TLS fingerprinting, HTTP/2, JSON serialization, HTML parsing
πŸ†

[6] epg

⚑A Python3 and Django4 based EPG (Electronic Program Guide) data scraping and publishing system.
🎯 To scrape various online sources of TV program listings and generate xmltv format files for apps like Perfect Player to load EPG information.
πŸ’‘ Features include scraping EPG data from multiple sources, backend channel configuration, auto source switching on failure, API for EPG data publishing, and robust performance tested on standard office computers with high request volumes.
πŸ€– Can you generate a Python and Django based web application that scrapes EPG data from multiple sources, provides a backend for channel management, automatically switches sources on failure, and offers a public API for EPG data consumption, with a proven scalability to handle millions of API requests daily?
πŸ”‘ Python3, Django4, Nginx, uWSGI, MySQL, SQLite3, requests, BeautifulSoup
πŸ†

[7] chatgpt_system_prompt

⚑A repository aimed at demonstrating how to extract system prompts and protect GPT instructions.
🎯 The code is intended to provide methods for extracting system prompts from ChatGPT, protecting GPT prompts from being leaked, and retrieving GPT's action schema.
πŸ’‘ The project includes methods to divert ChatGPT's attention to reveal system prompts, instructions to protect prompts from being leaked, steps to exploit sandbox files caching/optimization, and guidelines to contribute to the repository with a consistent format.
πŸ€– Create a guide on how to safely extract system prompts from ChatGPT and protect GPT instructions from being leaked, while also providing a method to access action schema for various GPT models.
πŸ”‘ ChatGPT, Markdown, Python, GitHub, Web Scraping
πŸ†

[8] flyscrape

⚑A standalone and scriptable web scraper that uses Go for speed and JavaScript for flexibility.
🎯 To provide a high-performance, configurable, and scriptable web scraping solution that simplifies data extraction tasks.
πŸ’‘ Configurable options, standalone binary, JavaScript scripting, simple API, fast iteration, request caching, zero dependencies
πŸ€–Generate a JSON object describing a Go-based web scraper with JavaScript integration for custom data extraction logic.
πŸ”‘ Go, JavaScript
πŸ†

[9] BilibiliPotPlayer

⚑A PotPlayer plugin for playing Bilibili videos directly from the website.
🎯 To enable users to play Bilibili videos and live streams through PotPlayer with added features like account login, search, and skipping intro/outro.
πŸ’‘ Allows direct playback of Bilibili content in PotPlayer, account login via cookies, search functionality within PotPlayer, skipping of video intros/outros, display of video thumbnails in playlist, and creation of auto-updating playlists.
πŸ€– Create a PotPlayer plugin that integrates with Bilibili, supports user login, video search, intro/outro skipping, thumbnail display, and has the ability to create auto-updating playlists.
πŸ”‘ PotPlayer Extension, AS Scripting, Web Scraping
πŸ†

[10] bananalyzer

⚑An AI Agent evaluation framework for web tasks with a focus on structured information retrieval.
🎯 To evaluate AI agents' abilities to perform structured data extraction and information retrieval across diverse web pages.
πŸ’‘ Banana-lyzer offers CLI tool for running evaluations, support for static website snapshots via mhtml, diverse datasets across industries, predefined test intents with structured JSON output, and easy integration for custom agents.
πŸ€– Generate an AI-powered web task evaluation framework that allows for testing agents with static website snapshots, diverse intents, and structured JSON outputs, including CLI tooling and custom agent integration.
πŸ”‘ Python, Playwright, pytest, FastAPI, Poetry, CLI
πŸ†

[11] yt-fts

⚑A command-line tool to scrape YouTube channel subtitles and perform full text searches on them.
🎯 To enable users to search through YouTube channel subtitles for specific keywords or phrases and provide time-stamped URLs to the videos containing them.
πŸ’‘ The project includes features such as downloading subtitles, updating channel data, deleting channels from the database, and performing both regular and semantic searches using OpenAI's embeddings API.
πŸ€– Create a command-line tool for scraping YouTube channel subtitles, storing them in a database, and implementing full-text and semantic search capabilities using OpenAI's embeddings API.
πŸ”‘ Python, yt-dlp, SQLite, OpenAI's embeddings API
πŸ†

[12] Price-Tracking-Web-Scraper

⚑A web scraper for tracking product prices with a user interface, primarily focused on Amazon Canada
🎯To enable users to automatically track and scrape product prices from e-commerce websites, starting with Amazon.ca, for comparison or alerting purposes.
πŸ’‘ Automated price tracking with a user interface; Configurable for multiple e-commerce sources; Easy setup with provided auth.json template; Backend API with Flask; Frontend with React; Playwright for browser automation; Bright Data integration for enhanced scraping; Automation through scheduler for regular price updates; Windows batch script for task scheduling.
πŸ€– Generate a web-based application that allows users to track product prices from Amazon Canada using React for the frontend, Flask for the backend, Playwright for browser automation, and integrate Bright Data for web scraping. Include authentication setup, dependency installation guides, and automation scripts for both the Flask backend and the price tracking functionality.
πŸ”‘ React, Flask, Playwright, Bright Data
πŸ†

[13] crawlProject

⚑A collection of web crawling projects for educational purposes.
🎯 To provide a variety of web scraping and crawling techniques and examples for educational and practice purposes.
πŸ’‘ The project includes tutorials on using different libraries for web scraping, handling automation with Selenium and Playwright, overcoming captcha challenges, and dealing with javascript obfuscation and environment detection.
πŸ€– Generate a comprehensive web scraping tutorial project with examples using technologies like requests, lxml, Playwright, Selenium, Scrapy, and feapder, including solutions for captchas and JS obfuscation.
πŸ”‘ requests, curl_cffi, lxml, playwright, ddddocr, selenium, scrapy, feapder, pycryptodome, pyexecjs2, m3u8, prettytable, tqdm, loguru, retrying, crypto-js, jsdom, tough-cookie
πŸ†

[14] Mind2Web

⚑A dataset, code, and models to develop and evaluate generalist agents that follow language instructions to complete web tasks.
🎯 To support research on building generalist web agents capable of performing diverse tasks on real-world websites.
πŸ’‘ The project provides a dataset with over 2,000 tasks from 137 websites, code for fine-tuning and evaluating models, and pre-trained models for candidate generation and action prediction.
πŸ€– Develop a comprehensive dataset and modeling framework for training and evaluating generalist web agents capable of performing tasks across various domains and websites.
πŸ”‘ Python, DeBERTa-v3, T5, SentenceTransformer, Huggingface Transformers, Playwright, Globus, Hydra
πŸ†

[15] emploleaks

⚑An OSINT tool for gathering information about company employees and potential leaked credentials.
🎯 To identify and retrieve public and potentially sensitive information about company employees including personal emails and leaked passwords.
πŸ’‘ Uses LinkedIn to find employee information, integrates with GitLab for personal code repositories, and searches a custom COMB database for leaked passwords. Supports connection to databases, cookie-based authentication, and provides an indexed COMB database building guide.
πŸ€– Generate a Python-based OSINT tool that leverages LinkedIn for employee data, integrates with GitLab, and checks for leaked passwords using a custom COMB database.
πŸ”‘ Python, LinkedIn API, COMB database, PostgreSQL, OSINT techniques
πŸ†

[16] google-maps-scraper

⚑A Python-based tool for scraping and extracting lead information from Google Maps.
🎯 To scrape Google Maps for business leads and provide direct contact information for sales and marketing purposes.
πŸ’‘ The scraper can extract emails, social media profiles, and other contact details of businesses listed on Google Maps. It sorts and filters leads by reviews, ratings, and other criteria, supports scraping multiple queries, and resumes from where it left off if interrupted. Additionally, it can scrape listings from all cities in a specified country and provides detailed instructions for non-technical users.
πŸ€– Create a Python-based Google Maps scraping tool that can extract business contact details, filter and sort leads, and support multiple queries and cities.
πŸ”‘ Python, Botasaurus Framework, RapidAPI, Gitpod
πŸ†

[17] FinNLP

⚑A project for scraping and analyzing financial data from various sources.
🎯 To aggregate and analyze financial news, social media sentiment, and company announcements from multiple platforms for US and Chinese markets.
πŸ’‘ FinNLP features data extraction from news platforms and social media, offering insights into financial markets by analyzing news headlines, social media posts, and company announcements.
πŸ€– Generate a Python-based tool that scrapes, processes, and analyses financial data from news outlets, social media, and company announcements, with the ability to handle various APIs and data formats.
πŸ”‘ Python, Finnhub API, Sina Finance API, Eastmoney API, Stocktwits API, Reddit Streaming API, Weibo API, SEC API, Juchao API
πŸ†

Read more