mirror of
https://github.com/AndyTheFactory/newspaper4k.git
synced 2025-04-04 14:11:16 +08:00
* feat(lang): ⚡ Rework of tokenizer. Additionally implemented new (easier) way of adding languages to the packet * feat(lang): added language dependecies as optional * feat(lang): add Bengali, Nepali, Tamil, Georgian, Marathi, Telugu, Latvian, Czech, Slovak, Burmese language support * refactor(lang): moving all language related files in languages folder * refactor(lang): added valid_languages function that returns available languages * refactor(misc): ⚡ removed ParsingCandidate, RawHelper, URLHelper classes. Removed link_hash from article (was never used) * refactor(parse): article.link_hash is no longer available * fix(cli): json output in stdout missing [] * feat(parse): 🔥 article is now pickleable * feat(parse): 🔥 Source object is now pickleable * refactor(parse): ✨ Tidying up the gravity scoring process. No changes in the final score result * refactor(parse): 🚀 compute word statistics for a node taking children nodes into account * fix(parse): ⚡ Bug with auto detecting website language. If no language supplied, the detected language was not used * fix(parse): ⚡ added figure as a tag to be removed before text generation * fix(parse): 🔥 better article paragraph detection * fix(parse): ⚡ get feeds fixed, it was not parsing the main page for possible feeds * fix(misc): ✨ tydiing up some code in urls.py * feat(misc): better typing support and type hinting Author: Tom Parker-Shemilt <palfrey@***.net> * feat(misc): Simplify favicon return Author: Tom Parker-Shemilt <palfrey@***.net> * feat(misc): Basic mypy support Author: Tom Parker-Shemilt <palfrey@***.net> * feat(doc): 📝 adding evaluation results * feat(doc): 🚀 Documentation Update. Added Examples, documented new features * refactor(core): Minimum Python now 3.8; Also test 3.10/11/12 Author: Tom Parker-Shemilt <palfrey@***.net> * refactor(core): run gh actions on PR's. Author: Tom Parker-Shemilt <palfrey@***.net> * refactor(core): Set SETUPTOOLS_USE_DISTUTILS. setuptools as per numpy recommendations. Upgrade numpy and pandas for >= 3.9.Author: Tom Parker-Shemilt <palfrey@***.net> * refactor(core): Upgrade regex, virtualenv to avoid breaking pre-commit, distutils for everyone. Author: Tom Parker-Shemilt <palfrey@***.net> * feat(sources): ✨ new option when building sources. You can limit the article parsing to the source home page only. Other categories or feeds are then ignored * feat(misc): 📈 added cloudscraper as optional dependancy. If installed, it will us cloudscraper as a layer over requests. Cloudscraper tries to bypass cloudflair protection * feat(lang): ✨ New integration of Google news using GNews module. You can now use GoogleNewsSource to search and parse news based on keywords, topic, location or website * fix(parse): ⚡ Better title parsing. Added language specific regex for article titles * feat(parse): ✨ added filter that limits the source.build to a specific category. use source.build(url,only_in_path=True) to scrape only stories that are in the starting url path * fix(parse): 🔥 better binary content detection * fix(lang): ⚡ better is_highlink_density for non-latin languages * feat(lang): 📝 added stopwords for af, br, ca,eo, eu, ga, gl, gu, ha, hy, ku, ms, so, st, tl, ur, yo, zu from https://github.com/stopwords-iso * refactor(parse): 💥 deprecated text_cleaned, clean_doc. Removed clean_top_node, article.clean_top_node is removed. Failtures if it was accessed * feat(lang): 🚀 added support for another 13 languages * fix(misc): 🎨 mypy stubs for gnews and cloudscraper + small typing fixes * fix(parse): 🐛 better feed discovery in Souce objects * fix(parse): 🐛 fixed an issue with non latin high density detection * docs(doc): 🔥 Added typing and docstrings to most of the code * fix(types): 🎨 added stubs for gnews * fix(misc): 🚑 python-setup github action version bump Co-authored-by: Tom Parker-Shemilt <palfrey@tevp.net>
60 lines
1.8 KiB
YAML
Executable File
60 lines
1.8 KiB
YAML
Executable File
fail_fast: true
|
|
exclude: '^tests/data/'
|
|
repos:
|
|
- repo: https://github.com/pre-commit/pre-commit-hooks
|
|
rev: v4.4.0
|
|
hooks:
|
|
- id: check-ast # checks Python syntax
|
|
- id: check-json # checks JSON syntax
|
|
- id: check-merge-conflict # checks for no merge conflict strings
|
|
- id: check-shebang-scripts-are-executable # checks all shell scripts have executable permissions
|
|
- id: check-toml # checks TOML syntax
|
|
- id: check-yaml # checks YAML syntax
|
|
- id: end-of-file-fixer # checks there is a newline at the end of the file
|
|
- id: mixed-line-ending # normalizes line endings
|
|
- id: no-commit-to-branch # prevents committing to main
|
|
- id: trailing-whitespace # trims trailing whitespace
|
|
args: [--markdown-linebreak-ext=md]
|
|
|
|
- repo: https://github.com/psf/black
|
|
rev: 23.9.1
|
|
hooks:
|
|
- id: black-jupyter
|
|
args:
|
|
- "--preview"
|
|
|
|
- repo: https://github.com/astral-sh/ruff-pre-commit
|
|
rev: v0.0.290
|
|
hooks:
|
|
- id: ruff
|
|
|
|
- repo: https://github.com/codespell-project/codespell
|
|
rev: v2.2.5
|
|
hooks:
|
|
- id: codespell
|
|
args: ['--ignore-words=.codespell-dictionary.txt']
|
|
exclude: >
|
|
(?x)^(
|
|
.*\.json|
|
|
.*\.pyc|
|
|
.*\.lock|
|
|
.*\.html|
|
|
.*/stopwords.*\.txt|
|
|
tests/data/.*|
|
|
newspaper/languages.py|
|
|
newspaper/languages/.*|
|
|
newspaper/resources/.*
|
|
)$
|
|
additional_dependencies:
|
|
- tomli
|
|
|
|
- repo: local
|
|
hooks:
|
|
- id: update-requirements-txt
|
|
name: update-requirements-txt
|
|
description: 'Generate requirements.txt based on poetry.lock'
|
|
entry: poetry
|
|
args: [export, --without-hashes, --without, dev, --without-urls, --format, requirements.txt, --output, requirements_poetry.txt]
|
|
language: system
|
|
pass_filenames: false
|