refactor: reorganizar docs y pocs bajo INFO/

- docs/ → INFO/DOCS/CONTEXT/ (documentación técnica en markdown) - FLUJOS/DOCS/ + FLUJOS_DATOS/DOCS/ → INFO/DOCS/ (txts de arquitectura) - POCS/ → INFO/POCS/ (pruebas de concepto) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-21 23:49:33 +02:00 · 2026-04-21 23:49:33 +02:00 · 954f47996f
commit 954f47996f
parent 83f67b76b4
33 changed files with 0 additions and 0 deletions
--- a/INFO/DOCS/CONTEXT/SCRAPER_NOTICIAS.md
+++ b/INFO/DOCS/CONTEXT/SCRAPER_NOTICIAS.md
@ -0,0 +1,270 @@
+# Scraper de Noticias — Contexto técnico FLUJOS
+**Fecha:** 2026-04-21  
+**Archivo:** `FLUJOS_DATOS/NOTICIAS/main_noticias.py`  
+**Entorno:** `FLUJOS_DATOS/myenv/` (Python 3.11, venv)
+
+---
+
+## Qué hace
+
+Scraper web recursivo que:
+1. Visita ~90 URLs de medios de comunicación internacionales
+2. Explora sus páginas recursivamente hasta profundidad 6
+3. Descarga artículos (texto) y ficheros adjuntos (PDF, CSV, DOCX, XLSX, ZIP)
+4. Traduce a español si el contenido está en otro idioma
+5. Limpia y tokeniza con BERT
+6. Guarda en disco y MongoDB (`noticias`)
+
+---
+
+## Lista de fuentes (90 medios)
+
+```python
+urls = [
+    # Bases de datos de investigación
+    'https://reactionary.international/database/',
+    'https://aleph.occrp.org/',           # OCCRP — periodismo de investigación
+    'https://offshoreleaks.icij.org/',    # ICIJ — paraísos fiscales
+
+    # Prensa española
+    'https://www.publico.es/', 'https://www.elsaltodiario.com/',
+    'https://elpais.com/', 'https://www.elmundo.es/', 'https://www.abc.es/',
+    'https://www.lavanguardia.com/', 'https://www.elconfidencial.com/',
+    'https://www.eldiario.es/', 'https://www.rtve.es/', ...
+
+    # Prensa internacional
+    'https://www.nytimes.com/', 'https://www.theguardian.com/',
+    'https://www.lemonde.fr/', 'https://www.spiegel.de/',
+    'https://www.washingtonpost.com/', 'https://www.aljazeera.com/',
+    'https://www.bbc.com/', 'https://www.reuters.com/',
+    'https://www.ft.com/', 'https://www.economist.com/', ...
+
+    # Prensa tech / seguridad
+    'https://www.wired.com/', 'https://www.theregister.com/',
+    'https://www.arstechnica.com/', 'https://www.zdnet.com/',
+    'https://www.cyberdefensemagazine.com/', 'https://www.darkreading.com/', ...
+]
+```
+
+Total: ~90 URLs seed. Cada una se explora recursivamente hasta 6 niveles de profundidad.
+
+---
+
+## Flujo de scraping recursivo
+
+```python
+def explore_and_extract_articles(url, articles_folder, files_folder,
+                                  processed_urls, size_limit, depth=0, max_depth=6):
+```
+
+```
+para cada URL seed:
+    explore_and_extract_articles(url, depth=0, max_depth=6)
+        └── HTMLSession.get(url).html.render()  # ejecuta JavaScript con Chromium headless
+            para cada link encontrado:
+                if link ya procesado: skip
+                processed_urls.add(link)
+                
+                if extensión es PDF/CSV/DOCX/XLSX/ZIP/HTML/MD:
+                    download_and_save_file(link, files_folder)
+                else:
+                    extract_and_save_article(link, articles_folder)
+                    explore_and_extract_articles(link, depth+1)  # recursivo
+                
+                if tamaño total > 50 GB: parar
+    
+    explore_wayback_machine(url, articles_folder)  # fallback Wayback Machine
+```
+
+### Renderizado JavaScript
+
+Usa `requests-html` con Chromium headless (Pyppeteer) para renderizar páginas que cargan contenido con JavaScript. Esto permite scraping de medios que usan SPA/React.
+
+```python
+session = HTMLSession()
+response = session.get(url, timeout=30)
+response.html.render(timeout=30, sleep=1)  # espera 1s a que cargue el JS
+links = response.html.absolute_links
+```
+
+---
+
+## Extracción y limpieza de artículos
+
+```python
+def extract_and_save_article(url, articles_folder):
+    response = requests.get(url, timeout=30)
+    soup = BeautifulSoup(response.content, 'html.parser')
+    
+    title = soup.find('title').get_text().strip()
+    paragraphs = soup.find_all('p')
+    content = ' '.join([p.get_text() for p in paragraphs])
+    
+    translated = translate_text(content)    # → español
+    cleaned = clean_text(translated)         # → limpieza + stopwords
+    
+    filename = clean_filename(title) + '.txt'
+    guardar en articles_folder/filename
+```
+
+### Traducción automática
+
+```python
+from deep_translator import GoogleTranslator
+
+def translate_text(text):
+    return GoogleTranslator(source='auto', target='es').translate(text)
+```
+
+Usa Google Translate vía `deep-translator`. Detecta idioma automáticamente. Fallo → devuelve el texto original sin traducir.
+
+### Limpieza de texto
+
+```python
+def clean_text(text):
+    text = re.sub(r'<!\[\s*CDATA\s*\[.*?\]\]>', '', text, flags=re.S)  # CDATA
+    soup = BeautifulSoup(text, 'html.parser')
+    text = soup.get_text(separator=" ")     # HTML → texto plano
+    text = text.lower()
+    text = re.sub(r'http\S+', '', text)     # elimina URLs
+    text = re.sub(r'[^a-záéíóúñü\s]', '', text)  # solo letras + espacios
+    text = re.sub(r'\s+', ' ', text).strip()
+    words = [w for w in text.split() if w not in STOPWORDS]
+    return ' '.join(words)
+```
+
+---
+
+## Procesamiento de ficheros descargados
+
+```python
+def process_files(files_folder, destination_folder):
+    for file in os.walk(files_folder):
+        if .pdf:   content = read_pdf(file_path)    # PyPDF2
+        elif .csv: content = read_csv(file_path)     # csv.reader
+        elif .txt: content = open(file_path).read()
+        elif .docx: content = read_docx(file_path)  # python-docx
+        elif .xlsx: content = read_xlsx(file_path)  # openpyxl
+        elif .zip:  content = read_zip(file_path)   # zipfile
+        elif .html/.md: content = format_content(html2text)
+        
+        translated = translate_text(content)
+        cleaned = clean_text(translated)
+        tokenize_and_save(cleaned, file, destination_folder)
+```
+
+---
+
+## Tokenización BERT
+
+```python
+from transformers import BertTokenizer
+tokenizer = BertTokenizer.from_pretrained('dccuchile/bert-base-spanish-wwm-cased')
+
+def tokenize_and_save(text, filename, destination_folder):
+    tokens = tokenizer.encode(text, truncation=True, max_length=512, add_special_tokens=True)
+    tokens_str = ' '.join(map(str, tokens))
+    open(f'{destination_folder}/{filename}', 'w').write(tokens_str)
+```
+
+Mismo modelo BERT en español que el scraper de Wikipedia. Trunca a 512 tokens.
+
+---
+
+## Deduplicación por URL
+
+```python
+def register_processed_notifications(base_folder, urls):
+    """Lee/escribe processed_articles.txt para evitar re-procesar URLs."""
+    txt_path = os.path.join(base_folder, "processed_articles.txt")
+    processed_urls = set(open(txt_path).read().splitlines())
+    urls_to_process = [u for u in urls if u not in processed_urls]
+    # Añade nuevas URLs al fichero
+    return urls_to_process
+```
+
+Las URLs ya procesadas se guardan en `NOTICIAS/processed_articles.txt`. Esto es la deduplicación a nivel de seed URL, pero NO previene artículos duplicados desde diferentes URLs.
+
+---
+
+## Límites configurados
+
+```python
+FOLDER_SIZE_LIMIT = 50 * 1024 * 1024 * 1024  # 50 GB máximo en disco
+max_depth = 6                                   # profundidad recursiva máxima
+```
+
+---
+
+## Estructura de ficheros en disco (ignorada por git)
+
+```
+FLUJOS_DATOS/NOTICIAS/
+├── articulos/          # .gitignore — .txt por artículo scrapeado
+├── archivos/           # .gitignore — PDF, CSV, DOCX, etc. descargados
+├── tokenized/          # .gitignore — IDs BERT por documento
+├── processed_articles.txt  # .gitignore — URLs ya procesadas
+├── noticias_procesadas.txt # .gitignore
+├── main_noticias.py
+└── docs.txt
+```
+
+---
+
+## Documento MongoDB generado (colección `noticias`)
+
+La inserción a MongoDB la hace `pipeline_mongolo.py` (Fase 2), no el scraper directamente. El scraper solo guarda en disco.
+
+```json
+{
+  "_id": ObjectId,
+  "archivo": "titulo-de-la-noticia.txt",
+  "tema": "guerra global",
+  "subtema": "conflictos internacionales",
+  "texto": "texto limpio de la noticia...",
+  "fecha": ISODate | null
+}
+```
+
+---
+
+## Wayback Machine como fallback
+
+```python
+def explore_wayback_machine(url, articles_folder):
+    api_url = f"http://archive.org/wayback/available?url={url}"
+    data = requests.get(api_url).json()
+    archive_url = data['archived_snapshots']['closest']['url']
+    extract_and_save_article(archive_url, articles_folder)
+```
+
+Si un medio está caído o bloquea el scraper, intenta obtener la versión más reciente desde archive.org.
+
+---
+
+## Limitaciones conocidas
+
+1. **Renderizado headless lento** — `requests-html` usa Pyppeteer/Chromium, ~2–5 seg/página. Escalar a 20.000 artículos cuesta horas.
+2. **Sin respeto a robots.txt** — El scraper no verifica robots.txt. Algunos medios bloquean scraping.
+3. **Paywall** — Medios como FT, NYT, WSJ bloquean sin suscripción. El scraper solo obtiene lo que es público.
+4. **Traducción de textos largos** — `deep-translator` tiene límite de ~5.000 chars por llamada. Textos largos pueden fallar silenciosamente.
+5. **Sin fecha de publicación** — Se parsea el título HTML, no los metadatos `<meta property="article:published_time">`. El campo `fecha` suele quedar vacío.
+6. **Recursión sin límite de anchura** — Una página con 1.000 links genera 1.000 llamadas recursivas. Puede tardar mucho en sitios grandes.
+
+---
+
+## Dependencias
+
+```
+requests
+requests-html       # HTMLSession + Pyppeteer
+beautifulsoup4
+html2text
+deep-translator     # Google Translate API no oficial
+PyPDF2
+python-docx
+openpyxl
+transformers        # BertTokenizer
+tqdm
+pymongo
+```